Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
6eae5e4
implement batch processing of new versions to archive
qqmyers Dec 21, 2020
8313404
add listonly and limit options, count commandEx as failure
qqmyers Dec 21, 2020
70d923a
send list in response for listonly
qqmyers Dec 21, 2020
96d3723
fix query
qqmyers Dec 21, 2020
cb9f374
case sensitive in query
qqmyers Dec 21, 2020
76e2396
param to only archive latest version
qqmyers Dec 21, 2020
2e8d990
off by one in limit
qqmyers Dec 21, 2020
b796833
documentation
qqmyers Dec 23, 2020
006a4ba
Update doc/sphinx-guides/source/installation/config.rst
qqmyers Jan 8, 2021
bba8ba0
Update doc/sphinx-guides/source/installation/config.rst
qqmyers Jan 8, 2021
011c97a
Update doc/sphinx-guides/source/installation/config.rst
qqmyers Jan 8, 2021
1a1c28c
updates per review
qqmyers Jan 8, 2021
8a0ad71
Merge branch 'IQSS/7493-batch_archiving_API_call' of https://github.c…
qqmyers Jan 8, 2021
7b5aead
Merge remote-tracking branch 'IQSS/develop' into IQSS/7493-batch_arch…
qqmyers Jan 29, 2021
fd32dfd
Merge remote-tracking branch 'IQSS/develop' into IQSS/7493-batch_arch…
qqmyers Feb 23, 2021
805ff95
Merge remote-tracking branch 'IQSS/develop' into IQSS/7493-batch_arch…
qqmyers Apr 7, 2021
e1415f9
Merge remote-tracking branch 'IQSS/develop' into IQSS/7493-batch_arch…
qqmyers Apr 13, 2021
ef9a0b9
Merge remote-tracking branch 'IQSS/develop' into IQSS/7493-batch_arch…
qqmyers Aug 12, 2021
9443e04
Merge remote-tracking branch 'IQSS/develop' into IQSS/7493-batch_arch…
qqmyers Feb 2, 2022
242befa
Merge remote-tracking branch 'IQSS/develop' into IQSS/7493-batch_arch…
qqmyers Feb 15, 2022
7047d00
Merge remote-tracking branch 'IQSS/develop' into TDL/7493-batch_archi…
qqmyers Feb 23, 2022
4549f0c
TDL Bundle text
qqmyers Feb 23, 2022
56ff7bc
fix thread use of requestscoped service
qqmyers Feb 24, 2022
46f8554
update doc to match api call name
qqmyers Feb 24, 2022
33b85f4
adjust to use a space per dataverse (alias)
qqmyers Feb 24, 2022
a5edb8c
Merge remote-tracking branch 'IQSS/develop' into TDL/7493-batch_archi…
qqmyers Feb 24, 2022
e205f4b
custom version
qqmyers Feb 24, 2022
5e625a5
Merge remote-tracking branch 'IQSS/develop' into
qqmyers Mar 7, 2022
53b9803
Merge remote-tracking branch 'IQSS/develop' into
qqmyers Mar 18, 2022
30877d7
Use try with resources to close connections for non-200 status
qqmyers Mar 31, 2022
4f354c9
munge to insure valid spaceName
qqmyers Mar 31, 2022
eebee96
add STANDARD cookie spec
qqmyers Mar 31, 2022
3c90bbe
modify zero-file behaviour to include empty manifest
qqmyers Mar 31, 2022
446063e
set non-null copy location in failure cases to avoid retries
qqmyers Mar 31, 2022
5b2ef34
Refactor/fix popup logic
qqmyers Mar 30, 2022
917d2d9
fix return - not clear why Eclipse didn't flag this
qqmyers Mar 30, 2022
04e4ead
Use download popup's license test code, cleanup
qqmyers Mar 30, 2022
9a3913d
logging/comment updates per review
qqmyers Mar 30, 2022
8e49621
Try to avoid GC causing connection close
qqmyers Mar 31, 2022
7fc6ba4
Clearer logging
qqmyers Mar 31, 2022
26bebc8
avoid spacename with 'final . followed by a number'
qqmyers Apr 1, 2022
45d6e29
Update to use join, try to provide rollback
qqmyers Apr 2, 2022
acb61d2
Add thread control and avoid .- in spaceName which is also prohibited
qqmyers Apr 4, 2022
ab8325b
add method to change thread pool size
qqmyers Apr 4, 2022
5cbfd4a
avoid local var
qqmyers Apr 4, 2022
8f9e7eb
don't close response in 200 case
qqmyers Apr 5, 2022
14cec22
add version to datacite file
qqmyers Apr 5, 2022
345c97a
add _ before version
qqmyers Apr 5, 2022
1be42f5
count success/fail correctly
qqmyers Apr 5, 2022
338c058
Merge remote-tracking branch 'IQSS/develop' into
qqmyers Apr 6, 2022
12e284c
Merge remote-tracking branch 'IQSS/develop' into TDL/7493-batch_archi…
qqmyers Apr 9, 2022
4c87e1e
removing TDL specific changes/obsolete changes
qqmyers Apr 9, 2022
3526d66
change batch call to POST
qqmyers Apr 9, 2022
e4470bf
remove unused import
qqmyers Apr 9, 2022
819447b
remove broader changes
qqmyers Apr 9, 2022
48dd0e9
remove all but thread code, add Google/Local archivers
qqmyers Apr 9, 2022
4a75c93
Merge remote-tracking branch 'IQSS/develop' into TDL/7493-make_BagGen…
qqmyers Apr 12, 2022
3dfb0c3
document setting/function
qqmyers Apr 12, 2022
cd7602b
documentation update per review
qqmyers Apr 13, 2022
36aa64c
add required space
qqmyers Apr 14, 2022
109a4a1
Update doc/sphinx-guides/source/developers/workflows.rst
qqmyers Apr 14, 2022
f6cea7c
typo - remove :
qqmyers Apr 14, 2022
32f3a58
Merge branch 'TDL/7493-make_BagGenerate_threads_configurable_only' of…
qqmyers Apr 14, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions doc/sphinx-guides/source/admin/integrations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Getting Data In
A variety of integrations are oriented toward making it easier for your researchers to deposit data into your Dataverse installation.

GitHub
+++++++
++++++

Dataverse integration with GitHub is implemented via a Dataverse Uploader GitHub Action. It is a reusable, composite workflow for uploading a git repository or subdirectory into a dataset on a target Dataverse installation. The action is customizable, allowing users to choose to replace a dataset, add to the dataset, publish it or leave it as a draft version on Dataverse. The action provides some metadata to the dataset, such as the origin GitHub repository, and it preserves the directory tree structure.

Expand Down Expand Up @@ -157,12 +157,14 @@ Archivematica

Sponsored by the `Ontario Council of University Libraries (OCUL) <https://ocul.on.ca/>`_, this technical integration enables users of Archivematica to select datasets from connected Dataverse installations and process them for long-term access and digital preservation. For more information and list of known issues, please refer to Artefactual's `release notes <https://wiki.archivematica.org/Archivematica_1.8_and_Storage_Service_0.13_release_notes>`_, `integration documentation <https://www.archivematica.org/en/docs/archivematica-1.8/user-manual/transfer/dataverse/>`_, and the `project wiki <https://wiki.archivematica.org/Dataverse>`_.

DuraCloud/Chronopolis
+++++++++++++++++++++
.. _rda-bagit-archiving:

RDA BagIt (BagPack) Archiving
+++++++++++++++++++++++++++++

A Dataverse installation can be configured to submit a copy of published Datasets, packaged as `Research Data Alliance conformant <https://www.rd-alliance.org/system/files/Research%20Data%20Repository%20Interoperability%20WG%20-%20Final%20Recommendations_reviewed_0.pdf>`_ zipped `BagIt <https://tools.ietf.org/html/draft-kunze-bagit-17>`_ bags to the `Chronopolis <https://libraries.ucsd.edu/chronopolis/>`_ via `DuraCloud <https://duraspace.org/duracloud/>`_
A Dataverse installation can be configured to submit a copy of published Datasets, packaged as `Research Data Alliance conformant <https://www.rd-alliance.org/system/files/Research%20Data%20Repository%20Interoperability%20WG%20-%20Final%20Recommendations_reviewed_0.pdf>`_ zipped `BagIt <https://tools.ietf.org/html/draft-kunze-bagit-17>`_ bags to the `Chronopolis <https://libraries.ucsd.edu/chronopolis/>`_ via `DuraCloud <https://duraspace.org/duracloud/>`_, to a local file system, or to `Google Cloud Storage <https://cloud.google.com/storage>`_.

For details on how to configure this integration, look for "DuraCloud/Chronopolis" in the :doc:`/installation/config` section of the Installation Guide.
For details on how to configure this integration, see :ref:`BagIt Export` in the :doc:`/installation/config` section of the Installation Guide.

Future Integrations
-------------------
Expand Down
7 changes: 4 additions & 3 deletions doc/sphinx-guides/source/developers/workflows.rst
Original file line number Diff line number Diff line change
Expand Up @@ -178,9 +178,9 @@ Available variables are:
archiver
++++++++

A step that sends an archival copy of a Dataset Version to a configured archiver, e.g. the DuraCloud interface of Chronopolis. See the `DuraCloud/Chronopolis Integration documentation <http://guides.dataverse.org/en/latest/admin/integrations.html#id15>`_ for further detail.
A step that sends an archival copy of a Dataset Version to a configured archiver, e.g. the DuraCloud interface of Chronopolis. See :ref:`rda-bagit-archiving` for further detail.

Note - the example step includes two settings required for any archiver and three (DuraCloud*) that are specific to DuraCloud.
Note - the example step includes two settings required for any archiver, three (DuraCloud*) that are specific to DuraCloud, and the optional BagGeneratorThreads setting that controls parallelism when creating the Bag.

.. code:: json

Expand All @@ -196,7 +196,8 @@ Note - the example step includes two settings required for any archiver and thre
":ArchiverSettings": "string",
":DuraCloudHost":"string",
":DuraCloudPort":"string",
":DuraCloudContext":"string"
":DuraCloudContext":"string",
":BagGeneratorThreads":"string"
}
}

25 changes: 19 additions & 6 deletions doc/sphinx-guides/source/installation/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -932,7 +932,7 @@ The minimal configuration to support an archiver integration involves adding a m

\:ArchiverSettings - the archiver class can access required settings including existing Dataverse installation settings and dynamically defined ones specific to the class. This setting is a comma-separated list of those settings. For example\:

``curl http://localhost:8080/api/admin/settings/:ArchiverSettings -X PUT -d ":DuraCloudHost, :DuraCloudPort, :DuraCloudContext"``
``curl http://localhost:8080/api/admin/settings/:ArchiverSettings -X PUT -d ":DuraCloudHost, :DuraCloudPort, :DuraCloudContext, :BagGeneratorThreads"``

The DPN archiver defines three custom settings, one of which is required (the others have defaults):

Expand All @@ -942,6 +942,12 @@ The DPN archiver defines three custom settings, one of which is required (the ot

:DuraCloudPort and :DuraCloudContext are also defined if you are not using the defaults ("443" and "duracloud" respectively). (Note\: these settings are only in effect if they are listed in the \:ArchiverSettings. Otherwise, they will not be passed to the DuraCloud Archiver class.)

It also can use one setting that is common to all Archivers: :BagGeneratorThreads

``curl http://localhost:8080/api/admin/settings/:BagGeneratorThreads -X PUT -d '8'``

By default, the Bag generator zips two datafiles at a time when creating the Bag. This setting can be used to lower that to 1, i.e. to decrease system load, or to increase it, e.g. to 4 or 8, to speed processing of many small files.

Archivers may require JVM options as well. For the Chronopolis archiver, the username and password associated with your organization's Chronopolis/DuraCloud account should be configured in Payara:

``./asadmin create-jvm-options '-Dduracloud.username=YOUR_USERNAME_HERE'``
Expand All @@ -963,9 +969,9 @@ ArchiverClassName - the fully qualified class to be used for archiving. For exam

\:ArchiverSettings - the archiver class can access required settings including existing Dataverse installation settings and dynamically defined ones specific to the class. This setting is a comma-separated list of those settings. For example\:

``curl http://localhost:8080/api/admin/settings/:ArchiverSettings -X PUT -d ":BagItLocalPath"``
``curl http://localhost:8080/api/admin/settings/:ArchiverSettings -X PUT -d ":BagItLocalPath, :BagGeneratorThreads"``

:BagItLocalPath is the file path that you've set in :ArchiverSettings.
:BagItLocalPath is the file path that you've set in :ArchiverSettings. See the DuraCloud Configuration section for a description of :BagGeneratorThreads.

.. _Google Cloud Configuration:

Expand All @@ -976,9 +982,9 @@ The Google Cloud Archiver can send Dataverse Project Bags to a bucket in Google'

``curl http://localhost:8080/api/admin/settings/:ArchiverClassName -X PUT -d "edu.harvard.iq.dataverse.engine.command.impl.GoogleCloudSubmitToArchiveCommand"``

``curl http://localhost:8080/api/admin/settings/:ArchiverSettings -X PUT -d ":GoogleCloudBucket, :GoogleCloudProject"``
``curl http://localhost:8080/api/admin/settings/:ArchiverSettings -X PUT -d ":GoogleCloudBucket, :GoogleCloudProject, :BagGeneratorThreads"``

The Google Cloud Archiver defines two custom settings, both are required. The credentials for your account, in the form of a json key file, must also be obtained and stored locally (see below):
The Google Cloud Archiver defines two custom settings, both are required. It can also use the :BagGeneratorThreads setting as described in the DuraCloud Configuration section above. The credentials for your account, in the form of a json key file, must also be obtained and stored locally (see below):

In order to use the Google Cloud Archiver, you must have a Google account. You will need to create a project and bucket within that account and provide those values in the settings:

Expand Down Expand Up @@ -2400,6 +2406,13 @@ For example, the LocalSubmitToArchiveCommand only uses the :BagItLocalPath setti

``curl -X PUT -d ':BagItLocalPath' http://localhost:8080/api/admin/settings/:ArchiverSettings``

:BagGeneratorThreads
++++++++++++++++++++

An archiver setting shared by several implementations (e.g. DuraCloud, Google, and Local) that can make Bag generation use fewer or more threads in zipping datafiles that the default of 2

``curl http://localhost:8080/api/admin/settings/:BagGeneratorThreads -X PUT -d '8'``

:DuraCloudHost
++++++++++++++
:DuraCloudPort
Expand All @@ -2415,7 +2428,7 @@ These three settings define the host, port, and context used by the DuraCloudSub
This is the local file system path to be used with the LocalSubmitToArchiveCommand class. It is recommended to use an absolute path. See the :ref:`Local Path Configuration` section above.

:GoogleCloudBucket
++++++++++++++++++
++++++++++++++++++
:GoogleCloudProject
+++++++++++++++++++

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
import edu.harvard.iq.dataverse.engine.command.RequiredPermissions;
import edu.harvard.iq.dataverse.engine.command.exception.CommandException;
import edu.harvard.iq.dataverse.settings.SettingsServiceBean;
import edu.harvard.iq.dataverse.util.bagit.BagGenerator;
import edu.harvard.iq.dataverse.workflow.step.WorkflowStepResult;

import java.util.Date;
Expand All @@ -24,6 +25,7 @@ public abstract class AbstractSubmitToArchiveCommand extends AbstractCommand<Dat
private final DatasetVersion version;
private final Map<String, String> requestedSettings = new HashMap<String, String>();
private static final Logger logger = Logger.getLogger(AbstractSubmitToArchiveCommand.class.getName());
private static final int DEFAULT_THREADS = 2;

public AbstractSubmitToArchiveCommand(DataverseRequest aRequest, DatasetVersion version) {
super(aRequest, version.getDataset());
Expand Down Expand Up @@ -67,6 +69,18 @@ public DatasetVersion execute(CommandContext ctxt) throws CommandException {
*/
abstract public WorkflowStepResult performArchiveSubmission(DatasetVersion version, ApiToken token, Map<String, String> requestedSetttings);

protected int getNumberOfBagGeneratorThreads() {
if (requestedSettings.get(BagGenerator.BAG_GENERATOR_THREADS) != null) {
try {
return Integer.valueOf(requestedSettings.get(BagGenerator.BAG_GENERATOR_THREADS));
} catch (NumberFormatException nfe) {
logger.warning("Can't parse the value of setting " + BagGenerator.BAG_GENERATOR_THREADS
+ " as an integer - using default:" + DEFAULT_THREADS);
}
}
return DEFAULT_THREADS;
}

@Override
public String describe() {
return super.describe() + "DatasetVersion: [" + version.getId() + " (v"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ public class DuraCloudSubmitToArchiveCommand extends AbstractSubmitToArchiveComm
private static final String DURACLOUD_HOST = ":DuraCloudHost";
private static final String DURACLOUD_CONTEXT = ":DuraCloudContext";


public DuraCloudSubmitToArchiveCommand(DataverseRequest aRequest, DatasetVersion version) {
super(aRequest, version);
}
Expand Down Expand Up @@ -128,6 +129,7 @@ public void run() {
try (PipedOutputStream out = new PipedOutputStream(in)){
// Generate bag
BagGenerator bagger = new BagGenerator(new OREMap(dv, false), dataciteXml);
bagger.setNumConnections(getNumberOfBagGeneratorThreads());
bagger.setAuthenticationKey(token.getTokenString());
bagger.generateBag(out);
} catch (Exception e) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,7 @@ public void run() {
try (PipedOutputStream out = new PipedOutputStream(in)) {
// Generate bag
BagGenerator bagger = new BagGenerator(new OREMap(dv, false), dataciteXml);
bagger.setNumConnections(getNumberOfBagGeneratorThreads());
bagger.setAuthenticationKey(token.getTokenString());
bagger.generateBag(out);
} catch (Exception e) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ public WorkflowStepResult performArchiveSubmission(DatasetVersion dv, ApiToken t
new File(localPath + "/" + spaceName + "-datacite.v" + dv.getFriendlyVersionNumber() + ".xml"),
dataciteXml, StandardCharsets.UTF_8);
BagGenerator bagger = new BagGenerator(new OREMap(dv, false), dataciteXml);
bagger.setNumConnections(getNumberOfBagGeneratorThreads());
bagger.setAuthenticationKey(token.getTokenString());
zipName = localPath + "/" + spaceName + "v" + dv.getFriendlyVersionNumber() + ".zip";
bagger.generateBag(new FileOutputStream(zipName + ".partial"));
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,7 @@ public class BagGenerator {
private boolean usetemp = false;

private int numConnections = 8;
public static final String BAG_GENERATOR_THREADS = ":BagGeneratorThreads";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having a String here, can we keep all the database settings centralized in the Key enum at SettingsServiceBean?

I tried this locally and the following seems to work:

public static final String BAG_GENERATOR_THREADS = SettingsServiceBean.Key.BagGeneratorThreads.toString();

(I had to import SettingsServiceBean.)

I see that DuraCloudSubmitToArchiveCommand also has strings like :DuraCloudPort, :DuraCloudHost, and :DuraCloudContext hard-coded, but we could centralize those later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've purposely avoided that so far with the hope that at some point these archivers will be packaged separately. Right now they are loaded via reflection (you specify the classname in a property) and there should be no references to the specific Archiver classes or their properties in the main codebase. (As you note though we do have their properties in the master list in the guides so far.)
( I hope it isn't too much work for these classes and their dependencies to be pulled out into separate jars but it isn't something I know how to do off-hand - perhaps @poikilotherm?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I guess I was under the impression that since these archiver classes/commands are in the main code base that they have to be.

It would be nice to figure out how to extract them to a different git repo. It would be a great success story of modularity in Dataverse.

Once they are truly extracted from the main code base, I suppose their documentation should be extracted too, just like we do with external tools.

It sounds like we're not there yet. That's fine. Thanks for the cleanup you did.


private OREMap oremap;

Expand Down Expand Up @@ -1080,4 +1081,9 @@ public void setAuthenticationKey(String tokenString) {
apiKey = tokenString;
}

public void setNumConnections(int numConnections) {
this.numConnections = numConnections;
logger.fine("BagGenerator will use " + numConnections + " threads");
}

}