diff --git a/doc/release-notes/12167-ore-bag-archiving-changes.md b/doc/release-notes/12167-ore-bag-archiving-changes.md new file mode 100644 index 00000000000..a10dbdce1df --- /dev/null +++ b/doc/release-notes/12167-ore-bag-archiving-changes.md @@ -0,0 +1,54 @@ +## Archiving, OAI-ORE, and BagIt Export + +This release includes multiple updates to the OAI-ORE metadata export and the process of creating archival bags, improving performance, fixing bugs, and adding significant new functionality. + +### General Archiving Improvements +- Multiple performance and scaling improvements have been made for creating archival bags for large datasets, including: + - The duration of archiving tasks triggered from the version table or API are no longer limited by the transaction time limit. + - Temporary storage space requirements have increased by `1/:BagGeneratorThreads` of the zipped bag size. (This is a consequence of changes to avoid timeout errors on larger files/datasets.) + - The size of individual data files and the total dataset size that will be included in an archival bag can now be limited. Admins can choose whether files above these limits are transferred along with, but outside, the zipped bag (creating a complete archival copy) or are just referenced (using the concept of a "holey" bag and just listing the oversized files and the Dataverse URLs from which they can be retrieved in a `fetch.txt` file). In the holey bag case, an active service on the archiving platform must retrieve the oversized files (using appropriate credentials as needed) to make a complete copy. + - Superusers can now see a pending status in the dataset version table while archiving is active. + - Workflows are now triggered outside the transactions related to publication, assuring that workflow locks and status updates are always recorded. + - Potential conflicts between archiving/workflows, indexing, and metadata exports after publication have been resolved, avoiding cases where the status/last update times for these actions were not recorded. +- A bug has been fixed where superusers would incorrectly see the "Submit" button to launch archiving from the dataset page version table. +- The local, S3, and Google archivers have been updated to support deleting existing archival files for a version to allow re-creating the bag for a given version. +- For archivers that support file deletion, it is now possible to recreate an archival bag after "Update Current Version" has been used (replacing the original bag). By default, Dataverse will mark the current version's archive as out-of-date, but will not automatically re-archive it. + - A new 'obsolete' status has been added to indicate when an archival bag exists for a version but it was created prior to an "Update Current Version" change. +- Improvements have been made to file retrieval for bagging, including retries on errors and when download requests are being throttled. + - A bug causing `:BagGeneratorThreads` to be ignored has been fixed, and the default has been reduced to 2. +- Retrieval of files for inclusion in an archival bag is no longer counted as a download. +- It is now possible to require that all previous versions have been successfully archived before archiving of a newly published version can succeed. (This is intended to support use cases where deduplication of files between dataset versions will be done and is a step towards supporting the Oxford Common File Layout (OCFL).) +- The pending status has changed to use the same JSON format as other statuses + +### OAI-ORE Export Updates +- The export now uses URIs for checksum algorithms, conforming with JSON-LD requirements. +- A bug causing failures with deaccessioned versions has been fixed. This occurred when the deaccession note ("Deaccession Reason" in the UI) was null, which is permissible via the API. +- The `https://schema.org/additionalType` has been updated to "Dataverse OREMap Format v1.0.2" to reflect format changes. + +### Archival Bag (BagIt) Updates +- The `bag-info.txt` file now correctly includes information for dataset contacts, fixing a bug where nothing was included when multiple contacts were defined. (Multiple contacts were always included in the OAI-ORE file in the bag; only the baginfo file was affected). +- Values used in the `bag-info.txt` file that may be multi-line (i.e. with embedded CR or LF characters) are now properly indented and wrapped per the BagIt specification (`Internal-Sender-Identifier`, `External-Description`, `Source-Organization`, `Organization-Address`). +- The dataset name is no longer used as a subdirectory within the `data/` directory to reduce issues with unzipping long paths on some filesystems. +- For dataset versions with no files, the empty `manifest-.txt` file will now use the algorithm from the `:FileFixityChecksumAlgorithm` setting instead of defaulting to MD5. +- A new key, `Dataverse-Bag-Version`, has been added to `bag-info.txt` with the value "1.0" to allow for tracking changes to Dataverse's archival bag generation over time. +- When using the `holey` bag option discussed above, the required `fetch.txt` file will be included. + + +### New Configuration Settings + +This release introduces several new settings to control archival and bagging behavior. + +- `:ArchiveOnlyIfEarlierVersionsAreArchived` (Default: `false`) + When set to `true`, dataset versions must be archived in order. That is, all prior versions of a dataset must be archived before the latest version can be archived. + +The following JVM options (MicroProfile Config Settings) control bag size and holey bag support: +- `dataverse.bagit.zip.holey` +- `dataverse.bagit.zip.max-data-size` +- `dataverse.bagit.zip.max-file-size` + +- `dataverse.bagit.archive-on-version-update` (Default: `false`) + Indicates whether archival bag creation should be triggered (if configured) when a version is updated and was already successfully archived, i.e., via the Update-Current-Version publication option. Setting the flag to `true` only works if the archiver being used supports deleting existing archival bags. + + ###Backward Incompatibility + + The name of archival zipped bag produced by the LocalSubmitToArchiveCommand archiver now has a '.' character before the version number mirror the name used by other archivers, e.g. the name will be like doi-10-5072-fk2-fosg5q.v1.0.zip rather than doi-10-5072-fk2-fosg5qv1.0.zip \ No newline at end of file diff --git a/doc/sphinx-guides/source/admin/big-data-administration.rst b/doc/sphinx-guides/source/admin/big-data-administration.rst index c4a98a6987a..c1d2a02c4a2 100644 --- a/doc/sphinx-guides/source/admin/big-data-administration.rst +++ b/doc/sphinx-guides/source/admin/big-data-administration.rst @@ -302,6 +302,7 @@ There are a broad range of options (that are not turned on by default) for impro - :ref:`:DisableSolrFacetsWithoutJsession` - disables facets for users who have disabled cookies (e.g. for bots) - :ref:`:DisableUncheckedTypesFacet` - only disables the facet showing the number of collections, datasets, files matching the query (this facet is potentially less useful than others) - :ref:`:StoreIngestedTabularFilesWithVarHeaders` - by default, Dataverse stores ingested files without headers and dynamically adds them back at download time. Once this setting is enabled, Dataverse will leave the headers in place (for newly ingested files), reducing the cost of downloads +- :ref:`dataverse.bagit.zip.max-file-size`, :ref:`dataverse.bagit.zip.max-data-size`, and :ref:`dataverse.bagit.zip.holey` - options to control the size and temporary storage requirements when generating archival Bags - see :ref:`BagIt Export` Scaling Infrastructure diff --git a/doc/sphinx-guides/source/installation/config.rst b/doc/sphinx-guides/source/installation/config.rst index 07084d8b126..f829b57d3e6 100644 --- a/doc/sphinx-guides/source/installation/config.rst +++ b/doc/sphinx-guides/source/installation/config.rst @@ -2259,10 +2259,22 @@ These archival Bags include all of the files and metadata in a given dataset ver The Dataverse Software offers an internal archive workflow which may be configured as a PostPublication workflow via an admin API call to manually submit previously published Datasets and prior versions to a configured archive such as Chronopolis. The workflow creates a `JSON-LD `_ serialized `OAI-ORE `_ map file, which is also available as a metadata export format in the Dataverse Software web interface. +The size of the zipped archival Bag can be limited, and files that don't fit within that limit can either be transferred separately (placed so that they are correctly positioned according to the BagIt specification when the zipped bag in unzipped in place) or just referenced for later download (using the BagIt concept of a 'holey' bag with a list of files in a ``fetch.txt`` file) can now be configured for all archivers. These settings allow for managing large datasets by excluding files over a certain size or total data size, which can be useful for archivers with size limitations or to reduce transfer times. See the :ref:`dataverse.bagit.zip.max-file-size`, :ref:`dataverse.bagit.zip.max-data-size`, and :ref:`dataverse.bagit.zip.holey` JVM options for more details. + At present, archiving classes include the DuraCloudSubmitToArchiveCommand, LocalSubmitToArchiveCommand, GoogleCloudSubmitToArchive, and S3SubmitToArchiveCommand , which all extend the AbstractSubmitToArchiveCommand and use the configurable mechanisms discussed below. (A DRSSubmitToArchiveCommand, which works with Harvard's DRS also exists and, while specific to DRS, is a useful example of how Archivers can support single-version-only semantics and support archiving only from specified collections (with collection specific parameters)). All current options support the :ref:`Archival Status API` calls and the same status is available in the dataset page version table (for contributors/those who could view the unpublished dataset, with more detail available to superusers). +Two settings that can be used with all current Archivers are: + +- \:BagGeneratorThreads - the number of threads to use when adding data files to the zipped bag. The default is 2. Values of 4 or more may increase performance on larger machines but may cause problems if file access is throttled +- \:ArchiveOnlyIfEarlierVersionsAreArchived - when true, requires dataset versions to be archived in order by confirming that all prior versions have been successfully archived before allowing a new version to be archived. Default is false + +These must be included in the \:ArchiverSettings for the Archiver to work + +Archival Bags are created per dataset version. By default, if a version is republished (via the superuser-only 'Update Current Version' publication option in the UI/API), a new archival bag is not created for the version. +If the archiver used is capable of deleting existing bags (Google, S3, and File Archivers) superusers can trigger a manual update of the archival bag, and, if the :ref:`dataverse.bagit.archive-on-version-update` flag is set to true, this will be done automatically when 'Update Current Version' is used. + .. _Duracloud Configuration: Duracloud Configuration @@ -3715,6 +3727,14 @@ The email for your institution that you'd like to appear in bag-info.txt. See :r Can also be set via *MicroProfile Config API* sources, e.g. the environment variable ``DATAVERSE_BAGIT_SOURCEORG_EMAIL``. +.. _dataverse.bagit.archive-on-version-update: + +dataverse.bagit.archive-on-version-update ++++++++++++++++++++++++++++++++++++++++++ + +Indicates whether archival bag creation should be triggered (if configured) when a version is updated and was already successfully archived, +i.e via the Update-Current-Version publication option. Setting the flag true only works if the archiver being used supports deleting existing archival bags. + .. _dataverse.files.globus-monitoring-server: dataverse.files.globus-monitoring-server @@ -3868,6 +3888,21 @@ This can instead be restricted to only superusers who can publish the dataset us Example: ``dataverse.coar-notify.relationship-announcement.notify-superusers-only=true`` +.. _dataverse.bagit.zip.holey: + +``dataverse.bagit.zip.holey`` + A boolean that, if true, will cause the BagIt archiver to create a "holey" bag. In a holey bag, files that are not included in the bag are listed in the ``fetch.txt`` file with a URL from which they can be downloaded. This is used in conjunction with ``dataverse.bagit.zip.max-file-size`` and/or ``dataverse.bagit.zip.max-data-size``. Default: false. + +.. _dataverse.bagit.zip.max-data-size: + +``dataverse.bagit.zip.max-data-size`` + The maximum total (uncompressed) size of data files (in bytes) to include in a BagIt zip archive. If the total size of the dataset files exceeds this limit, files will be excluded from the zipped bag (starting from the largest) until the total size is under the limit. Excluded files will be handled as defined by ``dataverse.bagit.zip.holey`` - just listed if that setting is true or being transferred separately and placed next to the zipped bag. When not set, there is no limit. + +.. _dataverse.bagit.zip.max-file-size: + +``dataverse.bagit.zip.max-file-size`` + The maximum (uncompressed) size of a single file (in bytes) to include in a BagIt zip archive. Any file larger than this will be excluded. Excluded files will be handled as defined by ``dataverse.bagit.zip.holey`` - just listed if that setting is true or being transferred separately and placed next to the zipped bag. When not set, there is no limit. + .. _feature-flags: Feature Flags @@ -4031,9 +4066,6 @@ dataverse.feature.only-update-datacite-when-needed Only contact DataCite to update a DOI after checking to see if DataCite has outdated information (for efficiency, lighter load on DataCite, especially when using file DOIs). - - - .. _:ApplicationServerSettings: Application Server Settings @@ -5342,6 +5374,11 @@ This setting specifies which storage system to use by identifying the particular For examples, see the specific configuration above in :ref:`BagIt Export`. +:ArchiveOnlyIfEarlierVersionsAreArchived +++++++++++++++++++++++++++++++++++++++++ + +This setting, if true, only allows creation of an archival Bag for a dataset version if all prior versions have been successfully archived. The default is false (any version can be archived independently as long as other settings allow it) + :ArchiverSettings +++++++++++++++++ diff --git a/src/main/java/edu/harvard/iq/dataverse/DataFile.java b/src/main/java/edu/harvard/iq/dataverse/DataFile.java index 45604a5472b..8a08cd15029 100644 --- a/src/main/java/edu/harvard/iq/dataverse/DataFile.java +++ b/src/main/java/edu/harvard/iq/dataverse/DataFile.java @@ -109,18 +109,22 @@ public class DataFile extends DvObject implements Comparable { * The list of types should be limited to the list above in the technote * because the string gets passed into MessageDigest.getInstance() and you * can't just pass in any old string. + * + * The URIs are used in the OAI_ORE export. They are taken from the associated XML Digital Signature standards. */ public enum ChecksumType { - MD5("MD5"), - SHA1("SHA-1"), - SHA256("SHA-256"), - SHA512("SHA-512"); + MD5("MD5", "http://www.w3.org/2001/04/xmldsig-more#md5"), + SHA1("SHA-1", "http://www.w3.org/2000/09/xmldsig#sha1"), + SHA256("SHA-256", "http://www.w3.org/2001/04/xmlenc#sha256"), + SHA512("SHA-512", "http://www.w3.org/2001/04/xmlenc#sha512"); private final String text; + private final String uri; - private ChecksumType(final String text) { + private ChecksumType(final String text, final String uri) { this.text = text; + this.uri = uri; } public static ChecksumType fromString(String text) { @@ -131,13 +135,30 @@ public static ChecksumType fromString(String text) { } } } - throw new IllegalArgumentException("ChecksumType must be one of these values: " + Arrays.asList(ChecksumType.values()) + "."); + throw new IllegalArgumentException( + "ChecksumType must be one of these values: " + Arrays.asList(ChecksumType.values()) + "."); + } + + public static ChecksumType fromUri(String uri) { + if (uri != null) { + for (ChecksumType checksumType : ChecksumType.values()) { + if (uri.equals(checksumType.uri)) { + return checksumType; + } + } + } + throw new IllegalArgumentException( + "ChecksumType must be one of these values: " + Arrays.asList(ChecksumType.values()) + "."); } @Override public String toString() { return text; } + + public String toUri() { + return uri; + } } //@Expose diff --git a/src/main/java/edu/harvard/iq/dataverse/DatasetPage.java b/src/main/java/edu/harvard/iq/dataverse/DatasetPage.java index a2da482258d..dfac5f75771 100644 --- a/src/main/java/edu/harvard/iq/dataverse/DatasetPage.java +++ b/src/main/java/edu/harvard/iq/dataverse/DatasetPage.java @@ -39,6 +39,7 @@ import edu.harvard.iq.dataverse.engine.command.impl.UpdateDatasetVersionCommand; import edu.harvard.iq.dataverse.export.ExportService; import edu.harvard.iq.dataverse.util.cache.CacheFactoryBean; +import edu.harvard.iq.dataverse.util.json.JsonUtil; import io.gdcc.spi.export.ExportException; import io.gdcc.spi.export.Exporter; import edu.harvard.iq.dataverse.ingest.IngestRequest; @@ -102,6 +103,8 @@ import jakarta.faces.view.ViewScoped; import jakarta.inject.Inject; import jakarta.inject.Named; +import jakarta.json.Json; +import jakarta.json.JsonObjectBuilder; import jakarta.persistence.OptimisticLockException; import org.apache.commons.lang3.StringUtils; @@ -385,7 +388,9 @@ public void setSelectedHostDataverse(Dataverse selectedHostDataverse) { private boolean showIngestSuccess; private Boolean archivable = null; - private Boolean versionArchivable = null; + private Boolean checkForArchivalCopy; + private Boolean supportsDelete; + private HashMap versionArchivable = new HashMap<>(); private Boolean someVersionArchived = null; public boolean isShowIngestSuccess() { @@ -2990,27 +2995,38 @@ public String updateCurrentVersion() { String className = settingsService.get(SettingsServiceBean.Key.ArchiverClassName.toString()); AbstractSubmitToArchiveCommand archiveCommand = ArchiverUtil.createSubmitToArchiveCommand(className, dvRequestService.getDataverseRequest(), updateVersion); if (archiveCommand != null) { - // Delete the record of any existing copy since it is now out of date/incorrect - updateVersion.setArchivalCopyLocation(null); - /* - * Then try to generate and submit an archival copy. Note that running this - * command within the CuratePublishedDatasetVersionCommand was causing an error: - * "The attribute [id] of class - * [edu.harvard.iq.dataverse.DatasetFieldCompoundValue] is mapped to a primary - * key column in the database. Updates are not allowed." To avoid that, and to - * simplify reporting back to the GUI whether this optional step succeeded, I've - * pulled this out as a separate submit(). - */ - try { - updateVersion = commandEngine.submit(archiveCommand); - if (!updateVersion.getArchivalCopyLocationStatus().equals(DatasetVersion.ARCHIVAL_STATUS_FAILURE)) { - successMsg = BundleUtil.getStringFromBundle("datasetversion.update.archive.success"); - } else { - errorMsg = BundleUtil.getStringFromBundle("datasetversion.update.archive.failure"); + //There is an archiver configured, so now decide what to do: + // If a successful copy exists, don't automatically update, just note the old copy is obsolete (and enable the superadmin button in the display to allow a ~manual update if desired) + // If pending or an obsolete copy exists, do nothing (nominally if a pending run succeeds and we're updating the current version here, it should be marked as obsolete - ignoring for now since updates within the time an archiving run is pending should be rare + // If a failure or null, rerun archiving now. If a failure is due to an exiting copy in the repo, we'll fail again + String status = updateVersion.getArchivalCopyLocationStatus(); + if((status==null) || status.equals(DatasetVersion.ARCHIVAL_STATUS_FAILURE) || (JvmSettings.BAGIT_ARCHIVE_ON_VERSION_UPDATE.lookupOptional(Boolean.class).orElse(false) && archiveCommand.canDelete())){ + // Delete the record of any existing copy since it is now out of date/incorrect + JsonObjectBuilder job = Json.createObjectBuilder(); + job.add(DatasetVersion.ARCHIVAL_STATUS, DatasetVersion.ARCHIVAL_STATUS_PENDING); + updateVersion.setArchivalCopyLocation(JsonUtil.prettyPrint(job.build())); + //Persist to db now + datasetVersionService.persistArchivalCopyLocation(updateVersion); + /* + * Then try to generate and submit an archival copy. Note that running this + * command within the CuratePublishedDatasetVersionCommand was causing an error: + * "The attribute [id] of class + * [edu.harvard.iq.dataverse.DatasetFieldCompoundValue] is mapped to a primary + * key column in the database. Updates are not allowed." To avoid that, and to + * simplify reporting back to the GUI whether this optional step succeeded, I've + * pulled this out as a separate submit(). + */ + try { + commandEngine.submitAsync(archiveCommand); + JsfHelper.addSuccessMessage(BundleUtil.getStringFromBundle("datasetversion.archive.inprogress")); + } catch (CommandException ex) { + errorMsg = BundleUtil.getStringFromBundle("datasetversion.update.archive.failure") + " - " + ex.toString(); + logger.severe(ex.getMessage()); } - } catch (CommandException ex) { - errorMsg = BundleUtil.getStringFromBundle("datasetversion.update.archive.failure") + " - " + ex.toString(); - logger.severe(ex.getMessage()); + } else if(status.equals(DatasetVersion.ARCHIVAL_STATUS_SUCCESS)) { + //Not automatically replacing the old archival copy as creating it is expensive + updateVersion.setArchivalStatusOnly(DatasetVersion.ARCHIVAL_STATUS_OBSOLETE); + datasetVersionService.persistArchivalCopyLocation(updateVersion); } } } @@ -6094,33 +6110,33 @@ public void refreshPaginator() { /** * This method can be called from *.xhtml files to allow archiving of a dataset - * version from the user interface. It is not currently (11/18) used in the IQSS/develop - * branch, but is used by QDR and is kept here in anticipation of including a - * GUI option to archive (already published) versions after other dataset page - * changes have been completed. + * version from the user interface. * * @param id - the id of the datasetversion to archive. */ - public void archiveVersion(Long id) { + public void archiveVersion(Long id, boolean force) { if (session.getUser() instanceof AuthenticatedUser) { DatasetVersion dv = datasetVersionService.retrieveDatasetVersionByVersionId(id).getDatasetVersion(); String className = settingsWrapper.getValueForKey(SettingsServiceBean.Key.ArchiverClassName, null); AbstractSubmitToArchiveCommand cmd = ArchiverUtil.createSubmitToArchiveCommand(className, dvRequestService.getDataverseRequest(), dv); if (cmd != null) { try { - DatasetVersion version = commandEngine.submit(cmd); - if (!version.getArchivalCopyLocationStatus().equals(DatasetVersion.ARCHIVAL_STATUS_FAILURE)) { + String status = dv.getArchivalCopyLocationStatus(); + if (status == null || (force && cmd.canDelete())) { + + // Set initial pending status + JsonObjectBuilder job = Json.createObjectBuilder(); + job.add(DatasetVersion.ARCHIVAL_STATUS, DatasetVersion.ARCHIVAL_STATUS_PENDING); + dv.setArchivalCopyLocation(JsonUtil.prettyPrint(job.build())); + //Persist now + datasetVersionService.persistArchivalCopyLocation(dv); + commandEngine.submitAsync(cmd); + logger.info( - "DatasetVersion id=" + version.getId() + " submitted to Archive, status: " + dv.getArchivalCopyLocationStatus()); - } else { - logger.severe("Error submitting version " + version.getId() + " due to conflict/error at Archive"); - } - if (version.getArchivalCopyLocation() != null) { + "DatasetVersion id=" + dv.getId() + " submitted to Archive, status: " + dv.getArchivalCopyLocationStatus()); setVersionTabList(resetVersionTabList()); this.setVersionTabListForPostLoad(getVersionTabList()); - JsfHelper.addSuccessMessage(BundleUtil.getStringFromBundle("datasetversion.archive.success")); - } else { - JsfHelper.addErrorMessage(BundleUtil.getStringFromBundle("datasetversion.archive.failure")); + JsfHelper.addSuccessMessage(BundleUtil.getStringFromBundle("datasetversion.archive.inprogress")); } } catch (CommandException ex) { logger.log(Level.SEVERE, "Unexpected Exception calling submit archive command", ex); @@ -6154,41 +6170,61 @@ public boolean isArchivable() { return archivable; } - public boolean isVersionArchivable() { - if (versionArchivable == null) { + public boolean isVersionArchivable(Long id) { + Boolean thisVersionArchivable = versionArchivable.get(id); + if (thisVersionArchivable == null) { // If this dataset isn't in an archivable collection return false - versionArchivable = false; + thisVersionArchivable = false; + boolean requiresEarlierVersionsToBeArchived = settingsWrapper.isTrueForKey(SettingsServiceBean.Key.ArchiveOnlyIfEarlierVersionsAreArchived, false); if (isArchivable()) { - boolean checkForArchivalCopy = false; // Otherwise, we need to know if the archiver is single-version-only // If it is, we have to check for an existing archived version to answer the // question String className = settingsWrapper.getValueForKey(SettingsServiceBean.Key.ArchiverClassName, null); if (className != null) { try { - Class clazz = Class.forName(className); - Method m = clazz.getMethod("isSingleVersion", SettingsWrapper.class); - Object[] params = { settingsWrapper }; - checkForArchivalCopy = (Boolean) m.invoke(null, params); - + DatasetVersion targetVersion = dataset.getVersions().stream() + .filter(v -> v.getId().equals(id)).findFirst().orElse(null); + if (requiresEarlierVersionsToBeArchived) {// Find the specific version by id + DatasetVersion priorVersion = DatasetUtil.getPriorVersion(targetVersion); + + if (priorVersion== null || (isVersionArchivable(priorVersion.getId()) + && ArchiverUtil.isVersionArchived(priorVersion))) { + thisVersionArchivable = true; + } + } + if (checkForArchivalCopy == null) { + //Only check once + Class clazz = Class.forName(className); + Method m = clazz.getMethod("isSingleVersion", SettingsWrapper.class); + Method m2 = clazz.getMethod("supportsDelete"); + Object[] params = { settingsWrapper }; + checkForArchivalCopy = (Boolean) m.invoke(null, params); + supportsDelete = (Boolean) m2.invoke(null); + } if (checkForArchivalCopy) { // If we have to check (single version archiving), we can't allow archiving if // one version is already archived (or attempted - any non-null status) - versionArchivable = !isSomeVersionArchived(); + thisVersionArchivable = !isSomeVersionArchived(); } else { - // If we allow multiple versions or didn't find one that has had archiving run - // on it, we can archive, so return true - versionArchivable = true; + // If we didn't find one that has had archiving run + // on it, or archiving per version is supported and either + // the status is null or the archiver can delete prior runs and status isn't success, + // we can archive, so return true + // Find the specific version by id + String status = targetVersion.getArchivalCopyLocationStatus(); + thisVersionArchivable = (status == null) || ((!status.equals(DatasetVersion.ARCHIVAL_STATUS_SUCCESS) && (!status.equals(DatasetVersion.ARCHIVAL_STATUS_PENDING)) && supportsDelete)); } } catch (ClassNotFoundException | IllegalAccessException | IllegalArgumentException | InvocationTargetException | NoSuchMethodException | SecurityException e) { - logger.warning("Failed to call isSingleVersion on configured archiver class: " + className); + logger.warning("Failed to call methods on configured archiver class: " + className); e.printStackTrace(); } } } + versionArchivable.put(id, thisVersionArchivable); } - return versionArchivable; + return thisVersionArchivable; } public boolean isSomeVersionArchived() { diff --git a/src/main/java/edu/harvard/iq/dataverse/DatasetServiceBean.java b/src/main/java/edu/harvard/iq/dataverse/DatasetServiceBean.java index a58dad4f4c7..8b820fbc7a4 100644 --- a/src/main/java/edu/harvard/iq/dataverse/DatasetServiceBean.java +++ b/src/main/java/edu/harvard/iq/dataverse/DatasetServiceBean.java @@ -1140,4 +1140,19 @@ public void saveStorageQuota(Dataset target, Long allocation) { } em.flush(); } + + @TransactionAttribute(TransactionAttributeType.REQUIRES_NEW) + public void setLastExportTimeInNewTransaction(Long datasetId, Date lastExportTime) { + try { + Dataset currentDataset = find(datasetId); + if (currentDataset != null) { + currentDataset.setLastExportTime(lastExportTime); + merge(currentDataset); + } else { + logger.log(Level.SEVERE, "Could not find Dataset with id={0} to retry persisting archival copy location after OptimisticLockException.", datasetId); + } + } catch (Exception e) { + logger.log(Level.SEVERE, "Failed to retry export after OptimisticLockException for dataset id=" + datasetId, e); + } + } } diff --git a/src/main/java/edu/harvard/iq/dataverse/DatasetVersion.java b/src/main/java/edu/harvard/iq/dataverse/DatasetVersion.java index 93b0ccfef61..92bab58e8d6 100644 --- a/src/main/java/edu/harvard/iq/dataverse/DatasetVersion.java +++ b/src/main/java/edu/harvard/iq/dataverse/DatasetVersion.java @@ -132,6 +132,7 @@ public enum VersionState { public static final String ARCHIVAL_STATUS_PENDING = "pending"; public static final String ARCHIVAL_STATUS_SUCCESS = "success"; public static final String ARCHIVAL_STATUS_FAILURE = "failure"; + public static final String ARCHIVAL_STATUS_OBSOLETE = "obsolete"; @Id @GeneratedValue(strategy = GenerationType.IDENTITY) @@ -231,8 +232,9 @@ public enum VersionState { @Transient private DatasetVersionDifference dvd; + //The Json version of the archivalCopyLocation string @Transient - private JsonObject archivalStatus; + private JsonObject archivalCopyLocationJson; public Long getId() { return this.id; @@ -383,25 +385,25 @@ public String getArchivalCopyLocation() { public String getArchivalCopyLocationStatus() { populateArchivalStatus(false); - if(archivalStatus!=null) { - return archivalStatus.getString(ARCHIVAL_STATUS); + if(archivalCopyLocationJson!=null) { + return archivalCopyLocationJson.getString(ARCHIVAL_STATUS); } return null; } public String getArchivalCopyLocationMessage() { populateArchivalStatus(false); - if(archivalStatus!=null) { - return archivalStatus.getString(ARCHIVAL_STATUS_MESSAGE); + if(archivalCopyLocationJson!=null && archivalCopyLocationJson.containsKey(ARCHIVAL_STATUS_MESSAGE)) { + return archivalCopyLocationJson.getString(ARCHIVAL_STATUS_MESSAGE); } return null; } private void populateArchivalStatus(boolean force) { - if(archivalStatus ==null || force) { + if(archivalCopyLocationJson ==null || force) { if(archivalCopyLocation!=null) { try { - archivalStatus = JsonUtil.getJsonObject(archivalCopyLocation); - } catch(Exception e) { + archivalCopyLocationJson = JsonUtil.getJsonObject(archivalCopyLocation); + } catch (Exception e) { logger.warning("DatasetVersion id: " + id + "has a non-JsonObject value, parsing error: " + e.getMessage()); logger.fine(archivalCopyLocation); } @@ -414,6 +416,15 @@ public void setArchivalCopyLocation(String location) { populateArchivalStatus(true); } + // Convenience method to just change the status without changing the location + public void setArchivalStatusOnly(String status) { + populateArchivalStatus(false); + JsonObjectBuilder job = Json.createObjectBuilder(archivalCopyLocationJson); + job.add(DatasetVersion.ARCHIVAL_STATUS, status); + archivalCopyLocationJson = job.build(); + archivalCopyLocation = JsonUtil.prettyPrint(archivalCopyLocationJson); + } + public String getDeaccessionLink() { return deaccessionLink; } diff --git a/src/main/java/edu/harvard/iq/dataverse/DatasetVersionServiceBean.java b/src/main/java/edu/harvard/iq/dataverse/DatasetVersionServiceBean.java index 60df1fd3dfd..a5dd724104f 100644 --- a/src/main/java/edu/harvard/iq/dataverse/DatasetVersionServiceBean.java +++ b/src/main/java/edu/harvard/iq/dataverse/DatasetVersionServiceBean.java @@ -28,11 +28,14 @@ import jakarta.ejb.EJB; import jakarta.ejb.EJBException; import jakarta.ejb.Stateless; +import jakarta.ejb.TransactionAttribute; +import jakarta.ejb.TransactionAttributeType; import jakarta.inject.Named; import jakarta.json.Json; import jakarta.json.JsonObjectBuilder; import jakarta.persistence.EntityManager; import jakarta.persistence.NoResultException; +import jakarta.persistence.OptimisticLockException; import jakarta.persistence.PersistenceContext; import jakarta.persistence.Query; import jakarta.persistence.TypedQuery; @@ -1333,4 +1336,24 @@ public Long getDatasetVersionCount(Long datasetId, boolean canViewUnpublishedVer return em.createQuery(cq).getSingleResult(); } + + + /** + * Update the archival copy location for a specific version of a dataset. + * Archiving can be long-running and other parallel updates to the datasetversion have likely occurred + * so this method will just re-find the version rather than risking an + * OptimisticLockException and then having to retry in yet another transaction (since the OLE rolls this one back). + * + * @param dv + * The dataset version whose archival copy location we want to update. Must not be {@code null}. + */ + @TransactionAttribute(TransactionAttributeType.REQUIRES_NEW) + public void persistArchivalCopyLocation(DatasetVersion dv) { + DatasetVersion currentVersion = find(dv.getId()); + if (currentVersion != null) { + currentVersion.setArchivalCopyLocation(dv.getArchivalCopyLocation()); + } else { + logger.log(Level.SEVERE, "Could not find DatasetVersion with id={0} to retry persisting archival copy location after OptimisticLockException.", dv.getId()); + } + } } diff --git a/src/main/java/edu/harvard/iq/dataverse/EjbDataverseEngine.java b/src/main/java/edu/harvard/iq/dataverse/EjbDataverseEngine.java index 4d6d59cb013..4fa85a543d8 100644 --- a/src/main/java/edu/harvard/iq/dataverse/EjbDataverseEngine.java +++ b/src/main/java/edu/harvard/iq/dataverse/EjbDataverseEngine.java @@ -31,6 +31,9 @@ import java.util.Map; import java.util.Set; + +import jakarta.ejb.AsyncResult; +import jakarta.ejb.Asynchronous; import jakarta.ejb.EJB; import jakarta.ejb.Stateless; import jakarta.inject.Named; @@ -45,6 +48,7 @@ import java.util.Arrays; import java.util.EnumSet; import java.util.Stack; +import java.util.concurrent.Future; import java.util.logging.Level; import java.util.logging.Logger; import jakarta.annotation.Resource; @@ -348,6 +352,27 @@ public R submit(Command aCommand) throws CommandException { logSvc.log(logRec); } } + + /** + * Submits a command for asynchronous execution. + * The command will be executed in a separate thread and won't block the caller. + * + * @param The return type of the command + * @param aCommand The command to execute + * @return A Future representing the pending result + * @throws CommandException if the command cannot be submitted + */ + @Asynchronous + public Future submitAsync(Command aCommand) throws CommandException { + try { + logger.log(Level.INFO, "Submitting async command: {0}", aCommand.getClass().getSimpleName()); + R result = submit(aCommand); + return new AsyncResult<>(result); + } catch (Exception e) { + logger.log(Level.SEVERE, "Async command execution failed: " + aCommand.getClass().getSimpleName(), e); + throw e; + } + } protected void completeCommand(Command command, Object r, Stack called) { diff --git a/src/main/java/edu/harvard/iq/dataverse/FileMetadataVersionsHelper.java b/src/main/java/edu/harvard/iq/dataverse/FileMetadataVersionsHelper.java index 4d408a72c8c..cc632054642 100644 --- a/src/main/java/edu/harvard/iq/dataverse/FileMetadataVersionsHelper.java +++ b/src/main/java/edu/harvard/iq/dataverse/FileMetadataVersionsHelper.java @@ -1,6 +1,7 @@ package edu.harvard.iq.dataverse; import edu.harvard.iq.dataverse.authorization.Permission; +import edu.harvard.iq.dataverse.dataset.DatasetUtil; import edu.harvard.iq.dataverse.engine.command.DataverseRequest; import jakarta.ejb.EJB; import jakarta.ejb.Stateless; @@ -95,18 +96,7 @@ private FileMetadata getPreviousFileMetadata(FileMetadata fileMetadata, FileMeta //TODO: this could use some refactoring to cut down on the number of for loops! private FileMetadata getPreviousFileMetadata(FileMetadata fileMetadata, DatasetVersion currentversion) { List allfiles = allRelatedFiles(fileMetadata); - boolean foundCurrent = false; - DatasetVersion priorVersion = null; - for (DatasetVersion versionLoop : fileMetadata.getDatasetVersion().getDataset().getVersions()) { - if (foundCurrent) { - priorVersion = versionLoop; - break; - } - if (versionLoop.equals(currentversion)) { - foundCurrent = true; - } - - } + DatasetVersion priorVersion = DatasetUtil.getPriorVersion(fileMetadata.getDatasetVersion()); if (priorVersion != null && priorVersion.getFileMetadatasSorted() != null) { for (FileMetadata fmdTest : priorVersion.getFileMetadatasSorted()) { for (DataFile fileTest : allfiles) { diff --git a/src/main/java/edu/harvard/iq/dataverse/api/Admin.java b/src/main/java/edu/harvard/iq/dataverse/api/Admin.java index 18f28569d7d..10aadde57b6 100644 --- a/src/main/java/edu/harvard/iq/dataverse/api/Admin.java +++ b/src/main/java/edu/harvard/iq/dataverse/api/Admin.java @@ -2067,6 +2067,7 @@ public Response submitDatasetVersionToArchive(@Context ContainerRequestContext c if(dv==null) { return error(Status.BAD_REQUEST, "Requested version not found."); } + //ToDo - allow forcing with a non-success status for archivers that supportsDelete() if (dv.getArchivalCopyLocation() == null) { String className = settingsService.getValueForKey(SettingsServiceBean.Key.ArchiverClassName); // Note - the user is being sent via the createDataverseRequest(au) call to the @@ -2132,7 +2133,7 @@ public Response archiveAllUnarchivedDatasetVersions(@Context ContainerRequestCon try { AuthenticatedUser au = getRequestAuthenticatedUserOrDie(crc); - + //ToDo - allow forcing with a non-success status for archivers that supportsDelete() List dsl = datasetversionService.getUnarchivedDatasetVersions(); if (dsl != null) { if (listonly) { diff --git a/src/main/java/edu/harvard/iq/dataverse/api/Datasets.java b/src/main/java/edu/harvard/iq/dataverse/api/Datasets.java index b6688a8143b..738955e259b 100644 --- a/src/main/java/edu/harvard/iq/dataverse/api/Datasets.java +++ b/src/main/java/edu/harvard/iq/dataverse/api/Datasets.java @@ -1278,27 +1278,35 @@ public Response publishDataset(@Context ContainerRequestContext crc, @PathParam( DatasetVersion updateVersion = ds.getLatestVersion(); AbstractSubmitToArchiveCommand archiveCommand = ArchiverUtil.createSubmitToArchiveCommand(className, createDataverseRequest(user), updateVersion); if (archiveCommand != null) { - // Delete the record of any existing copy since it is now out of date/incorrect - updateVersion.setArchivalCopyLocation(null); - /* - * Then try to generate and submit an archival copy. Note that running this - * command within the CuratePublishedDatasetVersionCommand was causing an error: - * "The attribute [id] of class - * [edu.harvard.iq.dataverse.DatasetFieldCompoundValue] is mapped to a primary - * key column in the database. Updates are not allowed." To avoid that, and to - * simplify reporting back to the GUI whether this optional step succeeded, I've - * pulled this out as a separate submit(). - */ - try { - updateVersion = commandEngine.submit(archiveCommand); - if (!updateVersion.getArchivalCopyLocationStatus().equals(DatasetVersion.ARCHIVAL_STATUS_FAILURE)) { - successMsg = BundleUtil.getStringFromBundle("datasetversion.update.archive.success"); - } else { - successMsg = BundleUtil.getStringFromBundle("datasetversion.update.archive.failure"); + String status = updateVersion.getArchivalCopyLocationStatus(); + if ((status == null) || status.equals(DatasetVersion.ARCHIVAL_STATUS_FAILURE)) { + // Delete the record of any existing copy since it is now out of + // date/incorrect + JsonObjectBuilder job = Json.createObjectBuilder(); + job.add(DatasetVersion.ARCHIVAL_STATUS, DatasetVersion.ARCHIVAL_STATUS_PENDING); + updateVersion.setArchivalCopyLocation(JsonUtil.prettyPrint(job.build())); + datasetVersionSvc.persistArchivalCopyLocation(updateVersion); + /* + * Then try to generate and submit an archival copy. Note that running this + * command within the CuratePublishedDatasetVersionCommand was causing an error: + * "The attribute [id] of class + * [edu.harvard.iq.dataverse.DatasetFieldCompoundValue] is mapped to a primary + * key column in the database. Updates are not allowed." To avoid that, and to + * simplify reporting back to the GUI whether this optional step succeeded, I've + * pulled this out as a separate submit(). + */ + try { + commandEngine.submitAsync(archiveCommand); + successMsg = BundleUtil.getStringFromBundle("datasetversion.archive.inprogress"); + } catch (CommandException ex) { + successMsg = BundleUtil.getStringFromBundle("datasetversion.update.archive.failure") + + " - " + ex.toString(); + logger.severe(ex.getMessage()); } - } catch (CommandException ex) { - successMsg = BundleUtil.getStringFromBundle("datasetversion.update.archive.failure") + " - " + ex.toString(); - logger.severe(ex.getMessage()); + } else if (status.equals(DatasetVersion.ARCHIVAL_STATUS_SUCCESS)) { + // Not automatically replacing the old archival copy as creating it is expensive + updateVersion.setArchivalStatusOnly(DatasetVersion.ARCHIVAL_STATUS_OBSOLETE); + datasetVersionSvc.persistArchivalCopyLocation(updateVersion); } } } catch (CommandException ex) { @@ -1388,17 +1396,25 @@ public Response publishMigratedDataset(@Context ContainerRequestContext crc, Str */ String errorMsg = null; Optional prePubWf = wfService.getDefaultWorkflow(TriggerType.PrePublishDataset); - + DataverseRequest dataverseRequest = createDataverseRequest(user); try { // ToDo - should this be in onSuccess()? May relate to todo above if (prePubWf.isPresent()) { + // Create the workflow lock BEFORE starting the workflow + DatasetLock workflowLock = new DatasetLock(DatasetLock.Reason.Workflow, user); + workflowLock.setDataset(ds); + datasetSvc.addDatasetLock(ds, workflowLock); + + // Build context with the lock attached + WorkflowContext context = new WorkflowContext(dataverseRequest, ds, TriggerType.PrePublishDataset, !contactPIDProvider); + context.setLockId(ds.getLockFor(DatasetLock.Reason.Workflow).getId()); // Start the workflow, the workflow will call FinalizeDatasetPublication later wfService.start(prePubWf.get(), - new WorkflowContext(createDataverseRequest(user), ds, TriggerType.PrePublishDataset, !contactPIDProvider), + new WorkflowContext(dataverseRequest, ds, TriggerType.PrePublishDataset, !contactPIDProvider), false); } else { FinalizeDatasetPublicationCommand cmd = new FinalizeDatasetPublicationCommand(ds, - createDataverseRequest(user), !contactPIDProvider); + dataverseRequest, !contactPIDProvider); ds = commandEngine.submit(cmd); } } catch (CommandException ex) { diff --git a/src/main/java/edu/harvard/iq/dataverse/dataset/DatasetUtil.java b/src/main/java/edu/harvard/iq/dataverse/dataset/DatasetUtil.java index 2ce5471a523..79451a61a84 100644 --- a/src/main/java/edu/harvard/iq/dataverse/dataset/DatasetUtil.java +++ b/src/main/java/edu/harvard/iq/dataverse/dataset/DatasetUtil.java @@ -740,4 +740,21 @@ public static String getLocaleCurationStatusLabelFromString(String label) { } return localizedName; } + + // Find the prior version - relies on version sorting by major/minor numbers + public static DatasetVersion getPriorVersion(DatasetVersion version) { + boolean foundCurrent = false; + DatasetVersion priorVersion = null; + for (DatasetVersion versionLoop : version.getDataset().getVersions()) { + if (foundCurrent) { + priorVersion = versionLoop; + break; + } + if (versionLoop.equals(version)) { + foundCurrent = true; + } + + } + return priorVersion; + } } diff --git a/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/AbstractSubmitToArchiveCommand.java b/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/AbstractSubmitToArchiveCommand.java index 29c27d0396d..137d41e2c97 100644 --- a/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/AbstractSubmitToArchiveCommand.java +++ b/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/AbstractSubmitToArchiveCommand.java @@ -2,8 +2,9 @@ import edu.harvard.iq.dataverse.DataCitation; import edu.harvard.iq.dataverse.Dataset; +import edu.harvard.iq.dataverse.DatasetFieldConstant; +import edu.harvard.iq.dataverse.DatasetLock.Reason; import edu.harvard.iq.dataverse.DatasetVersion; -import edu.harvard.iq.dataverse.DvObject; import edu.harvard.iq.dataverse.SettingsWrapper; import edu.harvard.iq.dataverse.authorization.Permission; import edu.harvard.iq.dataverse.authorization.users.ApiToken; @@ -15,23 +16,34 @@ import edu.harvard.iq.dataverse.engine.command.exception.CommandException; import edu.harvard.iq.dataverse.pidproviders.doi.datacite.DOIDataCiteRegisterService; import edu.harvard.iq.dataverse.settings.SettingsServiceBean; +import edu.harvard.iq.dataverse.settings.SettingsServiceBean.Key; +import edu.harvard.iq.dataverse.util.ListSplitUtil; import edu.harvard.iq.dataverse.util.bagit.BagGenerator; import edu.harvard.iq.dataverse.util.bagit.OREMap; +import edu.harvard.iq.dataverse.workflow.step.Failure; +import edu.harvard.iq.dataverse.util.json.JsonLDTerm; import edu.harvard.iq.dataverse.workflow.step.WorkflowStepResult; +import jakarta.ejb.TransactionAttribute; +import jakarta.ejb.TransactionAttributeType; +import jakarta.json.JsonObject; +import jakarta.json.Json; +import jakarta.json.JsonObjectBuilder; import java.io.IOException; import java.io.PipedInputStream; import java.io.PipedOutputStream; import java.security.DigestInputStream; import java.util.HashMap; +import java.util.List; import java.util.Map; import java.util.logging.Logger; @RequiredPermissions(Permission.PublishDataset) public abstract class AbstractSubmitToArchiveCommand extends AbstractCommand { - private final DatasetVersion version; - private final Map requestedSettings = new HashMap(); + protected final DatasetVersion version; + protected final Map requestedSettings = new HashMap(); + protected String spaceName = null; protected boolean success=false; private static final Logger logger = Logger.getLogger(AbstractSubmitToArchiveCommand.class.getName()); private static final int MAX_ZIP_WAIT = 20000; @@ -43,16 +55,24 @@ public AbstractSubmitToArchiveCommand(DataverseRequest aRequest, DatasetVersion } @Override + @TransactionAttribute(TransactionAttributeType.REQUIRED) public DatasetVersion execute(CommandContext ctxt) throws CommandException { - + + // Check for locks while we're still in a transaction + Dataset dataset = version.getDataset(); + if (dataset.getLockFor(Reason.finalizePublication) != null + || dataset.getLockFor(Reason.FileValidationFailed) != null) { + throw new CommandException("Dataset is locked and cannot be archived", this); + } + String settings = ctxt.settings().getValueForKey(SettingsServiceBean.Key.ArchiverSettings); - String[] settingsArray = settings.split(","); - for (String setting : settingsArray) { - setting = setting.trim(); - if (!setting.startsWith(":")) { - logger.warning("Invalid Archiver Setting: " + setting); + List settingsList = ListSplitUtil.split(settings); + for (String settingName : settingsList) { + Key setting = Key.parse(settingName); + if (setting == null) { + logger.warning("Invalid Archiver Setting: " + settingName); } else { - requestedSettings.put(setting, ctxt.settings().get(setting)); + requestedSettings.put(settingName, ctxt.settings().getValueForKey(setting)); } } @@ -62,10 +82,94 @@ public DatasetVersion execute(CommandContext ctxt) throws CommandException { //No un-expired token token = ctxt.authentication().generateApiTokenForUser(user); } - performArchiveSubmission(version, token, requestedSettings); + if (!preconditionsMet(version, token, requestedSettings)) { + JsonObjectBuilder statusObjectBuilder = Json.createObjectBuilder(); + statusObjectBuilder.add(DatasetVersion.ARCHIVAL_STATUS, DatasetVersion.ARCHIVAL_STATUS_FAILURE); + statusObjectBuilder.add(DatasetVersion.ARCHIVAL_STATUS_MESSAGE, + "Successful archiving of earlier versions is required."); + version.setArchivalCopyLocation(statusObjectBuilder.build().toString()); + } else { + + String dataCiteXml = getDataCiteXml(version); + OREMap oreMap = new OREMap(version, false); + JsonObject ore = oreMap.getOREMap(); + Map terms = getJsonLDTerms(oreMap); + performArchivingAndPersist(ctxt, version, dataCiteXml, ore, terms, token, requestedSettings); + } return ctxt.em().merge(version); } + // While we have a transaction context, get the terms needed to create the baginfo file + public static Map getJsonLDTerms(OREMap oreMap) { + Map terms = new HashMap(); + terms.put(DatasetFieldConstant.datasetContact, oreMap.getContactTerm()); + terms.put(DatasetFieldConstant.datasetContactName, oreMap.getContactNameTerm()); + terms.put(DatasetFieldConstant.datasetContactEmail, oreMap.getContactEmailTerm()); + terms.put(DatasetFieldConstant.description, oreMap.getDescriptionTerm()); + terms.put(DatasetFieldConstant.descriptionText, oreMap.getDescriptionTextTerm()); + + return terms; + } + + /** + * Note that this method may be called from the execute method above OR from a + * workflow in which execute() is never called and therefore in which all + * variables must be sent as method parameters. (Nominally version is set in the + * constructor and could be dropped from the parameter list.) + * @param ctxt + * + * @param version - the DatasetVersion to archive + * @param token - an API Token for the user performing this action + * @param requestedSettings - a map of the names/values for settings required by this archiver (sent because this class is not part of the EJB context (by design) and has no direct access to service beans). + */ + public boolean preconditionsMet(DatasetVersion version, ApiToken token, Map requestedSettings) { + // Check if earlier versions must be archived first + String requireEarlierArchivedValue = requestedSettings.get(SettingsServiceBean.Key.ArchiveOnlyIfEarlierVersionsAreArchived.toString()); + boolean requireEarlierArchived = Boolean.parseBoolean(requireEarlierArchivedValue); + if (requireEarlierArchived) { + + Dataset dataset = version.getDataset(); + List versions = dataset.getVersions(); + + boolean foundCurrent = false; + + // versions are ordered, all versions after the current one have lower + // major/minor version numbers + for (DatasetVersion versionInLoop : versions) { + if (foundCurrent) { + // Once foundCurrent is true, we are looking at prior versions + // Check if this earlier version has been successfully archived + String archivalStatus = versionInLoop.getArchivalCopyLocationStatus(); + if (archivalStatus == null || !archivalStatus.equals(DatasetVersion.ARCHIVAL_STATUS_SUCCESS) +// || !archivalStatus.equals(DatasetVersion.ARCHIVAL_STATUS_OBSOLETE) + ) { + return false; + } + } + if (versionInLoop.equals(version)) { + foundCurrent = true; + } + + } + } + return true; + } + + @TransactionAttribute(TransactionAttributeType.NOT_SUPPORTED) + public WorkflowStepResult performArchivingAndPersist(CommandContext ctxt, DatasetVersion version, String dataCiteXml, JsonObject ore, Map terms, ApiToken token, Map requestedSetttings) { + // This runs OUTSIDE any transaction + BagGenerator.setNumConnections(getNumberOfBagGeneratorThreads()); + WorkflowStepResult wfsr = performArchiveSubmission(version, dataCiteXml, ore, terms, token, requestedSettings); + persistResult(ctxt, version); + return wfsr; + } + + @TransactionAttribute(TransactionAttributeType.REQUIRES_NEW) + private void persistResult(CommandContext ctxt, DatasetVersion versionWithStatus) { + // New transaction just for this quick operation + ctxt.datasetVersion().persistArchivalCopyLocation(versionWithStatus); + } + /** * This method is the only one that should be overwritten by other classes. Note * that this method may be called from the execute method above OR from a @@ -74,10 +178,14 @@ public DatasetVersion execute(CommandContext ctxt) throws CommandException { * constructor and could be dropped from the parameter list.) * * @param version - the DatasetVersion to archive + * @param dataCiteXml + * @param ore + * @param terms * @param token - an API Token for the user performing this action * @param requestedSettings - a map of the names/values for settings required by this archiver (sent because this class is not part of the EJB context (by design) and has no direct access to service beans). */ - abstract public WorkflowStepResult performArchiveSubmission(DatasetVersion version, ApiToken token, Map requestedSetttings); + abstract public WorkflowStepResult performArchiveSubmission(DatasetVersion version, String dataCiteXml, JsonObject ore, Map terms, ApiToken token, Map requestedSetttings); + protected int getNumberOfBagGeneratorThreads() { if (requestedSettings.get(BagGenerator.BAG_GENERATOR_THREADS) != null) { @@ -97,7 +205,7 @@ public String describe() { + version.getFriendlyVersionNumber()+")]"; } - String getDataCiteXml(DatasetVersion dv) { + public String getDataCiteXml(DatasetVersion dv) { DataCitation dc = new DataCitation(dv); Map metadata = dc.getDataCiteMetadata(); return DOIDataCiteRegisterService.getMetadataFromDvObject(dv.getDataset().getGlobalId().asString(), metadata, @@ -105,13 +213,13 @@ String getDataCiteXml(DatasetVersion dv) { } public Thread startBagThread(DatasetVersion dv, PipedInputStream in, DigestInputStream digestInputStream2, - String dataciteXml, ApiToken token) throws IOException, InterruptedException { + String dataciteXml, JsonObject ore, Map terms, ApiToken token) throws IOException, InterruptedException { Thread bagThread = new Thread(new Runnable() { public void run() { try (PipedOutputStream out = new PipedOutputStream(in)) { // Generate bag - BagGenerator bagger = new BagGenerator(new OREMap(dv, false), dataciteXml); - bagger.setNumConnections(getNumberOfBagGeneratorThreads()); + BagGenerator.setNumConnections(getNumberOfBagGeneratorThreads()); + BagGenerator bagger = new BagGenerator(ore, dataciteXml, terms); bagger.setAuthenticationKey(token.getTokenString()); bagger.generateBag(out); success = true; @@ -183,4 +291,32 @@ public static boolean isSingleVersion(SettingsWrapper settingsWrapper) { public static boolean isSingleVersion(SettingsServiceBean settingsService) { return false; } + + /** Whether the archiver can delete existing archival files (and thus can retry when the existing files are incomplete/obsolete) + * A static version supports calls via reflection while the instance method supports inheritance for use on actual command instances (see DatasetPage for both use cases). + * @return + */ + public static boolean supportsDelete() { + return false; + } + + public boolean canDelete() { + return supportsDelete(); + } + + protected String getDataCiteFileName(String spaceName, DatasetVersion dv) { + return spaceName + "_datacite.v" + dv.getFriendlyVersionNumber(); + } + + protected String getFileName(String spaceName, DatasetVersion dv) { + return spaceName + ".v" + dv.getFriendlyVersionNumber(); + } + + protected String getSpaceName(Dataset dataset) { + if (spaceName == null) { + spaceName = dataset.getGlobalId().asString().replace(':', '-').replace('/', '-').replace('.', '-') + .toLowerCase(); + } + return spaceName; + } } diff --git a/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/DRSSubmitToArchiveCommand.java b/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/DRSSubmitToArchiveCommand.java index 78e8454255b..1a49a68b097 100644 --- a/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/DRSSubmitToArchiveCommand.java +++ b/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/DRSSubmitToArchiveCommand.java @@ -4,13 +4,19 @@ import edu.harvard.iq.dataverse.DatasetVersion; import edu.harvard.iq.dataverse.Dataverse; import edu.harvard.iq.dataverse.SettingsWrapper; +import edu.harvard.iq.dataverse.DatasetLock.Reason; import edu.harvard.iq.dataverse.authorization.Permission; import edu.harvard.iq.dataverse.authorization.users.ApiToken; +import edu.harvard.iq.dataverse.authorization.users.AuthenticatedUser; import edu.harvard.iq.dataverse.branding.BrandingUtil; import edu.harvard.iq.dataverse.engine.command.Command; +import edu.harvard.iq.dataverse.engine.command.CommandContext; import edu.harvard.iq.dataverse.engine.command.DataverseRequest; import edu.harvard.iq.dataverse.engine.command.RequiredPermissions; +import edu.harvard.iq.dataverse.engine.command.exception.CommandException; import edu.harvard.iq.dataverse.settings.SettingsServiceBean; +import edu.harvard.iq.dataverse.util.bagit.OREMap; +import edu.harvard.iq.dataverse.util.json.JsonLDTerm; import edu.harvard.iq.dataverse.util.json.JsonUtil; import edu.harvard.iq.dataverse.workflow.step.Failure; import edu.harvard.iq.dataverse.workflow.step.WorkflowStepResult; @@ -34,6 +40,8 @@ import java.util.Set; import java.util.logging.Logger; +import jakarta.ejb.TransactionAttribute; +import jakarta.ejb.TransactionAttributeType; import jakarta.json.Json; import jakarta.json.JsonObject; import jakarta.json.JsonObjectBuilder; @@ -77,13 +85,82 @@ public class DRSSubmitToArchiveCommand extends S3SubmitToArchiveCommand implemen private static final String TRUST_CERT = "trust_cert"; private static final String TIMEOUT = "timeout"; + private String archivableAncestorAlias; + public DRSSubmitToArchiveCommand(DataverseRequest aRequest, DatasetVersion version) { super(aRequest, version); } @Override - public WorkflowStepResult performArchiveSubmission(DatasetVersion dv, ApiToken token, - Map requestedSettings) { + @TransactionAttribute(TransactionAttributeType.REQUIRED) + public DatasetVersion execute(CommandContext ctxt) throws CommandException { + + + // Check for locks while we're still in a transaction + Dataset dataset = version.getDataset(); + if (dataset.getLockFor(Reason.finalizePublication) != null + || dataset.getLockFor(Reason.FileValidationFailed) != null) { + throw new CommandException("Dataset is locked and cannot be archived", this); + } + + String settings = ctxt.settings().getValueForKey(SettingsServiceBean.Key.ArchiverSettings); + String[] settingsArray = settings.split(","); + for (String setting : settingsArray) { + setting = setting.trim(); + if (!setting.startsWith(":")) { + logger.warning("Invalid Archiver Setting: " + setting); + } else { + requestedSettings.put(setting, ctxt.settings().get(setting)); + } + } + + // Compute archivable ancestor while we're in a transaction and entities are managed + JsonObject drsConfigObject = null; + try { + drsConfigObject = JsonUtil.getJsonObject(requestedSettings.get(DRS_CONFIG)); + } catch (Exception e) { + logger.warning("Unable to parse " + DRS_CONFIG + " setting as a Json object"); + } + + if (drsConfigObject != null) { + JsonObject adminMetadata = drsConfigObject.getJsonObject(ADMIN_METADATA); + if (adminMetadata != null) { + JsonObject collectionsObj = adminMetadata.getJsonObject(COLLECTIONS); + if (collectionsObj != null) { + Set collections = collectionsObj.keySet(); + Dataverse ancestor = dataset.getOwner(); + // Compute this while entities are still managed + archivableAncestorAlias = getArchivableAncestor(ancestor, collections); + } + } + } + + AuthenticatedUser user = getRequest().getAuthenticatedUser(); + ApiToken token = ctxt.authentication().findApiTokenByUser(user); + if (token == null) { + //No un-expired token + token = ctxt.authentication().generateApiTokenForUser(user); + } + if (!preconditionsMet(version, token, requestedSettings)) { + JsonObjectBuilder statusObjectBuilder = Json.createObjectBuilder(); + statusObjectBuilder.add(DatasetVersion.ARCHIVAL_STATUS, DatasetVersion.ARCHIVAL_STATUS_FAILURE); + statusObjectBuilder.add(DatasetVersion.ARCHIVAL_STATUS_MESSAGE, + "Successful archiving of earlier versions is required."); + version.setArchivalCopyLocation(statusObjectBuilder.build().toString()); + } else { + + String dataCiteXml = getDataCiteXml(version); + OREMap oreMap = new OREMap(version, false); + JsonObject ore = oreMap.getOREMap(); + Map terms = getJsonLDTerms(oreMap); + performArchivingAndPersist(ctxt, version, dataCiteXml, ore, terms, token, requestedSettings); + } + return ctxt.em().merge(version); + } + + @Override + public WorkflowStepResult performArchiveSubmission(DatasetVersion dv, String dataciteXml, JsonObject ore, + Map terms, ApiToken token, Map requestedSettings) { logger.fine("In DRSSubmitToArchiveCommand..."); JsonObject drsConfigObject = null; @@ -97,7 +174,7 @@ public WorkflowStepResult performArchiveSubmission(DatasetVersion dv, ApiToken t Set collections = adminMetadata.getJsonObject(COLLECTIONS).keySet(); Dataset dataset = dv.getDataset(); Dataverse ancestor = dataset.getOwner(); - String alias = getArchivableAncestor(ancestor, collections); + String alias = archivableAncestorAlias; // Use the pre-computed alias instead of calling getArchivableAncestor again String spaceName = getSpaceName(dataset); String packageId = getFileName(spaceName, dv); @@ -113,7 +190,7 @@ public WorkflowStepResult performArchiveSubmission(DatasetVersion dv, ApiToken t JsonObject collectionConfig = adminMetadata.getJsonObject(COLLECTIONS).getJsonObject(alias); - WorkflowStepResult s3Result = super.performArchiveSubmission(dv, token, requestedSettings); + WorkflowStepResult s3Result = super.performArchiveSubmission(dv, dataciteXml, ore, terms, token, requestedSettings); JsonObjectBuilder statusObject = Json.createObjectBuilder(); statusObject.add(DatasetVersion.ARCHIVAL_STATUS, DatasetVersion.ARCHIVAL_STATUS_FAILURE); @@ -242,7 +319,7 @@ public WorkflowStepResult performArchiveSubmission(DatasetVersion dv, ApiToken t logger.severe("DRS Ingest Failed for: " + packageId + " - response does not include status and message"); return new Failure( - "DRS Archiver fail in Ingest call \" - response does not include status and message"); + "DRS Archiver fail in Ingest call - response does not include status and message"); } } else { logger.severe("DRS Ingest Failed for: " + packageId + " with status code: " + code); diff --git a/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/DuraCloudSubmitToArchiveCommand.java b/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/DuraCloudSubmitToArchiveCommand.java index fe4a25091d7..57a4a68a44a 100644 --- a/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/DuraCloudSubmitToArchiveCommand.java +++ b/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/DuraCloudSubmitToArchiveCommand.java @@ -2,7 +2,6 @@ import edu.harvard.iq.dataverse.Dataset; import edu.harvard.iq.dataverse.DatasetVersion; -import edu.harvard.iq.dataverse.DatasetLock.Reason; import edu.harvard.iq.dataverse.authorization.Permission; import edu.harvard.iq.dataverse.authorization.users.ApiToken; import edu.harvard.iq.dataverse.engine.command.DataverseRequest; @@ -10,13 +9,21 @@ import static edu.harvard.iq.dataverse.settings.SettingsServiceBean.Key.DuraCloudContext; import static edu.harvard.iq.dataverse.settings.SettingsServiceBean.Key.DuraCloudHost; import static edu.harvard.iq.dataverse.settings.SettingsServiceBean.Key.DuraCloudPort; + +import edu.harvard.iq.dataverse.util.bagit.BagGenerator; +import edu.harvard.iq.dataverse.util.json.JsonLDTerm; import edu.harvard.iq.dataverse.workflow.step.Failure; import edu.harvard.iq.dataverse.workflow.step.WorkflowStepResult; +import java.io.File; +import java.io.FileOutputStream; import java.io.IOException; +import java.io.InputStream; import java.io.PipedInputStream; import java.io.PipedOutputStream; import java.nio.charset.StandardCharsets; +import java.nio.file.Files; +import java.nio.file.Path; import java.security.DigestInputStream; import java.security.MessageDigest; import java.security.NoSuchAlgorithmException; @@ -49,8 +56,8 @@ public DuraCloudSubmitToArchiveCommand(DataverseRequest aRequest, DatasetVersion } @Override - public WorkflowStepResult performArchiveSubmission(DatasetVersion dv, ApiToken token, - Map requestedSettings) { + public WorkflowStepResult performArchiveSubmission(DatasetVersion dv, String dataciteXml, jakarta.json.JsonObject ore, + Map terms, ApiToken token, Map requestedSettings) { String port = requestedSettings.get(DURACLOUD_PORT) != null ? requestedSettings.get(DURACLOUD_PORT) : DEFAULT_PORT; @@ -64,173 +71,198 @@ public WorkflowStepResult performArchiveSubmission(DatasetVersion dv, ApiToken t // This will make the archivalCopyLocation non-null after a failure which should // stop retries - if (dataset.getLockFor(Reason.finalizePublication) == null - && dataset.getLockFor(Reason.FileValidationFailed) == null) { - // Use Duracloud client classes to login - ContentStoreManager storeManager = new ContentStoreManagerImpl(host, port, dpnContext); - Credential credential = new Credential(System.getProperty("duracloud.username"), - System.getProperty("duracloud.password")); - storeManager.login(credential); + // Use Duracloud client classes to login + ContentStoreManager storeManager = new ContentStoreManagerImpl(host, port, dpnContext); + Credential credential = new Credential(System.getProperty("duracloud.username"), + System.getProperty("duracloud.password")); + storeManager.login(credential); + /* + * Aliases can contain upper case characters which are not allowed in space + * names. Similarly, aliases can contain '_' which isn't allowed in a space + * name. The line below replaces any upper case chars with lowercase and + * replaces any '_' with '.-' . The '-' after the dot assures we don't break the + * rule that + * "The last period in a aspace may not immediately be followed by a number". + * (Although we could check, it seems better to just add '.-' all the time.As + * written the replaceAll will also change any chars not valid in a spaceName to + * '.' which would avoid code breaking if the alias constraints change. That + * said, this line may map more than one alias to the same spaceName, e.g. + * "test" and "Test" aliases both map to the "test" space name. This does not + * break anything but does potentially put bags from more than one collection in + * the same space. + */ + String spaceName = dataset.getOwner().getAlias().toLowerCase().replaceAll("[^a-z0-9-]", ".dcsafe"); + //This archiver doesn't use the standard spaceName, but does use it to generate the file name + String baseFileName = getFileName(getSpaceName(dataset), dv); + + ContentStore store; + // Set a failure status that will be updated if we succeed + JsonObjectBuilder statusObject = Json.createObjectBuilder(); + statusObject.add(DatasetVersion.ARCHIVAL_STATUS, DatasetVersion.ARCHIVAL_STATUS_FAILURE); + statusObject.add(DatasetVersion.ARCHIVAL_STATUS_MESSAGE, "Bag not transferred"); + + Path tempBagFile = null; + + try { /* - * Aliases can contain upper case characters which are not allowed in space - * names. Similarly, aliases can contain '_' which isn't allowed in a space - * name. The line below replaces any upper case chars with lowercase and - * replaces any '_' with '.-' . The '-' after the dot assures we don't break the - * rule that - * "The last period in a aspace may not immediately be followed by a number". - * (Although we could check, it seems better to just add '.-' all the time.As - * written the replaceAll will also change any chars not valid in a spaceName to - * '.' which would avoid code breaking if the alias constraints change. That - * said, this line may map more than one alias to the same spaceName, e.g. - * "test" and "Test" aliases both map to the "test" space name. This does not - * break anything but does potentially put bags from more than one collection in - * the same space. + * If there is a failure in creating a space, it is likely that a prior version + * has not been fully processed (snapshot created, archiving completed and files + * and space deleted - currently manual operations done at the project's + * duracloud website) */ - String spaceName = dataset.getOwner().getAlias().toLowerCase().replaceAll("[^a-z0-9-]", ".dcsafe"); - String baseFileName = dataset.getGlobalId().asString().replace(':', '-').replace('/', '-') - .replace('.', '-').toLowerCase() + "_v" + dv.getFriendlyVersionNumber(); - - ContentStore store; - //Set a failure status that will be updated if we succeed - JsonObjectBuilder statusObject = Json.createObjectBuilder(); - statusObject.add(DatasetVersion.ARCHIVAL_STATUS, DatasetVersion.ARCHIVAL_STATUS_FAILURE); - statusObject.add(DatasetVersion.ARCHIVAL_STATUS_MESSAGE, "Bag not transferred"); - - try { - /* - * If there is a failure in creating a space, it is likely that a prior version - * has not been fully processed (snapshot created, archiving completed and files - * and space deleted - currently manual operations done at the project's - * duracloud website) - */ - store = storeManager.getPrimaryContentStore(); - // Create space to copy archival files to - if (!store.spaceExists(spaceName)) { - store.createSpace(spaceName); - } - String dataciteXml = getDataCiteXml(dv); - - MessageDigest messageDigest = MessageDigest.getInstance("MD5"); - try (PipedInputStream dataciteIn = new PipedInputStream(); - DigestInputStream digestInputStream = new DigestInputStream(dataciteIn, messageDigest)) { - // Add datacite.xml file - - Thread dcThread = new Thread(new Runnable() { - public void run() { - try (PipedOutputStream dataciteOut = new PipedOutputStream(dataciteIn)) { - - dataciteOut.write(dataciteXml.getBytes(StandardCharsets.UTF_8)); - dataciteOut.close(); - success=true; - } catch (Exception e) { - logger.severe("Error creating datacite.xml: " + e.getMessage()); - // TODO Auto-generated catch block - e.printStackTrace(); - } + store = storeManager.getPrimaryContentStore(); + // Create space to copy archival files to + if (!store.spaceExists(spaceName)) { + store.createSpace(spaceName); + } + + MessageDigest messageDigest = MessageDigest.getInstance("MD5"); + try (PipedInputStream dataciteIn = new PipedInputStream(); + DigestInputStream digestInputStream = new DigestInputStream(dataciteIn, messageDigest)) { + // Add datacite.xml file + + Thread dcThread = new Thread(new Runnable() { + public void run() { + try (PipedOutputStream dataciteOut = new PipedOutputStream(dataciteIn)) { + + dataciteOut.write(dataciteXml.getBytes(StandardCharsets.UTF_8)); + dataciteOut.close(); + success = true; + } catch (Exception e) { + logger.severe("Error creating datacite.xml: " + e.getMessage()); + // TODO Auto-generated catch block + e.printStackTrace(); } - }); - dcThread.start(); - // Have seen Pipe Closed errors for other archivers when used as a workflow - // without this delay loop - int i = 0; - while (digestInputStream.available() <= 0 && i < 100) { - Thread.sleep(10); - i++; } - String checksum = store.addContent(spaceName, baseFileName + "_datacite.xml", digestInputStream, - -1l, null, null, null); - logger.fine("Content: datacite.xml added with checksum: " + checksum); - dcThread.join(); - String localchecksum = Hex.encodeHexString(digestInputStream.getMessageDigest().digest()); - if (!success || !checksum.equals(localchecksum)) { - logger.severe("Failure on " + baseFileName); - logger.severe(success ? checksum + " not equal to " + localchecksum : "failed to transfer to DuraCloud"); + }); + dcThread.start(); + // Have seen Pipe Closed errors for other archivers when used as a workflow + // without this delay loop + int i = 0; + while (digestInputStream.available() <= 0 && i < 100) { + Thread.sleep(10); + i++; + } + String checksum = store.addContent(spaceName, baseFileName + "_datacite.xml", digestInputStream, + -1l, null, null, null); + logger.fine("Content: datacite.xml added with checksum: " + checksum); + dcThread.join(); + String localchecksum = Hex.encodeHexString(digestInputStream.getMessageDigest().digest()); + if (!success || !checksum.equals(localchecksum)) { + logger.severe("Failure on " + baseFileName); + logger.severe(success ? checksum + " not equal to " + localchecksum + : "failed to transfer to DuraCloud"); + try { + store.deleteContent(spaceName, baseFileName + "_datacite.xml"); + } catch (ContentStoreException cse) { + logger.warning(cse.getMessage()); + } + return new Failure("Error in transferring DataCite.xml file to DuraCloud", + "DuraCloud Submission Failure: incomplete metadata transfer"); + } + + // Store BagIt file + success = false; + String fileName = baseFileName + ".zip"; + + // Add BagIt ZIP file + // Although DuraCloud uses SHA-256 internally, it's API uses MD5 to verify the + // transfer + Path bagFile = null; + + tempBagFile = Files.createTempFile("dataverse-bag-", ".zip"); + logger.fine("Creating bag in temporary file: " + tempBagFile.toString()); + // Generate bag + BagGenerator bagger = new BagGenerator(ore, dataciteXml, terms); + bagger.setAuthenticationKey(token.getTokenString()); + + // Generate bag to temporary file using the provided ore JsonObject + try (FileOutputStream fos = new FileOutputStream(tempBagFile.toFile())) { + if (!bagger.generateBag(fos)) { + throw new IOException("Bag generation failed"); + } + } + + // Store BagIt file + long bagSize = Files.size(tempBagFile); + logger.fine("Bag created successfully, size: " + bagSize + " bytes"); + + // Now upload the bag file + messageDigest = MessageDigest.getInstance("MD5"); + try (InputStream is = Files.newInputStream(bagFile); + DigestInputStream bagDigestInputStream = new DigestInputStream(is, messageDigest)) { + checksum = store.addContent(spaceName, fileName, bagDigestInputStream, + bagFile.toFile().length(), "application/zip", null, null); + localchecksum = Hex.encodeHexString(bagDigestInputStream.getMessageDigest().digest()); + + if (checksum != null && checksum.equals(localchecksum)) { + logger.fine("Content: " + fileName + " added with checksum: " + checksum); + success = true; + } else { + logger.severe("Failure on " + fileName); + logger.severe(checksum + " not equal to " + localchecksum); try { + store.deleteContent(spaceName, fileName); store.deleteContent(spaceName, baseFileName + "_datacite.xml"); } catch (ContentStoreException cse) { logger.warning(cse.getMessage()); } - return new Failure("Error in transferring DataCite.xml file to DuraCloud", - "DuraCloud Submission Failure: incomplete metadata transfer"); - } - - // Store BagIt file - success = false; - String fileName = baseFileName + ".zip"; - - // Add BagIt ZIP file - // Although DuraCloud uses SHA-256 internally, it's API uses MD5 to verify the - // transfer - - messageDigest = MessageDigest.getInstance("MD5"); - try (PipedInputStream in = new PipedInputStream(100000); - DigestInputStream digestInputStream2 = new DigestInputStream(in, messageDigest)) { - Thread bagThread = startBagThread(dv, in, digestInputStream2, dataciteXml, token); - checksum = store.addContent(spaceName, fileName, digestInputStream2, -1l, null, null, null); - bagThread.join(); - if (success) { - logger.fine("Content: " + fileName + " added with checksum: " + checksum); - localchecksum = Hex.encodeHexString(digestInputStream2.getMessageDigest().digest()); - } - if (!success || !checksum.equals(localchecksum)) { - logger.severe("Failure on " + fileName); - logger.severe(success ? checksum + " not equal to " + localchecksum : "failed to transfer to DuraCloud"); - try { - store.deleteContent(spaceName, fileName); - store.deleteContent(spaceName, baseFileName + "_datacite.xml"); - } catch (ContentStoreException cse) { - logger.warning(cse.getMessage()); - } - return new Failure("Error in transferring Zip file to DuraCloud", - "DuraCloud Submission Failure: incomplete archive transfer"); - } + return new Failure("Error in transferring Zip file to DuraCloud", + "DuraCloud Submission Failure: incomplete archive transfer"); } + } - logger.fine("DuraCloud Submission step: Content Transferred"); + logger.fine("DuraCloud Submission step: Content Transferred"); - // Document the location of dataset archival copy location (actually the URL - // where you can - // view it as an admin) - StringBuffer sb = new StringBuffer("https://"); - sb.append(host); - if (!port.equals("443")) { - sb.append(":" + port); - } - sb.append("/duradmin/spaces/sm/"); - sb.append(store.getStoreId()); - sb.append("/" + spaceName + "/" + fileName); - statusObject.add(DatasetVersion.ARCHIVAL_STATUS, DatasetVersion.ARCHIVAL_STATUS_SUCCESS); - statusObject.add(DatasetVersion.ARCHIVAL_STATUS_MESSAGE, sb.toString()); - - logger.fine("DuraCloud Submission step complete: " + sb.toString()); - } catch (ContentStoreException | IOException e) { - // TODO Auto-generated catch block - logger.warning(e.getMessage()); - e.printStackTrace(); - return new Failure("Error in transferring file to DuraCloud", - "DuraCloud Submission Failure: archive file not transferred"); - } catch (InterruptedException e) { - logger.warning(e.getLocalizedMessage()); - e.printStackTrace(); + // Document the location of dataset archival copy location (actually the URL + // where you can + // view it as an admin) + StringBuffer sb = new StringBuffer("https://"); + sb.append(host); + if (!port.equals(DEFAULT_PORT)) { + sb.append(":" + port); } - } catch (ContentStoreException e) { + sb.append("/duradmin/spaces/sm/"); + sb.append(store.getStoreId()); + sb.append("/" + spaceName + "/" + fileName); + statusObject.add(DatasetVersion.ARCHIVAL_STATUS, DatasetVersion.ARCHIVAL_STATUS_SUCCESS); + statusObject.add(DatasetVersion.ARCHIVAL_STATUS_MESSAGE, sb.toString()); + + logger.fine("DuraCloud Submission step complete: " + sb.toString()); + } catch (ContentStoreException | IOException e) { + // TODO Auto-generated catch block logger.warning(e.getMessage()); e.printStackTrace(); - String mesg = "DuraCloud Submission Failure"; - if (!(1 == dv.getVersion()) || !(0 == dv.getMinorVersionNumber())) { - mesg = mesg + ": Prior Version archiving not yet complete?"; - } - return new Failure("Unable to create DuraCloud space with name: " + baseFileName, mesg); - } catch (NoSuchAlgorithmException e) { - logger.severe("MD5 MessageDigest not available!"); + return new Failure("Error in transferring file to DuraCloud", + "DuraCloud Submission Failure: archive file not transferred"); + } catch (InterruptedException e) { + logger.warning(e.getLocalizedMessage()); + e.printStackTrace(); } - finally { - dv.setArchivalCopyLocation(statusObject.build().toString()); + } catch (ContentStoreException e) { + logger.warning(e.getMessage()); + e.printStackTrace(); + String mesg = "DuraCloud Submission Failure"; + if (!(1 == dv.getVersion()) || !(0 == dv.getMinorVersionNumber())) { + mesg = mesg + ": Prior Version archiving not yet complete?"; + } + return new Failure("Unable to create DuraCloud space with name: " + baseFileName, mesg); + } catch (NoSuchAlgorithmException e) { + logger.severe("MD5 MessageDigest not available!"); + } catch (Exception e) { + logger.warning(e.getLocalizedMessage()); + e.printStackTrace(); + return new Failure("Error in transferring file to DuraCloud", + "DuraCloud Submission Failure: internal error"); + } finally { + if (tempBagFile != null) { + try { + Files.deleteIfExists(tempBagFile); + } catch (IOException e) { + logger.warning("Failed to delete temporary bag file: " + tempBagFile + " : " + e.getMessage()); + } } - } else { - logger.warning( - "DuraCloud Submision Workflow aborted: Dataset locked for finalizePublication, or because file validation failed"); - return new Failure("Dataset locked"); + dv.setArchivalCopyLocation(statusObject.build().toString()); } return WorkflowStepResult.OK; } else { diff --git a/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/FinalizeDatasetPublicationCommand.java b/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/FinalizeDatasetPublicationCommand.java index 1ef68ae4853..7cc5bb47d97 100644 --- a/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/FinalizeDatasetPublicationCommand.java +++ b/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/FinalizeDatasetPublicationCommand.java @@ -28,6 +28,7 @@ import edu.harvard.iq.dataverse.privateurl.PrivateUrl; import edu.harvard.iq.dataverse.settings.SettingsServiceBean; import edu.harvard.iq.dataverse.util.BundleUtil; +import edu.harvard.iq.dataverse.workflow.WorkflowContext; import edu.harvard.iq.dataverse.workflow.WorkflowContext.TriggerType; import java.awt.datatransfer.StringSelection; @@ -245,14 +246,11 @@ public Dataset execute(CommandContext ctxt) throws CommandException { //Remove any pre-pub workflow lock (not needed as WorkflowServiceBean.workflowComplete() should already have removed it after setting the finalizePublication lock?) ctxt.datasets().removeDatasetLocks(ds, DatasetLock.Reason.Workflow); - //Should this be in onSuccess()? ctxt.workflows().getDefaultWorkflow(TriggerType.PostPublishDataset).ifPresent(wf -> { - try { - ctxt.workflows().start(wf, buildContext(ds, TriggerType.PostPublishDataset, datasetExternallyReleased), false); - } catch (CommandException ex) { - ctxt.datasets().removeDatasetLocks(ds, DatasetLock.Reason.Workflow); - logger.log(Level.SEVERE, "Error invoking post-publish workflow: " + ex.getMessage(), ex); - } + // Create the workflow lock BEFORE starting the workflow + DatasetLock workflowLock = new DatasetLock(DatasetLock.Reason.Workflow, (AuthenticatedUser) getRequest().getUser()); + workflowLock.setDataset(ds); + ctxt.datasets().addDatasetLock(ds, workflowLock); }); Dataset readyDataset = ctxt.em().merge(ds); @@ -288,6 +286,22 @@ public boolean onSuccess(CommandContext ctxt, Object r) { } catch (Exception e) { logger.warning("Failure to send dataset published messages for : " + dataset.getId() + " : " + e.getMessage()); } + + final Dataset ds = dataset; + ctxt.workflows().getDefaultWorkflow(TriggerType.PostPublishDataset).ifPresent(wf -> { + // Build context with the lock attached + WorkflowContext context = buildContext(ds, TriggerType.PostPublishDataset, datasetExternallyReleased); + context.setLockId(ds.getLockFor(DatasetLock.Reason.Workflow).getId()); + try { + ctxt.workflows().start(wf, context, false); + } catch (CommandException e) { + logger.log(Level.SEVERE, "Error invoking post-publish workflow: " + e.getMessage(), e); + } + }); + // Metadata export: + ctxt.datasets().reExportDatasetAsync(dataset); + + ctxt.index().asyncIndexDataset(dataset, true); //re-indexing dataverses that have additional subjects if (!dataversesToIndex.isEmpty()){ @@ -303,23 +317,6 @@ public boolean onSuccess(CommandContext ctxt, Object r) { } } - // Metadata export: - - try { - ExportService instance = ExportService.getInstance(); - instance.exportAllFormats(dataset); - dataset = ctxt.datasets().merge(dataset); - } catch (Exception ex) { - // Something went wrong! - // Just like with indexing, a failure to export is not a fatal - // condition. We'll just log the error as a warning and keep - // going: - logger.log(Level.WARNING, "Finalization: exception caught while exporting: "+ex.getMessage(), ex); - // ... but it is important to only update the export time stamp if the - // export was indeed successful. - } - ctxt.index().asyncIndexDataset(dataset, true); - return retVal; } diff --git a/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/GoogleCloudSubmitToArchiveCommand.java b/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/GoogleCloudSubmitToArchiveCommand.java index 7dfb9f07e19..43769dbdb49 100644 --- a/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/GoogleCloudSubmitToArchiveCommand.java +++ b/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/GoogleCloudSubmitToArchiveCommand.java @@ -7,7 +7,6 @@ import com.google.cloud.storage.StorageException; import com.google.cloud.storage.StorageOptions; import edu.harvard.iq.dataverse.Dataset; -import edu.harvard.iq.dataverse.DatasetLock.Reason; import edu.harvard.iq.dataverse.DatasetVersion; import edu.harvard.iq.dataverse.authorization.Permission; import edu.harvard.iq.dataverse.authorization.users.ApiToken; @@ -16,18 +15,27 @@ import edu.harvard.iq.dataverse.settings.JvmSettings; import static edu.harvard.iq.dataverse.settings.SettingsServiceBean.Key.GoogleCloudBucket; import static edu.harvard.iq.dataverse.settings.SettingsServiceBean.Key.GoogleCloudProject; +import edu.harvard.iq.dataverse.util.bagit.BagGenerator; +import edu.harvard.iq.dataverse.util.bagit.BagGenerator.FileEntry; +import edu.harvard.iq.dataverse.util.json.JsonLDTerm; import edu.harvard.iq.dataverse.workflow.step.Failure; import edu.harvard.iq.dataverse.workflow.step.WorkflowStepResult; import org.apache.commons.codec.binary.Hex; +import org.apache.commons.compress.parallel.InputStreamSupplier; import jakarta.json.Json; +import jakarta.json.JsonObject; import jakarta.json.JsonObjectBuilder; import java.io.File; import java.io.FileInputStream; +import java.io.FileOutputStream; import java.io.IOException; +import java.io.InputStream; import java.io.PipedInputStream; import java.io.PipedOutputStream; import java.nio.charset.StandardCharsets; +import java.nio.file.Files; +import java.nio.file.Path; import java.security.DigestInputStream; import java.security.MessageDigest; import java.util.Map; @@ -44,132 +52,219 @@ public GoogleCloudSubmitToArchiveCommand(DataverseRequest aRequest, DatasetVersi super(aRequest, version); } + public static boolean supportsDelete() { + return true; + } + @Override + public boolean canDelete() { + return supportsDelete(); + } + @Override - public WorkflowStepResult performArchiveSubmission(DatasetVersion dv, ApiToken token, Map requestedSettings) { + public WorkflowStepResult performArchiveSubmission(DatasetVersion dv, String dataciteXml, JsonObject ore, + Map terms, ApiToken token, Map requestedSettings) { logger.fine("In GoogleCloudSubmitToArchiveCommand..."); String bucketName = requestedSettings.get(GOOGLECLOUD_BUCKET); String projectName = requestedSettings.get(GOOGLECLOUD_PROJECT); logger.fine("Project: " + projectName + " Bucket: " + bucketName); if (bucketName != null && projectName != null) { Storage storage; - //Set a failure status that will be updated if we succeed + // Set a failure status that will be updated if we succeed JsonObjectBuilder statusObject = Json.createObjectBuilder(); statusObject.add(DatasetVersion.ARCHIVAL_STATUS, DatasetVersion.ARCHIVAL_STATUS_FAILURE); statusObject.add(DatasetVersion.ARCHIVAL_STATUS_MESSAGE, "Bag not transferred"); - + String cloudKeyFile = JvmSettings.FILES_DIRECTORY.lookup() + File.separator + "googlecloudkey.json"; - + + // Create temporary file for bag + Path tempBagFile = null; + try (FileInputStream cloudKeyStream = new FileInputStream(cloudKeyFile)) { storage = StorageOptions.newBuilder() - .setCredentials(ServiceAccountCredentials.fromStream(cloudKeyStream)) - .setProjectId(projectName) - .build() - .getService(); + .setCredentials(ServiceAccountCredentials.fromStream(cloudKeyStream)).setProjectId(projectName) + .build().getService(); Bucket bucket = storage.get(bucketName); Dataset dataset = dv.getDataset(); - if (dataset.getLockFor(Reason.finalizePublication) == null) { - - String spaceName = dataset.getGlobalId().asString().replace(':', '-').replace('/', '-') - .replace('.', '-').toLowerCase(); - - String dataciteXml = getDataCiteXml(dv); - MessageDigest messageDigest = MessageDigest.getInstance("MD5"); - try (PipedInputStream dataciteIn = new PipedInputStream(); - DigestInputStream digestInputStream = new DigestInputStream(dataciteIn, messageDigest)) { - // Add datacite.xml file - - Thread dcThread = new Thread(new Runnable() { - public void run() { - try (PipedOutputStream dataciteOut = new PipedOutputStream(dataciteIn)) { - - dataciteOut.write(dataciteXml.getBytes(StandardCharsets.UTF_8)); - dataciteOut.close(); - success = true; - } catch (Exception e) { - logger.severe("Error creating datacite.xml: " + e.getMessage()); - // TODO Auto-generated catch block - e.printStackTrace(); - // throw new RuntimeException("Error creating datacite.xml: " + e.getMessage()); - } - } - }); - dcThread.start(); - // Have seen Pipe Closed errors for other archivers when used as a workflow - // without this delay loop - int i = 0; - while (digestInputStream.available() <= 0 && i < 100) { - Thread.sleep(10); - i++; - } - Blob dcXml = bucket.create(spaceName + "/datacite.v" + dv.getFriendlyVersionNumber() + ".xml", digestInputStream, "text/xml", Bucket.BlobWriteOption.doesNotExist()); - - dcThread.join(); - String checksum = dcXml.getMd5ToHexString(); - logger.fine("Content: datacite.xml added with checksum: " + checksum); - String localchecksum = Hex.encodeHexString(digestInputStream.getMessageDigest().digest()); - if (!success || !checksum.equals(localchecksum)) { - logger.severe("Failure on " + spaceName); - logger.severe(success ? checksum + " not equal to " + localchecksum : "datacite.xml transfer did not succeed"); - try { - dcXml.delete(Blob.BlobSourceOption.generationMatch()); - } catch (StorageException se) { - logger.warning(se.getMessage()); + + String spaceName = getSpaceName(dataset); + + // Check for and delete existing files for this version + String dataciteFileName = spaceName + "/" + getDataCiteFileName(spaceName, dv) + ".xml"; + String bagFileName = spaceName + "/" + getFileName(spaceName,dv) + ".zip"; + + logger.fine("Checking for existing files in archive..."); + + try { + Blob existingDatacite = bucket.get(dataciteFileName); + if (existingDatacite != null && existingDatacite.exists()) { + logger.fine("Found existing datacite.xml, deleting: " + dataciteFileName); + existingDatacite.delete(); + logger.fine("Deleted existing datacite.xml"); + } + } catch (StorageException se) { + logger.warning("Error checking/deleting existing datacite.xml: " + se.getMessage()); + } + + try { + Blob existingBag = bucket.get(bagFileName); + if (existingBag != null && existingBag.exists()) { + logger.fine("Found existing bag file, deleting: " + bagFileName); + existingBag.delete(); + logger.fine("Deleted existing bag file"); + } + } catch (StorageException se) { + logger.warning("Error checking/deleting existing bag file: " + se.getMessage()); + } + + // Upload datacite.xml + MessageDigest messageDigest = MessageDigest.getInstance("MD5"); + try (PipedInputStream dataciteIn = new PipedInputStream(); + DigestInputStream digestInputStream = new DigestInputStream(dataciteIn, messageDigest)) { + // Add datacite.xml file + + Thread dcThread = new Thread(new Runnable() { + public void run() { + try (PipedOutputStream dataciteOut = new PipedOutputStream(dataciteIn)) { + + dataciteOut.write(dataciteXml.getBytes(StandardCharsets.UTF_8)); + dataciteOut.close(); + success = true; + } catch (Exception e) { + logger.severe("Error creating datacite.xml: " + e.getMessage()); + e.printStackTrace(); } - return new Failure("Error in transferring DataCite.xml file to GoogleCloud", - "GoogleCloud Submission Failure: incomplete metadata transfer"); } + }); + dcThread.start(); + // Have seen Pipe Closed errors for other archivers when used as a workflow + // without this delay loop + int i = 0; + while (digestInputStream.available() <= 0 && i < 100) { + Thread.sleep(10); + i++; + } + Blob dcXml = bucket.create(dataciteFileName, digestInputStream, "text/xml", + Bucket.BlobWriteOption.doesNotExist()); - // Store BagIt file - success = false; - String fileName = spaceName + ".v" + dv.getFriendlyVersionNumber() + ".zip"; - - // Add BagIt ZIP file - // Google uses MD5 as one way to verify the - // transfer - messageDigest = MessageDigest.getInstance("MD5"); - try (PipedInputStream in = new PipedInputStream(100000); - DigestInputStream digestInputStream2 = new DigestInputStream(in, messageDigest)) { - Thread bagThread = startBagThread(dv, in, digestInputStream2, dataciteXml, token); - Blob bag = bucket.create(spaceName + "/" + fileName, digestInputStream2, "application/zip", - Bucket.BlobWriteOption.doesNotExist()); - if (bag.getSize() == 0) { - throw new IOException("Empty Bag"); - } - bagThread.join(); - - checksum = bag.getMd5ToHexString(); - logger.fine("Bag: " + fileName + " added with checksum: " + checksum); - localchecksum = Hex.encodeHexString(digestInputStream2.getMessageDigest().digest()); - if (!success || !checksum.equals(localchecksum)) { - logger.severe(success ? checksum + " not equal to " + localchecksum - : "bag transfer did not succeed"); - try { - bag.delete(Blob.BlobSourceOption.generationMatch()); - } catch (StorageException se) { - logger.warning(se.getMessage()); - } - return new Failure("Error in transferring Zip file to GoogleCloud", - "GoogleCloud Submission Failure: incomplete archive transfer"); - } + dcThread.join(); + String checksum = dcXml.getMd5ToHexString(); + logger.fine("Content: datacite.xml added with checksum: " + checksum); + String localchecksum = Hex.encodeHexString(digestInputStream.getMessageDigest().digest()); + if (!success || !checksum.equals(localchecksum)) { + logger.severe("Failure on " + spaceName); + logger.severe(success ? checksum + " not equal to " + localchecksum + : "datacite.xml transfer did not succeed"); + try { + dcXml.delete(Blob.BlobSourceOption.generationMatch()); + } catch (StorageException se) { + logger.warning(se.getMessage()); } + return new Failure("Error in transferring DataCite.xml file to GoogleCloud", + "GoogleCloud Submission Failure: incomplete metadata transfer"); + } + } - logger.fine("GoogleCloud Submission step: Content Transferred"); + tempBagFile = Files.createTempFile("dataverse-bag-", ".zip"); + logger.fine("Creating bag in temporary file: " + tempBagFile.toString()); - // Document the location of dataset archival copy location (actually the URL - // where you can view it as an admin) - // Changed to point at bucket where the zip and datacite.xml are visible + BagGenerator bagger = new BagGenerator(ore, dataciteXml, terms); + bagger.setAuthenticationKey(token.getTokenString()); + // Generate bag to temporary file using the provided ore JsonObject + try (FileOutputStream fos = new FileOutputStream(tempBagFile.toFile())) { + if (!bagger.generateBag(fos)) { + throw new IOException("Bag generation failed"); + } + } + + // Store BagIt file + long bagSize = Files.size(tempBagFile); + logger.fine("Bag created successfully, size: " + bagSize + " bytes"); + + if (bagSize == 0) { + throw new IOException("Generated bag file is empty"); + } + + // Upload bag file and calculate checksum during upload + messageDigest = MessageDigest.getInstance("MD5"); + String localChecksum; + + try (FileInputStream fis = new FileInputStream(tempBagFile.toFile()); + DigestInputStream dis = new DigestInputStream(fis, messageDigest)) { + + logger.fine("Uploading bag to GoogleCloud: " + bagFileName); + + Blob bag = bucket.create(bagFileName, dis, "application/zip", + Bucket.BlobWriteOption.doesNotExist()); + + if (bag.getSize() == 0) { + throw new IOException("Uploaded bag has zero size"); + } + + // Get checksum after upload completes + localChecksum = Hex.encodeHexString(dis.getMessageDigest().digest()); + String remoteChecksum = bag.getMd5ToHexString(); + + logger.fine("Bag: " + bagFileName + " uploaded"); + logger.fine("Local checksum: " + localChecksum); + logger.fine("Remote checksum: " + remoteChecksum); + + if (!localChecksum.equals(remoteChecksum)) { + logger.severe("Bag checksum mismatch!"); + logger.severe("Local: " + localChecksum + " != Remote: " + remoteChecksum); + try { + bag.delete(Blob.BlobSourceOption.generationMatch()); + } catch (StorageException se) { + logger.warning(se.getMessage()); + } + return new Failure("Error in transferring Zip file to GoogleCloud", + "GoogleCloud Submission Failure: bag checksum mismatch"); + } + } + + logger.fine("GoogleCloud Submission step: Content Transferred Successfully"); - StringBuffer sb = new StringBuffer("https://console.cloud.google.com/storage/browser/"); - sb.append(bucketName + "/" + spaceName); - statusObject.add(DatasetVersion.ARCHIVAL_STATUS, DatasetVersion.ARCHIVAL_STATUS_SUCCESS); - statusObject.add(DatasetVersion.ARCHIVAL_STATUS_MESSAGE, sb.toString()); - + // Now upload any files that were too large for the bag + for (FileEntry entry : bagger.getOversizedFiles()) { + String childPath = entry.getChildPath(entry.getChildTitle()); + String fileKey = spaceName + "/" + childPath; + logger.fine("Uploading oversized file to GoogleCloud: " + fileKey); + messageDigest = MessageDigest.getInstance("MD5"); + InputStreamSupplier supplier = bagger.getInputStreamSupplier(entry.getDataUrl()); + try (InputStream is = supplier.get(); + DigestInputStream dis = new DigestInputStream(is, messageDigest)) { + Blob oversizedFileBlob = bucket.create(fileKey, dis, Bucket.BlobWriteOption.doesNotExist()); + if (oversizedFileBlob.getSize() == 0) { + throw new IOException("Uploaded oversized file has zero size: " + fileKey); + } + localChecksum = Hex.encodeHexString(dis.getMessageDigest().digest()); + String remoteChecksum = oversizedFileBlob.getMd5ToHexString(); + logger.fine("Oversized file: " + fileKey + " uploaded"); + logger.fine("Local checksum: " + localChecksum); + logger.fine("Remote checksum: " + remoteChecksum); + if (!localChecksum.equals(remoteChecksum)) { + logger.severe("Oversized file checksum mismatch!"); + logger.severe("Local: " + localChecksum + " != Remote: " + remoteChecksum); + try { + oversizedFileBlob.delete(Blob.BlobSourceOption.generationMatch()); + } catch (StorageException se) { + logger.warning(se.getMessage()); + } + return new Failure("Error in transferring oversized file to GoogleCloud", + "GoogleCloud Submission Failure: oversized file transfer incomplete"); + } + } catch (IOException e) { + logger.warning("Failed to upload oversized file: " + childPath + " : " + e.getMessage()); + return new Failure("Error uploading oversized file to Google Cloud: " + childPath); } - } else { - logger.warning("GoogleCloud Submision Workflow aborted: Dataset locked for pidRegister"); - return new Failure("Dataset locked"); } + + // Document the location of dataset archival copy location (actually the URL + // to the bucket). + statusObject.add(DatasetVersion.ARCHIVAL_STATUS, DatasetVersion.ARCHIVAL_STATUS_SUCCESS); + statusObject.add(DatasetVersion.ARCHIVAL_STATUS_MESSAGE, + String.format("https://storage.cloud.google.com/%s/%s", bucketName, spaceName)); + } catch (Exception e) { logger.warning(e.getLocalizedMessage()); e.printStackTrace(); @@ -177,11 +272,19 @@ public void run() { e.getLocalizedMessage() + ": check log for details"); } finally { + if (tempBagFile != null) { + try { + Files.deleteIfExists(tempBagFile); + } catch (IOException e) { + logger.warning("Failed to delete temporary bag file: " + tempBagFile + " : " + e.getMessage()); + } + } dv.setArchivalCopyLocation(statusObject.build().toString()); } return WorkflowStepResult.OK; } else { - return new Failure("GoogleCloud Submission not configured - no \":GoogleCloudBucket\" and/or \":GoogleCloudProject\"."); + return new Failure( + "GoogleCloud Submission not configured - no \":GoogleCloudBucket\" and/or \":GoogleCloudProject\"."); } } diff --git a/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/LocalSubmitToArchiveCommand.java b/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/LocalSubmitToArchiveCommand.java index 462879f2ec9..a594ac02cfb 100644 --- a/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/LocalSubmitToArchiveCommand.java +++ b/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/LocalSubmitToArchiveCommand.java @@ -2,7 +2,6 @@ import edu.harvard.iq.dataverse.Dataset; import edu.harvard.iq.dataverse.DatasetVersion; -import edu.harvard.iq.dataverse.DatasetLock.Reason; import edu.harvard.iq.dataverse.authorization.Permission; import edu.harvard.iq.dataverse.authorization.users.ApiToken; import edu.harvard.iq.dataverse.engine.command.Command; @@ -10,7 +9,8 @@ import edu.harvard.iq.dataverse.engine.command.RequiredPermissions; import static edu.harvard.iq.dataverse.settings.SettingsServiceBean.Key.BagItLocalPath; import edu.harvard.iq.dataverse.util.bagit.BagGenerator; -import edu.harvard.iq.dataverse.util.bagit.OREMap; +import edu.harvard.iq.dataverse.util.bagit.BagGenerator.FileEntry; +import edu.harvard.iq.dataverse.util.json.JsonLDTerm; import edu.harvard.iq.dataverse.workflow.step.Failure; import edu.harvard.iq.dataverse.workflow.step.WorkflowStepResult; @@ -19,10 +19,12 @@ import java.util.logging.Logger; import jakarta.json.Json; +import jakarta.json.JsonObject; import jakarta.json.JsonObjectBuilder; import java.io.File; import java.io.FileOutputStream; +import java.io.InputStream; import org.apache.commons.io.FileUtils; @@ -34,15 +36,23 @@ public class LocalSubmitToArchiveCommand extends AbstractSubmitToArchiveCommand public LocalSubmitToArchiveCommand(DataverseRequest aRequest, DatasetVersion version) { super(aRequest, version); } + + public static boolean supportsDelete() { + return true; + } + @Override + public boolean canDelete() { + return supportsDelete(); + } @Override - public WorkflowStepResult performArchiveSubmission(DatasetVersion dv, ApiToken token, - Map requestedSettings) { - logger.fine("In LocalCloudSubmitToArchive..."); + public WorkflowStepResult performArchiveSubmission(DatasetVersion dv, String dataciteXml, JsonObject ore, + Map terms, ApiToken token, Map requestedSettings) { + logger.fine("In LocalSubmitToArchive..."); String localPath = requestedSettings.get(BagItLocalPath.toString()); String zipName = null; - //Set a failure status that will be updated if we succeed + // Set a failure status that will be updated if we succeed JsonObjectBuilder statusObject = Json.createObjectBuilder(); statusObject.add(DatasetVersion.ARCHIVAL_STATUS, DatasetVersion.ARCHIVAL_STATUS_FAILURE); statusObject.add(DatasetVersion.ARCHIVAL_STATUS_MESSAGE, "Bag not transferred"); @@ -51,42 +61,87 @@ public WorkflowStepResult performArchiveSubmission(DatasetVersion dv, ApiToken t Dataset dataset = dv.getDataset(); - if (dataset.getLockFor(Reason.finalizePublication) == null - && dataset.getLockFor(Reason.FileValidationFailed) == null) { - - String spaceName = dataset.getGlobalId().asString().replace(':', '-').replace('/', '-') - .replace('.', '-').toLowerCase(); - - String dataciteXml = getDataCiteXml(dv); - - FileUtils.writeStringToFile( - new File(localPath + "/" + spaceName + "-datacite.v" + dv.getFriendlyVersionNumber() + ".xml"), - dataciteXml, StandardCharsets.UTF_8); - BagGenerator bagger = new BagGenerator(new OREMap(dv, false), dataciteXml); - bagger.setNumConnections(getNumberOfBagGeneratorThreads()); - bagger.setAuthenticationKey(token.getTokenString()); - zipName = localPath + "/" + spaceName + "v" + dv.getFriendlyVersionNumber() + ".zip"; - //ToDo: generateBag(File f, true) seems to do the same thing (with a .tmp extension) - since we don't have to use a stream here, could probably just reuse the existing code? - bagger.generateBag(new FileOutputStream(zipName + ".partial")); - - File srcFile = new File(zipName + ".partial"); - File destFile = new File(zipName); - - if (srcFile.renameTo(destFile)) { - logger.fine("Localhost Submission step: Content Transferred"); - statusObject.add(DatasetVersion.ARCHIVAL_STATUS, DatasetVersion.ARCHIVAL_STATUS_SUCCESS); - statusObject.add(DatasetVersion.ARCHIVAL_STATUS_MESSAGE, "file://" + zipName); + String spaceName = getSpaceName(dataset); + + // Define file paths + String dataciteFileName = localPath + "/" + getDataCiteFileName(spaceName, dv) + ".xml"; + zipName = localPath + "/" + getFileName(spaceName, dv) + ".zip"; + + // Check for and delete existing files for this version + logger.fine("Checking for existing files in archive..."); + + File existingDatacite = new File(dataciteFileName); + if (existingDatacite.exists()) { + logger.fine("Found existing datacite.xml, deleting: " + dataciteFileName); + if (existingDatacite.delete()) { + logger.fine("Deleted existing datacite.xml"); + } else { + logger.warning("Failed to delete existing datacite.xml: " + dataciteFileName); + } + } + + File existingBag = new File(zipName); + if (existingBag.exists()) { + logger.fine("Found existing bag file, deleting: " + zipName); + if (existingBag.delete()) { + logger.fine("Deleted existing bag file"); + } else { + logger.warning("Failed to delete existing bag file: " + zipName); + } + } + + // Also check for and delete the .partial file if it exists + File existingPartial = new File(zipName + ".partial"); + if (existingPartial.exists()) { + logger.fine("Found existing partial bag file, deleting: " + zipName + ".partial"); + if (existingPartial.delete()) { + logger.fine("Deleted existing partial bag file"); } else { - logger.warning("Unable to move " + zipName + ".partial to " + zipName); + logger.warning("Failed to delete existing partial bag file: " + zipName + ".partial"); + } + } + + // Write datacite.xml file + FileUtils.writeStringToFile(new File(dataciteFileName), dataciteXml, StandardCharsets.UTF_8); + logger.fine("Datacite XML written to: " + dataciteFileName); + + // Generate bag + BagGenerator bagger = new BagGenerator(ore, dataciteXml, terms); + bagger.setAuthenticationKey(token.getTokenString()); + + boolean bagSuccess = bagger.generateBag(new FileOutputStream(zipName + ".partial")); + + if (!bagSuccess) { + logger.severe("Bag generation failed for " + zipName); + return new Failure("Local Submission Failure", "Bag generation failed"); + } + // Now download any files that were too large for the bag + for (FileEntry entry : bagger.getOversizedFiles()) { + String childPath = entry.getChildPath(entry.getChildTitle()); + File destFile = new File(localPath, + localPath + "/" + spaceName + "v" + dv.getFriendlyVersionNumber() + "/" + childPath); + logger.fine("Downloading oversized file to " + destFile.getAbsolutePath()); + destFile.getParentFile().mkdirs(); + try (InputStream is = bagger.getInputStreamSupplier(entry.getDataUrl()).get()) { + FileUtils.copyInputStreamToFile(is, destFile); } + } + + File srcFile = new File(zipName + ".partial"); + File destFile = new File(zipName); + + if (srcFile.renameTo(destFile)) { + logger.fine("Localhost Submission step: Content Transferred to " + zipName); + statusObject.add(DatasetVersion.ARCHIVAL_STATUS, DatasetVersion.ARCHIVAL_STATUS_SUCCESS); + statusObject.add(DatasetVersion.ARCHIVAL_STATUS_MESSAGE, "file://" + zipName); } else { - logger.warning( - "Localhost Submision Workflow aborted: Dataset locked for finalizePublication, or because file validation failed"); - return new Failure("Dataset locked"); + logger.severe("Unable to move " + zipName + ".partial to " + zipName); + return new Failure("Local Submission Failure", "Unable to rename partial file to final file"); } } catch (Exception e) { logger.warning("Failed to archive " + zipName + " : " + e.getLocalizedMessage()); e.printStackTrace(); + return new Failure("Local Submission Failure", e.getLocalizedMessage() + ": check log for details"); } finally { dv.setArchivalCopyLocation(statusObject.build().toString()); } diff --git a/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/PublishDatasetCommand.java b/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/PublishDatasetCommand.java index 915ef6ea2a1..8282aa076ca 100644 --- a/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/PublishDatasetCommand.java +++ b/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/PublishDatasetCommand.java @@ -12,11 +12,13 @@ import edu.harvard.iq.dataverse.engine.command.exception.IllegalCommandException; import edu.harvard.iq.dataverse.util.BundleUtil; import edu.harvard.iq.dataverse.workflow.Workflow; +import edu.harvard.iq.dataverse.workflow.WorkflowContext; import edu.harvard.iq.dataverse.workflow.WorkflowContext.TriggerType; import jakarta.persistence.OptimisticLockException; import java.util.Optional; +import java.util.logging.Level; import java.util.logging.Logger; import static java.util.stream.Collectors.joining; import static edu.harvard.iq.dataverse.engine.command.impl.PublishDatasetResult.Status; @@ -106,20 +108,21 @@ public PublishDatasetResult execute(CommandContext ctxt) throws CommandException } } - //ToDo - should this be in onSuccess()? May relate to todo above Optional prePubWf = ctxt.workflows().getDefaultWorkflow(TriggerType.PrePublishDataset); - if ( prePubWf.isPresent() ) { + if (prePubWf.isPresent()) { // We start a workflow try { + // Create the workflow lock BEFORE starting the workflow + DatasetLock workflowLock = new DatasetLock(DatasetLock.Reason.Workflow, (AuthenticatedUser) getRequest().getUser()); + workflowLock.setDataset(theDataset); + ctxt.datasets().addDatasetLock(theDataset, workflowLock); theDataset = ctxt.em().merge(theDataset); ctxt.em().flush(); - ctxt.workflows().start(prePubWf.get(), - buildContext(theDataset, TriggerType.PrePublishDataset, datasetExternallyReleased), true); + return new PublishDatasetResult(theDataset, Status.Workflow); } catch (OptimisticLockException e) { throw new CommandException(e.getMessage(), e, this); } - } else{ // We will skip trying to register the global identifiers for datafiles // if "dependent" file-level identifiers are requested, AND the naming @@ -131,7 +134,7 @@ public PublishDatasetResult execute(CommandContext ctxt) throws CommandException // than the configured limit number of files, then call Finalize // asychronously (default is 10) // ... - // Additionaly in 4.9.3 we have added a system variable to disable + // Additionally in 4.9.3 we have added a system variable to disable // registering file PIDs on the installation level. boolean registerGlobalIdsForFiles = ctxt.systemConfig().isFilePIDsEnabledForCollection(getDataset().getOwner()) && @@ -257,10 +260,23 @@ public boolean onSuccess(CommandContext ctxt, Object r) { dataset = ((PublishDatasetResult) r).getDataset(); } + final Dataset ds = dataset; + if (dataset != null) { + Optional prePubWf = ctxt.workflows().getDefaultWorkflow(TriggerType.PrePublishDataset); - //A pre-publication workflow will call FinalizeDatasetPublicationCommand itself when it completes - if (! prePubWf.isPresent() ) { + // A pre-publication workflow will call FinalizeDatasetPublicationCommand itself when it completes + if (prePubWf.isPresent()) { + WorkflowContext context = buildContext(ds, TriggerType.PrePublishDataset, datasetExternallyReleased); + context.setLockId(ds.getLockFor(DatasetLock.Reason.Workflow).getId()); + try { + ctxt.workflows().start(prePubWf.get(), context, true); + } catch (CommandException e) { + logger.log(Level.SEVERE, "Error invoking pre-publish workflow: " + e.getMessage(), e); + return false; + } + } + else { logger.fine("From onSuccess, calling FinalizeDatasetPublicationCommand for dataset " + dataset.getGlobalId().asString()); ctxt.datasets().callFinalizePublishCommandAsynchronously(dataset.getId(), ctxt, request, datasetExternallyReleased); } diff --git a/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/S3SubmitToArchiveCommand.java b/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/S3SubmitToArchiveCommand.java index 65531d775c8..17be53a458f 100644 --- a/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/S3SubmitToArchiveCommand.java +++ b/src/main/java/edu/harvard/iq/dataverse/engine/command/impl/S3SubmitToArchiveCommand.java @@ -2,24 +2,28 @@ import edu.harvard.iq.dataverse.Dataset; import edu.harvard.iq.dataverse.DatasetVersion; -import edu.harvard.iq.dataverse.DatasetLock.Reason; import edu.harvard.iq.dataverse.authorization.Permission; import edu.harvard.iq.dataverse.authorization.users.ApiToken; import edu.harvard.iq.dataverse.engine.command.DataverseRequest; import edu.harvard.iq.dataverse.engine.command.RequiredPermissions; import static edu.harvard.iq.dataverse.settings.SettingsServiceBean.Key.S3ArchiverConfig; import edu.harvard.iq.dataverse.util.bagit.BagGenerator; -import edu.harvard.iq.dataverse.util.bagit.OREMap; +import edu.harvard.iq.dataverse.util.bagit.BagGenerator.FileEntry; import edu.harvard.iq.dataverse.util.json.JsonUtil; +import edu.harvard.iq.dataverse.util.json.JsonLDTerm; import edu.harvard.iq.dataverse.workflow.step.Failure; import edu.harvard.iq.dataverse.workflow.step.WorkflowStepResult; -import java.io.ByteArrayInputStream; import java.io.File; -import java.io.FileInputStream; +import java.io.IOException; +import java.io.InputStream; import java.nio.charset.StandardCharsets; +import java.util.List; import java.util.Map; import java.util.concurrent.CompletableFuture; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.logging.Level; import java.util.logging.Logger; import jakarta.annotation.Resource; @@ -28,6 +32,7 @@ import jakarta.json.JsonObject; import jakarta.json.JsonObjectBuilder; +import org.apache.commons.compress.parallel.InputStreamSupplier; import org.eclipse.microprofile.config.Config; import org.eclipse.microprofile.config.ConfigProvider; @@ -38,25 +43,25 @@ import software.amazon.awssdk.auth.credentials.ProfileCredentialsProvider; import software.amazon.awssdk.auth.credentials.StaticCredentialsProvider; import software.amazon.awssdk.core.async.AsyncRequestBody; -import software.amazon.awssdk.core.sync.RequestBody; import software.amazon.awssdk.regions.Region; import software.amazon.awssdk.services.s3.S3AsyncClient; import software.amazon.awssdk.services.s3.S3AsyncClientBuilder; -import software.amazon.awssdk.services.s3.S3Client; -import software.amazon.awssdk.services.s3.model.GetObjectAttributesRequest; -import software.amazon.awssdk.services.s3.model.GetObjectAttributesResponse; -import software.amazon.awssdk.services.s3.model.ObjectAttributes; +import software.amazon.awssdk.services.s3.model.DeleteObjectRequest; +import software.amazon.awssdk.services.s3.model.DeleteObjectResponse; +import software.amazon.awssdk.services.s3.model.HeadObjectRequest; +import software.amazon.awssdk.services.s3.model.NoSuchKeyException; import software.amazon.awssdk.services.s3.model.PutObjectRequest; import software.amazon.awssdk.services.s3.model.PutObjectResponse; -import software.amazon.awssdk.services.s3.S3ClientBuilder; -import software.amazon.awssdk.services.s3.S3Configuration; import software.amazon.awssdk.http.async.SdkAsyncHttpClient; import software.amazon.awssdk.http.nio.netty.NettyNioAsyncHttpClient; import software.amazon.awssdk.utils.StringUtils; import software.amazon.awssdk.transfer.s3.S3TransferManager; import software.amazon.awssdk.transfer.s3.model.CompletedFileUpload; +import software.amazon.awssdk.transfer.s3.model.CompletedUpload; import software.amazon.awssdk.transfer.s3.model.FileUpload; +import software.amazon.awssdk.transfer.s3.model.Upload; import software.amazon.awssdk.transfer.s3.model.UploadFileRequest; +import software.amazon.awssdk.transfer.s3.model.UploadRequest; @RequiredPermissions(Permission.PublishDataset) public class S3SubmitToArchiveCommand extends AbstractSubmitToArchiveCommand { @@ -70,16 +75,24 @@ public class S3SubmitToArchiveCommand extends AbstractSubmitToArchiveCommand { private static final Config config = ConfigProvider.getConfig(); protected S3AsyncClient s3 = null; private S3TransferManager tm = null; - private String spaceName = null; + protected String bucketName = null; public S3SubmitToArchiveCommand(DataverseRequest aRequest, DatasetVersion version) { super(aRequest, version); } + + public static boolean supportsDelete() { + return true; + } + @Override + public boolean canDelete() { + return supportsDelete(); + } @Override - public WorkflowStepResult performArchiveSubmission(DatasetVersion dv, ApiToken token, - Map requestedSettings) { + public WorkflowStepResult performArchiveSubmission(DatasetVersion dv, String dataciteXml, JsonObject ore, + Map terms, ApiToken token, Map requestedSettings) { logger.fine("In S3SubmitToArchiveCommand..."); JsonObject configObject = null; @@ -98,76 +111,163 @@ public WorkflowStepResult performArchiveSubmission(DatasetVersion dv, ApiToken t JsonObjectBuilder statusObject = Json.createObjectBuilder(); statusObject.add(DatasetVersion.ARCHIVAL_STATUS, DatasetVersion.ARCHIVAL_STATUS_FAILURE); statusObject.add(DatasetVersion.ARCHIVAL_STATUS_MESSAGE, "Bag not transferred"); - + ExecutorService executor = Executors.newCachedThreadPool(); + try { Dataset dataset = dv.getDataset(); - if (dataset.getLockFor(Reason.finalizePublication) == null) { + spaceName = getSpaceName(dataset); + + // Define keys for datacite.xml and bag file + String dcKey = spaceName + "/" + getDataCiteFileName(spaceName, dv) + ".xml"; + String bagKey = spaceName + "/" + getFileName(spaceName, dv) + ".zip"; - spaceName = getSpaceName(dataset); - String dataciteXml = getDataCiteXml(dv); - // Add datacite.xml file - String dcKey = spaceName + "/" + getDataCiteFileName(spaceName, dv) + ".xml"; + // Check for and delete existing files for this version + logger.fine("Checking for existing files in archive..."); - PutObjectRequest putRequest = PutObjectRequest.builder() + try { + HeadObjectRequest headDcRequest = HeadObjectRequest.builder() .bucket(bucketName) .key(dcKey) .build(); - CompletableFuture putFuture = s3.putObject(putRequest, - AsyncRequestBody.fromString(dataciteXml, StandardCharsets.UTF_8)); + s3.headObject(headDcRequest).join(); + + // If we get here, the object exists, so delete it + logger.fine("Found existing datacite.xml, deleting: " + dcKey); + DeleteObjectRequest deleteDcRequest = DeleteObjectRequest.builder() + .bucket(bucketName) + .key(dcKey) + .build(); + + CompletableFuture deleteDcFuture = s3.deleteObject(deleteDcRequest); + DeleteObjectResponse deleteDcResponse = deleteDcFuture.join(); + + if (deleteDcResponse.sdkHttpResponse().isSuccessful()) { + logger.fine("Deleted existing datacite.xml"); + } else { + logger.warning("Failed to delete existing datacite.xml: " + dcKey); + } + } catch (Exception e) { + if (e.getCause() instanceof NoSuchKeyException) { + logger.fine("No existing datacite.xml found"); + } else { + logger.warning("Error checking/deleting existing datacite.xml: " + e.getMessage()); + } + } + + try { + HeadObjectRequest headBagRequest = HeadObjectRequest.builder() + .bucket(bucketName) + .key(bagKey) + .build(); + + s3.headObject(headBagRequest).join(); + + // If we get here, the object exists, so delete it + logger.fine("Found existing bag file, deleting: " + bagKey); + DeleteObjectRequest deleteBagRequest = DeleteObjectRequest.builder() + .bucket(bucketName) + .key(bagKey) + .build(); - // Wait for the put operation to complete - PutObjectResponse putResponse = putFuture.join(); + CompletableFuture deleteBagFuture = s3.deleteObject(deleteBagRequest); + DeleteObjectResponse deleteBagResponse = deleteBagFuture.join(); - if (!putResponse.sdkHttpResponse().isSuccessful()) { - logger.warning("Could not write datacite xml to S3"); - return new Failure("S3 Archiver failed writing datacite xml file"); + if (deleteBagResponse.sdkHttpResponse().isSuccessful()) { + logger.fine("Deleted existing bag file"); + } else { + logger.warning("Failed to delete existing bag file: " + bagKey); + } + } catch (Exception e) { + if (e.getCause() instanceof NoSuchKeyException) { + logger.fine("No existing bag file found"); + } else { + logger.warning("Error checking/deleting existing bag file: " + e.getMessage()); } + } + + // Add datacite.xml file + PutObjectRequest putRequest = PutObjectRequest.builder() + .bucket(bucketName) + .key(dcKey) + .build(); + + CompletableFuture putFuture = s3.putObject(putRequest, + AsyncRequestBody.fromString(dataciteXml, StandardCharsets.UTF_8)); + + // Wait for the put operation to complete + PutObjectResponse putResponse = putFuture.join(); + + if (!putResponse.sdkHttpResponse().isSuccessful()) { + logger.warning("Could not write datacite xml to S3"); + return new Failure("S3 Archiver failed writing datacite xml file"); + } + + // Store BagIt file + String fileName = getFileName(spaceName, dv); + + // Generate bag + BagGenerator bagger = new BagGenerator(ore, dataciteXml, terms); + bagger.setAuthenticationKey(token.getTokenString()); + if (bagger.generateBag(fileName, false)) { + File bagFile = bagger.getBagFile(fileName); + + UploadFileRequest uploadFileRequest = UploadFileRequest.builder() + .putObjectRequest(req -> req.bucket(bucketName).key(bagKey)).source(bagFile.toPath()) + .build(); + + FileUpload fileUpload = tm.uploadFile(uploadFileRequest); - // Store BagIt file - String fileName = getFileName(spaceName, dv); + CompletedFileUpload uploadResult = fileUpload.completionFuture().join(); - String bagKey = spaceName + "/" + fileName + ".zip"; - // Add BagIt ZIP file - // Google uses MD5 as one way to verify the - // transfer + if (uploadResult.response().sdkHttpResponse().isSuccessful()) { + logger.fine("S3 Submission step: Content Transferred"); - // Generate bag - BagGenerator bagger = new BagGenerator(new OREMap(dv, false), dataciteXml); - bagger.setAuthenticationKey(token.getTokenString()); - if (bagger.generateBag(fileName, false)) { - File bagFile = bagger.getBagFile(fileName); + List bigFiles = bagger.getOversizedFiles(); - UploadFileRequest uploadFileRequest = UploadFileRequest.builder() - .putObjectRequest(req -> req.bucket(bucketName).key(bagKey)).source(bagFile.toPath()) - .build(); + for (FileEntry entry : bigFiles) { + String childPath = entry.getChildPath(entry.getChildTitle()); + String fileKey = spaceName + "/" + childPath; + InputStreamSupplier supplier = bagger.getInputStreamSupplier(entry.getDataUrl()); + try (InputStream is = supplier.get()) { - FileUpload fileUpload = tm.uploadFile(uploadFileRequest); + PutObjectRequest filePutRequest = PutObjectRequest.builder().bucket(bucketName) + .key(fileKey).build(); - CompletedFileUpload uploadResult = fileUpload.completionFuture().join(); + UploadRequest uploadRequest = UploadRequest.builder().putObjectRequest(filePutRequest) + .requestBody(AsyncRequestBody.fromInputStream(is, entry.getSize(), executor)) + .build(); - if (uploadResult.response().sdkHttpResponse().isSuccessful()) { - logger.fine("S3 Submission step: Content Transferred"); + Upload upload = tm.upload(uploadRequest); + CompletedUpload completedUpload = upload.completionFuture().join(); - statusObject.add(DatasetVersion.ARCHIVAL_STATUS, DatasetVersion.ARCHIVAL_STATUS_SUCCESS); - statusObject.add(DatasetVersion.ARCHIVAL_STATUS_MESSAGE, - String.format("https://%s.s3.amazonaws.com/%s", bucketName, bagKey)); - } else { - logger.severe("Error sending file to S3: " + fileName); - return new Failure("Error in transferring Bag file to S3", - "S3 Submission Failure: incomplete transfer"); + if (completedUpload.response().sdkHttpResponse().isSuccessful()) { + logger.fine("Successfully uploaded oversized file: " + fileKey); + } else { + logger.warning("Failed to upload oversized file: " + fileKey); + return new Failure("Error uploading oversized file to S3: " + fileKey); + } + } catch (IOException e) { + logger.log(Level.WARNING, "Failed to get input stream for oversized file: " + fileKey, + e); + return new Failure("Error getting input stream for oversized file: " + fileKey); + } } + + statusObject.add(DatasetVersion.ARCHIVAL_STATUS, DatasetVersion.ARCHIVAL_STATUS_SUCCESS); + statusObject.add(DatasetVersion.ARCHIVAL_STATUS_MESSAGE, + String.format("https://%s.s3.amazonaws.com/%s", bucketName, bagKey)); } else { - logger.warning("Could not write local Bag file " + fileName); - return new Failure("S3 Archiver fail writing temp local bag"); + logger.severe("Error sending file to S3: " + fileName); + return new Failure("Error in transferring Bag file to S3", + "S3 Submission Failure: incomplete transfer"); } - } else { - logger.warning( - "S3 Archiver Submision Workflow aborted: Dataset locked for publication/pidRegister"); - return new Failure("Dataset locked"); + logger.warning("Could not write local Bag file " + fileName); + return new Failure("S3 Archiver fail writing temp local bag"); } + } catch (Exception e) { logger.warning(e.getLocalizedMessage()); e.printStackTrace(); @@ -175,6 +275,7 @@ public WorkflowStepResult performArchiveSubmission(DatasetVersion dv, ApiToken t e.getLocalizedMessage() + ": check log for details"); } finally { + executor.shutdown(); if (tm != null) { tm.close(); } @@ -183,24 +284,8 @@ public WorkflowStepResult performArchiveSubmission(DatasetVersion dv, ApiToken t return WorkflowStepResult.OK; } else { return new Failure( - "S3 Submission not configured - no \":S3ArchivalProfile\" and/or \":S3ArchivalConfig\" or no bucket-name defined in config."); - } - } - - protected String getDataCiteFileName(String spaceName, DatasetVersion dv) { - return spaceName + "_datacite.v" + dv.getFriendlyVersionNumber(); - } - - protected String getFileName(String spaceName, DatasetVersion dv) { - return spaceName + ".v" + dv.getFriendlyVersionNumber(); - } - - protected String getSpaceName(Dataset dataset) { - if (spaceName == null) { - spaceName = dataset.getGlobalId().asString().replace(':', '-').replace('/', '-').replace('.', '-') - .toLowerCase(); + "S3 Submission not configured - no \":S3ArchivalProfile\" and/or \":S3ArchivalConfig\" or no bucket-name defined in config."); } - return spaceName; } private S3AsyncClient createClient(JsonObject configObject) { diff --git a/src/main/java/edu/harvard/iq/dataverse/harvest/server/OAIRecordServiceBean.java b/src/main/java/edu/harvard/iq/dataverse/harvest/server/OAIRecordServiceBean.java index cc15d4c978b..b31268725b0 100644 --- a/src/main/java/edu/harvard/iq/dataverse/harvest/server/OAIRecordServiceBean.java +++ b/src/main/java/edu/harvard/iq/dataverse/harvest/server/OAIRecordServiceBean.java @@ -26,6 +26,7 @@ import static jakarta.ejb.TransactionAttributeType.REQUIRES_NEW; import jakarta.inject.Named; import jakarta.persistence.EntityManager; +import jakarta.persistence.OptimisticLockException; import jakarta.persistence.PersistenceContext; import jakarta.persistence.TypedQuery; import jakarta.persistence.TemporalType; @@ -262,7 +263,11 @@ public void exportAllFormatsInNewTransaction(Dataset dataset) throws ExportExcep try { ExportService exportServiceInstance = ExportService.getInstance(); exportServiceInstance.exportAllFormats(dataset); - dataset = datasetService.merge(dataset); + //Use em.merge or the jakarta OLE we want to catch will be wrapped + dataset = em.merge(dataset); + em.flush(); + } catch (OptimisticLockException ole) { + datasetService.setLastExportTimeInNewTransaction(dataset.getId(), dataset.getLastExportTime()); } catch (Exception e) { logger.log(Level.FINE, "Caught unknown exception while trying to export", e); throw new ExportException(e.getMessage()); diff --git a/src/main/java/edu/harvard/iq/dataverse/settings/JvmSettings.java b/src/main/java/edu/harvard/iq/dataverse/settings/JvmSettings.java index 05390ba8a8c..cf74fc62337 100644 --- a/src/main/java/edu/harvard/iq/dataverse/settings/JvmSettings.java +++ b/src/main/java/edu/harvard/iq/dataverse/settings/JvmSettings.java @@ -276,6 +276,11 @@ public enum JvmSettings { BAGIT_SOURCE_ORG_NAME(SCOPE_BAGIT_SOURCEORG, "name"), BAGIT_SOURCEORG_ADDRESS(SCOPE_BAGIT_SOURCEORG, "address"), BAGIT_SOURCEORG_EMAIL(SCOPE_BAGIT_SOURCEORG, "email"), + SCOPE_BAGIT_ZIP(SCOPE_BAGIT, "zip"), + BAGIT_ZIP_MAX_FILE_SIZE(SCOPE_BAGIT_ZIP, "max-file-size"), + BAGIT_ZIP_MAX_DATA_SIZE(SCOPE_BAGIT_ZIP, "max-data-size"), + BAGIT_ZIP_HOLEY(SCOPE_BAGIT_ZIP, "holey"), + BAGIT_ARCHIVE_ON_VERSION_UPDATE(SCOPE_BAGIT, "archive-on-version-update"), // STORAGE USE SETTINGS SCOPE_STORAGEUSE(PREFIX, "storageuse"), diff --git a/src/main/java/edu/harvard/iq/dataverse/settings/SettingsServiceBean.java b/src/main/java/edu/harvard/iq/dataverse/settings/SettingsServiceBean.java index 98578eed8d7..36306b1df37 100644 --- a/src/main/java/edu/harvard/iq/dataverse/settings/SettingsServiceBean.java +++ b/src/main/java/edu/harvard/iq/dataverse/settings/SettingsServiceBean.java @@ -489,6 +489,12 @@ Whether Harvesting (OAI) service is enabled */ ArchiverClassName, + /* + * Only create an archival Bag for a dataset version if all prior versions have + * been successfully archived + */ + ArchiveOnlyIfEarlierVersionsAreArchived, + /** * Custom settings for each archiver. See list below. */ @@ -806,16 +812,13 @@ public static SettingsServiceBean.Key parse(String key) { // Cut off the ":" we verified is present before String normalizedKey = key.substring(1); - // Iterate through all the known keys and return on match (case sensitive!) // We are case sensitive here because Dataverse implicitely uses case sensitive keys everywhere! - for (SettingsServiceBean.Key k : SettingsServiceBean.Key.values()) { - if (k.name().equals(normalizedKey)) { - return k; - } + try { + return SettingsServiceBean.Key.valueOf(normalizedKey); + } catch (IllegalArgumentException e) { + // Fall through on no match - return null for invalid keys + return null; } - - // Fall through on no match - return null; } } diff --git a/src/main/java/edu/harvard/iq/dataverse/util/ArchiverUtil.java b/src/main/java/edu/harvard/iq/dataverse/util/ArchiverUtil.java index 18ec6243d5a..7d03004f3f7 100644 --- a/src/main/java/edu/harvard/iq/dataverse/util/ArchiverUtil.java +++ b/src/main/java/edu/harvard/iq/dataverse/util/ArchiverUtil.java @@ -71,5 +71,16 @@ public static boolean isSomeVersionArchived(Dataset dataset) { return someVersionArchived; } + + /** + * Checks if a version has been successfully archived. + * + * @param version the version to check + * @return true if the version has been successfully archived, false otherwise + */ + public static boolean isVersionArchived(DatasetVersion version) { + String status = version.getArchivalCopyLocationStatus(); + return status != null && status.equals(DatasetVersion.ARCHIVAL_STATUS_SUCCESS); + } } \ No newline at end of file diff --git a/src/main/java/edu/harvard/iq/dataverse/util/bagit/BagGenerator.java b/src/main/java/edu/harvard/iq/dataverse/util/bagit/BagGenerator.java index f24ebdb8655..1459e2989da 100644 --- a/src/main/java/edu/harvard/iq/dataverse/util/bagit/BagGenerator.java +++ b/src/main/java/edu/harvard/iq/dataverse/util/bagit/BagGenerator.java @@ -4,12 +4,15 @@ import java.io.ByteArrayInputStream; import java.io.File; import java.io.FileOutputStream; +import java.io.FilterInputStream; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; +import java.io.InterruptedIOException; import java.io.OutputStream; import java.io.PrintWriter; import java.net.MalformedURLException; +import java.net.SocketTimeoutException; import java.net.URI; import java.net.URISyntaxException; import java.nio.charset.StandardCharsets; @@ -20,10 +23,13 @@ import java.text.SimpleDateFormat; import java.util.ArrayList; import java.util.Calendar; +import java.util.Collections; import java.util.HashMap; import java.util.HashSet; import java.util.Iterator; import java.util.LinkedHashMap; +import java.util.List; +import java.util.Map; import java.util.Set; import java.util.TreeSet; import java.util.Map.Entry; @@ -33,9 +39,10 @@ import java.util.concurrent.TimeUnit; import java.util.logging.Level; import java.util.logging.Logger; +import java.util.regex.Matcher; +import java.util.regex.Pattern; import java.util.zip.ZipEntry; -import edu.harvard.iq.dataverse.util.BundleUtil; import org.apache.commons.codec.digest.DigestUtils; import org.apache.commons.compress.archivers.zip.ParallelScatterZipCreator; import org.apache.commons.compress.archivers.zip.ScatterZipOutputStream; @@ -44,25 +51,24 @@ import org.apache.commons.compress.archivers.zip.ZipArchiveOutputStream; import org.apache.commons.compress.archivers.zip.ZipFile; import org.apache.commons.compress.parallel.InputStreamSupplier; -import org.apache.commons.compress.utils.IOUtils; -import org.apache.commons.text.WordUtils; -import org.apache.http.client.ClientProtocolException; -import org.apache.http.client.config.CookieSpecs; -import org.apache.http.client.config.RequestConfig; -import org.apache.http.client.methods.CloseableHttpResponse; -import org.apache.http.client.methods.HttpGet; -import org.apache.http.config.Registry; -import org.apache.http.config.RegistryBuilder; -import org.apache.http.conn.socket.ConnectionSocketFactory; -import org.apache.http.conn.socket.PlainConnectionSocketFactory; -import org.apache.http.conn.ssl.NoopHostnameVerifier; -import org.apache.http.conn.ssl.SSLConnectionSocketFactory; -import org.apache.http.conn.ssl.TrustSelfSignedStrategy; -import org.apache.http.ssl.SSLContextBuilder; -import org.apache.http.util.EntityUtils; -import org.apache.http.impl.client.CloseableHttpClient; -import org.apache.http.impl.client.HttpClients; -import org.apache.http.impl.conn.PoolingHttpClientConnectionManager; +import org.apache.hc.client5.http.ClientProtocolException; +import org.apache.hc.client5.http.classic.methods.HttpGet; +import org.apache.hc.client5.http.config.RequestConfig; +import org.apache.hc.client5.http.impl.classic.CloseableHttpClient; +import org.apache.hc.client5.http.impl.classic.CloseableHttpResponse; +import org.apache.hc.client5.http.impl.classic.HttpClients; +import org.apache.hc.client5.http.impl.io.PoolingHttpClientConnectionManager; +import org.apache.hc.client5.http.protocol.HttpClientContext; +import org.apache.hc.client5.http.socket.ConnectionSocketFactory; +import org.apache.hc.client5.http.socket.PlainConnectionSocketFactory; +import org.apache.hc.client5.http.ssl.NoopHostnameVerifier; +import org.apache.hc.client5.http.ssl.SSLConnectionSocketFactory; +import org.apache.hc.client5.http.ssl.TrustSelfSignedStrategy; +import org.apache.hc.core5.http.HttpEntity; +import org.apache.hc.core5.http.config.Registry; +import org.apache.hc.core5.http.config.RegistryBuilder; +import org.apache.hc.core5.ssl.SSLContextBuilder; +import org.apache.hc.core5.util.Timeout; import org.json.JSONArray; import com.google.gson.JsonArray; import com.google.gson.JsonElement; @@ -72,17 +78,33 @@ import com.google.gson.JsonSyntaxException; import edu.harvard.iq.dataverse.DataFile; +import edu.harvard.iq.dataverse.DatasetFieldConstant; import edu.harvard.iq.dataverse.DataFile.ChecksumType; import edu.harvard.iq.dataverse.pidproviders.PidUtil; import edu.harvard.iq.dataverse.settings.JvmSettings; import static edu.harvard.iq.dataverse.settings.SettingsServiceBean.Key.BagGeneratorThreads; -import edu.harvard.iq.dataverse.util.json.JsonLDTerm; -import java.util.Optional; +import edu.harvard.iq.dataverse.util.SystemConfig; +import edu.harvard.iq.dataverse.util.json.JsonLDTerm; +import jakarta.enterprise.inject.spi.CDI; + +/** + * Creates an archival zipped Bag for long-term storage. It is intended to + * include all the information needed to reconstruct the dataset version in a + * new Dataverse instance. + * + * Note that the Dataverse-Bag-Version written in the generateInfoFile() method + * should be updated any time the content/structure of the bag is changed. + * + */ public class BagGenerator { private static final Logger logger = Logger.getLogger(BagGenerator.class.getCanonicalName()); - + + static final String CRLF = "\r\n"; + + protected static final int MAX_RETRIES = 5; + private ParallelScatterZipCreator scatterZipCreator = null; private ScatterZipOutputStream dirs = null; @@ -92,10 +114,11 @@ public class BagGenerator { private HashMap pidMap = new LinkedHashMap(); private HashMap checksumMap = new LinkedHashMap(); - private int timeout = 60; - private RequestConfig config = RequestConfig.custom().setConnectTimeout(timeout * 1000) - .setConnectionRequestTimeout(timeout * 1000).setSocketTimeout(timeout * 1000) - .setCookieSpec(CookieSpecs.STANDARD).build(); + private int timeout = 300; + private RequestConfig config = RequestConfig.custom() + .setConnectionRequestTimeout(Timeout.ofSeconds(timeout)) + .setResponseTimeout(Timeout.ofSeconds(timeout)) + .build(); protected CloseableHttpClient client; private PoolingHttpClientConnectionManager cm = null; @@ -120,13 +143,43 @@ public class BagGenerator { private boolean usetemp = false; - private int numConnections = 8; - public static final String BAG_GENERATOR_THREADS = BagGeneratorThreads.toString(); + private Map terms; - private OREMap oremap; + private static int numConnections = 2; + public static final String BAG_GENERATOR_THREADS = BagGeneratorThreads.toString(); static PrintWriter pw = null; + + // Size limits and holey Bags + private long maxDataFileSize = Long.MAX_VALUE; + private long maxTotalDataSize = Long.MAX_VALUE; + private long currentBagDataSize = 0; + private StringBuilder fetchFileContent = new StringBuilder(); + private boolean usingFetchFile = false; + private boolean createHoleyBag = false; + private List oversizedFiles = new ArrayList<>(); + + // Bag-info.txt field labels + private static final String CONTACT_NAME = "Contact-Name: "; + private static final String CONTACT_EMAIL = "Contact-Email: "; + private static final String SOURCE_ORGANIZATION = "Source-Organization: "; + private static final String ORGANIZATION_ADDRESS = "Organization-Address: "; + private static final String ORGANIZATION_EMAIL = "Organization-Email: "; + private static final String EXTERNAL_DESCRIPTION = "External-Description: "; + private static final String BAGGING_DATE = "Bagging-Date: "; + private static final String EXTERNAL_IDENTIFIER = "External-Identifier: "; + private static final String BAG_SIZE = "Bag-Size: "; + private static final String PAYLOAD_OXUM = "Payload-Oxum: "; + private static final String INTERNAL_SENDER_IDENTIFIER = "Internal-Sender-Identifier: "; + + /** THIS NUMBER SHOULD CHANGE ANY TIME THE BAG CONTENTS ARE CHANGED */ + private static final String DATAVERSE_BAG_VERSION = "Dataverse-Bag-Version: 1.0"; + + // Implement exponential backoff with jitter + static final long baseWaitTimeMs = 1000; // Start with 1 second + static final long maxWaitTimeMs = 30000; // Cap at 30 seconds + /** * This BagGenerator creates a BagIt version 1.0 * (https://tools.ietf.org/html/draft-kunze-bagit-16) compliant bag that is also @@ -139,19 +192,27 @@ public class BagGenerator { * and zipping are done in parallel, using a connection pool. The required space * on disk is ~ n+1/n of the final bag size, e.g. 125% of the bag size for a * 4-way parallel zip operation. - * @throws Exception - * @throws JsonSyntaxException + * @param oremapObject - OAI-ORE Map file as a JSON object + * @param dataciteXml - DataCite XML file as a string + * @param terms - Map of schema.org/terms to their corresponding JsonLDTerm objects + * + * @throws Exception + * @throws JsonSyntaxException */ - public BagGenerator(OREMap oreMap, String dataciteXml) throws JsonSyntaxException, Exception { - this.oremap = oreMap; - this.oremapObject = oreMap.getOREMap(); - //(JsonObject) new JsonParser().parse(oreMap.getOREMap().toString()); + public BagGenerator(jakarta.json.JsonObject oremapObject, String dataciteXml, Map terms) throws JsonSyntaxException, Exception { + this.oremapObject = oremapObject; this.dataciteXml = dataciteXml; + this.terms = terms; try { - // Using Dataverse, all the URLs to be retrieved should be on the current server, so allowing self-signed certs and not verifying hostnames are useful in testing and - // shouldn't be a significant security issue. This should not be allowed for arbitrary OREMap sources. + /* + * Using Dataverse, all the URLs to be retrieved should be on the current + * server, so allowing self-signed certs and not verifying hostnames are useful + * in testing and shouldn't be a significant security issue. This should not be + * allowed for arbitrary OREMap sources. + * + */ SSLContextBuilder builder = new SSLContextBuilder(); try { builder.loadTrustMaterial(null, new TrustSelfSignedStrategy()); @@ -159,33 +220,45 @@ public BagGenerator(OREMap oreMap, String dataciteXml) throws JsonSyntaxExceptio e.printStackTrace(); } - SSLConnectionSocketFactory sslConnectionFactory = new SSLConnectionSocketFactory(builder.build(), NoopHostnameVerifier.INSTANCE); + SSLConnectionSocketFactory sslConnectionFactory = new SSLConnectionSocketFactory( + builder.build(), + NoopHostnameVerifier.INSTANCE + ); Registry registry = RegistryBuilder.create() - .register("http", PlainConnectionSocketFactory.getSocketFactory()) + .register("http", PlainConnectionSocketFactory.getSocketFactory()) .register("https", sslConnectionFactory).build(); cm = new PoolingHttpClientConnectionManager(registry); cm.setDefaultMaxPerRoute(numConnections); cm.setMaxTotal(numConnections > 20 ? numConnections : 20); - client = HttpClients.custom().setConnectionManager(cm).setDefaultRequestConfig(config).build(); + client = HttpClients.custom() + .setConnectionManager(cm) + .setDefaultRequestConfig(config) + .build(); scatterZipCreator = new ParallelScatterZipCreator(Executors.newFixedThreadPool(numConnections)); } catch (NoSuchAlgorithmException | KeyManagementException e) { - logger.warning("Aint gonna work"); + logger.warning("Failed to initialize HTTP client"); e.printStackTrace(); } + initializeHoleyBagLimits(); + } + + private void initializeHoleyBagLimits() { + this.maxDataFileSize = JvmSettings.BAGIT_ZIP_MAX_FILE_SIZE.lookupOptional(Long.class).orElse(Long.MAX_VALUE); + this.maxTotalDataSize = JvmSettings.BAGIT_ZIP_MAX_DATA_SIZE.lookupOptional(Long.class).orElse(Long.MAX_VALUE); + this.createHoleyBag = JvmSettings.BAGIT_ZIP_HOLEY.lookupOptional(Boolean.class).orElse(false); + logger.fine("BagGenerator size limits - maxDataFileSize: " + maxDataFileSize + + ", maxTotalDataSize: " + maxTotalDataSize + + ", createHoleyBag: " + createHoleyBag); } public void setIgnoreHashes(boolean val) { ignorehashes = val; } - - public void setDefaultCheckSumType(ChecksumType type) { - hashtype=type; - } - + public static void println(String s) { System.out.println(s); System.out.flush(); @@ -203,18 +276,18 @@ public static void println(String s) { * @return success true/false */ public boolean generateBag(OutputStream outputStream) throws Exception { - File tmp = File.createTempFile("qdr-scatter-dirs", "tmp"); dirs = ScatterZipOutputStream.fileBased(tmp); - // The oremapObject is javax.json.JsonObject and we need com.google.gson.JsonObject for the aggregation object - aggregation = (JsonObject) new JsonParser().parse(oremapObject.getJsonObject(JsonLDTerm.ore("describes").getLabel()).toString()); + // The oremapObject is javax.json.JsonObject and we need + // com.google.gson.JsonObject for the aggregation object + aggregation = (JsonObject) JsonParser + .parseString(oremapObject.getJsonObject(JsonLDTerm.ore("describes").getLabel()).toString()); String pidUrlString = aggregation.get("@id").getAsString(); - String pidString=PidUtil.parseAsGlobalID(pidUrlString).asString(); - bagID = pidString + "v." - + aggregation.get(JsonLDTerm.schemaOrg("version").getLabel()).getAsString(); - + String pidString = PidUtil.parseAsGlobalID(pidUrlString).asString(); + bagID = pidString + "v." + aggregation.get(JsonLDTerm.schemaOrg("version").getLabel()).getAsString(); + logger.info("Generating Bag: " + bagID); try { // Create valid filename from identifier and extend path with @@ -240,7 +313,15 @@ public boolean generateBag(OutputStream outputStream) throws Exception { resourceUsed = new Boolean[aggregates.size() + 1]; // Process current container (the aggregation itself) and its // children - processContainer(aggregation, currentPath); + // Recursively collect all files from the entire tree, start with an empty set of processedContainers + List allFiles = new ArrayList<>(); + collectAllFiles(aggregation, currentPath, allFiles, false); + + // Sort files by size (smallest first) + Collections.sort(allFiles); + + // Process all files in sorted order + processAllFiles(allFiles); } // Create manifest files // pid-mapping.txt - a DataOne recommendation to connect ids and @@ -249,7 +330,7 @@ public boolean generateBag(OutputStream outputStream) throws Exception { boolean first = true; for (Entry pidEntry : pidMap.entrySet()) { if (!first) { - pidStringBuffer.append("\r\n"); + pidStringBuffer.append(CRLF); } else { first = false; } @@ -264,13 +345,22 @@ public boolean generateBag(OutputStream outputStream) throws Exception { first = true; for (Entry sha1Entry : checksumMap.entrySet()) { if (!first) { - sha1StringBuffer.append("\r\n"); + sha1StringBuffer.append(CRLF); } else { first = false; } String path = sha1Entry.getKey(); sha1StringBuffer.append(sha1Entry.getValue() + " " + path); } + if(hashtype == null) { // No files - still want to send an empty manifest to nominally comply with BagIT specification requirement. + try { + // Use the current type if we can retrieve it + hashtype = CDI.current().select(SystemConfig.class).get().getFileFixityChecksumAlgorithm(); + } catch (Exception e) { + // Default to MD5 if we can't + hashtype = DataFile.ChecksumType.MD5; + } + } if (!(hashtype == null)) { String manifestName = "manifest-"; if (hashtype.equals(DataFile.ChecksumType.SHA1)) { @@ -286,7 +376,7 @@ public boolean generateBag(OutputStream outputStream) throws Exception { } createFileFromString(manifestName, sha1StringBuffer.toString()); } else { - logger.warning("No Hash values (no files?) sending empty manifest to nominally comply with BagIT specification requirement"); + logger.warning("No Hash value defined sending empty manifest-md5 to nominally comply with BagIT specification requirement"); createFileFromString("manifest-md5.txt", ""); } // bagit.txt - Required by spec @@ -312,6 +402,8 @@ public boolean generateBag(OutputStream outputStream) throws Exception { logger.fine("Creating bag: " + bagName); + writeFetchFile(); + ZipArchiveOutputStream zipArchiveOutputStream = new ZipArchiveOutputStream(outputStream); /* @@ -358,7 +450,6 @@ public boolean generateBag(OutputStream outputStream) throws Exception { public boolean generateBag(String bagName, boolean temp) { usetemp = temp; - FileOutputStream bagFileOS = null; try { File origBagFile = getBagFile(bagName); File bagFile = origBagFile; @@ -367,82 +458,78 @@ public boolean generateBag(String bagName, boolean temp) { logger.fine("Writing to: " + bagFile.getAbsolutePath()); } // Create an output stream backed by the file - bagFileOS = new FileOutputStream(bagFile); - if (generateBag(bagFileOS)) { - //The generateBag call sets this.bagName to the correct value - validateBagFile(bagFile); - if (usetemp) { - logger.fine("Moving tmp zip"); - origBagFile.delete(); - bagFile.renameTo(origBagFile); + try (FileOutputStream bagFileOS = new FileOutputStream(bagFile)) { + if (generateBag(bagFileOS)) { + // The generateBag call sets this.bagName to the correct value + validateBagFile(bagFile); + if (usetemp) { + logger.fine("Moving tmp zip"); + origBagFile.delete(); + bagFile.renameTo(origBagFile); + } + return true; + } else { + return false; } - return true; - } else { - return false; } } catch (Exception e) { - logger.log(Level.SEVERE,"Bag Exception: ", e); + logger.log(Level.SEVERE, "Bag Exception: ", e); e.printStackTrace(); logger.warning("Failure: Processing failure during Bagit file creation"); return false; - } finally { - IOUtils.closeQuietly(bagFileOS); } } public void validateBag(String bagId) { logger.info("Validating Bag"); - ZipFile zf = null; - InputStream is = null; try { File bagFile = getBagFile(bagId); - zf = new ZipFile(bagFile); - ZipArchiveEntry entry = zf.getEntry(getValidName(bagId) + "/manifest-sha1.txt"); - if (entry != null) { - logger.info("SHA1 hashes used"); - hashtype = DataFile.ChecksumType.SHA1; - } else { - entry = zf.getEntry(getValidName(bagId) + "/manifest-sha512.txt"); + try (ZipFile zf = ZipFile.builder().setFile(bagFile).get()) { + ZipArchiveEntry entry = zf.getEntry(getValidName(bagId) + "/manifest-sha1.txt"); if (entry != null) { - logger.info("SHA512 hashes used"); - hashtype = DataFile.ChecksumType.SHA512; + logger.info("SHA1 hashes used"); + hashtype = DataFile.ChecksumType.SHA1; } else { - entry = zf.getEntry(getValidName(bagId) + "/manifest-sha256.txt"); + entry = zf.getEntry(getValidName(bagId) + "/manifest-sha512.txt"); if (entry != null) { - logger.info("SHA256 hashes used"); - hashtype = DataFile.ChecksumType.SHA256; + logger.info("SHA512 hashes used"); + hashtype = DataFile.ChecksumType.SHA512; } else { - entry = zf.getEntry(getValidName(bagId) + "/manifest-md5.txt"); + entry = zf.getEntry(getValidName(bagId) + "/manifest-sha256.txt"); if (entry != null) { - logger.info("MD5 hashes used"); - hashtype = DataFile.ChecksumType.MD5; + logger.info("SHA256 hashes used"); + hashtype = DataFile.ChecksumType.SHA256; + } else { + entry = zf.getEntry(getValidName(bagId) + "/manifest-md5.txt"); + if (entry != null) { + logger.info("MD5 hashes used"); + hashtype = DataFile.ChecksumType.MD5; + } } } } + if (entry == null) + throw new IOException("No manifest file found"); + try (InputStream is = zf.getInputStream(entry)) { + BufferedReader br = new BufferedReader(new InputStreamReader(is)); + String line = br.readLine(); + while (line != null) { + logger.fine("Hash entry: " + line); + int breakIndex = line.indexOf(' '); + String hash = line.substring(0, breakIndex); + String path = line.substring(breakIndex + 1); + logger.fine("Adding: " + path + " with hash: " + hash); + checksumMap.put(path, hash); + line = br.readLine(); + } + } } - if (entry == null) - throw new IOException("No manifest file found"); - is = zf.getInputStream(entry); - BufferedReader br = new BufferedReader(new InputStreamReader(is)); - String line = br.readLine(); - while (line != null) { - logger.fine("Hash entry: " + line); - int breakIndex = line.indexOf(' '); - String hash = line.substring(0, breakIndex); - String path = line.substring(breakIndex + 1); - logger.fine("Adding: " + path + " with hash: " + hash); - checksumMap.put(path, hash); - line = br.readLine(); - } - IOUtils.closeQuietly(is); logger.info("HashMap Map contains: " + checksumMap.size() + " entries"); checkFiles(checksumMap, bagFile); } catch (IOException io) { - logger.log(Level.SEVERE,"Could not validate Hashes", io); + logger.log(Level.SEVERE, "Could not validate Hashes", io); } catch (Exception e) { - logger.log(Level.SEVERE,"Could not validate Hashes", e); - } finally { - IOUtils.closeQuietly(zf); + logger.log(Level.SEVERE, "Could not validate Hashes", e); } return; } @@ -465,7 +552,7 @@ public File getBagFile(String bagID) throws Exception { private void validateBagFile(File bagFile) throws IOException { // Run a confirmation test - should verify all files and hashes - + // Check files calculates the hashes and file sizes and reports on // whether hashes are correct checkFiles(checksumMap, bagFile); @@ -479,26 +566,31 @@ public static String getValidName(String bagName) { return bagName.replaceAll("\\W", "-"); } - private void processContainer(JsonObject item, String currentPath) throws IOException { + // Collect all files recursively and process containers to create dirs in the zip + private void collectAllFiles(JsonObject item, String currentPath, List allFiles, boolean addTitle) + throws IOException { JsonArray children = getChildren(item); - HashSet titles = new HashSet(); - String title = null; - if (item.has(JsonLDTerm.dcTerms("Title").getLabel())) { - title = item.get("Title").getAsString(); - } else if (item.has(JsonLDTerm.schemaOrg("name").getLabel())) { - title = item.get(JsonLDTerm.schemaOrg("name").getLabel()).getAsString(); + + if (addTitle) { //For any sub-collections (non-Dataverse) + String title = null; + if (item.has(JsonLDTerm.dcTerms("Title").getLabel())) { + title = item.get("Title").getAsString(); + } else if (item.has(JsonLDTerm.schemaOrg("name").getLabel())) { + title = item.get(JsonLDTerm.schemaOrg("name").getLabel()).getAsString(); + } + logger.fine("Collecting files from " + title + "/ at path " + currentPath); + currentPath = currentPath + title + "/"; } - logger.fine("Adding " + title + "/ to path " + currentPath); - currentPath = currentPath + title + "/"; + // Mark this container as processed + String containerId = item.get("@id").getAsString(); + + // Create directory and update tracking for this container int containerIndex = -1; try { createDir(currentPath); - // Add containers to pid map and mark as 'used', but no sha1 hash - // value - containerIndex = getUnusedIndexOf(item.get("@id").getAsString()); + containerIndex = getUnusedIndexOf(containerId); resourceUsed[containerIndex] = true; - pidMap.put(item.get("@id").getAsString(), currentPath); - + pidMap.put(containerId, currentPath); } catch (InterruptedException | IOException | ExecutionException e) { e.printStackTrace(); logger.severe(e.getMessage()); @@ -506,8 +598,8 @@ private void processContainer(JsonObject item, String currentPath) throws IOExce resourceUsed[containerIndex] = false; } throw new IOException("Unable to create bag"); - } + for (int i = 0; i < children.size(); i++) { // Find the ith child in the overall array of aggregated @@ -522,119 +614,188 @@ private void processContainer(JsonObject item, String currentPath) throws IOExce // Aggregation is at index 0, so need to shift by 1 for aggregates // entries JsonObject child = aggregates.get(index - 1).getAsJsonObject(); + // Dataverse does not currently use containers - this is for other variants/future use if (childIsContainer(child)) { - // create dir and process children - // processContainer will mark this item as used - processContainer(child, currentPath); + // Recursively collect files from this container + collectAllFiles(child, currentPath, allFiles, true); } else { - resourceUsed[index] = true; - // add item - // ToDo - String dataUrl = child.get(JsonLDTerm.schemaOrg("sameAs").getLabel()).getAsString(); - logger.fine("File url: " + dataUrl); - String childTitle = child.get(JsonLDTerm.schemaOrg("name").getLabel()).getAsString(); - if (titles.contains(childTitle)) { - logger.warning("**** Multiple items with the same title in: " + currentPath); - logger.warning("**** Will cause failure in hash and size validation in: " + bagID); - } else { - titles.add(childTitle); + + // Get file size + Long fileSize = null; + if (child.has(JsonLDTerm.filesize.getLabel())) { + fileSize = child.get(JsonLDTerm.filesize.getLabel()).getAsLong(); } - String childPath = currentPath + childTitle; - JsonElement directoryLabel = child.get(JsonLDTerm.DVCore("directoryLabel").getLabel()); - if(directoryLabel!=null) { - childPath=currentPath + directoryLabel.getAsString() + "/" + childTitle; + if (fileSize == null) { + logger.severe("File size missing for child: " + childId); + throw new IOException("Unable to create bag due to missing file size"); } - - String childHash = null; - if (child.has(JsonLDTerm.checksum.getLabel())) { - ChecksumType childHashType = ChecksumType.fromString( - child.getAsJsonObject(JsonLDTerm.checksum.getLabel()).get("@type").getAsString()); - if (hashtype == null) { - //If one wasn't set as a default, pick up what the first child with one uses - hashtype = childHashType; - } - if (hashtype != null && !hashtype.equals(childHashType)) { - logger.warning("Multiple hash values in use - will calculate " + hashtype.toString() - + " hashes for " + childTitle); - } else { - childHash = child.getAsJsonObject(JsonLDTerm.checksum.getLabel()).get("@value").getAsString(); - if (checksumMap.containsValue(childHash)) { - // Something else has this hash - logger.warning("Duplicate/Collision: " + child.get("@id").getAsString() + " has SHA1 Hash: " - + childHash + " in: " + bagID); - } - logger.fine("Adding " + childPath + " with hash " + childHash + " to checksumMap"); - checksumMap.put(childPath, childHash); - } + // Store minimal info for sorting - JsonObject is just a reference + allFiles.add(new FileEntry(fileSize, child, currentPath, index)); + } + } + } + + + // Process all files in sorted order + private void processAllFiles(List sortedFiles) + throws IOException, ExecutionException, InterruptedException { + + // Track titles to detect duplicates + Set titles = new HashSet<>(); + + if ((hashtype == null) | ignorehashes) { + hashtype = DataFile.ChecksumType.SHA512; + } + + for (FileEntry entry : sortedFiles) { + // Extract all needed information from the JsonObject reference + JsonObject child = entry.jsonObject; + + String childTitle = entry.getChildTitle(); + + // Check for duplicate titles + if (titles.contains(childTitle)) { + logger.warning("**** Multiple items with the same title in: " + entry.currentPath); + logger.warning("**** Will cause failure in hash and size validation in: " + bagID); + } else { + titles.add(childTitle); + } + + String childPath= entry.getChildPath(childTitle); + + // Get hash if exists + String childHash = null; + if (child.has(JsonLDTerm.checksum.getLabel())) { + ChecksumType childHashType = ChecksumType + .fromUri(child.getAsJsonObject(JsonLDTerm.checksum.getLabel()).get("@type").getAsString()); + if (hashtype == null) { + hashtype = childHashType; } - if ((hashtype == null) | ignorehashes) { - // Pick sha512 when ignoring hashes or none exist - hashtype = DataFile.ChecksumType.SHA512; + if (hashtype != null && !hashtype.equals(childHashType)) { + logger.warning("Multiple hash values in use - will calculate " + hashtype.toString() + + " hashes for " + childTitle); + } else { + childHash = child.getAsJsonObject(JsonLDTerm.checksum.getLabel()).get("@value").getAsString(); } - try { - if ((childHash == null) | ignorehashes) { - // Generate missing hashInputStream inputStream = null; - InputStream inputStream = null; - try { - inputStream = getInputStreamSupplier(dataUrl).get(); - - if (hashtype != null) { - if (hashtype.equals(DataFile.ChecksumType.SHA1)) { - childHash = DigestUtils.sha1Hex(inputStream); - } else if (hashtype.equals(DataFile.ChecksumType.SHA256)) { - childHash = DigestUtils.sha256Hex(inputStream); - } else if (hashtype.equals(DataFile.ChecksumType.SHA512)) { - childHash = DigestUtils.sha512Hex(inputStream); - } else if (hashtype.equals(DataFile.ChecksumType.MD5)) { - childHash = DigestUtils.md5Hex(inputStream); - } + } + + resourceUsed[entry.resourceIndex] = true; + String dataUrl = entry.getDataUrl(); + + try { + if ((childHash == null) | ignorehashes) { + // Generate missing hash + + try (InputStream inputStream = getInputStreamSupplier(dataUrl).get()){ + if (hashtype != null) { + if (hashtype.equals(DataFile.ChecksumType.SHA1)) { + childHash = DigestUtils.sha1Hex(inputStream); + } else if (hashtype.equals(DataFile.ChecksumType.SHA256)) { + childHash = DigestUtils.sha256Hex(inputStream); + } else if (hashtype.equals(DataFile.ChecksumType.SHA512)) { + childHash = DigestUtils.sha512Hex(inputStream); + } else if (hashtype.equals(DataFile.ChecksumType.MD5)) { + childHash = DigestUtils.md5Hex(inputStream); } - - } catch (IOException e) { - logger.severe("Failed to read " + childPath); - throw e; - } finally { - IOUtils.closeQuietly(inputStream); } - if (childHash != null) { - JsonObject childHashObject = new JsonObject(); - childHashObject.addProperty("@type", hashtype.toString()); - childHashObject.addProperty("@value", childHash); - child.add(JsonLDTerm.checksum.getLabel(), (JsonElement) childHashObject); - checksumMap.put(childPath, childHash); - } else { - logger.warning("Unable to calculate a " + hashtype + " for " + dataUrl); - } + } catch (IOException e) { + logger.severe("Failed to read " + childPath); + throw e; } - logger.fine("Requesting: " + childPath + " from " + dataUrl); - createFileFromURL(childPath, dataUrl); - dataCount++; - if (dataCount % 1000 == 0) { - logger.info("Retrieval in progress: " + dataCount + " files retrieved"); + if (childHash != null) { + JsonObject childHashObject = new JsonObject(); + childHashObject.addProperty("@type", hashtype.toString()); + childHashObject.addProperty("@value", childHash); + child.add(JsonLDTerm.checksum.getLabel(), (JsonElement) childHashObject); + + checksumMap.put(childPath, childHash); + } else { + logger.warning("Unable to calculate a " + hashtype + " for " + dataUrl); } - if (child.has(JsonLDTerm.filesize.getLabel())) { - Long size = child.get(JsonLDTerm.filesize.getLabel()).getAsLong(); - totalDataSize += size; - if (size > maxFileSize) { - maxFileSize = size; - } + } else { + // Hash already exists, add to checksumMap + if (checksumMap.containsValue(childHash)) { + logger.warning("Duplicate/Collision: " + child.get("@id").getAsString() + + " has hash: " + childHash + " in: " + bagID); } - if (child.has(JsonLDTerm.schemaOrg("fileFormat").getLabel())) { - mimetypes.add(child.get(JsonLDTerm.schemaOrg("fileFormat").getLabel()).getAsString()); + logger.fine("Adding " + childPath + " with hash " + childHash + " to checksumMap"); + checksumMap.put(childPath, childHash); + } + // Add file to bag or fetch file + if (!addToZip(entry.size)) { + if(createHoleyBag) { + logger.fine("Adding to fetch file: " + childPath + " from " + dataUrl + + " (size: " + entry.size + " bytes)"); + addToFetchFile(dataUrl, entry.size, childPath); + usingFetchFile = true; + } else { + // Add to list for archiver to retrieve + oversizedFiles.add(entry); + logger.fine("Adding " + childPath + " to oversized files list for archiver"); } - - } catch (Exception e) { - resourceUsed[index] = false; - e.printStackTrace(); - throw new IOException("Unable to create bag"); + } else { + logger.fine("Requesting: " + childPath + " from " + dataUrl + + " (size: " + entry.size + " bytes)"); + createFileFromURL(childPath, dataUrl); + currentBagDataSize += entry.size; + } + + dataCount++; + if (dataCount % 1000 == 0) { + logger.info("Retrieval in progress: " + dataCount + " files retrieved"); + } + + totalDataSize += entry.size; + if (entry.size > maxFileSize) { + maxFileSize = entry.size; + } + + if (child.has(JsonLDTerm.schemaOrg("fileFormat").getLabel())) { + mimetypes.add(child.get(JsonLDTerm.schemaOrg("fileFormat").getLabel()).getAsString()); } - // Check for nulls! - pidMap.put(child.get("@id").getAsString(), childPath); - + } catch (Exception e) { + resourceUsed[entry.resourceIndex] = false; + e.printStackTrace(); + throw new IOException("Unable to create bag"); } + + pidMap.put(child.get("@id").getAsString(), childPath); + } + } + + // Helper method to determine if file should go to fetch file + private boolean addToZip(long fileSize) { + + // Check individual file size limit + if (fileSize > maxDataFileSize) { + logger.fine("File size " + fileSize + " exceeds max data file size " + maxDataFileSize); + return false; + } + + // Check total bag size limit + if (currentBagDataSize + fileSize > maxTotalDataSize) { + logger.fine("Adding file would exceed max total data size. Current: " + currentBagDataSize + + ", File: " + fileSize + ", Max: " + maxTotalDataSize); + return false; + } + + return true; + } + + // Method to append to fetch file content + private void addToFetchFile(String url, long size, String filename) { + // Format: URL size filename + fetchFileContent.append(url).append(" ").append(Long.toString(size)).append(" ").append(filename).append(CRLF); + } + + // Method to write fetch file to bag (call this before finalizing the bag) + private void writeFetchFile() throws IOException, ExecutionException, InterruptedException { + if (usingFetchFile && fetchFileContent.length() > 0) { + logger.info("Creating fetch.txt file for holey bag"); + createFileFromString("fetch.txt", fetchFileContent.toString()); } } @@ -705,9 +866,7 @@ private void createFileFromURL(final String relPath, final String uri) private void checkFiles(HashMap shaMap, File bagFile) { ExecutorService executor = Executors.newFixedThreadPool(numConnections); - ZipFile zf = null; - try { - zf = new ZipFile(bagFile); + try (ZipFile zf = ZipFile.builder().setFile(bagFile).get()) { BagValidationJob.setZipFile(zf); BagValidationJob.setBagGenerator(this); @@ -730,12 +889,9 @@ private void checkFiles(HashMap shaMap, File bagFile) { } } catch (InterruptedException e) { logger.log(Level.SEVERE, "Hash Calculations interrupted", e); - } + } } catch (IOException e1) { - // TODO Auto-generated catch block e1.printStackTrace(); - } finally { - IOUtils.closeQuietly(zf); } logger.fine("Hash Validations Completed"); @@ -758,59 +914,55 @@ public void writeTo(ZipArchiveOutputStream zipArchiveOutputStream) logger.fine("Files written"); } - static final String CRLF = "\r\n"; - private String generateInfoFile() { logger.fine("Generating info file"); StringBuffer info = new StringBuffer(); - JsonArray contactsArray = new JsonArray(); - /* Contact, and it's subfields, are terms from citation.tsv whose mapping to a formal vocabulary and label in the oremap may change - * so we need to find the labels used. - */ - JsonLDTerm contactTerm = oremap.getContactTerm(); + /* + * Contact, and it's subfields, are terms from citation.tsv whose mapping to a + * formal vocabulary and label in the oremap may change so we need to find the + * labels used. + */ + JsonLDTerm contactTerm = terms.get(DatasetFieldConstant.datasetContact); if ((contactTerm != null) && aggregation.has(contactTerm.getLabel())) { JsonElement contacts = aggregation.get(contactTerm.getLabel()); - JsonLDTerm contactNameTerm = oremap.getContactNameTerm(); - JsonLDTerm contactEmailTerm = oremap.getContactEmailTerm(); - + JsonLDTerm contactNameTerm = terms.get(DatasetFieldConstant.datasetContactName); + JsonLDTerm contactEmailTerm = terms.get(DatasetFieldConstant.datasetContactEmail); + if (contacts.isJsonArray()) { + JsonArray contactsArray = contacts.getAsJsonArray(); for (int i = 0; i < contactsArray.size(); i++) { - info.append("Contact-Name: "); + JsonElement person = contactsArray.get(i); if (person.isJsonPrimitive()) { - info.append(person.getAsString()); + info.append(multilineWrap(CONTACT_NAME + person.getAsString())); info.append(CRLF); } else { - if(contactNameTerm != null) { - info.append(((JsonObject) person).get(contactNameTerm.getLabel()).getAsString()); - info.append(CRLF); + if (contactNameTerm != null) { + info.append(multilineWrap(CONTACT_NAME + ((JsonObject) person).get(contactNameTerm.getLabel()).getAsString())); + info.append(CRLF); } - if ((contactEmailTerm!=null) &&((JsonObject) person).has(contactEmailTerm.getLabel())) { - info.append("Contact-Email: "); - info.append(((JsonObject) person).get(contactEmailTerm.getLabel()).getAsString()); + if ((contactEmailTerm != null) && ((JsonObject) person).has(contactEmailTerm.getLabel())) { + info.append(multilineWrap(CONTACT_EMAIL + ((JsonObject) person).get(contactEmailTerm.getLabel()).getAsString())); info.append(CRLF); } } } } else { - info.append("Contact-Name: "); - if (contacts.isJsonPrimitive()) { - info.append((String) contacts.getAsString()); + info.append(multilineWrap(CONTACT_NAME + (String) contacts.getAsString())); info.append(CRLF); } else { JsonObject person = contacts.getAsJsonObject(); - if(contactNameTerm != null) { - info.append(person.get(contactNameTerm.getLabel()).getAsString()); - info.append(CRLF); + if (contactNameTerm != null) { + info.append(multilineWrap(CONTACT_NAME + person.get(contactNameTerm.getLabel()).getAsString())); + info.append(CRLF); } - if ((contactEmailTerm!=null) && (person.has(contactEmailTerm.getLabel()))) { - info.append("Contact-Email: "); - info.append(person.get(contactEmailTerm.getLabel()).getAsString()); + if ((contactEmailTerm != null) && (person.has(contactEmailTerm.getLabel()))) { + info.append(multilineWrap(CONTACT_EMAIL + person.get(contactEmailTerm.getLabel()).getAsString())); info.append(CRLF); } } @@ -820,88 +972,222 @@ private String generateInfoFile() { logger.warning("No contact info available for BagIt Info file"); } - String orgName = JvmSettings.BAGIT_SOURCE_ORG_NAME.lookupOptional(String.class).orElse("Dataverse Installation ()"); + String orgName = JvmSettings.BAGIT_SOURCE_ORG_NAME.lookupOptional(String.class) + .orElse("Dataverse Installation ()"); String orgAddress = JvmSettings.BAGIT_SOURCEORG_ADDRESS.lookupOptional(String.class).orElse(""); String orgEmail = JvmSettings.BAGIT_SOURCEORG_EMAIL.lookupOptional(String.class).orElse(""); - info.append("Source-Organization: " + orgName); + info.append(multilineWrap(SOURCE_ORGANIZATION + orgName)); // ToDo - make configurable info.append(CRLF); - info.append("Organization-Address: " + WordUtils.wrap(orgAddress, 78, CRLF + " ", true)); + info.append(multilineWrap(ORGANIZATION_ADDRESS + orgAddress)); info.append(CRLF); // Not a BagIt standard name - info.append("Organization-Email: " + orgEmail); + info.append(multilineWrap(ORGANIZATION_EMAIL + orgEmail)); info.append(CRLF); - info.append("External-Description: "); - - /* Description, and it's subfields, are terms from citation.tsv whose mapping to a formal vocabulary and label in the oremap may change - * so we need to find the labels used. + /* + * Description, and it's subfields, are terms from citation.tsv whose mapping to + * a formal vocabulary and label in the oremap may change so we need to find the + * labels used. */ - JsonLDTerm descriptionTerm = oremap.getDescriptionTerm(); - JsonLDTerm descriptionTextTerm = oremap.getDescriptionTextTerm(); + JsonLDTerm descriptionTerm = terms.get(DatasetFieldConstant.description); + JsonLDTerm descriptionTextTerm = terms.get(DatasetFieldConstant.descriptionText); if (descriptionTerm == null) { logger.warning("No description available for BagIt Info file"); } else { - info.append( - // FixMe - handle description having subfields better - WordUtils.wrap(getSingleValue(aggregation.get(descriptionTerm.getLabel()), - descriptionTextTerm.getLabel()), 78, CRLF + " ", true)); + info.append(multilineWrap(EXTERNAL_DESCRIPTION + + getSingleValue(aggregation.get(descriptionTerm.getLabel()), descriptionTextTerm.getLabel()))); info.append(CRLF); } - info.append("Bagging-Date: "); + info.append(BAGGING_DATE); info.append((new SimpleDateFormat("yyyy-MM-dd").format(Calendar.getInstance().getTime()))); info.append(CRLF); - info.append("External-Identifier: "); - info.append(aggregation.get("@id").getAsString()); + info.append(multilineWrap(EXTERNAL_IDENTIFIER + aggregation.get("@id").getAsString())); info.append(CRLF); - info.append("Bag-Size: "); + info.append(BAG_SIZE); info.append(byteCountToDisplaySize(totalDataSize)); info.append(CRLF); - info.append("Payload-Oxum: "); + info.append(PAYLOAD_OXUM); info.append(Long.toString(totalDataSize)); info.append("."); info.append(Long.toString(dataCount)); info.append(CRLF); - info.append("Internal-Sender-Identifier: "); String catalog = orgName + " Catalog"; if (aggregation.has(JsonLDTerm.schemaOrg("includedInDataCatalog").getLabel())) { catalog = aggregation.get(JsonLDTerm.schemaOrg("includedInDataCatalog").getLabel()).getAsString(); } - info.append(catalog + ":" + aggregation.get(JsonLDTerm.schemaOrg("name").getLabel()).getAsString()); + info.append(multilineWrap(INTERNAL_SENDER_IDENTIFIER + catalog + ":" + + aggregation.get(JsonLDTerm.schemaOrg("name").getLabel()).getAsString())); info.append(CRLF); + // Add a version number for our bag type - should be updated with any change to + // the bag content/structure + info.append(DATAVERSE_BAG_VERSION); + info.append(CRLF); return info.toString(); } + static private String multilineWrap(String value) { + // Normalize line breaks and ensure all lines after the first are indented + String[] lines = value.split("\\r?\\n"); + StringBuilder wrappedValue = new StringBuilder(); + for (int i = 0; i < lines.length; i++) { + // Skip empty lines - RFC8493 (section 7.3) doesn't allow truly empty lines, + // While trailing whitespace or whitespace-only lines appear to be allowed, it's + // not clear that handling them adds value (visually identical entries in + // Dataverse could result in entries w/ or w/o extra lines in the bag-info.txt + // file + String line = lines[i].trim(); + if (line.length() > 0) { + // Recommended line length, including the label or indents is 79 + String wrapped = lineWrap(line, 79, CRLF + " ", true); + wrappedValue.append(wrapped); + if (i < lines.length - 1) { + wrappedValue.append(CRLF).append(" "); + } + } + } + return wrappedValue.toString(); + } + + /** Adapted from Apache WordUtils.wrap() - make subsequent lines shorter by the length of any spaces in newLineStr*/ + public static String lineWrap(final String str, int wrapLength, String newLineStr, final boolean wrapLongWords) { + if (str == null) { + return null; + } + if (newLineStr == null) { + newLineStr = System.lineSeparator(); + } + if (wrapLength < 1) { + wrapLength = 1; + } + + // Calculate the indent length (characters after CRLF in newLineStr) + int indentLength = 0; + int crlfIndex = newLineStr.lastIndexOf("\n"); + if (crlfIndex != -1) { + indentLength = newLineStr.length() - crlfIndex -1; + } + + String wrapOn = " "; + final Pattern patternToWrapOn = Pattern.compile(wrapOn); + final int inputLineLength = str.length(); + int offset = 0; + final StringBuilder wrappedLine = new StringBuilder(inputLineLength + 32); + int matcherSize = -1; + boolean isFirstLine = true; + + while (offset < inputLineLength) { + // Adjust wrap length based on whether this is the first line or subsequent + // lines + int currentWrapLength = isFirstLine ? wrapLength : (wrapLength - indentLength); + + int spaceToWrapAt = -1; + Matcher matcher = patternToWrapOn.matcher(str.substring(offset, + Math.min((int) Math.min(Integer.MAX_VALUE, offset + currentWrapLength + 1L), inputLineLength))); + if (matcher.find()) { + if (matcher.start() == 0) { + matcherSize = matcher.end(); + if (matcherSize != 0) { + offset += matcher.end(); + continue; + } + offset += 1; + } + spaceToWrapAt = matcher.start() + offset; + } + + // only last line without leading spaces is left + if (inputLineLength - offset <= currentWrapLength) { + break; + } + + while (matcher.find()) { + spaceToWrapAt = matcher.start() + offset; + } + + if (spaceToWrapAt >= offset) { + // normal case + wrappedLine.append(str, offset, spaceToWrapAt); + wrappedLine.append(newLineStr); + offset = spaceToWrapAt + 1; + isFirstLine = false; + + } else // really long word or URL + if (wrapLongWords) { + if (matcherSize == 0) { + offset--; + } + // wrap really long word one line at a time + wrappedLine.append(str, offset, currentWrapLength + offset); + wrappedLine.append(newLineStr); + offset += currentWrapLength; + matcherSize = -1; + isFirstLine = false; + } else { + // do not wrap really long word, just extend beyond limit + matcher = patternToWrapOn.matcher(str.substring(offset + currentWrapLength)); + if (matcher.find()) { + matcherSize = matcher.end() - matcher.start(); + spaceToWrapAt = matcher.start() + offset + currentWrapLength; + } + + if (spaceToWrapAt >= 0) { + if (matcherSize == 0 && offset != 0) { + offset--; + } + wrappedLine.append(str, offset, spaceToWrapAt); + wrappedLine.append(newLineStr); + offset = spaceToWrapAt + 1; + isFirstLine = false; + } else { + if (matcherSize == 0 && offset != 0) { + offset--; + } + wrappedLine.append(str, offset, str.length()); + offset = inputLineLength; + matcherSize = -1; + } + } + } + + if (matcherSize == 0 && offset < inputLineLength) { + offset--; + } + + // Whatever is left in line is short enough to just pass through + wrappedLine.append(str, offset, str.length()); + + return wrappedLine.toString(); + } + /** - * Kludge - compound values (e.g. for descriptions) are sent as an array of + * Compound values (e.g. for descriptions) are sent as an array of * objects containing key/values whereas a single value is sent as one object. * For cases where multiple values are sent, create a concatenated string so * that information is not lost. * - * @param jsonElement - * - the root json object - * @param key - * - the key to find a value(s) for + * @param jsonElement - the root json object + * @param key - the key to find a value(s) for * @return - a single string */ String getSingleValue(JsonElement jsonElement, String key) { String val = ""; - if(jsonElement.isJsonObject()) { - JsonObject jsonObject=jsonElement.getAsJsonObject(); + if (jsonElement.isJsonObject()) { + JsonObject jsonObject = jsonElement.getAsJsonObject(); val = jsonObject.get(key).getAsString(); } else if (jsonElement.isJsonArray()) { - + Iterator iter = jsonElement.getAsJsonArray().iterator(); ArrayList stringArray = new ArrayList(); while (iter.hasNext()) { @@ -949,6 +1235,7 @@ private static JsonArray getChildren(JsonObject parent) { // Logic to decide if this is a container - // first check for children, then check for source-specific type indicators + // Dataverse does not currently use containers - this is for other variants/future use private static boolean childIsContainer(JsonObject item) { if (getChildren(item).size() != 0) { return true; @@ -994,10 +1281,8 @@ private HttpGet createNewGetRequest(URI url, String returnType) { urlString = urlString + ((urlString.indexOf('?') != -1) ? "&key=" : "?key=") + apiKey; request = new HttpGet(new URI(urlString)); } catch (MalformedURLException e) { - // TODO Auto-generated catch block e.printStackTrace(); } catch (URISyntaxException e) { - // TODO Auto-generated catch block e.printStackTrace(); } } else { @@ -1009,75 +1294,114 @@ private HttpGet createNewGetRequest(URI url, String returnType) { return request; } - InputStreamSupplier getInputStreamSupplier(final String uriString) { + /** Get a stream supplier for the given URI. + * + * Caller must close the stream when done. + */ + public InputStreamSupplier getInputStreamSupplier(final String uriString) { return new InputStreamSupplier() { public InputStream get() { try { URI uri = new URI(uriString); - int tries = 0; - while (tries < 5) { + while (tries < MAX_RETRIES) { - logger.fine("Get # " + tries + " for " + uriString); + logger.finest("Get # " + tries + " for " + uriString); HttpGet getFile = createNewGetRequest(uri, null); - logger.finest("Retrieving " + tries + ": " + uriString); - CloseableHttpResponse response = null; + try { - response = client.execute(getFile); - // Note - if we ever need to pass an HttpClientContext, we need a new one per - // thread. - int statusCode = response.getStatusLine().getStatusCode(); + // Execute the request directly and keep the response open + final CloseableHttpResponse response = (CloseableHttpResponse) client.executeOpen(null, getFile, HttpClientContext.create()); + int statusCode = response.getCode(); + if (statusCode == 200) { logger.finest("Retrieved: " + uri); - return response.getEntity().getContent(); - } - logger.warning("Attempt: " + tries + " - Unexpected Status when retrieving " + uriString - + " : " + statusCode); - if (statusCode < 500) { - logger.fine("Will not retry for 40x errors"); - tries += 5; + // Return a wrapped stream that will close the response when the stream is closed + final HttpEntity entity = response.getEntity(); + if (entity != null) { + // Create a wrapper stream that closes the response when the stream is closed + return new FilterInputStream(entity.getContent()) { + @Override + public void close() throws IOException { + try { + super.close(); + } finally { + response.close(); + } + } + }; + } else { + response.close(); + logger.warning("No content in response for: " + uriString); + return null; + } } else { + // Close the response for non-200 responses + response.close(); + + logger.warning("Attempt: " + tries + " - Unexpected Status when retrieving " + uriString + + " : " + statusCode); tries++; - } - // Error handling - if (response != null) { try { - EntityUtils.consumeQuietly(response.getEntity()); - response.close(); - } catch (IOException io) { - logger.warning( - "Exception closing response after status: " + statusCode + " on " + uri); + // Calculate exponential backoff: 2^tries * baseWaitTimeMs (1 sec) + long waitTime = (long) (Math.pow(2, tries) * baseWaitTimeMs); + + // Add jitter: random value between 0-30% of the wait time + long jitter = (long) (waitTime * 0.3 * Math.random()); + waitTime = waitTime + jitter; + + // Cap the wait time at maxWaitTimeMs (30 seconds) + waitTime = Math.min(waitTime, maxWaitTimeMs); + + logger.fine("Sleeping for " + waitTime + "ms before retry attempt " + tries); + Thread.sleep(waitTime); + } catch (InterruptedException ie) { + logger.log(Level.SEVERE, "InterruptedException during retry delay for file: " + uriString, ie); + Thread.currentThread().interrupt(); // Restore interrupt status + tries += MAX_RETRIES; // Skip remaining attempts } } } catch (ClientProtocolException e) { - tries += 5; - // TODO Auto-generated catch block - e.printStackTrace(); + tries += MAX_RETRIES; + logger.log(Level.SEVERE, "ClientProtocolException when retrieving file: " + uriString + " (attempt " + tries + ")", e); + } catch (SocketTimeoutException e) { + // Specific handling for timeout exceptions + tries++; + logger.log(Level.SEVERE, "SocketTimeoutException when retrieving file: " + uriString + " (attempt " + tries + " of " + MAX_RETRIES + ") - Request exceeded timeout", e); + if (tries == MAX_RETRIES) { + logger.log(Level.SEVERE, "FINAL FAILURE: File could not be retrieved after all retries due to timeouts: " + uriString, e); + } + } catch (InterruptedIOException e) { + // Catches interruptions during I/O operations + tries += MAX_RETRIES; + logger.log(Level.SEVERE, "InterruptedIOException when retrieving file: " + uriString + " - Operation was interrupted", e); + Thread.currentThread().interrupt(); // Restore interrupt status } catch (IOException e) { - // Retry if this is a potentially temporary error such - // as a timeout + // Retry if this is a potentially temporary error such as a timeout tries++; - logger.log(Level.WARNING, "Attempt# " + tries + " : Unable to retrieve file: " + uriString, - e); - if (tries == 5) { - logger.severe("Final attempt failed for " + uriString); + logger.log(Level.WARNING, "IOException when retrieving file: " + uriString + " (attempt " + tries + " of " + MAX_RETRIES+ ")", e); + if (tries == MAX_RETRIES) { + logger.log(Level.SEVERE, "FINAL FAILURE: File could not be retrieved after all retries: " + uriString, e); } - e.printStackTrace(); } } - } catch (URISyntaxException e) { - // TODO Auto-generated catch block - e.printStackTrace(); + logger.log(Level.SEVERE, "URISyntaxException for file: " + uriString + " - Invalid URI format", e); } - logger.severe("Could not read: " + uriString); + logger.severe("FAILED TO RETRIEVE FILE after all retries: " + uriString); return null; } }; } + + + public List getOversizedFiles() { + return oversizedFiles; + } + /** * Adapted from org/apache/commons/io/FileUtils.java change to SI - add 2 digits * of precision @@ -1101,8 +1425,7 @@ public InputStream get() { * Returns a human-readable version of the file size, where the input represents * a specific number of bytes. * - * @param size - * the number of bytes + * @param size the number of bytes * @return a human-readable display value (includes units) */ public static String byteCountToDisplaySize(long size) { @@ -1124,9 +1447,56 @@ public void setAuthenticationKey(String tokenString) { apiKey = tokenString; } - public void setNumConnections(int numConnections) { - this.numConnections = numConnections; - logger.fine("BagGenerator will use " + numConnections + " threads"); + public static void setNumConnections(int numConnections) { + BagGenerator.numConnections = numConnections; + logger.fine("All BagGenerators will now use " + numConnections + " threads"); } + + // Inner class to hold file information before processing + public static class FileEntry implements Comparable { + final long size; + final JsonObject jsonObject; // Direct reference, not a copy + final String currentPath; // Parent directory path + final int resourceIndex; // Still need this for resourceUsed tracking + + FileEntry(long size, JsonObject jsonObject, String currentPath, int resourceIndex) { + this.size = size; + this.jsonObject = jsonObject; + this.currentPath = currentPath; + this.resourceIndex = resourceIndex; + } + + public String getDataUrl() { + return suppressDownloadCounts(jsonObject.get(JsonLDTerm.schemaOrg("sameAs").getLabel()).getAsString()); + } + + public String getChildTitle() { + return jsonObject.get(JsonLDTerm.schemaOrg("name").getLabel()).getAsString(); + } + + public String getChildPath(String title) { + // Build full path using stored currentPath + String childPath = currentPath + title; + JsonElement directoryLabel = jsonObject.get(JsonLDTerm.DVCore("directoryLabel").getLabel()); + if (directoryLabel != null) { + childPath = currentPath + directoryLabel.getAsString() + "/" + title; + } + return childPath; + } + private String suppressDownloadCounts(String uriString) { + // Adding gbrecs to suppress counting this access as a download (archiving is + // not a download indicating scientific use) + return uriString + (uriString.contains("?") ? "&" : "?") + "gbrecs=true"; + } + + @Override + public int compareTo(FileEntry other) { + return Long.compare(this.size, other.size); + } + + public long getSize() { + return size; + } + } } \ No newline at end of file diff --git a/src/main/java/edu/harvard/iq/dataverse/util/bagit/OREMap.java b/src/main/java/edu/harvard/iq/dataverse/util/bagit/OREMap.java index 4cbc2aa7b9a..0d99a5bddd1 100644 --- a/src/main/java/edu/harvard/iq/dataverse/util/bagit/OREMap.java +++ b/src/main/java/edu/harvard/iq/dataverse/util/bagit/OREMap.java @@ -49,7 +49,7 @@ public class OREMap { public static final String NAME = "OREMap"; //NOTE: Update this value whenever the output of this class is changed - private static final String DATAVERSE_ORE_FORMAT_VERSION = "Dataverse OREMap Format v1.0.1"; + private static final String DATAVERSE_ORE_FORMAT_VERSION = "Dataverse OREMap Format v1.0.2"; //v1.0.1 - added versionNote private static final String DATAVERSE_SOFTWARE_NAME = "Dataverse"; private static final String DATAVERSE_SOFTWARE_URL = "https://github.com/iqss/dataverse"; @@ -130,7 +130,8 @@ public JsonObjectBuilder getOREMapBuilder(boolean aggregationOnly) { if(vs.equals(VersionState.DEACCESSIONED)) { JsonObjectBuilder deaccBuilder = Json.createObjectBuilder(); deaccBuilder.add(JsonLDTerm.schemaOrg("name").getLabel(), vs.name()); - deaccBuilder.add(JsonLDTerm.DVCore("reason").getLabel(), version.getDeaccessionNote()); + // Reason is supposed to not be null, but historically this has not been enforced (in the API) + addIfNotNull(deaccBuilder, JsonLDTerm.DVCore("reason"), version.getDeaccessionNote()); addIfNotNull(deaccBuilder, JsonLDTerm.DVCore("forwardUrl"), version.getDeaccessionLink()); aggBuilder.add(JsonLDTerm.schemaOrg("creativeWorkStatus").getLabel(), deaccBuilder); @@ -280,7 +281,7 @@ public JsonObjectBuilder getOREMapBuilder(boolean aggregationOnly) { JsonObject checksum = null; // Add checksum. RDA recommends SHA-512 if (df.getChecksumType() != null && df.getChecksumValue() != null) { - checksum = Json.createObjectBuilder().add("@type", df.getChecksumType().toString()) + checksum = Json.createObjectBuilder().add("@type", df.getChecksumType().toUri()) .add("@value", df.getChecksumValue()).build(); aggRes.add(JsonLDTerm.checksum.getLabel(), checksum); } @@ -505,11 +506,16 @@ private static void addCvocValue(String val, JsonArrayBuilder vals, JsonObject c for (String prefix : context.keySet()) { localContext.putIfAbsent(prefix, context.getString(prefix)); } - JsonObjectBuilder job = Json.createObjectBuilder(datasetFieldService.getExternalVocabularyValue(val)); - job.add("@id", val); - JsonObject extVal = job.build(); - logger.fine("Adding: " + extVal); - vals.add(extVal); + JsonObject cachedValue = datasetFieldService.getExternalVocabularyValue(val); + if (cachedValue != null) { + JsonObjectBuilder job = Json.createObjectBuilder(cachedValue); + job.add("@id", val); + JsonObject extVal = job.build(); + logger.fine("Adding: " + extVal); + vals.add(extVal); + } else { + vals.add(val); + } } else { vals.add(val); } diff --git a/src/main/java/edu/harvard/iq/dataverse/workflow/WorkflowServiceBean.java b/src/main/java/edu/harvard/iq/dataverse/workflow/WorkflowServiceBean.java index ae1175f0e1d..ab9f0a94baf 100644 --- a/src/main/java/edu/harvard/iq/dataverse/workflow/WorkflowServiceBean.java +++ b/src/main/java/edu/harvard/iq/dataverse/workflow/WorkflowServiceBean.java @@ -133,8 +133,8 @@ public void start(Workflow wf, WorkflowContext ctxt, boolean findDataset) throws * (e.g. if this method is not asynchronous) * */ - - if (!findDataset) { + boolean isLocked = ctxt.getLockId()!=null; + if (!findDataset && !isLocked) { /* * Sleep here briefly to make sure the database update from the callers * transaction completes which avoids any concurrency/optimistic lock issues. @@ -152,7 +152,9 @@ public void start(Workflow wf, WorkflowContext ctxt, boolean findDataset) throws } //Refresh will only em.find the dataset if findDataset is true. (otherwise the dataset is em.merged) ctxt = refresh(ctxt, retrieveRequestedSettings( wf.getRequiredSettings()), getCurrentApiToken(ctxt.getRequest().getAuthenticatedUser()), findDataset); - lockDataset(ctxt, new DatasetLock(DatasetLock.Reason.Workflow, ctxt.getRequest().getAuthenticatedUser())); + if(!isLocked) { + lockDataset(ctxt, new DatasetLock(DatasetLock.Reason.Workflow, ctxt.getRequest().getAuthenticatedUser())); + } forward(wf, ctxt); } @@ -180,12 +182,12 @@ private Map retrieveRequestedSettings(Map requir break; } case "boolean": { - retrievedSettings.put(setting, settings.isTrue(settingType, false)); + retrievedSettings.put(setting, settings.isTrue(setting, false)); break; } case "long": { retrievedSettings.put(setting, - settings.getValueForKeyAsLong(SettingsServiceBean.Key.valueOf(setting))); + settings.getValueForKeyAsLong(SettingsServiceBean.Key.parse(setting))); break; } } @@ -290,7 +292,7 @@ private void executeSteps(Workflow wf, WorkflowContext ctxt, int initialStepIdx try { if (res == WorkflowStepResult.OK) { logger.log(Level.INFO, "Workflow {0} step {1}: OK", new Object[]{ctxt.getInvocationId(), stepIdx}); - em.merge(ctxt.getDataset()); + // The dataset is merged in refresh(ctxt) ctxt = refresh(ctxt); } else if (res instanceof Failure) { logger.log(Level.WARNING, "Workflow {0} failed: {1}", new Object[]{ctxt.getInvocationId(), ((Failure) res).getReason()}); diff --git a/src/main/java/edu/harvard/iq/dataverse/workflow/internalspi/ArchivalSubmissionWorkflowStep.java b/src/main/java/edu/harvard/iq/dataverse/workflow/internalspi/ArchivalSubmissionWorkflowStep.java index b0567bff107..aacaa585dd7 100644 --- a/src/main/java/edu/harvard/iq/dataverse/workflow/internalspi/ArchivalSubmissionWorkflowStep.java +++ b/src/main/java/edu/harvard/iq/dataverse/workflow/internalspi/ArchivalSubmissionWorkflowStep.java @@ -1,9 +1,14 @@ package edu.harvard.iq.dataverse.workflow.internalspi; +import edu.harvard.iq.dataverse.Dataset; +import edu.harvard.iq.dataverse.DatasetLock.Reason; +import edu.harvard.iq.dataverse.DatasetVersion; import edu.harvard.iq.dataverse.engine.command.DataverseRequest; import edu.harvard.iq.dataverse.engine.command.impl.AbstractSubmitToArchiveCommand; import edu.harvard.iq.dataverse.settings.SettingsServiceBean; import edu.harvard.iq.dataverse.util.ArchiverUtil; +import edu.harvard.iq.dataverse.util.bagit.OREMap; +import edu.harvard.iq.dataverse.util.json.JsonLDTerm; import edu.harvard.iq.dataverse.workflow.WorkflowContext; import edu.harvard.iq.dataverse.workflow.step.Failure; import edu.harvard.iq.dataverse.workflow.step.WorkflowStep; @@ -14,6 +19,7 @@ import java.util.logging.Level; import java.util.logging.Logger; +import jakarta.json.JsonObject; import jakarta.servlet.http.HttpServletRequest; /** @@ -45,11 +51,53 @@ public WorkflowStepResult run(WorkflowContext context) { } } + Dataset d = context.getDataset(); + if (d.isLockedFor(Reason.FileValidationFailed)) { + logger.severe("Dataset locked for file validation failure - will not archive"); + return new Failure("File Validation Lock", "Dataset has file validation problem - will not archive"); + } DataverseRequest dvr = new DataverseRequest(context.getRequest().getAuthenticatedUser(), (HttpServletRequest) null); String className = requestedSettings.get(SettingsServiceBean.Key.ArchiverClassName.toString()); AbstractSubmitToArchiveCommand archiveCommand = ArchiverUtil.createSubmitToArchiveCommand(className, dvr, context.getDataset().getReleasedVersion()); if (archiveCommand != null) { - return (archiveCommand.performArchiveSubmission(context.getDataset().getReleasedVersion(), context.getApiToken(), requestedSettings)); + // Generate the required components for archiving + DatasetVersion version = context.getDataset().getReleasedVersion(); + if (!archiveCommand.preconditionsMet(version, context.getApiToken(), requestedSettings)) { + return new Failure("Earlier versions must be successfully archived first", + "Archival prerequisites not met"); + } + + // Generate DataCite XML + String dataCiteXml = archiveCommand.getDataCiteXml(version); + + // Generate OREMap + OREMap oreMap = new OREMap(version, false); + JsonObject ore = oreMap.getOREMap(); + + // Get JSON-LD terms + Map terms = AbstractSubmitToArchiveCommand.getJsonLDTerms(oreMap); + + // Call the updated method with all required parameters + /* + * Note: because this must complete before the workflow can complete and update the version status + * in the db a long-running archive submission via workflow could hit a transaction timeout and fail. + * The commands themselves have been updated to run archive submission outside of any transaction + * and update the status in a separate transaction, so archiving a given version that way could + * succeed where this workflow failed. + * + * Another difference when running in a workflow - this step has no way to set the archiving status to + * pending as is done when running archiving from the UI/API. Instead, there is a generic workflow + * lock on the dataset. + */ + return archiveCommand.performArchiveSubmission( + version, + dataCiteXml, + ore, + terms, + context.getApiToken(), + requestedSettings + ); + } else { logger.severe("No Archiver instance could be created for name: " + className); return new Failure("No Archiver", "Could not create instance of class: " + className); diff --git a/src/main/java/propertyFiles/Bundle.properties b/src/main/java/propertyFiles/Bundle.properties index 7f4518e65bd..f4e75efc472 100644 --- a/src/main/java/propertyFiles/Bundle.properties +++ b/src/main/java/propertyFiles/Bundle.properties @@ -1651,7 +1651,7 @@ dataset.share.datasetShare=Share Dataset dataset.share.datasetShare.tip=Share this dataset on your favorite social media networks. dataset.share.datasetShare.shareText=View this dataset. dataset.locked.message=Dataset Locked -dataset.locked.message.details=This dataset is locked until publication. +dataset.locked.message.details=This dataset is temporarily locked while background processing related to publication completes. dataset.locked.inReview.message=Submitted for Review dataset.locked.ingest.message=The tabular data files uploaded are being processed and converted into the archival format dataset.unlocked.ingest.message=The tabular files have been ingested. @@ -1682,7 +1682,6 @@ dataset.compute.computeBatchListHeader=Compute Batch dataset.compute.computeBatchRestricted=This dataset contains restricted files you may not compute on because you have not been granted access. dataset.delete.error=Could not deaccession the dataset because the {0} update failed. dataset.publish.workflow.message=Publish in Progress -dataset.publish.workflow.inprogress=This dataset is locked until publication. dataset.pidRegister.workflow.inprogress=The dataset is locked while the persistent identifiers are being registered or updated, and/or the physical files are being validated. dataset.versionUI.draft=Draft dataset.versionUI.inReview=In Review @@ -2144,6 +2143,7 @@ file.dataFilesTab.versions.headers.contributors.withheld=Contributor name(s) wit file.dataFilesTab.versions.headers.published=Published on file.dataFilesTab.versions.headers.archived=Archival Status file.dataFilesTab.versions.headers.archived.success=Archived +file.dataFilesTab.versions.headers.archived.obsolete=Original Version Archived file.dataFilesTab.versions.headers.archived.pending=Pending file.dataFilesTab.versions.headers.archived.failure=Failed file.dataFilesTab.versions.headers.archived.notarchived=Not Archived @@ -2702,6 +2702,7 @@ dataset.notlinked.msg=There was a problem linking this dataset to yours: dataset.linking.popop.already.linked.note=Note: This dataset is already linked to the following dataverse(s): dataset.linking.popup.not.linked.note=Note: This dataset is not linked to any of your accessible dataverses datasetversion.archive.success=Archival copy of Version successfully submitted +datasetversion.archive.inprogress= Data Project archiving has been started datasetversion.archive.failure=Error in submitting an archival copy datasetversion.update.failure=Dataset Version Update failed. Changes are still in the DRAFT version. datasetversion.update.archive.failure=Dataset Version Update succeeded, but the attempt to update the archival copy failed. diff --git a/src/main/webapp/dataset-versions.xhtml b/src/main/webapp/dataset-versions.xhtml index 9e5f0a9b24d..df5a39c09b7 100644 --- a/src/main/webapp/dataset-versions.xhtml +++ b/src/main/webapp/dataset-versions.xhtml @@ -169,13 +169,20 @@ + + + + + + - + + - + diff --git a/src/main/webapp/resources/css/structure.css b/src/main/webapp/resources/css/structure.css index cd2e7d33d10..27cb0d7e8bf 100644 --- a/src/main/webapp/resources/css/structure.css +++ b/src/main/webapp/resources/css/structure.css @@ -936,6 +936,9 @@ div.dvnDifferanceTable .versionValue { } div[id$="versionsTable"] tbody {word-break:break-word;} +.archive-submit-link { + display: block; +} /* DATATABLE + DROPDOWN BUTTON + OVERFLOW VISIBLE */ thead.ui-datatable-scrollable-theadclone {display:none} diff --git a/src/test/java/edu/harvard/iq/dataverse/api/BagIT.java b/src/test/java/edu/harvard/iq/dataverse/api/BagIT.java index 16c44003f35..b649ad6bb95 100644 --- a/src/test/java/edu/harvard/iq/dataverse/api/BagIT.java +++ b/src/test/java/edu/harvard/iq/dataverse/api/BagIT.java @@ -87,7 +87,7 @@ public void testBagItExport() throws IOException { .replace('.', '-').toLowerCase(); // spacename: doi-10-5072-fk2-fosg5q - String pathToZip = bagitExportDir + "/" + spaceName + "v1.0" + ".zip"; + String pathToZip = bagitExportDir + "/" + spaceName + ".v1.0" + ".zip"; try { // give the bag time to generate diff --git a/src/test/java/edu/harvard/iq/dataverse/api/SwordIT.java b/src/test/java/edu/harvard/iq/dataverse/api/SwordIT.java index 709908ac6eb..22dfe61da07 100644 --- a/src/test/java/edu/harvard/iq/dataverse/api/SwordIT.java +++ b/src/test/java/edu/harvard/iq/dataverse/api/SwordIT.java @@ -954,7 +954,8 @@ public void testDeleteFiles() { reindexDataset4ToFindDatabaseId.then().assertThat() .statusCode(OK.getStatusCode()); Integer datasetId4 = JsonPath.from(reindexDataset4ToFindDatabaseId.asString()).getInt("data.id"); - + UtilIT.sleepForReindex(datasetPersistentId4, apiToken, 5); + Response destroyDataset4 = UtilIT.destroyDataset(datasetId4, apiToken); destroyDataset4.prettyPrint(); destroyDataset4.then().assertThat() diff --git a/src/test/java/edu/harvard/iq/dataverse/util/bagit/BagGeneratorInfoFileTest.java b/src/test/java/edu/harvard/iq/dataverse/util/bagit/BagGeneratorInfoFileTest.java new file mode 100644 index 00000000000..05e83b8540d --- /dev/null +++ b/src/test/java/edu/harvard/iq/dataverse/util/bagit/BagGeneratorInfoFileTest.java @@ -0,0 +1,296 @@ + +package edu.harvard.iq.dataverse.util.bagit; + +import edu.harvard.iq.dataverse.engine.command.impl.AbstractSubmitToArchiveCommand; +import edu.harvard.iq.dataverse.util.json.JsonLDTerm; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; +import org.mockito.Mock; +import org.mockito.MockitoAnnotations; + +import com.google.gson.JsonParser; + +import jakarta.json.Json; +import jakarta.json.JsonArrayBuilder; +import jakarta.json.JsonObject; +import jakarta.json.JsonObjectBuilder; + +import java.lang.reflect.Field; +import java.lang.reflect.Method; + +import static org.junit.jupiter.api.Assertions.*; +import static org.mockito.Mockito.*; + +public class BagGeneratorInfoFileTest { + + private BagGenerator bagGenerator; + private JsonObjectBuilder testAggregationBuilder; + + @Mock + private OREMap mockOreMap; + + @BeforeEach + public void setUp() throws Exception { + MockitoAnnotations.openMocks(this); + + // Create base test aggregation builder with required fields + testAggregationBuilder = Json.createObjectBuilder(); + testAggregationBuilder.add("@id", "doi:10.5072/FK2/TEST123"); + testAggregationBuilder.add(JsonLDTerm.schemaOrg("name").getLabel(), "Test Dataset"); + testAggregationBuilder.add(JsonLDTerm.schemaOrg("includedInDataCatalog").getLabel(), "Test Catalog"); + } + + /** + * Helper method to finalize the aggregation and create the BagGenerator + */ + private void initializeBagGenerator() throws Exception { + JsonObject testAggregation = testAggregationBuilder.build(); + + JsonObjectBuilder oremapJsonBuilder = Json.createObjectBuilder(); + oremapJsonBuilder.add(JsonLDTerm.ore("describes").getLabel(), testAggregation); + JsonObject oremapObject = oremapJsonBuilder.build(); + // Mock the OREMap.getOREMap() method to return the built JSON + when(mockOreMap.getOREMap()).thenReturn(oremapObject); + + // Initialize BagGenerator with test data + bagGenerator = new BagGenerator(oremapObject, "", AbstractSubmitToArchiveCommand.getJsonLDTerms(mockOreMap)); + setPrivateField(bagGenerator, "aggregation", (com.google.gson.JsonObject) JsonParser + .parseString(oremapObject.getJsonObject(JsonLDTerm.ore("describes").getLabel()).toString())); + setPrivateField(bagGenerator, "totalDataSize", 1024000L); + setPrivateField(bagGenerator, "dataCount", 10L); + } + + @Test + public void testGenerateInfoFileWithSingleContact() throws Exception { + // Arrange + JsonLDTerm contactTerm = JsonLDTerm.schemaOrg("creator"); + JsonLDTerm contactNameTerm = JsonLDTerm.schemaOrg("name"); + JsonLDTerm contactEmailTerm = JsonLDTerm.schemaOrg("email"); + + when(mockOreMap.getContactTerm()).thenReturn(contactTerm); + when(mockOreMap.getContactNameTerm()).thenReturn(contactNameTerm); + when(mockOreMap.getContactEmailTerm()).thenReturn(contactEmailTerm); + + JsonObjectBuilder contactBuilder = Json.createObjectBuilder(); + contactBuilder.add(contactNameTerm.getLabel(), "John Doe"); + contactBuilder.add(contactEmailTerm.getLabel(), "john.doe@example.com"); + testAggregationBuilder.add(contactTerm.getLabel(), contactBuilder); + + initializeBagGenerator(); + + // Act + String infoFile = invokeGenerateInfoFile(); + + // Assert + assertNotNull(infoFile); + assertTrue(infoFile.contains("Contact-Name: John Doe")); + assertTrue(infoFile.contains("Contact-Email: john.doe@example.com")); + } + + @Test + public void testGenerateInfoFileWithMultipleContacts() throws Exception { + // Arrange + JsonLDTerm contactTerm = JsonLDTerm.schemaOrg("creator"); + JsonLDTerm contactNameTerm = JsonLDTerm.schemaOrg("name"); + JsonLDTerm contactEmailTerm = JsonLDTerm.schemaOrg("email"); + + when(mockOreMap.getContactTerm()).thenReturn(contactTerm); + when(mockOreMap.getContactNameTerm()).thenReturn(contactNameTerm); + when(mockOreMap.getContactEmailTerm()).thenReturn(contactEmailTerm); + + JsonArrayBuilder contactsBuilder = Json.createArrayBuilder(); + + JsonObjectBuilder contact1 = Json.createObjectBuilder(); + contact1.add(contactNameTerm.getLabel(), "John Doe"); + contact1.add(contactEmailTerm.getLabel(), "john.doe@example.com"); + + JsonObjectBuilder contact2 = Json.createObjectBuilder(); + contact2.add(contactNameTerm.getLabel(), "Jane Smith"); + contact2.add(contactEmailTerm.getLabel(), "jane.smith@example.com"); + + JsonObjectBuilder contact3 = Json.createObjectBuilder(); + contact3.add(contactNameTerm.getLabel(), "Bob Johnson"); + contact3.add(contactEmailTerm.getLabel(), "bob.johnson@example.com"); + + contactsBuilder.add(contact1); + contactsBuilder.add(contact2); + contactsBuilder.add(contact3); + + testAggregationBuilder.add(contactTerm.getLabel(), contactsBuilder); + + initializeBagGenerator(); + + // Act + String infoFile = invokeGenerateInfoFile(); + + // Assert + assertNotNull(infoFile); + assertTrue(infoFile.contains("Contact-Name: John Doe")); + assertTrue(infoFile.contains("Contact-Email: john.doe@example.com")); + assertTrue(infoFile.contains("Contact-Name: Jane Smith")); + assertTrue(infoFile.contains("Contact-Email: jane.smith@example.com")); + assertTrue(infoFile.contains("Contact-Name: Bob Johnson")); + assertTrue(infoFile.contains("Contact-Email: bob.johnson@example.com")); + } + + @Test + public void testGenerateInfoFileWithSingleDescription() throws Exception { + // Arrange + JsonLDTerm descriptionTerm = JsonLDTerm.schemaOrg("description"); + JsonLDTerm descriptionTextTerm = JsonLDTerm.schemaOrg("value"); + + when(mockOreMap.getDescriptionTerm()).thenReturn(descriptionTerm); + when(mockOreMap.getDescriptionTextTerm()).thenReturn(descriptionTextTerm); + + JsonObjectBuilder descriptionBuilder = Json.createObjectBuilder(); + descriptionBuilder.add(descriptionTextTerm.getLabel(), "This is a test dataset description."); + testAggregationBuilder.add(descriptionTerm.getLabel(), descriptionBuilder); + + initializeBagGenerator(); + + // Act + String infoFile = invokeGenerateInfoFile(); + + // Assert + assertNotNull(infoFile); + assertTrue(infoFile.contains("External-Description: This is a test dataset description.")); + } + + @Test + public void testGenerateInfoFileWithMultipleDescriptions() throws Exception { + // Arrange + JsonLDTerm descriptionTerm = JsonLDTerm.schemaOrg("description"); + JsonLDTerm descriptionTextTerm = JsonLDTerm.schemaOrg("value"); + + when(mockOreMap.getDescriptionTerm()).thenReturn(descriptionTerm); + when(mockOreMap.getDescriptionTextTerm()).thenReturn(descriptionTextTerm); + + JsonArrayBuilder descriptionsBuilder = Json.createArrayBuilder(); + + JsonObjectBuilder desc1 = Json.createObjectBuilder(); + desc1.add(descriptionTextTerm.getLabel(), "First description of the dataset."); + + JsonObjectBuilder desc2 = Json.createObjectBuilder(); + desc2.add(descriptionTextTerm.getLabel(), "Second description with additional details."); + + JsonObjectBuilder desc3 = Json.createObjectBuilder(); + desc3.add(descriptionTextTerm.getLabel(), "Third description for completeness."); + + descriptionsBuilder.add(desc1); + descriptionsBuilder.add(desc2); + descriptionsBuilder.add(desc3); + + testAggregationBuilder.add(descriptionTerm.getLabel(), descriptionsBuilder); + + initializeBagGenerator(); + + // Act + String infoFile = invokeGenerateInfoFile(); + // Assert + assertNotNull(infoFile); + // Multiple descriptions should be concatenated with commas as per getSingleValue method + assertTrue(infoFile.contains("External-Description: First description of the dataset.,Second description with\r\n additional details.,Third description for completeness.")); + } + + @Test + public void testGenerateInfoFileWithRequiredFields() throws Exception { + // Arrange - minimal setup with required fields already in setUp() + JsonLDTerm contactTerm = JsonLDTerm.schemaOrg("creator"); + JsonLDTerm contactNameTerm = JsonLDTerm.schemaOrg("name"); + JsonLDTerm descriptionTerm = JsonLDTerm.schemaOrg("description"); + JsonLDTerm descriptionTextTerm = JsonLDTerm.schemaOrg("value"); + + when(mockOreMap.getContactTerm()).thenReturn(contactTerm); + when(mockOreMap.getContactNameTerm()).thenReturn(contactNameTerm); + when(mockOreMap.getContactEmailTerm()).thenReturn(null); + when(mockOreMap.getDescriptionTerm()).thenReturn(descriptionTerm); + when(mockOreMap.getDescriptionTextTerm()).thenReturn(descriptionTextTerm); + + JsonObjectBuilder contactBuilder = Json.createObjectBuilder(); + contactBuilder.add(contactNameTerm.getLabel(), "Test Contact"); + testAggregationBuilder.add(contactTerm.getLabel(), contactBuilder); + + JsonObjectBuilder descriptionBuilder = Json.createObjectBuilder(); + descriptionBuilder.add(descriptionTextTerm.getLabel(), "Test description"); + testAggregationBuilder.add(descriptionTerm.getLabel(), descriptionBuilder); + + initializeBagGenerator(); + + // Act + String infoFile = invokeGenerateInfoFile(); + + // Assert + assertNotNull(infoFile); + assertTrue(infoFile.contains("Contact-Name: Test Contact")); + assertTrue(infoFile.contains("External-Description: Test description")); + assertTrue(infoFile.contains("Source-Organization:")); + assertTrue(infoFile.contains("Organization-Address:")); + assertTrue(infoFile.contains("Organization-Email:")); + assertTrue(infoFile.contains("Bagging-Date:")); + assertTrue(infoFile.contains("External-Identifier: doi:10.5072/FK2/TEST123")); + assertTrue(infoFile.contains("Bag-Size:")); + assertTrue(infoFile.contains("Payload-Oxum: 1024000.10")); + assertTrue(infoFile.contains("Internal-Sender-Identifier: Test Catalog:Test Dataset")); + } + + @Test + public void testGenerateInfoFileWithDifferentBagSizes() throws Exception { + // Arrange + JsonLDTerm contactTerm = JsonLDTerm.schemaOrg("creator"); + when(mockOreMap.getContactTerm()).thenReturn(contactTerm); + when(mockOreMap.getContactNameTerm()).thenReturn(null); + when(mockOreMap.getContactEmailTerm()).thenReturn(null); + when(mockOreMap.getDescriptionTerm()).thenReturn(null); + + initializeBagGenerator(); + + // Test with bytes + setPrivateField(bagGenerator, "totalDataSize", 512L); + setPrivateField(bagGenerator, "dataCount", 5L); + String infoFile1 = invokeGenerateInfoFile(); + assertTrue(infoFile1.contains("Bag-Size: 512 bytes")); + assertTrue(infoFile1.contains("Payload-Oxum: 512.5")); + + // Test with KB + setPrivateField(bagGenerator, "totalDataSize", 2048L); + setPrivateField(bagGenerator, "dataCount", 3L); + String infoFile2 = invokeGenerateInfoFile(); + assertTrue(infoFile2.contains("Bag-Size: 2.05 KB")); + assertTrue(infoFile2.contains("Payload-Oxum: 2048.3")); + + // Test with MB + setPrivateField(bagGenerator, "totalDataSize", 5242880L); + setPrivateField(bagGenerator, "dataCount", 100L); + String infoFile3 = invokeGenerateInfoFile(); + assertTrue(infoFile3.contains("Bag-Size: 5.24 MB")); + assertTrue(infoFile3.contains("Payload-Oxum: 5242880.100")); + + // Test with GB + setPrivateField(bagGenerator, "totalDataSize", 2147483648L); + setPrivateField(bagGenerator, "dataCount", 1000L); + + String infoFile4 = invokeGenerateInfoFile(); + assertTrue(infoFile4.contains("Bag-Size: 2.15 GB")); + assertTrue(infoFile4.contains("Payload-Oxum: 2147483648.1000")); + } + + // Helper methods + + /** + * Invokes the private generateInfoFile method using reflection + */ + private String invokeGenerateInfoFile() throws Exception { + Method method = BagGenerator.class.getDeclaredMethod("generateInfoFile"); + method.setAccessible(true); + return (String) method.invoke(bagGenerator); + } + + /** + * Sets a private field value using reflection + */ + private void setPrivateField(Object target, String fieldName, Object value) throws Exception { + Field field = BagGenerator.class.getDeclaredField(fieldName); + field.setAccessible(true); + field.set(target, value); + } +} \ No newline at end of file diff --git a/src/test/java/edu/harvard/iq/dataverse/util/bagit/BagGeneratorMultilineWrapTest.java b/src/test/java/edu/harvard/iq/dataverse/util/bagit/BagGeneratorMultilineWrapTest.java new file mode 100644 index 00000000000..19d478f4b0d --- /dev/null +++ b/src/test/java/edu/harvard/iq/dataverse/util/bagit/BagGeneratorMultilineWrapTest.java @@ -0,0 +1,160 @@ + +package edu.harvard.iq.dataverse.util.bagit; + +import static org.assertj.core.api.Assertions.assertThat; + +import java.lang.reflect.InvocationTargetException; +import java.lang.reflect.Method; + +import org.junit.jupiter.api.BeforeAll; +import org.junit.jupiter.api.Test; + +/** + * Tests adapted for DD-2093: verify the behavior of BagGenerator.multilineWrap. + */ +public class BagGeneratorMultilineWrapTest { + + private static Method multilineWrap; + + @BeforeAll + static void setUp() throws NoSuchMethodException { + // Access the private static method via reflection + multilineWrap = BagGenerator.class.getDeclaredMethod("multilineWrap", String.class); + multilineWrap.setAccessible(true); + } + + private String callMultilineWrap(String input) { + try { + return (String) multilineWrap.invoke(null, input); + } catch (IllegalAccessException | InvocationTargetException e) { + throw new RuntimeException(e); + } + } + + @Test + void shortLine_noWrap() { + String input = "Hello world"; + String out = callMultilineWrap(input); + assertThat(out).isEqualTo("Hello world"); + } + + @Test + void exactBoundary_78chars_noWrap() { + String input = "a".repeat(78); + String out = callMultilineWrap(input); + assertThat(out).isEqualTo(input); + } + + @Test + void longSingleWord_wrapsAt78WithIndent() { + String input = "a".repeat(100); + String expected = "a".repeat(79) + "\r\n " + "a".repeat(21); + String out = callMultilineWrap(input); + assertThat(out).isEqualTo(expected); + } + + @Test + void multiline_input_indentsSecondAndSubsequentOriginalLines() { + String input = "Line1\nLine2\nLine3"; + String expected = "Line1\r\n Line2\r\n Line3"; + String out = callMultilineWrap(input); + assertThat(out).isEqualTo(expected); + } + + @Test + void multiline_withLF_normalizedAndIndented() { + String input = "a".repeat(200); + String expected = "a".repeat(79) + "\r\n " + "a".repeat(78) + "\r\n " + "a".repeat(43); + String out = callMultilineWrap(input); + assertThat(out).isEqualTo(expected); + } + + @Test + void emptyLines_trimmedAndSkipped() { + String input = "Line1\n\nLine3"; + String expected = "Line1\r\n Line3"; + String out = callMultilineWrap(input); + assertThat(out).isEqualTo(expected); + } + + @Test + void whitespaceOnlyLines_ignored() { + String input = "Line1\n \n\t\t\nLine3"; + String expected = "Line1\r\n Line3"; + String out = callMultilineWrap(input); + assertThat(out).isEqualTo(expected); + } + + @Test + void longSecondLine_preservesIndentOnWraps() { + String line1 = "Header"; + String line2 = "b".repeat(90); + String input = line1 + "\n" + line2; + String expected = "Header\r\n " + "b".repeat(79) + "\r\n " + "b".repeat(11); + String out = callMultilineWrap(input); + assertThat(out).isEqualTo(expected); + } + + @Test + void labelLength_reducesFirstLineMaxLength() { + // With a label of length 20, first line should wrap at 78-20=58 chars + String label = "l".repeat(20); + String input = label + "a".repeat(150); + // First line: 58 chars, subsequent lines: 78 + String expected = label + "a".repeat(59) + "\r\n " + "a".repeat(78) + "\r\n " + "a".repeat(13); + String out = callMultilineWrap(input); + assertThat(out).isEqualTo(expected); + } + + @Test + void labelLength_zero_behavesAsDefault() { + String input = "a".repeat(100); + String expected = "a".repeat(79) + "\r\n " + "a".repeat(21); + String out = callMultilineWrap(input); + assertThat(out).isEqualTo(expected); + } + + @Test + void labelLength_withMultipleLines_onlyAffectsFirstLine() { + String label = "l".repeat(15); + String input = label + "a".repeat(100) + "\nSecond line content"; + // First line wraps at 79-15=64, then continues at 78 per line + // Second line starts fresh and wraps normally + String expected = label + "a".repeat(64) + "\r\n " + "a".repeat(36) + "\r\n Second line content"; + String out = callMultilineWrap(input); + assertThat(out).isEqualTo(expected); + } + + @Test + void wrapsAtWordBoundary_notMidWord() { + // Create a string with a word boundary at position 75 + // "a" repeated 75 times, then a space, then more characters + String input = "a".repeat(75) + " " + "b".repeat(20); + // Should wrap at the space (position 75), not at position 79 + String expected = "a".repeat(75) + "\r\n " + "b".repeat(20); + String out = callMultilineWrap(input); + assertThat(out).isEqualTo(expected); + } + + @Test + void wrapsAtWordBoundary_multipleSpaces() { + // Test with word boundary closer to the limit + String input = "a".repeat(70) + " word " + "b".repeat(20); + // Should wrap after "word" (at position 76) + String expected = "a".repeat(70) + " word\r\n " + "b".repeat(20); + String out = callMultilineWrap(input); + assertThat(out).isEqualTo(expected); + } + + @Test + void wrapsAtWordBoundary_withLabelLength() { + String label = "l".repeat(20); + // With label length=20, first line wraps at 78-20=58 + // Create string with word boundary at position 55 + String input = label + "a".repeat(55) + " " + "b".repeat(30); + // Should wrap at the space (position 55) + String expected = label + "a".repeat(55) + "\r\n " + "b".repeat(30); + String out = callMultilineWrap(input); + assertThat(out).isEqualTo(expected); + } +} \ No newline at end of file