Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
113 commits
Select commit Hold shift + click to select a range
c9f728b
add checksum URI values and methods
qqmyers Dec 6, 2025
a25e47b
update version and use checksum URIs
qqmyers Dec 6, 2025
6c0cb49
handle multiline descriptions and org names
qqmyers Dec 6, 2025
7a34db8
drop blank lines in multiline values
qqmyers Dec 9, 2025
b0daad7
remove title as a folder
qqmyers Dec 9, 2025
e5457a8
handle null deaccession reason
qqmyers Dec 9, 2025
10b0556
use static to simplify testing
qqmyers Dec 10, 2025
d6cf1e2
Merge remote-tracking branch 'IQSS/develop' into OREBag1.0.2
qqmyers Dec 10, 2025
6d24185
Sanitize/split multiline catalog entry, add Dataverse-Bag-Version
qqmyers Dec 10, 2025
c4daf28
Added unit tests for multilineWrap
janvanmansum Dec 11, 2025
e76bc91
Removed unnecessary repeat helper method
janvanmansum Dec 11, 2025
108c912
Alined test names with actual test being done
janvanmansum Dec 11, 2025
62ea9d9
Merge pull request #48 from janvanmansum/OREBag1.0.2-amend
qqmyers Dec 11, 2025
884b81b
DD-2098 - allow archivalstatus calls on deaccessioned versions
qqmyers Dec 16, 2025
5e4e90a
Merge remote-tracking branch 'IQSS/develop' into OREBag1.0.2
qqmyers Dec 16, 2025
3076d69
set array properly
qqmyers Dec 17, 2025
cbdc15f
Merge remote-tracking branch 'IQSS/develop' into OREBag1.0.2
qqmyers Dec 19, 2025
1a7dafa
DD-2212 - use configured checksum when no files are present
qqmyers Dec 19, 2025
7eea57c
Revert "DD-2098 - allow archivalstatus calls on deaccessioned versions"
qqmyers Dec 19, 2025
2477cf9
add Source-Org as a potential multiline case, remove change to Int Id
qqmyers Dec 19, 2025
3f3908f
release note
qqmyers Dec 19, 2025
aa44c08
use constants, pass labelLength to wrapping, start custom lineWrap
qqmyers Dec 19, 2025
8227edf
update to handle overall 79 char length
qqmyers Dec 19, 2025
d0749fc
wrap any other potentially long values
qqmyers Dec 19, 2025
24a625f
cleanup deprecated code, auto-gen comments
qqmyers Dec 19, 2025
bf036f3
update comment
qqmyers Dec 22, 2025
be65611
add tests
qqmyers Dec 22, 2025
2516cf4
Merge remote-tracking branch 'IQSS/develop' into OREBag1.0.2
qqmyers Dec 22, 2025
24d098a
QDR updates to apache 5, better fault tolerance for file retrieval
qqmyers Dec 22, 2025
b4a3799
release note update
qqmyers Dec 22, 2025
f8f7739
initial impl
qqmyers Jan 7, 2026
5bd6f8d
fix requestedSettings handling
qqmyers Jan 8, 2026
4aaf6ca
efficiency improvement
qqmyers Jan 8, 2026
7cdef81
QDR fixes transx timeout, ignored bag thread setting, add deletable
qqmyers Jan 8, 2026
85a5239
Merge branch 'develop' into OREBag1.0.2
qqmyers Jan 16, 2026
ce974dc
create lock in finalize
qqmyers Jan 19, 2026
900033c
add lock before workflow in publish and api
qqmyers Jan 20, 2026
d1be22e
use update context
qqmyers Jan 21, 2026
e6426c9
move post wf to onSuccess
qqmyers Jan 21, 2026
baaa1db
prepub wf in onSuccess
qqmyers Jan 21, 2026
f64a80e
Cleanup, fix generic message for pre and post pub wf
qqmyers Jan 21, 2026
6e11382
handle possible OLE
qqmyers Jan 21, 2026
709b4da
try async command for archiving
qqmyers Nov 24, 2025
6487c14
save status
qqmyers Nov 24, 2025
9d32051
refactor, use persistArchivalCopyLocation everywhere
qqmyers Jan 8, 2026
ec5046c
catch OLE when persisting archivalcopylocation
qqmyers Jan 12, 2026
c1055b8
Add obsolete state, update display, add supportsDelete
qqmyers Nov 25, 2025
f912fd0
doc that api doesn't handls supportsDelete yet
qqmyers Nov 25, 2025
00f115e
support reflective and instance calls re: delete capability
qqmyers Nov 25, 2025
bc40370
use query to update status, async everywhere
qqmyers Dec 10, 2025
df9b5ce
fixes for dataset page re: archiving
qqmyers Dec 12, 2025
a64e1f7
merge issues
qqmyers Jan 16, 2026
c55230e
merge fix of persistArchivalCopy method refactors
qqmyers Jan 21, 2026
905570a
add flag, docs
qqmyers Jan 22, 2026
521fbf6
add delete to local and S3
qqmyers Jan 22, 2026
ba04ba2
fix doc ref
qqmyers Jan 27, 2026
7a18669
remove errant : char
qqmyers Jan 27, 2026
ae91b78
no transaction time limit during bagging from command (not workflow)
qqmyers Jan 23, 2026
d2a25c3
use new transaction to start
qqmyers Jan 24, 2026
a45b76b
typo
qqmyers Jan 24, 2026
a4c583e
Use pending, use JSON
qqmyers Jan 24, 2026
305f7e3
merge fix of persistArchivalCopy method refactors
qqmyers Jan 21, 2026
d2282d9
combined release note
qqmyers Jan 28, 2026
609e2b5
Merge remote-tracking branch 'IQSS/develop' into Arch1-createWFLocksE…
qqmyers Jan 28, 2026
236fca4
missed change to static
qqmyers Jan 28, 2026
e461415
Merge remote-tracking branch 'IQSS/develop' into OREBag1.0.2
qqmyers Jan 28, 2026
1b42978
suppress counting file retrieval to bag as a download in gb table
qqmyers Jan 28, 2026
56de8cb
Merge branch 'OREBag1.0.2' of https://github.com/GlobalDataverseCommu…
qqmyers Jan 28, 2026
3083179
Merge remote-tracking branch 'IQSS/develop' into OREBag1.0.2
qqmyers Jan 29, 2026
67e01e0
archival submit fix - per version cache
qqmyers Dec 16, 2025
50e8c61
Add check to display submit button only if prior versions are archvd
qqmyers Jan 29, 2026
74a73fb
Merge remote-tracking branch 'IQSS/develop' into DANS-2097
qqmyers Jan 29, 2026
0642897
setting name tweak, add docs, release note
qqmyers Jan 29, 2026
ca0af05
simplify
qqmyers Jan 29, 2026
49f4818
basic fetch
qqmyers Jan 30, 2026
7f5179f
order by file size
qqmyers Jan 30, 2026
bc63285
only add subcollection folders (if they exist)
qqmyers Jan 30, 2026
59f3a2a
replace deprecated constructs
qqmyers Jan 30, 2026
69c9a0d
restore name collision check
qqmyers Jan 30, 2026
422435a
add null check to quiet log/avoid exception
qqmyers Jan 30, 2026
d9cfe1d
cleanup - checksum change
qqmyers Jan 30, 2026
4895f80
cleanup, suppress downloads with gbrec for fetch file
qqmyers Jan 30, 2026
62a03b2
add setting, refactor, for non-holey option
qqmyers Feb 1, 2026
637b2e3
Update to track non-zipped files, add method
qqmyers Feb 4, 2026
a6b0505
reuse stream supplier, update archivers to send oversized files
qqmyers Feb 4, 2026
5739e35
docs, release note update
qqmyers Feb 4, 2026
5c82ab8
style fix
qqmyers Feb 4, 2026
b0be6a1
Merge remote-tracking branch 'IQSS/develop' into DANS-2157_holey_bags3
qqmyers Feb 10, 2026
61f6d1b
Merge branch 'DANS-2097' into DANS-QDR-merged_bag_changes_for_QA
qqmyers Feb 16, 2026
8c85f98
Merge remote-tracking branch 'QDR/Arch6-archive_outside_transaction'
qqmyers Feb 17, 2026
949606b
merge fixes - refactor precondition check for prior versions
qqmyers Feb 17, 2026
de9ed31
test fix
qqmyers Feb 17, 2026
ee87ab5
style fix to separate submit button from status
qqmyers Feb 17, 2026
9840648
missing param
qqmyers Feb 17, 2026
6911fa7
Merge remote-tracking branch 'QDR/Arch1-createWFLocksEarly' into DANS…
qqmyers Feb 17, 2026
20008ec
add sleep
qqmyers Feb 18, 2026
6fcd84d
release note updates
qqmyers Feb 18, 2026
17588c7
tweaks, remove duplicates
qqmyers Feb 18, 2026
27b0d37
switch to jvm setting
qqmyers Feb 19, 2026
7c08907
update static string to include ver number
qqmyers Feb 19, 2026
fb7517b
missed change - use pending/obsolete
qqmyers Feb 19, 2026
79e1ddc
fix param order per review
qqmyers Feb 19, 2026
7eef9ca
update/fix release note
qqmyers Feb 19, 2026
599cb0f
443 fix per review
qqmyers Feb 19, 2026
1bb3fa7
refactor per review
qqmyers Feb 19, 2026
06e48ac
fix indent per review
qqmyers Feb 19, 2026
3b72a4c
fix javadoc per review
qqmyers Feb 19, 2026
a77e0a8
remove param in doc per review
qqmyers Feb 19, 2026
2e1d2e5
cleanup
qqmyers Feb 19, 2026
d83b7af
add spacename to datacite file
qqmyers Feb 19, 2026
98b2e97
Merge remote-tracking branch 'IQSS/develop' into
qqmyers Feb 19, 2026
7e732a0
handle Local archiver zip name change
qqmyers Feb 19, 2026
f80a1cd
use constants
qqmyers Feb 20, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 54 additions & 0 deletions doc/release-notes/12167-ore-bag-archiving-changes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
## Archiving, OAI-ORE, and BagIt Export

This release includes multiple updates to the OAI-ORE metadata export and the process of creating archival bags, improving performance, fixing bugs, and adding significant new functionality.

### General Archiving Improvements
- Multiple performance and scaling improvements have been made for creating archival bags for large datasets, including:
- The duration of archiving tasks triggered from the version table or API are no longer limited by the transaction time limit.
- Temporary storage space requirements have increased by `1/:BagGeneratorThreads` of the zipped bag size. (This is a consequence of changes to avoid timeout errors on larger files/datasets.)
- The size of individual data files and the total dataset size that will be included in an archival bag can now be limited. Admins can choose whether files above these limits are transferred along with, but outside, the zipped bag (creating a complete archival copy) or are just referenced (using the concept of a "holey" bag and just listing the oversized files and the Dataverse URLs from which they can be retrieved in a `fetch.txt` file). In the holey bag case, an active service on the archiving platform must retrieve the oversized files (using appropriate credentials as needed) to make a complete copy.
- Superusers can now see a pending status in the dataset version table while archiving is active.
- Workflows are now triggered outside the transactions related to publication, assuring that workflow locks and status updates are always recorded.
- Potential conflicts between archiving/workflows, indexing, and metadata exports after publication have been resolved, avoiding cases where the status/last update times for these actions were not recorded.
- A bug has been fixed where superusers would incorrectly see the "Submit" button to launch archiving from the dataset page version table.
- The local, S3, and Google archivers have been updated to support deleting existing archival files for a version to allow re-creating the bag for a given version.
- For archivers that support file deletion, it is now possible to recreate an archival bag after "Update Current Version" has been used (replacing the original bag). By default, Dataverse will mark the current version's archive as out-of-date, but will not automatically re-archive it.
- A new 'obsolete' status has been added to indicate when an archival bag exists for a version but it was created prior to an "Update Current Version" change.
- Improvements have been made to file retrieval for bagging, including retries on errors and when download requests are being throttled.
- A bug causing `:BagGeneratorThreads` to be ignored has been fixed, and the default has been reduced to 2.
- Retrieval of files for inclusion in an archival bag is no longer counted as a download.
- It is now possible to require that all previous versions have been successfully archived before archiving of a newly published version can succeed. (This is intended to support use cases where deduplication of files between dataset versions will be done and is a step towards supporting the Oxford Common File Layout (OCFL).)
- The pending status has changed to use the same JSON format as other statuses

### OAI-ORE Export Updates
- The export now uses URIs for checksum algorithms, conforming with JSON-LD requirements.
- A bug causing failures with deaccessioned versions has been fixed. This occurred when the deaccession note ("Deaccession Reason" in the UI) was null, which is permissible via the API.
- The `https://schema.org/additionalType` has been updated to "Dataverse OREMap Format v1.0.2" to reflect format changes.

### Archival Bag (BagIt) Updates
- The `bag-info.txt` file now correctly includes information for dataset contacts, fixing a bug where nothing was included when multiple contacts were defined. (Multiple contacts were always included in the OAI-ORE file in the bag; only the baginfo file was affected).
- Values used in the `bag-info.txt` file that may be multi-line (i.e. with embedded CR or LF characters) are now properly indented and wrapped per the BagIt specification (`Internal-Sender-Identifier`, `External-Description`, `Source-Organization`, `Organization-Address`).
- The dataset name is no longer used as a subdirectory within the `data/` directory to reduce issues with unzipping long paths on some filesystems.
- For dataset versions with no files, the empty `manifest-<alg>.txt` file will now use the algorithm from the `:FileFixityChecksumAlgorithm` setting instead of defaulting to MD5.
- A new key, `Dataverse-Bag-Version`, has been added to `bag-info.txt` with the value "1.0" to allow for tracking changes to Dataverse's archival bag generation over time.
- When using the `holey` bag option discussed above, the required `fetch.txt` file will be included.


### New Configuration Settings

This release introduces several new settings to control archival and bagging behavior.

- `:ArchiveOnlyIfEarlierVersionsAreArchived` (Default: `false`)
When set to `true`, dataset versions must be archived in order. That is, all prior versions of a dataset must be archived before the latest version can be archived.

The following JVM options (MicroProfile Config Settings) control bag size and holey bag support:
- `dataverse.bagit.zip.holey`
- `dataverse.bagit.zip.max-data-size`
- `dataverse.bagit.zip.max-file-size`

- `dataverse.bagit.archive-on-version-update` (Default: `false`)
Indicates whether archival bag creation should be triggered (if configured) when a version is updated and was already successfully archived, i.e., via the Update-Current-Version publication option. Setting the flag to `true` only works if the archiver being used supports deleting existing archival bags.

###Backward Incompatibility

The name of archival zipped bag produced by the LocalSubmitToArchiveCommand archiver now has a '.' character before the version number mirror the name used by other archivers, e.g. the name will be like doi-10-5072-fk2-fosg5q.v1.0.zip rather than doi-10-5072-fk2-fosg5qv1.0.zip
1 change: 1 addition & 0 deletions doc/sphinx-guides/source/admin/big-data-administration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -302,6 +302,7 @@ There are a broad range of options (that are not turned on by default) for impro
- :ref:`:DisableSolrFacetsWithoutJsession` - disables facets for users who have disabled cookies (e.g. for bots)
- :ref:`:DisableUncheckedTypesFacet` - only disables the facet showing the number of collections, datasets, files matching the query (this facet is potentially less useful than others)
- :ref:`:StoreIngestedTabularFilesWithVarHeaders` - by default, Dataverse stores ingested files without headers and dynamically adds them back at download time. Once this setting is enabled, Dataverse will leave the headers in place (for newly ingested files), reducing the cost of downloads
- :ref:`dataverse.bagit.zip.max-file-size`, :ref:`dataverse.bagit.zip.max-data-size`, and :ref:`dataverse.bagit.zip.holey` - options to control the size and temporary storage requirements when generating archival Bags - see :ref:`BagIt Export`


Scaling Infrastructure
Expand Down
43 changes: 40 additions & 3 deletions doc/sphinx-guides/source/installation/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2259,10 +2259,22 @@ These archival Bags include all of the files and metadata in a given dataset ver

The Dataverse Software offers an internal archive workflow which may be configured as a PostPublication workflow via an admin API call to manually submit previously published Datasets and prior versions to a configured archive such as Chronopolis. The workflow creates a `JSON-LD <http://www.openarchives.org/ore/0.9/jsonld>`_ serialized `OAI-ORE <https://www.openarchives.org/ore/>`_ map file, which is also available as a metadata export format in the Dataverse Software web interface.

The size of the zipped archival Bag can be limited, and files that don't fit within that limit can either be transferred separately (placed so that they are correctly positioned according to the BagIt specification when the zipped bag in unzipped in place) or just referenced for later download (using the BagIt concept of a 'holey' bag with a list of files in a ``fetch.txt`` file) can now be configured for all archivers. These settings allow for managing large datasets by excluding files over a certain size or total data size, which can be useful for archivers with size limitations or to reduce transfer times. See the :ref:`dataverse.bagit.zip.max-file-size`, :ref:`dataverse.bagit.zip.max-data-size`, and :ref:`dataverse.bagit.zip.holey` JVM options for more details.

At present, archiving classes include the DuraCloudSubmitToArchiveCommand, LocalSubmitToArchiveCommand, GoogleCloudSubmitToArchive, and S3SubmitToArchiveCommand , which all extend the AbstractSubmitToArchiveCommand and use the configurable mechanisms discussed below. (A DRSSubmitToArchiveCommand, which works with Harvard's DRS also exists and, while specific to DRS, is a useful example of how Archivers can support single-version-only semantics and support archiving only from specified collections (with collection specific parameters)).

All current options support the :ref:`Archival Status API` calls and the same status is available in the dataset page version table (for contributors/those who could view the unpublished dataset, with more detail available to superusers).

Two settings that can be used with all current Archivers are:

- \:BagGeneratorThreads - the number of threads to use when adding data files to the zipped bag. The default is 2. Values of 4 or more may increase performance on larger machines but may cause problems if file access is throttled
- \:ArchiveOnlyIfEarlierVersionsAreArchived - when true, requires dataset versions to be archived in order by confirming that all prior versions have been successfully archived before allowing a new version to be archived. Default is false

These must be included in the \:ArchiverSettings for the Archiver to work

Archival Bags are created per dataset version. By default, if a version is republished (via the superuser-only 'Update Current Version' publication option in the UI/API), a new archival bag is not created for the version.
If the archiver used is capable of deleting existing bags (Google, S3, and File Archivers) superusers can trigger a manual update of the archival bag, and, if the :ref:`dataverse.bagit.archive-on-version-update` flag is set to true, this will be done automatically when 'Update Current Version' is used.

.. _Duracloud Configuration:

Duracloud Configuration
Expand Down Expand Up @@ -3715,6 +3727,14 @@ The email for your institution that you'd like to appear in bag-info.txt. See :r

Can also be set via *MicroProfile Config API* sources, e.g. the environment variable ``DATAVERSE_BAGIT_SOURCEORG_EMAIL``.

.. _dataverse.bagit.archive-on-version-update:

dataverse.bagit.archive-on-version-update
+++++++++++++++++++++++++++++++++++++++++

Indicates whether archival bag creation should be triggered (if configured) when a version is updated and was already successfully archived,
i.e via the Update-Current-Version publication option. Setting the flag true only works if the archiver being used supports deleting existing archival bags.

.. _dataverse.files.globus-monitoring-server:

dataverse.files.globus-monitoring-server
Expand Down Expand Up @@ -3868,6 +3888,21 @@ This can instead be restricted to only superusers who can publish the dataset us

Example: ``dataverse.coar-notify.relationship-announcement.notify-superusers-only=true``

.. _dataverse.bagit.zip.holey:

``dataverse.bagit.zip.holey``
A boolean that, if true, will cause the BagIt archiver to create a "holey" bag. In a holey bag, files that are not included in the bag are listed in the ``fetch.txt`` file with a URL from which they can be downloaded. This is used in conjunction with ``dataverse.bagit.zip.max-file-size`` and/or ``dataverse.bagit.zip.max-data-size``. Default: false.

.. _dataverse.bagit.zip.max-data-size:

``dataverse.bagit.zip.max-data-size``
The maximum total (uncompressed) size of data files (in bytes) to include in a BagIt zip archive. If the total size of the dataset files exceeds this limit, files will be excluded from the zipped bag (starting from the largest) until the total size is under the limit. Excluded files will be handled as defined by ``dataverse.bagit.zip.holey`` - just listed if that setting is true or being transferred separately and placed next to the zipped bag. When not set, there is no limit.

.. _dataverse.bagit.zip.max-file-size:

``dataverse.bagit.zip.max-file-size``
The maximum (uncompressed) size of a single file (in bytes) to include in a BagIt zip archive. Any file larger than this will be excluded. Excluded files will be handled as defined by ``dataverse.bagit.zip.holey`` - just listed if that setting is true or being transferred separately and placed next to the zipped bag. When not set, there is no limit.

.. _feature-flags:

Feature Flags
Expand Down Expand Up @@ -4031,9 +4066,6 @@ dataverse.feature.only-update-datacite-when-needed

Only contact DataCite to update a DOI after checking to see if DataCite has outdated information (for efficiency, lighter load on DataCite, especially when using file DOIs).




.. _:ApplicationServerSettings:

Application Server Settings
Expand Down Expand Up @@ -5342,6 +5374,11 @@ This setting specifies which storage system to use by identifying the particular

For examples, see the specific configuration above in :ref:`BagIt Export`.

:ArchiveOnlyIfEarlierVersionsAreArchived
++++++++++++++++++++++++++++++++++++++++

This setting, if true, only allows creation of an archival Bag for a dataset version if all prior versions have been successfully archived. The default is false (any version can be archived independently as long as other settings allow it)

:ArchiverSettings
+++++++++++++++++

Expand Down
33 changes: 27 additions & 6 deletions src/main/java/edu/harvard/iq/dataverse/DataFile.java
Original file line number Diff line number Diff line change
Expand Up @@ -109,18 +109,22 @@ public class DataFile extends DvObject implements Comparable {
* The list of types should be limited to the list above in the technote
* because the string gets passed into MessageDigest.getInstance() and you
* can't just pass in any old string.
*
* The URIs are used in the OAI_ORE export. They are taken from the associated XML Digital Signature standards.
*/
public enum ChecksumType {

MD5("MD5"),
SHA1("SHA-1"),
SHA256("SHA-256"),
SHA512("SHA-512");
MD5("MD5", "http://www.w3.org/2001/04/xmldsig-more#md5"),
SHA1("SHA-1", "http://www.w3.org/2000/09/xmldsig#sha1"),
SHA256("SHA-256", "http://www.w3.org/2001/04/xmlenc#sha256"),
SHA512("SHA-512", "http://www.w3.org/2001/04/xmlenc#sha512");

private final String text;
private final String uri;

private ChecksumType(final String text) {
private ChecksumType(final String text, final String uri) {
this.text = text;
this.uri = uri;
}

public static ChecksumType fromString(String text) {
Expand All @@ -131,13 +135,30 @@ public static ChecksumType fromString(String text) {
}
}
}
throw new IllegalArgumentException("ChecksumType must be one of these values: " + Arrays.asList(ChecksumType.values()) + ".");
throw new IllegalArgumentException(
"ChecksumType must be one of these values: " + Arrays.asList(ChecksumType.values()) + ".");
}

public static ChecksumType fromUri(String uri) {
if (uri != null) {
for (ChecksumType checksumType : ChecksumType.values()) {
if (uri.equals(checksumType.uri)) {
return checksumType;
}
}
}
throw new IllegalArgumentException(
"ChecksumType must be one of these values: " + Arrays.asList(ChecksumType.values()) + ".");
}

@Override
public String toString() {
return text;
}

public String toUri() {
return uri;
}
}

//@Expose
Expand Down
Loading