8720 allow metadata re export in smaller batches by PaulBoon · Pull Request #8721 · IQSS/dataverse

PaulBoon · 2022-05-19T14:42:45Z

What this PR does / why we need it:
This PR adds extra API endpoints:

to do a reExport of metadata (cached files) for a specific dataset
to clear all the exporting timestamps allowing to have exportAll continue after a premature ending.

Which issue(s) this PR closes:

Closes Allow metadata reExport in smaller batches or continue where the previous request ended #8720

Special notes for your reviewer:

Suggestions on how to test this:

The most likely 'use case' is that after a change of the metadata 'schema' a reExport all is needed, but you could the reExport a specific dataset and see if it changed correspondingly.
A simpler scenario for testing would be to note the filesystem timestamp of the cached files, check these, check the database timestamps and also check the export log after re-exporting has been done.

Example call:
curl http://localhost:8080/api/admin/metadata/reExportDataset?persistentId=doi:10.5072/FK2/AAA000

Example call for clearing the timestamps:
curl http://localhost:8080/api/admin/metadata/clearExportTimestamps

Test the reexport of a single dataset
Select a dataset and lookup the cached export files on the disk;
In my case I have a local filesystem with a dataset doi: 10.5072/FK2/O9LNLQ.
$ ls -al /data/dataverse/files/10.5072/FK2/O9LNLQ/*.cached
Gives me the listing of the files with the (filesystem) timestamp.
Next we wil do the reindexing.
$ curl http://localhost:8080/api/admin/metadata/reExportDataset?persistentId=doi:10.5072/FK2/O9LNLQ.
returns {"status":"OK","data":{"message":"export started"}}.
Check the timestamps, these should be 'updated' to the current time.
Check the payara log /var/lib/payara5/glassfish/domains/domain1/logs/server.log. At the end it should have a line stating the export of that dataset succeeded, similar to the following;
[#|2022-06-14T11:34:50.181+0200|INFO|Payara 5.2021.6|edu.harvard.iq.dataverse.DatasetServiceBean|_ThreadID=249;_ThreadName=__ejb-thread-pool6;_TimeMillis=1655199290181;_LevelValue=800;| Success exporting dataset: Manual Test doi:10.5072/FK2/O9LNLQ|#]
To confirm that this dataset is the only dataset updated you could check the 'cached' files of other datasets.
You can repeat it for this dataset or any other dataset and check the timestamp of the cached files.
Test for correct handling of wrong input:
$ curl http://localhost:8080/api/admin/metadata/reExportDataset
Should return
{"status":"ERROR","message":"No persistent id given."}.
And
$ curl http://localhost:8080/api/admin/metadata/reExportDataset?persistentId=doi:10.5072/NONEXISTING/O9LNLQ
Should return:
{"status":"ERROR","message":"Could not find dataset with persistent id doi:10.5072/NONEXISTING/O9LNLQ"}.
Test the clearing of the export timestamps
Note that this is easier to test on a system with only a few datasets, otherwise reexporting can take a long time.
For simplicity select the same dataset as before and have a look at the file timestamps again;
$ ls -al /data/dataverse/files/10.5072/FK2/O9LNLQ/*.cached.
Run exportAll: curl http://localhost:8080/api/admin/metadata/exportAll
Returns {"status":"WORKFLOW_IN_PROGRESS"}
The timestamps should not have changed, because it only exports if it is needed like when the export timestamps (in the database) are cleared.
Then clear those export timestamps:
$ curl http://localhost:8080/api/admin/metadata/clearExportTimestamps
Returns {"status":"OK","data":{"message":"cleared: 1"}} in case we only have one dataset in the archive, because the number should be the total amount of datasets.
Then run exportAll again, but this time all datasets should be re-exported.
Check the timestamps in the filesystem and the export log file and confirm that this is indeed the case.
The export log can be checked in your log dir; a new file should appear with the timestamp in the filename, for example with Payara in '/var/lib':
/var/lib/payara5/glassfish/domains/domain1/logs/export_2022-06-09T15-20-23.log.
Running exportAll yet again should not change the filesystem timestamps of the 'cached' files.

…etadata-reExport-in-smaller-batches # Conflicts: # src/main/java/edu/harvard/iq/dataverse/DatasetServiceBean.java

coveralls · 2022-05-19T15:21:39Z

Coverage increased (+0.4%) to 19.702% when pulling 89b0090 on PaulBoon:8720-allow-metadata-reExport-in-smaller-batches into 013fcd8 on IQSS:develop.

PaulBoon · 2022-06-14T11:38:49Z

Could not built the 'guides' , got an error where running make html:

Running Sphinx v5.0.1

Extension error:
Could not import extension sphinxcontrib.icon (exception: No module named 'sphinxcontrib.icon')
make: *** [html] Error 2

pdurbin · 2022-06-14T12:31:06Z

Could not import extension sphinxcontrib.icon (exception: No module named 'sphinxcontrib.icon')

@PaulBoon hi. Yes, because we merged PR #8647 you now have to pip install a package (sphinx-icon). Please see https://groups.google.com/g/dataverse-dev/c/fZpTQYQKR0g/m/DQTARER7AwAJ

PaulBoon · 2022-06-16T09:18:58Z

Could not import extension sphinxcontrib.icon (exception: No module named 'sphinxcontrib.icon')

@PaulBoon hi. Yes, because we merged PR #8647 you now have to pip install a package (sphinx-icon). Please see https://groups.google.com/g/dataverse-dev/c/fZpTQYQKR0g/m/DQTARER7AwAJ

Thanks @pdurbin, I was so naive to think pip install sphinxcontrib.icon would work, this pip install sphinx-icon did work, now make html does work also. Although now I get:

Warning, treated as error:
dot command 'dot' cannot be run (needed for graphviz output), check the graphviz_dot setting
make: *** [html] Error 2

Probably no problem for the PR, this did bring back memories about this 'dot' tool that I used decades ago.

pdurbin · 2022-06-16T11:25:08Z

Probably no problem for the PR, this did bring back memories about this 'dot' tool that I used decades ago.

@PaulBoon yes, it's the same old dot tool from days of yore. Still works! 😄

Here's we explain that you need to install dot/graphviz: https://guides.dataverse.org/en/5.11/developers/documentation.html#installing-graphviz

PaulBoon · 2022-07-18T16:47:43Z

What is holding you back to push it through?
We do have a situation on one of our other systems and would like to have this functionality, so it seems it's useful to have.

pdurbin · 2022-07-18T17:22:56Z

@PaulBoon hi! Sorry, this is one of currently 33 open pull requests that is sitting in the "On Deck" column on our board. In total, there are 81 open pull requests. Please be patient with us as we work through the queue and thank you once again for another contribution to Dataverse!

@landreev

…atches IQSS#8720 Conflicts: src/main/java/edu/harvard/iq/dataverse/DatasetServiceBean.java It looks like @landreev removed the updateLastExportTimeStamp method in 8135490 as part of PR IQSS#8782 so I made sure it's still gone.

pdurbin · 2022-09-08T20:40:56Z

@PaulBoon I noticed there are merge conflicts in this file for you (after merging develop in):

src/main/java/edu/harvard/iq/dataverse/DatasetServiceBean.java

Here's a screenshot from Netbeans showing the conflict:

It looks like @landreev removed the updateLastExportTimeStamp method in 8135490 as part of PR #8782 so I made sure it's still gone in your PR.

I can't push to your branch so I made this PR against your PR: PaulBoon#4

Alternatively, you are welcome to merge "develop" in and resolve the merge conflicts yourself. Thanks!

pdurbin

Overall, this is looking good! In addition to new APIs we're getting much better docs. 😄

I haven't run the code (and left a comment about the need to resolve merge conflicts) but in this review I'm giving suggestions for the docs and ask a question.

doc/sphinx-guides/source/admin/metadataexport.rst

pdurbin · 2022-09-08T20:51:13Z

src/main/java/edu/harvard/iq/dataverse/api/Metadata.java

+        }
+        Dataset dataset = null;
+        try {
+            dataset = datasetService.findByGlobalId(persistentId);


Hmm, shouldn't we support both PID and ID (database ID)? We have patterns for this in Datasets.java.

Documentation textual improvements Co-authored-by: Philip Durbin <philipdurbin@gmail.com>

8720 merge conflicts

…QSS#5771

pdurbin · 2022-09-09T19:30:40Z

@PaulBoon I just made another PR against this PR: PaulBoon#5

I added a test (heads up to @landreev that I re-enabled a test you had disabled in 3962a19 as part of #5771.)

While added the test I discovered an API that seems quite similar: https://guides.dataverse.org/en/5.11.1/api/native-api.html#export-metadata-of-a-dataset-in-various-formats

In that PR I'm also crosslinking the two APIs, existing and new.

These are my current questions: 😄

In light of that exiting "export dataset" API, do we still need this PR? (I'm aware that a new "clearExportTimestamps" API is being added as well.)
If we do still want this PR, should the new "reexport dataset" method be put under /api/dataset like the existing API?
If we do still want this PR, should we allow both PIDs and IDs of dataset to be supported?

PaulBoon · 2022-09-12T08:21:28Z

Hi @pdurbin, I must confess that my thought are elsewhere and this PR was a while ago so I am happy that you put some effort into this.

The main thing was the "clearExportTimestamps", so I would definitely want that functionality in there. The naming and API endpoint details could be debatable.
The existing export API is different than what my PR is doing I think, because it is probably also returning the metadata for the specified format. Updating the 'cached' metadata file is a side effect, which we can use. However we would have to add support for exporting all formats, somehow.

From a developers (API user) perspective it is nice to have the 'metadata re-export' API similar to the 'search re-index' API I think.

8720 reexport

pdurbin · 2022-09-14T17:44:26Z

@PaulBoon from standup today it sounds like @landreev and I are quite happy with this PR but we (and others) do think it would be nice if the API you added supported both PIDs (as it does now) and IDs. I can make a PR to add this in the coming days or if you want to go ahead, please do.

Either way, please merge the latest from develop (I don't have permission to your branch) because we've been on a mini merge fest. 🚀 Thanks!

p.s. We know about the failing test but it's just this flaky one: #8973

pdurbin · 2022-09-20T12:44:19Z

@PaulBoon when you get a chance, can you please merge develop into your branch? Thanks!

pdurbin · 2022-09-21T19:52:34Z

nice if the API you added supported both PIDs (as it does now) and IDs

PR made: PaulBoon#6

@PaulBoon I think the next steps for you are:

test and merge the PR above if you're happy with it
merge latest from develop
add a release note snippet at doc/release-notes/8720-reexport.md

Thanks!

pdurbin · 2022-09-26T18:09:28Z

@PaulBoon we discussed this issue at standup this morning and decided to move it to "community dev" until you've had a chance to review and maybe act on the steps above. Thanks!

support database IDs too (as well as PIDs) IQSS#8720

pdurbin · 2022-09-27T10:30:19Z

@PaulBoon thanks for merging. It looks like EditDDIIT.testUpdateVariableMetadata is failing. Can you please merge the latest from develop to pick up the fix for #8992?

…etadata-reExport-in-smaller-batches

pdurbin

I've been working closely with @PaulBoon on this. Looks good. I'll go ahead and merge it.

PaulBoon added 2 commits May 19, 2022 16:22

Initial implementation of reExportDataset API endpoint

15c4e5e

Merge branch 'develop' of github.com:IQSS/dataverse into 8720-allow-m…

80106dc

…etadata-reExport-in-smaller-batches # Conflicts: # src/main/java/edu/harvard/iq/dataverse/DatasetServiceBean.java

PaulBoon mentioned this pull request May 19, 2022

(Re)ExportAll Optimizations, Better Logging #7407

Open

Cleanup

6a74e0b

PaulBoon added 3 commits June 1, 2022 14:23

Implemented clearExportTimestamps

06852b8

Refactoring comment in exportDataset

102f65f

Added guides documentation for clearExportTimestamps and reExportDataset

89b0090

PaulBoon marked this pull request as ready for review June 14, 2022 11:37

pdurbin self-assigned this Sep 8, 2022

pdurbin mentioned this pull request Sep 8, 2022

8720 merge conflicts PaulBoon/dataverse#4

Merged

pdurbin assigned PaulBoon Sep 8, 2022

pdurbin reviewed Sep 8, 2022

View reviewed changes

PaulBoon and others added 4 commits September 9, 2022 14:29

Apply suggestions from code review

81b2937

Documentation textual improvements Co-authored-by: Philip Durbin <philipdurbin@gmail.com>

Merge pull request #4 from pdurbin/8720-merge-conflicts

3e20442

8720 merge conflicts

add API test for new testExport method near similar method IQSS#8720 I…

eeaefec

…QSS#5771

cross link related APIs IQSS#8720

dc97593

Merge pull request #5 from pdurbin/8720-reexport

d6e84cb

8720 reexport

pdurbin assigned landreev Sep 12, 2022

pdurbin unassigned landreev Sep 14, 2022

support database IDs too (as well as PIDs) IQSS#8720

bd47b8e

pdurbin removed their assignment Sep 26, 2022

Merge pull request #6 from pdurbin/8720-support-ids-too

c909ba5

support database IDs too (as well as PIDs) IQSS#8720

Merge branch 'develop' of github.com:IQSS/dataverse into 8720-allow-m…

ccfa579

…etadata-reExport-in-smaller-batches

pdurbin unassigned PaulBoon Sep 27, 2022

pdurbin approved these changes Sep 27, 2022

View reviewed changes

pdurbin merged commit 7c1683b into IQSS:develop Sep 27, 2022

pdurbin added this to the 5.12 milestone Sep 27, 2022

Conversation

PaulBoon commented May 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented May 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PaulBoon commented Jun 14, 2022

Uh oh!

pdurbin commented Jun 14, 2022

Uh oh!

PaulBoon commented Jun 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pdurbin commented Jun 16, 2022

Uh oh!

PaulBoon commented Jul 18, 2022

Uh oh!

pdurbin commented Jul 18, 2022

Uh oh!

pdurbin commented Sep 8, 2022

Uh oh!

pdurbin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pdurbin Sep 8, 2022

Choose a reason for hiding this comment

Uh oh!

pdurbin commented Sep 9, 2022

Uh oh!

PaulBoon commented Sep 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pdurbin commented Sep 14, 2022

Uh oh!

pdurbin commented Sep 20, 2022

Uh oh!

pdurbin commented Sep 21, 2022

Uh oh!

pdurbin commented Sep 26, 2022

Uh oh!

pdurbin commented Sep 27, 2022

Uh oh!

pdurbin left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PaulBoon commented May 19, 2022 •

edited

Loading

coveralls commented May 19, 2022 •

edited

Loading

PaulBoon commented Jun 16, 2022 •

edited

Loading

PaulBoon commented Sep 12, 2022 •

edited

Loading