Skip to content

8720 allow metadata re export in smaller batches#8721

Merged
pdurbin merged 15 commits intoIQSS:developfrom
PaulBoon:8720-allow-metadata-reExport-in-smaller-batches
Sep 27, 2022
Merged

8720 allow metadata re export in smaller batches#8721
pdurbin merged 15 commits intoIQSS:developfrom
PaulBoon:8720-allow-metadata-reExport-in-smaller-batches

Conversation

@PaulBoon
Copy link
Contributor

@PaulBoon PaulBoon commented May 19, 2022

What this PR does / why we need it:
This PR adds extra API endpoints:

  • to do a reExport of metadata (cached files) for a specific dataset
  • to clear all the exporting timestamps allowing to have exportAll continue after a premature ending.

Which issue(s) this PR closes:

Special notes for your reviewer:

Suggestions on how to test this:

The most likely 'use case' is that after a change of the metadata 'schema' a reExport all is needed, but you could the reExport a specific dataset and see if it changed correspondingly.
A simpler scenario for testing would be to note the filesystem timestamp of the cached files, check these, check the database timestamps and also check the export log after re-exporting has been done.

Example call:
curl http://localhost:8080/api/admin/metadata/reExportDataset?persistentId=doi:10.5072/FK2/AAA000

Example call for clearing the timestamps:
curl http://localhost:8080/api/admin/metadata/clearExportTimestamps

  • Test the reexport of a single dataset
    Select a dataset and lookup the cached export files on the disk;
    In my case I have a local filesystem with a dataset doi: 10.5072/FK2/O9LNLQ.
    $ ls -al /data/dataverse/files/10.5072/FK2/O9LNLQ/*.cached
    Gives me the listing of the files with the (filesystem) timestamp.
    Next we wil do the reindexing.
    $ curl http://localhost:8080/api/admin/metadata/reExportDataset?persistentId=doi:10.5072/FK2/O9LNLQ.
    returns {"status":"OK","data":{"message":"export started"}}.
    Check the timestamps, these should be 'updated' to the current time.
    Check the payara log /var/lib/payara5/glassfish/domains/domain1/logs/server.log. At the end it should have a line stating the export of that dataset succeeded, similar to the following;
    [#|2022-06-14T11:34:50.181+0200|INFO|Payara 5.2021.6|edu.harvard.iq.dataverse.DatasetServiceBean|_ThreadID=249;_ThreadName=__ejb-thread-pool6;_TimeMillis=1655199290181;_LevelValue=800;| Success exporting dataset: Manual Test doi:10.5072/FK2/O9LNLQ|#]
    To confirm that this dataset is the only dataset updated you could check the 'cached' files of other datasets.
    You can repeat it for this dataset or any other dataset and check the timestamp of the cached files.
    Test for correct handling of wrong input:
    $ curl http://localhost:8080/api/admin/metadata/reExportDataset
    Should return
    {"status":"ERROR","message":"No persistent id given."}.
    And
    $ curl http://localhost:8080/api/admin/metadata/reExportDataset?persistentId=doi:10.5072/NONEXISTING/O9LNLQ
    Should return:
    {"status":"ERROR","message":"Could not find dataset with persistent id doi:10.5072/NONEXISTING/O9LNLQ"}.

  • Test the clearing of the export timestamps
    Note that this is easier to test on a system with only a few datasets, otherwise reexporting can take a long time.
    For simplicity select the same dataset as before and have a look at the file timestamps again;
    $ ls -al /data/dataverse/files/10.5072/FK2/O9LNLQ/*.cached.
    Run exportAll: curl http://localhost:8080/api/admin/metadata/exportAll
    Returns {"status":"WORKFLOW_IN_PROGRESS"}
    The timestamps should not have changed, because it only exports if it is needed like when the export timestamps (in the database) are cleared.
    Then clear those export timestamps:
    $ curl http://localhost:8080/api/admin/metadata/clearExportTimestamps
    Returns {"status":"OK","data":{"message":"cleared: 1"}} in case we only have one dataset in the archive, because the number should be the total amount of datasets.
    Then run exportAll again, but this time all datasets should be re-exported.
    Check the timestamps in the filesystem and the export log file and confirm that this is indeed the case.
    The export log can be checked in your log dir; a new file should appear with the timestamp in the filename, for example with Payara in '/var/lib':
    /var/lib/payara5/glassfish/domains/domain1/logs/export_2022-06-09T15-20-23.log.
    Running exportAll yet again should not change the filesystem timestamps of the 'cached' files.

PaulBoon added 2 commits May 19, 2022 16:22
…etadata-reExport-in-smaller-batches

# Conflicts:
#	src/main/java/edu/harvard/iq/dataverse/DatasetServiceBean.java
@coveralls
Copy link

coveralls commented May 19, 2022

Coverage Status

Coverage increased (+0.4%) to 19.702% when pulling 89b0090 on PaulBoon:8720-allow-metadata-reExport-in-smaller-batches into 013fcd8 on IQSS:develop.

@PaulBoon PaulBoon marked this pull request as ready for review June 14, 2022 11:37
@PaulBoon
Copy link
Contributor Author

Could not built the 'guides' , got an error where running make html:

Running Sphinx v5.0.1

Extension error:
Could not import extension sphinxcontrib.icon (exception: No module named 'sphinxcontrib.icon')
make: *** [html] Error 2

@pdurbin
Copy link
Member

pdurbin commented Jun 14, 2022

Could not import extension sphinxcontrib.icon (exception: No module named 'sphinxcontrib.icon')

@PaulBoon hi. Yes, because we merged PR #8647 you now have to pip install a package (sphinx-icon). Please see https://groups.google.com/g/dataverse-dev/c/fZpTQYQKR0g/m/DQTARER7AwAJ

@PaulBoon
Copy link
Contributor Author

PaulBoon commented Jun 16, 2022

Could not import extension sphinxcontrib.icon (exception: No module named 'sphinxcontrib.icon')

@PaulBoon hi. Yes, because we merged PR #8647 you now have to pip install a package (sphinx-icon). Please see https://groups.google.com/g/dataverse-dev/c/fZpTQYQKR0g/m/DQTARER7AwAJ

Thanks @pdurbin, I was so naive to think pip install sphinxcontrib.icon would work, this pip install sphinx-icon did work, now make html does work also. Although now I get:

Warning, treated as error:
dot command 'dot' cannot be run (needed for graphviz output), check the graphviz_dot setting
make: *** [html] Error 2

Probably no problem for the PR, this did bring back memories about this 'dot' tool that I used decades ago.

@pdurbin
Copy link
Member

pdurbin commented Jun 16, 2022

Probably no problem for the PR, this did bring back memories about this 'dot' tool that I used decades ago.

@PaulBoon yes, it's the same old dot tool from days of yore. Still works! 😄

Here's we explain that you need to install dot/graphviz: https://guides.dataverse.org/en/5.11/developers/documentation.html#installing-graphviz

@PaulBoon
Copy link
Contributor Author

What is holding you back to push it through?
We do have a situation on one of our other systems and would like to have this functionality, so it seems it's useful to have.

@pdurbin
Copy link
Member

pdurbin commented Jul 18, 2022

@PaulBoon hi! Sorry, this is one of currently 33 open pull requests that is sitting in the "On Deck" column on our board. In total, there are 81 open pull requests. Please be patient with us as we work through the queue and thank you once again for another contribution to Dataverse!

@pdurbin pdurbin self-assigned this Sep 8, 2022
…atches IQSS#8720

Conflicts:
src/main/java/edu/harvard/iq/dataverse/DatasetServiceBean.java

It looks like @landreev removed the updateLastExportTimeStamp method
in 8135490 as part of PR IQSS#8782 so I made sure it's still gone.
@pdurbin
Copy link
Member

pdurbin commented Sep 8, 2022

@PaulBoon I noticed there are merge conflicts in this file for you (after merging develop in):

src/main/java/edu/harvard/iq/dataverse/DatasetServiceBean.java

Here's a screenshot from Netbeans showing the conflict:

Screen Shot 2022-09-08 at 4 26 17 PM

It looks like @landreev removed the updateLastExportTimeStamp method in 8135490 as part of PR #8782 so I made sure it's still gone in your PR.

I can't push to your branch so I made this PR against your PR: PaulBoon#4

Alternatively, you are welcome to merge "develop" in and resolve the merge conflicts yourself. Thanks!

Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this is looking good! In addition to new APIs we're getting much better docs. 😄

I haven't run the code (and left a comment about the need to resolve merge conflicts) but in this review I'm giving suggestions for the docs and ask a question.

}
Dataset dataset = null;
try {
dataset = datasetService.findByGlobalId(persistentId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, shouldn't we support both PID and ID (database ID)? We have patterns for this in Datasets.java.

@pdurbin
Copy link
Member

pdurbin commented Sep 9, 2022

@PaulBoon I just made another PR against this PR: PaulBoon#5

I added a test (heads up to @landreev that I re-enabled a test you had disabled in 3962a19 as part of #5771.)

While added the test I discovered an API that seems quite similar: https://guides.dataverse.org/en/5.11.1/api/native-api.html#export-metadata-of-a-dataset-in-various-formats

Screen Shot 2022-09-09 at 3 21 44 PM

In that PR I'm also crosslinking the two APIs, existing and new.

These are my current questions: 😄

  • In light of that exiting "export dataset" API, do we still need this PR? (I'm aware that a new "clearExportTimestamps" API is being added as well.)
  • If we do still want this PR, should the new "reexport dataset" method be put under /api/dataset like the existing API?
  • If we do still want this PR, should we allow both PIDs and IDs of dataset to be supported?

@PaulBoon
Copy link
Contributor Author

PaulBoon commented Sep 12, 2022

Hi @pdurbin, I must confess that my thought are elsewhere and this PR was a while ago so I am happy that you put some effort into this.

The main thing was the "clearExportTimestamps", so I would definitely want that functionality in there. The naming and API endpoint details could be debatable.
The existing export API is different than what my PR is doing I think, because it is probably also returning the metadata for the specified format. Updating the 'cached' metadata file is a side effect, which we can use. However we would have to add support for exporting all formats, somehow.

From a developers (API user) perspective it is nice to have the 'metadata re-export' API similar to the 'search re-index' API I think.

@pdurbin
Copy link
Member

pdurbin commented Sep 14, 2022

@PaulBoon from standup today it sounds like @landreev and I are quite happy with this PR but we (and others) do think it would be nice if the API you added supported both PIDs (as it does now) and IDs. I can make a PR to add this in the coming days or if you want to go ahead, please do.

Either way, please merge the latest from develop (I don't have permission to your branch) because we've been on a mini merge fest. 🚀 Thanks!

p.s. We know about the failing test but it's just this flaky one: #8973

@pdurbin
Copy link
Member

pdurbin commented Sep 20, 2022

@PaulBoon when you get a chance, can you please merge develop into your branch? Thanks!

@pdurbin
Copy link
Member

pdurbin commented Sep 21, 2022

nice if the API you added supported both PIDs (as it does now) and IDs

PR made: PaulBoon#6

@PaulBoon I think the next steps for you are:

  • test and merge the PR above if you're happy with it
  • merge latest from develop
  • add a release note snippet at doc/release-notes/8720-reexport.md

Thanks!

@pdurbin
Copy link
Member

pdurbin commented Sep 26, 2022

@PaulBoon we discussed this issue at standup this morning and decided to move it to "community dev" until you've had a chance to review and maybe act on the steps above. Thanks!

@pdurbin pdurbin removed their assignment Sep 26, 2022
support database IDs too (as well as PIDs) IQSS#8720
@pdurbin
Copy link
Member

pdurbin commented Sep 27, 2022

@PaulBoon thanks for merging. It looks like EditDDIIT.testUpdateVariableMetadata is failing. Can you please merge the latest from develop to pick up the fix for #8992?

Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been working closely with @PaulBoon on this. Looks good. I'll go ahead and merge it.

@pdurbin pdurbin merged commit 7c1683b into IQSS:develop Sep 27, 2022
@pdurbin pdurbin added this to the 5.12 milestone Sep 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow metadata reExport in smaller batches or continue where the previous request ended

4 participants