Skip to content

Conversation

@JR-1991
Copy link
Member

@JR-1991 JR-1991 commented Feb 18, 2025

According to IQSS/dataverse#11265 (comment) and #18, direct uploads of multiple tabular files trigger an OptimisticLock exception, resulting in a 500 status response when trying to register the uploaded files. This issue can be avoided by disabling tabular file ingestion.

This PR enhances the file.py model by introducing the tabIngest field. This field is set to True by default but can be disabled if the error occurs. The README.md has been updated to include information about this new field, as well as a troubleshooting section to guide users on how to resolve the issue.

Update: This PR includes a fix to prevent empty requests to replaceFiles or addFiles as pointed out by @landreev in IQSS/dataverse#11265 (comment)

Update 2: This PR includes a fix to handle ZIP files correctly in two ways:

  • Metadata updates do not apply to ZIP files, since these will disappear on the server side once unzipped.
  • Double-zipped files are handled as well.
  • If a ZIP contains too many files and the server throws an error, this one will be propagated to the user.
  • Added test cases to validate the behaviour

@JR-1991 JR-1991 linked an issue Feb 19, 2025 that may be closed by this pull request
* `replaceFiles` is not called when there are no files to replace
* `addFiles` is not called when there are no files to replace
* Both are called when there exist both new and replacement files
This is added to provide control structures to avoid dataset locks.
@JR-1991
Copy link
Member Author

JR-1991 commented Mar 18, 2025

The PR has now been updated to accommodate intermediate locks by utilizing tenacity as a general retry solution. The library is applied to both the upload and registration steps to prevent timeouts and lock problems. It has been used previously, but the retry strategy was very tight and has been loosened now.

Configuring the retry behavior

Furthermore, this PR adds a documentation and a dedicated function to control the retry behavior. Since instances can and will differ, users might run into timeout issues or others, and this provides a way to accommodate these cases. However, there is no one-fits-all solution, and values may need to be determined manually. In most cases, the defaults should work, though.

Extensions of test cases

Finally, the PR adds multiple test cases for the native upload, which uploads a set of large tabular files to check whether ingest errors are occurring. For local testing, the size of the tabular files can be configured via TEST_ROWS. On GitHub runners, this is defaulted to 100_000 for mild stress tests.

@JR-1991 JR-1991 merged commit 04553d5 into main Apr 16, 2025
12 checks passed
@github-project-automation github-project-automation bot moved this from Ready for Review to Done in PyDataverse Working Group Apr 16, 2025
@JR-1991 JR-1991 deleted the include-tab-ingest branch April 16, 2025 11:07
@bnavigator bnavigator mentioned this pull request Aug 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

Development

Successfully merging this pull request may close these issues.

Handle empty registration cases 500 error during "registering files" step

2 participants