Skip to content

Comments

7812 download all api speedup#7836

Merged
kcondon merged 6 commits intodevelopfrom
7812-download-all-api-speedup
May 3, 2021
Merged

7812 download all api speedup#7836
kcondon merged 6 commits intodevelopfrom
7812-download-all-api-speedup

Conversation

@landreev
Copy link
Contributor

What this PR does / why we need it:
This eliminates a performance issue in the download-all api on datasets with large numbers of versions. It was particularly bad in a configuration with the external zipper enabled. (because it results in one combined idle delay before the redirect is issued).

Which issue(s) this PR closes:

Closes #7812

Special notes for your reviewer:
Please see the issue for more info.

Note that in the download-all-by-version API I left the commands that retrieve the dataset and the version in place (as opposed to replacing them with direct custom query methods). This is because it needs to understand a more complex logic of being able to take both the version numbers and the reserved tokens like :draft, :latest-published etc... and I didn't feel like messing with it. This form of the API does not appear to be used much at all in prod. And if it does get used, this will only cost a 5-10 sec. initial delay - since these lookups only happen once, and are not repeated for every file.

Suggestions on how to test this:
The best test is to replicate the scenario that was actually causing us trouble in production - a download-all via API on one of the covid-rt datasets - on a copy of the prod. db. The easiest way to measure the delay before the redirect is by testing it on the command line with curl:

time curl -i "http://localhost:8080/api/access/dataset/:persistentId/?persistentId=doi:10.7910/DVN/BZ7FPH"

the expected redirect response should look like this:

HTTP/1.1 303 See Other
Server: Payara Server  5.2020.6 #badassfish
X-Powered-By: Servlet/4.0 JSP/2.3 (Payara Server  5.2020.6 #badassfish Java/Oracle Corporation/11)
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: PUT, GET, POST, DELETE, OPTIONS
Access-Control-Allow-Headers: Content-Type, X-Dataverse-Key
Set-Cookie: JSESSIONID=5d836d17373111030d8c3b23c06e; Path=/; HttpOnly
Location: https://rserve.dataverse.harvard.edu/cgi-bin/zipdownload?eb6-59ec74e74b65
Transfer-Encoding: chunked
X-Frame-Options: SAMEORIGIN

With stock 5.4.1 this should cost 100+ seconds on vm5; more in production (i.e., longer than the idle timeout on the LB). It should go down to under a couple of seconds with the fix in place.

Note that if you try to follow a redirect link like the one above on vm5, the actual download is NOT going to work. Because it's pointing to the prod. external zipper, and vm5 cannot authorize downloads there. But the problem we are fixing and testing is the initial delay, so we don't care about the actual zipping.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

…for /api/access/dataset; but needs to be cleaned up/refactored, and we may want to take a closer look at how we actually writing the guestbookresponses more closely. #7812
implements it for the local version of download-all (local, in-application
zipping, w/ no external zipper configured), plus fixes the unrelated
guestbook response issue.
@landreev
Copy link
Contributor Author

There are 2 Jenkins failures; one is related to something I did in the PR - the download all is now returning "404 Not Found" if called on a non-existing dataset; when it used to return "401 Bad Request". Will fix. The other appears to be kinda random... like a timing-related 500/etc. that we get occasionally... but will confirm.

Copy link
Contributor

@scolapasta scolapasta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. @landreev, I know you plan on removing some of the logging after QA, so when you do that can you take a look at the comment I pointed out as well?

DataverseRequest req = createDataverseRequest(user);
final Dataset retrieved = findDatasetOrDie(datasetIdOrPersistentId);
if (!(user instanceof GuestUser)) {
// If it is a guest user, let's not even bother looking up if a draft exists.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment feels a little weird here, since it's in the if that is not guest user and does check for draft

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. does it feel weird? - it basically explains why that "if" statement is there - as to not bother looking up... etc. but sure, I can change it/or take it outside the "if".

…t it was (for legacy reasons), removed the .info log messages, rewrote a comment per code review. (#7812)
@landreev
Copy link
Contributor Author

Made a commit with the changes as discussed above. Will keep an aye on the jenkins test, will check on that other, random method (edu.harvard.iq.dataverse.api.FileMetadataIT.testJsonParserWithDirectoryLabels) that failed in the last run)

@landreev
Copy link
Contributor Author

landreev commented May 1, 2021

That second test, testJsonParserWithDirectoryLabels, passed after I checked in the fix for the other one. But then I corrected another typo in a comment, which triggered another Jenkins run... and it failed again. So I'm going to assume that it is indeed something intermittent, and not going to sweat it.

@kcondon kcondon self-assigned this May 3, 2021
@kcondon kcondon merged commit bd202ef into develop May 3, 2021
@kcondon kcondon deleted the 7812-download-all-api-speedup branch May 3, 2021 15:41
@djbrooke djbrooke added this to the 5.5 milestone May 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performace problem in multi-file download API (GuestBookResonse)

4 participants