Conversation
saw IndirectList failure in this section, simplifying
landreev
left a comment
There was a problem hiding this comment.
I am moving the PR into "ready for QA".
I would like to reiterate that I want to help with the QA, especially with the part of measuring the actual performance on production datasets. I wanted to get a head start on that yesterday, but the perf (aka "qa") db has been out of commission.
|
Overall, regression testing went well. I'm not sure what the status is for the performance tests but you have my approval to merge if things look good on that end. |
|
Could you please sync the branch w/ develop - thanks. |
|
@landreev - I'd suggest setting dataverse.solr.min-files-to-use-proxy to ~200 files for performance testing. Somewhere around there it starts being faster to use the proxy. (By default it's MAXINT and so the 'use-proxy' option is disabled.) It's possible that memory/cpu could change the exact number where things get faster with the proxy. |
|
FWIW: At QDR, without full text indexing, reindexing 9 collections, 685 datasets, ~29660 files took 328 seconds, datasets with 5K-7K files were taking 10-15 seconds or so. |
|
continuous integration and maven tests failing :( |
|
One quick data point: this infamous dataset: http://qa.dataverse.org/dataset.xhtml?persistentId=doi:10.7910/DVN/25833 has been known to consistently freeze and crash the application when indexed during a full or partial reindex ( |
|
@ofahimIQSS I am done with the extra testing of the PR. |
|
I'm seeing an error on continuous-integration - |
|
("terminated abnormally" usually means some fluke outside the PR - the new aws instance timing out on startup or similar; I'm seeing that a new Jenkins run is in progress, we'll see) |
|
Happy to see that it succeeded eventually. |
|
Merging PR! |
What this PR does / why we need it: Indexing is slow. This PR speeds up the per-file indexing via multiple changes:
Which issue(s) this PR closes:
Special notes for your reviewer:
Testing at QDR with ~330 datasets containing up to 3K files (~12K files total): indexing now takes <2 minutes, <1 minute for a second run. (This includes some additional permissions checks since QDR allows full-text indexing of restricted files, and was done on our smallest test machine (1GB DV heap). Before the updates indexing took 6+ hours.)
The one ~non-obvious change w.r.t. moving constants out of loops is removing the datafile.isHarvested call with a dataset.isHarvested constant. If you look in the code, the datafile call just calls owner.isHarvested() so there's not change.
In general, I tested after each change to see if there was a performance improvement. In some cases the change was small - 10% and others were very large. I rejected and force pushed to remove some commits for trials that didn't improve things or caused problems (a parallel stream over files in the IndexServiceBean seems to cause failures in DataTable processing that look like the IndirectList failures we've seen in a couple other places).
What I did not do is go back to see if later changes, like increasing the cache size, made other changes less important. If there's anything concerning in the result, we could potentially try to pull that change out and test performance to see if everything is still useful.
W.r.t. the min-files-to-use-proxy: in the permission loop, we only need the file id, displayName (which comes from the fileMetadata.label for the latest version, regardless of which version you're indexing - possibly a bug), and whether the file is released. For large numbers of files, I created a proxy object with just those three fields, that can be retrieved via a query (when the dataset has more than min-files-to-use-proxy files) so that the list of filemetadata and datafile objects for a given version don't have to be retrieved (which appears to happen when you call version.getFileMetadatas() - before that it appears the filemetadata list is an IndirectList (assuming you don't use findDeep)). The setting is slightly misnamed in that the proxy object is also used for small datasets, it is just constructed from the fileMetadata directly. In testing, I thought I saw some slow-down using the query for very small datasets but somewhere in the 200-1000 file range, performance improved by using it.
Suggestions on how to test this: regression test, performance test. As this changes indexing and permission indexing, careful testing to assure that files can be found by category tags, prov text etc. would be worthwhile, as would verifying that files in draft versions can't be found unless the user has relevant permissions.
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Is there a release notes update needed for this change?: included
Additional documentation: The only thing added is one new setting which is documented.