Skip to content

Comments

Solr Index Improvements#11374

Merged
ofahimIQSS merged 85 commits intoIQSS:developfrom
GlobalDataverseCommunityConsortium:solr-index-improvements
May 27, 2025
Merged

Solr Index Improvements#11374
ofahimIQSS merged 85 commits intoIQSS:developfrom
GlobalDataverseCommunityConsortium:solr-index-improvements

Conversation

@qqmyers
Copy link
Member

@qqmyers qqmyers commented Mar 26, 2025

What this PR does / why we need it: Indexing is slow. This PR speeds up the per-file indexing via multiple changes:

  • moving dataset-level constants out of the per-file loop.
  • small changes to the differencing algorithm to check whether restriction has changed before digging into any tabular differences and avoids creating one of the differencing details when details aren't required,
  • increase the solr hard commit time (improved performance, possible slower restart time, no impact on how fast a dataset appears (soft index time))
  • use of a small DataFileProxy class for permission indexing, used when the a new jvm option is set for datasets with more than x files - reduces memory use as a large list of filemetadata/datafiles are not kept in memory
  • Use of NamedNativeQueries
  • Use of SqlResultSetMapping to avoid post processing results in Dataverse
  • Comparing any draft and last released filemetadatas for all files in a dataset via one query rather than per-file checks in Dataverse code
  • avoid double loops in comparing variable metadata
  • avoiding instantiating datasets before obtaining an indexing semaphore
  • Using streams instead of for loops
  • avoid calls to services to re-retrieve info from the db (that is already available in the dataset object tree)
  • NamedQuery to find assignees with a permission (via some role) on a given object (versus scanning roleassignments in code to find ones where the role has the right permission)
  • moved loops over versions out of the per-file methods
  • remove the deprecated unpublishedDataRelatedToMeModeEnabled and if statements that were always true
  • Increased the eclipselink cache sizes for filemetadata and datafiles to 5K and generally to 1K
  • avoid findDeep on datasets

Which issue(s) this PR closes:

  • Closes #

Special notes for your reviewer:

Testing at QDR with ~330 datasets containing up to 3K files (~12K files total): indexing now takes <2 minutes, <1 minute for a second run. (This includes some additional permissions checks since QDR allows full-text indexing of restricted files, and was done on our smallest test machine (1GB DV heap). Before the updates indexing took 6+ hours.)

The one ~non-obvious change w.r.t. moving constants out of loops is removing the datafile.isHarvested call with a dataset.isHarvested constant. If you look in the code, the datafile call just calls owner.isHarvested() so there's not change.

In general, I tested after each change to see if there was a performance improvement. In some cases the change was small - 10% and others were very large. I rejected and force pushed to remove some commits for trials that didn't improve things or caused problems (a parallel stream over files in the IndexServiceBean seems to cause failures in DataTable processing that look like the IndirectList failures we've seen in a couple other places).

What I did not do is go back to see if later changes, like increasing the cache size, made other changes less important. If there's anything concerning in the result, we could potentially try to pull that change out and test performance to see if everything is still useful.

W.r.t. the min-files-to-use-proxy: in the permission loop, we only need the file id, displayName (which comes from the fileMetadata.label for the latest version, regardless of which version you're indexing - possibly a bug), and whether the file is released. For large numbers of files, I created a proxy object with just those three fields, that can be retrieved via a query (when the dataset has more than min-files-to-use-proxy files) so that the list of filemetadata and datafile objects for a given version don't have to be retrieved (which appears to happen when you call version.getFileMetadatas() - before that it appears the filemetadata list is an IndirectList (assuming you don't use findDeep)). The setting is slightly misnamed in that the proxy object is also used for small datasets, it is just constructed from the fileMetadata directly. In testing, I thought I saw some slow-down using the query for very small datasets but somewhere in the 200-1000 file range, performance improved by using it.

Suggestions on how to test this: regression test, performance test. As this changes indexing and permission indexing, careful testing to assure that files can be found by category tags, prov text etc. would be worthwhile, as would verifying that files in draft versions can't be found unless the user has relevant permissions.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?: included

Additional documentation: The only thing added is one new setting which is documented.

@qqmyers qqmyers marked this pull request as ready for review March 26, 2025 15:30
@qqmyers qqmyers moved this to Ready for Triage in IQSS Dataverse Project Mar 26, 2025
@qqmyers qqmyers added the Size: 3 A percentage of a sprint. 2.1 hours. label Mar 26, 2025
@qqmyers qqmyers added this to the 6.7 milestone Mar 26, 2025
@coveralls
Copy link

coveralls commented Mar 26, 2025

Coverage Status

coverage: 23.122% (+0.04%) from 23.081%
when pulling 359c153 on GlobalDataverseCommunityConsortium:solr-index-improvements
into c4379a0 on IQSS:develop.

Copy link
Contributor

@landreev landreev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am moving the PR into "ready for QA".
I would like to reiterate that I want to help with the QA, especially with the part of measuring the actual performance on production datasets. I wanted to get a head start on that yesterday, but the perf (aka "qa") db has been out of commission.

@github-project-automation github-project-automation bot moved this from In Review 🔎 to Ready for QA ⏩ in IQSS Dataverse Project Apr 22, 2025
@landreev landreev removed their assignment Apr 22, 2025
@cmbz cmbz added the FY25 Sprint 22 FY25 Sprint 22 (2025-04-23 - 2025-05-07) label Apr 23, 2025
@ofahimIQSS ofahimIQSS self-assigned this Apr 29, 2025
@ofahimIQSS ofahimIQSS moved this from Ready for QA ⏩ to QA ✅ in IQSS Dataverse Project Apr 29, 2025
@cmbz cmbz added the FY25 Sprint 23 FY25 Sprint 23 (2025-05-07 - 2025-05-21) label May 7, 2025
@ofahimIQSS
Copy link
Contributor

Overall, regression testing went well. I'm not sure what the status is for the performance tests but you have my approval to merge if things look good on that end.

@landreev
Copy link
Contributor

Could you please sync the branch w/ develop - thanks.

@qqmyers
Copy link
Member Author

qqmyers commented May 20, 2025

@landreev - I'd suggest setting dataverse.solr.min-files-to-use-proxy to ~200 files for performance testing. Somewhere around there it starts being faster to use the proxy. (By default it's MAXINT and so the 'use-proxy' option is disabled.) It's possible that memory/cpu could change the exact number where things get faster with the proxy.

@qqmyers
Copy link
Member Author

qqmyers commented May 20, 2025

FWIW: At QDR, without full text indexing, reindexing 9 collections, 685 datasets, ~29660 files took 328 seconds, datasets with 5K-7K files were taking 10-15 seconds or so.

@ofahimIQSS
Copy link
Contributor

continuous integration and maven tests failing :(

@landreev
Copy link
Contributor

One quick data point: this infamous dataset: http://qa.dataverse.org/dataset.xhtml?persistentId=doi:10.7910/DVN/25833 has been known to consistently freeze and crash the application when indexed during a full or partial reindex (/api/admin/index or /api/admin/index/continue). This does not happen in this branch, in fact the dataset gets reindexed in some milliseconds (!). In other words, this appears to prove the theory that it was the expanded database query in the findDeep() method (no longer used in this branch) that was causing the application to run out of memory.

@landreev
Copy link
Contributor

@ofahimIQSS I am done with the extra testing of the PR.
I will add more details on the results and/or post in dv-tech. But it is ready to be merged whenever.
The short version is that the performance on production data makes it possible to run a full reindex in place, without having to do so offsite. Which is the best result that could be expected.

@ofahimIQSS
Copy link
Contributor

I'm seeing an error on continuous-integration -
Ansible run terminated abnormally, failing build.

@landreev landreev removed their assignment May 27, 2025
@landreev
Copy link
Contributor

("terminated abnormally" usually means some fluke outside the PR - the new aws instance timing out on startup or similar; I'm seeing that a new Jenkins run is in progress, we'll see)

@landreev
Copy link
Contributor

Happy to see that it succeeded eventually.

@ofahimIQSS
Copy link
Contributor

Merging PR!

@ofahimIQSS ofahimIQSS merged commit f91e75d into IQSS:develop May 27, 2025
23 of 24 checks passed
@github-project-automation github-project-automation bot moved this from QA ✅ to Merged 🚀 in IQSS Dataverse Project May 27, 2025
@ofahimIQSS ofahimIQSS removed their assignment May 27, 2025
@pdurbin pdurbin moved this from Merged 🚀 to Done 🧹 in IQSS Dataverse Project May 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

FY25 Sprint 20 FY25 Sprint 20 (2025-03-26 - 2025-04-09) FY25 Sprint 21 FY25 Sprint 21 (2025-04-09 - 2025-04-23) FY25 Sprint 22 FY25 Sprint 22 (2025-04-23 - 2025-05-07) FY25 Sprint 23 FY25 Sprint 23 (2025-05-07 - 2025-05-21) FY25 Sprint 24 FY25 Sprint 24 (2025-05-21 - 2025-06-04) Size: 10 A percentage of a sprint. 7 hours.

Projects

Status: Done 🧹

Development

Successfully merging this pull request may close these issues.

5 participants