Solr Index Improvements by qqmyers · Pull Request #11374 · IQSS/dataverse

qqmyers · 2025-03-26T14:49:53Z

What this PR does / why we need it: Indexing is slow. This PR speeds up the per-file indexing via multiple changes:

moving dataset-level constants out of the per-file loop.
small changes to the differencing algorithm to check whether restriction has changed before digging into any tabular differences and avoids creating one of the differencing details when details aren't required,
increase the solr hard commit time (improved performance, possible slower restart time, no impact on how fast a dataset appears (soft index time))
use of a small DataFileProxy class for permission indexing, used when the a new jvm option is set for datasets with more than x files - reduces memory use as a large list of filemetadata/datafiles are not kept in memory
Use of NamedNativeQueries
Use of SqlResultSetMapping to avoid post processing results in Dataverse
Comparing any draft and last released filemetadatas for all files in a dataset via one query rather than per-file checks in Dataverse code
avoid double loops in comparing variable metadata
avoiding instantiating datasets before obtaining an indexing semaphore
Using streams instead of for loops
avoid calls to services to re-retrieve info from the db (that is already available in the dataset object tree)
NamedQuery to find assignees with a permission (via some role) on a given object (versus scanning roleassignments in code to find ones where the role has the right permission)
moved loops over versions out of the per-file methods
remove the deprecated unpublishedDataRelatedToMeModeEnabled and if statements that were always true
Increased the eclipselink cache sizes for filemetadata and datafiles to 5K and generally to 1K
avoid findDeep on datasets

Which issue(s) this PR closes:

Closes #

Special notes for your reviewer:

Testing at QDR with ~330 datasets containing up to 3K files (~12K files total): indexing now takes <2 minutes, <1 minute for a second run. (This includes some additional permissions checks since QDR allows full-text indexing of restricted files, and was done on our smallest test machine (1GB DV heap). Before the updates indexing took 6+ hours.)

The one ~non-obvious change w.r.t. moving constants out of loops is removing the datafile.isHarvested call with a dataset.isHarvested constant. If you look in the code, the datafile call just calls owner.isHarvested() so there's not change.

In general, I tested after each change to see if there was a performance improvement. In some cases the change was small - 10% and others were very large. I rejected and force pushed to remove some commits for trials that didn't improve things or caused problems (a parallel stream over files in the IndexServiceBean seems to cause failures in DataTable processing that look like the IndirectList failures we've seen in a couple other places).

What I did not do is go back to see if later changes, like increasing the cache size, made other changes less important. If there's anything concerning in the result, we could potentially try to pull that change out and test performance to see if everything is still useful.

W.r.t. the min-files-to-use-proxy: in the permission loop, we only need the file id, displayName (which comes from the fileMetadata.label for the latest version, regardless of which version you're indexing - possibly a bug), and whether the file is released. For large numbers of files, I created a proxy object with just those three fields, that can be retrieved via a query (when the dataset has more than min-files-to-use-proxy files) so that the list of filemetadata and datafile objects for a given version don't have to be retrieved (which appears to happen when you call version.getFileMetadatas() - before that it appears the filemetadata list is an IndirectList (assuming you don't use findDeep)). The setting is slightly misnamed in that the proxy object is also used for small datasets, it is just constructed from the fileMetadata directly. In testing, I thought I saw some slow-down using the query for very small datasets but somewhere in the 200-1000 file range, performance improved by using it.

Suggestions on how to test this: regression test, performance test. As this changes indexing and permission indexing, careful testing to assure that files can be found by category tags, prov text etc. would be worthwhile, as would verifying that files in draft versions can't be found unless the user has relevant permissions.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?: included

Additional documentation: The only thing added is one new setting which is documented.

coveralls · 2025-03-26T15:39:04Z

coverage: 23.122% (+0.04%) from 23.081%
when pulling 359c153 on GlobalDataverseCommunityConsortium:solr-index-improvements
into c4379a0 on IQSS:develop.

saw IndirectList failure in this section, simplifying

landreev

I am moving the PR into "ready for QA".
I would like to reiterate that I want to help with the QA, especially with the part of measuring the actual performance on production datasets. I wanted to get a head start on that yesterday, but the perf (aka "qa") db has been out of commission.

ofahimIQSS · 2025-05-13T18:54:15Z

Overall, regression testing went well. I'm not sure what the status is for the performance tests but you have my approval to merge if things look good on that end.

landreev · 2025-05-19T20:46:36Z

Could you please sync the branch w/ develop - thanks.

qqmyers · 2025-05-20T13:58:13Z

@landreev - I'd suggest setting dataverse.solr.min-files-to-use-proxy to ~200 files for performance testing. Somewhere around there it starts being faster to use the proxy. (By default it's MAXINT and so the 'use-proxy' option is disabled.) It's possible that memory/cpu could change the exact number where things get faster with the proxy.

qqmyers · 2025-05-20T18:05:08Z

FWIW: At QDR, without full text indexing, reindexing 9 collections, 685 datasets, ~29660 files took 328 seconds, datasets with 5K-7K files were taking 10-15 seconds or so.

ofahimIQSS · 2025-05-20T18:10:36Z

continuous integration and maven tests failing :(

landreev · 2025-05-21T14:55:54Z

One quick data point: this infamous dataset: http://qa.dataverse.org/dataset.xhtml?persistentId=doi:10.7910/DVN/25833 has been known to consistently freeze and crash the application when indexed during a full or partial reindex (/api/admin/index or /api/admin/index/continue). This does not happen in this branch, in fact the dataset gets reindexed in some milliseconds (!). In other words, this appears to prove the theory that it was the expanded database query in the findDeep() method (no longer used in this branch) that was causing the application to run out of memory.

landreev · 2025-05-27T13:33:37Z

@ofahimIQSS I am done with the extra testing of the PR.
I will add more details on the results and/or post in dv-tech. But it is ready to be merged whenever.
The short version is that the performance on production data makes it possible to run a full reindex in place, without having to do so offsite. Which is the best result that could be expected.

ofahimIQSS · 2025-05-27T14:14:44Z

I'm seeing an error on continuous-integration -
Ansible run terminated abnormally, failing build.

landreev · 2025-05-27T15:03:42Z

("terminated abnormally" usually means some fluke outside the PR - the new aws instance timing out on startup or similar; I'm seeing that a new Jenkins run is in progress, we'll see)

landreev · 2025-05-27T15:58:28Z

Happy to see that it succeeded eventually.

ofahimIQSS · 2025-05-27T16:03:37Z

Merging PR!

qqmyers added 7 commits March 26, 2025 08:47

add debug index logging

1ab8b57

use loop constants, etc.

8cf78c4

minimize work when details false, check restrict earlier/simplier

1e43490

really fix test

3b746f7

simplify - fix restrict bug

f23a274

release note

8f89906

fix compile issue, additional tweaks

17cd5b5

qqmyers marked this pull request as ready for review March 26, 2025 15:30

qqmyers added this to IQSS Dataverse Project Mar 26, 2025

qqmyers moved this to Ready for Triage in IQSS Dataverse Project Mar 26, 2025

qqmyers added the Size: 3 A percentage of a sprint. 2.1 hours. label Mar 26, 2025

qqmyers added this to the 6.7 milestone Mar 26, 2025

qqmyers added 17 commits March 28, 2025 12:27

try parallel file loop

fb36f3b

fix NPE and final issues

a8e5476

try finddeep

646bb83

avoid double loop

612e521

diff by query

0d6f7be

numeric params

e2d4e98

fix merge issues, change doFullText logic

85425e2

formatting

985227b

restore indexing of released files

3d2c408

delay getting dataset until semaphore is available

a649937

restore transaction, don't finddeep

1b2548a

simplify ToU logic

9deef72

avoid keeping files in List

9e5ea00

change dataset case too

b7924a3

avoid variableservice

6f6e32e

saw IndirectList failure in this section, simplifying

try EAGER

dfbf603

avoid isTabularData

7e508b6

landreev approved these changes Apr 22, 2025

View reviewed changes

github-project-automation bot moved this from In Review 🔎 to Ready for QA ⏩ in IQSS Dataverse Project Apr 22, 2025

landreev removed their assignment Apr 22, 2025

cmbz added the FY25 Sprint 22 FY25 Sprint 22 (2025-04-23 - 2025-05-07) label Apr 23, 2025

ofahimIQSS self-assigned this Apr 29, 2025

ofahimIQSS moved this from Ready for QA ⏩ to QA ✅ in IQSS Dataverse Project Apr 29, 2025

pdurbin mentioned this pull request Apr 30, 2025

Guides: improve navigation #10942

Merged

scolapasta assigned landreev May 7, 2025

cmbz added the FY25 Sprint 23 FY25 Sprint 23 (2025-05-07 - 2025-05-21) label May 7, 2025

qqmyers mentioned this pull request May 15, 2025

Full-text Indexing Fixes #11494

Merged

Merge remote-tracking branch 'IQSS/develop' into solr-index-improvements

359c153

ofahimIQSS mentioned this pull request May 22, 2025

Blocked Build Due to Unavailable NetCDF Library (edu.ucar:cdm-core:jar:5.5.3) #11511

Closed

cmbz added the FY25 Sprint 24 FY25 Sprint 24 (2025-05-21 - 2025-06-04) label May 22, 2025

landreev removed their assignment May 27, 2025

ofahimIQSS merged commit f91e75d into IQSS:develop May 27, 2025
23 of 24 checks passed

github-project-automation bot moved this from QA ✅ to Merged 🚀 in IQSS Dataverse Project May 27, 2025

ofahimIQSS removed their assignment May 27, 2025

pdurbin moved this from Merged 🚀 to Done 🧹 in IQSS Dataverse Project May 28, 2025

qqmyers mentioned this pull request Aug 29, 2025

New files in later versions not being indexed #11776

Closed

Comments

Conversation

qqmyers commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

landreev left a comment

Choose a reason for hiding this comment

Uh oh!

ofahimIQSS commented May 13, 2025

Uh oh!

landreev commented May 19, 2025

Uh oh!

qqmyers commented May 20, 2025

Uh oh!

qqmyers commented May 20, 2025

Uh oh!

ofahimIQSS commented May 20, 2025

Uh oh!

landreev commented May 21, 2025

Uh oh!

landreev commented May 27, 2025

Uh oh!

ofahimIQSS commented May 27, 2025

Uh oh!

landreev commented May 27, 2025

Uh oh!

landreev commented May 27, 2025

Uh oh!

ofahimIQSS commented May 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

qqmyers commented Mar 26, 2025 •

edited

Loading

coveralls commented Mar 26, 2025 •

edited

Loading