Skip to content

8097 improve index speed for many files#8152

Merged
kcondon merged 28 commits intodevelopfrom
8097-improve-index-speed-for-many-files
Nov 29, 2021
Merged

8097 improve index speed for many files#8152
kcondon merged 28 commits intodevelopfrom
8097-improve-index-speed-for-many-files

Conversation

@sekmiller
Copy link
Contributor

@sekmiller sekmiller commented Oct 14, 2021

What this PR does / why we need it:
[Edit: please see the release note/last comments for what's in the PR. It contains an agreed-upon "incremental improvement" - does not really speed up the overall time significantly, but makes a from-scratch reindexing of an installation like ours less traumatic. The investigation of the root cause of the performance degradation will continue in the newly opened #8256. - L.A.]

Speeds up the processing of re-indexing datasets with a large number of files, in particular when there is a draft version of the dataset. (also fixes a fault where the compare metadata was run twice - once in the constructor and then again to get the boolean result.) The underlying reason is that the index only creates a solr doc for a draft version of a file if there's a difference in the file metadata. Comparing the two versions of file metadata slows down considerably when there is a large number of files.

Which issue(s) this PR closes:

Closes #8097 8 hours to index a dataset with thousands of files.

Special notes for your reviewer: The big performance gain comes from indexing the draft version after indexing the published version. In a test against a copy of prod the dataset in question went from 8+ hours to under 45 minutes. Smaller gains were accomplished by doing a direct lookup of the comparable fmd version and short circuiting the comparison once a difference was found.

Suggestions on how to test this: re-index a dataset with a large number of files and see that the index acts as expected and runs faster.

Does this PR introduce a user interface change? If mockups are available, please link/include them here: no

Is there a release notes update needed for this change?: no

Additional documentation: none

@landreev landreev self-requested a review October 15, 2021 13:34
@landreev landreev self-assigned this Oct 15, 2021
@coveralls
Copy link

coveralls commented Oct 18, 2021

Coverage Status

Coverage decreased (-0.009%) to 18.916% when pulling 2675397 on 8097-improve-index-speed-for-many-files into a70c1c4 on develop.

Copy link
Contributor

@landreev landreev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very happy about the 10X speed improvement. Some very good catches in this PR.
I feel like I should've been more proactive with my offer to help count the actual database queries before/after while you were working on this. The fact that processing the published version first has made such a big difference by itself makes me wonder why exactly; as I mentioned, I'm wondering if we were re-instantiating the entire 25K list of filemetadatas in the published version 25K times (!) otherwise. So I'd like to confirm that now.

) {
logger.fine("Checking if this file metadata is a duplicate.");
if (fileMetadata.getDataFile() != null) {
FileMetadata findReleasedFileMetadata = dataFileService.findFileMetadataByDatasetVersionIdAndDataFileId(dataset.getReleasedVersion().getId(), fileMetadata.getDataFile().getId());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The big win here is no longer doing an n**2 nested loop, but the replacement does n db queries. Would it be faster still to just run once through the dataset.getReleasedVersion().getFileMetadatas()) and make a map of file id to fileMetadata to use to find the matching filemetadata?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qqmyers That was my immediate concern too (Steve and I talked about this directly yesterday).
The 25K calls to findFileMetadataByDatasetVersionIdAndDataFileId() = 125K extra database queries, that are not necessarily cheap. The extra (*n) in the original loop would be iterating through objects already looked up and instantiated in memory... and that should not be that expensive - ? The only possible explanation I could think of how the above could be an improvement was if somehow that line dataset.getReleasedVersion().getFileMetadatas() in the original code was actually looking up and instantiating the entire list of filemetadatas on every iteration (!). What Steve said in the PR description, that changing the indexing order, and processing the released version first actually made a huge difference made that crazy guess a little more plausible. I mentioned it in my comment above.

I ran some tests on a prod. db copy last night, and got some strange results. We'll revisit this when Steve is back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qqmyers

Would it be faster still to just run once through the dataset.getReleasedVersion().getFileMetadatas()) and make a map of file id to fileMetadata to use to find the matching filemetadata?

If this is really necessary, this lookup of the filemetadata in the released version could be made more efficient, yes. A map was the first thing that came to my mind too. But there should be a simpler way still. Like, sorting both lists of FileMetadatas in the 2 versions by datafile id, then making a pass through them and selecting a list of file ids that need to be indexed, for example? - that's one n loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd have sorting costs though :-) . In any case - if there's gain to be had I'd guess it would be from avoiding db calls and any local O(n) solution would be good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(doh, with that approach we would not even need an extra pass/a saved list of ids - we could just sort both lists, then iterate through both of them in parallel as we index)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OCD or ADHD, I had to measure it... that brute force iteration through the java List of filemetadatas costs the total of 30+ seconds when indexing the dataset. So, 10% of the time it takes to index it, under normal circumstances.


if (this.originalFileMetadata == null && this.newFileMetadata.getDataFile() != null ){
//File Added
if (!details) return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any way through the series of if statements where you don't end up returning false if ! details? Just wondering if you can do this check once.

public boolean compareContent(FileMetadata other){
FileVersionDifference diffObj = new FileVersionDifference(this, other, false);
return diffObj.compareMetadata(this, other);
return diffObj.isSame();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this help? Seems like you create a new FileVersionDifference and call the compareMetadata method once either way since diffObject isn't reused.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qqmyers I thought the point was that the comparison was already performed once inside the constructor; so the saving was from not performing it twice.
But I'll let Steve give an authoritative answer when he's back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - got it. And other places the FileVersionDifference is kept around for a while. Could/should compareMetadata() be a private method now?

@landreev landreev removed their assignment Oct 20, 2021
@landreev
Copy link
Contributor

One extra data point:
(as always, any observations reported need to be verified)
We don't fully understand what's going on. But we have confirmed that the dataset above takes a very long time to index ONLY in the context of index-all. It takes a much more manageable time when indexed on its own directly. (/api/admin/index vs. /api/admin/index/dataset?persistentId=).

However, the difference between the 2 cases above may not have anything to do with how much data in other datasets has been processed by the time index-all gets to this dataset. I.e., the difference in performance appears to be present even when "all" in "index-all" is just this one dataset. The experiment illustrating the point is to have this dataset be the only one in the database with indextime=NULL, then run /api/admin/index/continue. This goes through the same code path as index-all, but only for unindexed datasets.
And the only difference in the code that I'm seeing right away is that indexall/continue is performed asynchronously; AND that it calls indexService.indexDatasetInNewTransaction(). While /api/admin/index/dataset calls indexService.indexDataset().

My gut feeling has always been that whatever was making datasets with too many files unmanageable had something to do with transactions. But we do need something more solid than that.

There were most definitely some algorithmic inefficiencies inside the indexDataset methods (like the comparison methods that have already been looked into). It's not clear however how much that has to do with the awful performance of index-all on datasets like this one. Case in point, the dataset 10.7910/DVN/ZYKVED - it has even more files; but only one version, the draft. Indexing it should not involve any comparisons (?), but it appears to be taking a very long time too. So could it really be a function of the sheer number of child objects being forced into the transaction context? That in turn may also be a function of something about the way we instantiate these objects inside the method.

I'll point out that at the very least we have a workaround for index-all specifically; even if it's bulky. I.e., knowing that datasets like the one above can be indexed in a more manageable way outside of index-all, that's what we could be doing. But we absolutely need to figure out what's going on, since this appears to be the same issue that makes such datasets unmanageable in other ways - takes forever to edit, save, etc.

@landreev
Copy link
Contributor

A followup update:

And the only difference in the code that I'm seeing right away is that indexall/continue is performed asynchronously; AND that it calls indexService.indexDatasetInNewTransaction(). While /api/admin/index/dataset calls indexService.indexDataset().

Actually, there's a 3rd difference, which appears to be the crucial one: indexService.indexDataset() takes a Dataset object as an argument; while indexService.indexDatasetInNewTransaction() takes a Long dataset id and instantiates the dataset inside.

Here's the experiment:

We know that indexing 3CTMKP via /api/admin/index/dataset takes minutes in the develop branch. This is the API code:

@GET
@Path("dataset")
public Response indexDatasetByPersistentId(@QueryParam("persistentId") String persistentId) {
     ...
     Future<String> indexDatasetFuture = indexService.indexDataset(dataset, doNormalSolrDocCleanUp);    

It appears that all it takes to make it take hours instead is to change the last line to

     Future<String> indexDatasetFuture = indexService.indexDatasetInNewTransaction(dataset.getId());

So we are omitting the @Asynchronous code in IndexBatchServiceBean that index-all calls; as it does not appear to play a role in the performance degradation.
The fact that indexDatasetInNewTransaction() has the @TransactionAttribute(REQUIRES_NEW) attribute and indexDataset() doesn't should NOT be making any difference in this scenario. Because the API bean Index.java is not an EJB, so the method indexDatasetByPersistentId() does not have a transaction context of its own. Meaning a new transaction is created when it calls either of the two IndexServiceBean methods above, regardless of the transaction attributes there. (Does this make sense? Is it actually correct?)

So... it appears that what makes the crucial difference is whether the dataset entity is instantiated within, or outside of the working transaction. This must be some known and documented EJB behavior at play. So, stackoverflow is our friend?

(I tested the above; but, as always, do not take my word for it necessarily)

@landreev
Copy link
Contributor

And another followup:
The reason we pass the id of the dataset to indexDatasetInNewTransaction() - we made that change on purpose (I did, in 2018), to address a memory leak-like behavior: before it was made, the dataset was instantiated in the top-level method in IndexBatchServiceBean - but that apparently was resulting in something being kept in memory, and growing significantly, until the end of index-all. Which of course takes a very long time.
I'm expecting that it should be possible to go back to passing a dataset entity; and still be able to clear everything from memory right away. But I also really want to find out for sure why instantiating inside the transaction is so expensive time-wise.

In the "slow" mode, under index-all/continue, it appears that the first pass, through the draft version, still happens fairly quickly. And then it is the second pass, through the published version, that slows down to a crawl. Which is consistent with what Steve observed earlier when trying to instantiate the published filemetadatas from scratch inside the indexing method.

(Also, I need to remember to add the query counts here. There are way too many - even with the direct /api/admin/index/dataset mode, that doesn't take hours).

@landreev
Copy link
Contributor

landreev commented Nov 3, 2021

This btw is a query count I saved earlier:
This from indexing this "big dataset" directly, (/api/admin/index/dataset), testing with the develop branch (takes 6.5 minutes); 300K+ queries:

  50631 SELECT t0.ID, t0.DTYPE, t0.AUTHORITY, t0.CREATEDATE, t0.GLOBALIDCREATETIME, t0.IDENTIFIER, t0.IDENTIFIERREGISTERED, t0.INDEXTIME, t0.MODIFICATIONTIME, t0.PERMISSIONINDEXTIME, t0.PERMISSIONMODIFICATIONTIME, t0.PREVIEWIMAGEAVAILABLE, t0.PROTOCOL, t0.PUBLICATIONDATE, t0.STORAGEIDENTIFIER, t0.CREATOR_ID, t0.OWNER_ID, t0.RELEASEUSER_ID, t1.ID, t1.AFFILIATION, t1.ALIAS, t1.DATAVERSETYPE, t1.description, t1.EXTERNALLABELSETNAME, t1.FACETROOT, t1.GUESTBOOKROOT, t1.METADATABLOCKROOT, t1.METADATALANGUAGE, t1.NAME, t1.PERMISSIONROOT, t1.STORAGEDRIVER, t1.TEMPLATEROOT, t1.THEMEROOT, t1.DEFAULTCONTRIBUTORROLE_ID, t1.DEFAULTTEMPLATE_ID FROM DVOBJECT t0, DATAVERSE t1
  50620 SELECT t0.ID, t0.NAME, t0.DATASET_ID FROM DATAFILECATEGORY t0, FILEMETADATA_DATAFILECATEGORY t1
  50620 SELECT ID, INTERVIEWINSTRUCTION, ISWEIGHTVAR, LABEL, LITERALQUESTION, NOTES, POSTQUESTION, UNIVERSE, WEIGHTED, DATAVARIABLE_ID, FILEMETADATA_ID, WEIGHTVARIABLE_ID FROM VARIABLEMETADATA
  25322 SELECT ID, CONTENT, LANG, NAME FROM SETTING
  25313 SELECT t0.ID, t0.DTYPE, t0.AUTHORITY, t0.CREATEDATE, t0.GLOBALIDCREATETIME, t0.IDENTIFIER, t0.IDENTIFIERREGISTERED, t0.INDEXTIME, t0.MODIFICATIONTIME, t0.PERMISSIONINDEXTIME, t0.PERMISSIONMODIFICATIONTIME, t0.PREVIEWIMAGEAVAILABLE, t0.PROTOCOL, t0.PUBLICATIONDATE, t0.STORAGEIDENTIFIER, t0.CREATOR_ID, t0.OWNER_ID, t0.RELEASEUSER_ID, t1.ID, t1.EXTERNALLABELSETNAME, t1.FILEACCESSREQUEST, t1.HARVESTIDENTIFIER, t1.LASTEXPORTTIME, t1.METADATALANGUAGE, t1.STORAGEDRIVER, t1.USEGENERICTHUMBNAIL, t1.citationDateDatasetFieldType_id, t1.harvestingClient_id, t1.guestbook_id, t1.thumbnailfile_id FROM DVOBJECT t0, DATASET t1
  25312 SELECT t0.ID, t0.DTYPE, t0.AUTHORITY, t0.CREATEDATE, t0.GLOBALIDCREATETIME, t0.IDENTIFIER, t0.IDENTIFIERREGISTERED, t0.INDEXTIME, t0.MODIFICATIONTIME, t0.PERMISSIONINDEXTIME, t0.PERMISSIONMODIFICATIONTIME, t0.PREVIEWIMAGEAVAILABLE, t0.PROTOCOL, t0.PUBLICATIONDATE, t0.STORAGEIDENTIFIER, t0.CREATOR_ID, t0.OWNER_ID, t0.RELEASEUSER_ID, t1.ID, t1.CHECKSUMTYPE, t1.CHECKSUMVALUE, t1.CONTENTTYPE, t1.FILESIZE, t1.INGESTSTATUS, t1.PREVIOUSDATAFILEID, t1.prov_entityname, t1.RESTRICTED, t1.ROOTDATAFILEID FROM DVOBJECT t0, DATAFILE t1
  25312 SELECT ID, DESCRIPTION, DIRECTORYLABEL, LABEL, prov_freeform, RESTRICTED, VERSION, DATAFILE_ID, DATASETVERSION_ID FROM FILEMETADATA
  25311 SELECT ID, CONTROLCARD, FORCETYPECHECK, LABELSFILE, TEXTENCODING, datafile_id FROM INGESTREQUEST
  25310 SELECT ID, CASEQUANTITY, ORIGINALFILEFORMAT, ORIGINALFILENAME, ORIGINALFILESIZE, ORIGINALFORMATVERSION, RECORDSPERCASE, UNF, VARQUANTITY, DATAFILE_ID FROM DATATABLE
     38 SELECT ID, DISPLAYORDER, value, DATASETFIELD_ID FROM DATASETFIELDVALUE
     25 SELECT ID, ADVANCEDSEARCHFIELDTYPE, ALLOWCONTROLLEDVOCABULARY, ALLOWMULTIPLES, description, DISPLAYFORMAT, DISPLAYONCREATE, DISPLAYORDER, FACETABLE, FIELDTYPE, name, REQUIRED, title, uri, VALIDATIONFORMAT, WATERMARK, METADATABLOCK_ID, PARENTDATASETFIELDTYPE_ID FROM DATASETFIELDTYPE
     22 SELECT t1.ID, t1.DISPLAYORDER, t1.IDENTIFIER, t1.STRVALUE, t1.DATASETFIELDTYPE_ID FROM DATASETFIELD_CONTROLLEDVOCABULARYVALUE t0, CONTROLLEDVOCABULARYVALUE t1
     16 SELECT ID, DATASETFIELDTYPE_ID, DATASETVERSION_ID, PARENTDATASETFIELDCOMPOUNDVALUE_ID, TEMPLATE_ID FROM DATASETFIELD
     10 SELECT ID, DISPLAYORDER, PARENTDATASETFIELD_ID FROM DATASETFIELDCOMPOUNDVALUE
      8 SELECT ID, AFFILIATION, CREATEDTIME, DEACTIVATED, DEACTIVATEDTIME, EMAIL, EMAILCONFIRMED, FIRSTNAME, LASTAPIUSETIME, LASTLOGINTIME, LASTNAME, POSITION, SUPERUSER, USERIDENTIFIER FROM AUTHENTICATEDUSER
      5 SELECT ID, DISPLAYORDER, IDENTIFIER, STRVALUE, DATASETFIELDTYPE_ID FROM CONTROLLEDVOCABULARYVALUE
      4 SELECT ID, AUTHENTICATIONPROVIDERID, PERSISTENTUSERID, AUTHENTICATEDUSER_ID FROM AUTHENTICATEDUSERLOOKUP
      4 SELECT ID, ASSIGNEEIDENTIFIER, PRIVATEURLANONYMIZEDACCESS, PRIVATEURLTOKEN, DEFINITIONPOINT_ID, ROLE_ID FROM ROLEASSIGNMENT
      4 SELECT ID, ALIAS, DESCRIPTION, NAME, PERMISSIONBITS, OWNER_ID FROM DATAVERSEROLE
      3 SELECT ID, UNF, ARCHIVALCOPYLOCATION, ARCHIVENOTE, ARCHIVETIME, CREATETIME, DEACCESSIONLINK, EXTERNALSTATUSLABEL, LASTUPDATETIME, MINORVERSIONNUMBER, RELEASETIME, VERSION, VERSIONNOTE, VERSIONNUMBER, VERSIONSTATE, DATASET_ID, termsOfUseAndAccess_id FROM DATASETVERSION
      3 SELECT DISTINCT DTYPE FROM DVOBJECT
      2 SELECT ID, LINKCREATETIME, DATASET_ID, LINKINGDATAVERSE_ID FROM DATASETLINKINGDATAVERSE
      2 SELECT ID, CREATETIME, NAME, USAGECOUNT, DATAVERSE_ID, termsOfUseAndAccess_id FROM TEMPLATE
      2 SELECT ID, BACKGROUNDCOLOR, LINKCOLOR, LINKURL, LOGO, LOGOALIGNMENT, LOGOBACKGROUNDCOLOR, LOGOFOOTER, LOGOFOOTERALIGNMENT, LOGOFOOTERBACKGROUNDCOLOR, LOGOFORMAT, TAGLINE, TEXTCOLOR, dataverse_id FROM DATAVERSETHEME
      2 SELECT ID, AVAILABILITYSTATUS, CITATIONREQUIREMENTS, CONDITIONS, CONFIDENTIALITYDECLARATION, CONTACTFORACCESS, DATAACCESSPLACE, DEPOSITORREQUIREMENTS, DISCLAIMER, FILEACCESSREQUEST, LICENSE, ORIGINALARCHIVE, RESTRICTIONS, SIZEOFCOLLECTION, SPECIALPERMISSIONS, STUDYCOMPLETION, TERMSOFACCESS, TERMSOFUSE FROM TERMSOFUSEANDACCESS
      1 UPDATE DVOBJECT SET PERMISSIONINDEXTIME = '2021-10-18 23:22:04.473'
      1 UPDATE DVOBJECT SET INDEXTIME = '2021-10-18 23:21:37.701'
      1 UPDATE DVOBJECT SET INDEXTIME = '2021-10-18 23:19:11.66'
      1 SELECT t.typlen FROM pg_catalog.pg_type t, pg_catalog.pg_namespace n
      1 SELECT t1.ID, t1.ADVANCEDSEARCHFIELDTYPE, t1.ALLOWCONTROLLEDVOCABULARY, t1.ALLOWMULTIPLES, t1.description, t1.DISPLAYFORMAT, t1.DISPLAYONCREATE, t1.DISPLAYORDER, t1.FACETABLE, t1.FIELDTYPE, t1.name, t1.REQUIRED, t1.title, t1.uri, t1.VALIDATIONFORMAT, t1.WATERMARK, t1.METADATABLOCK_ID, t1.PARENTDATASETFIELDTYPE_ID FROM dataverse_citationDatasetFieldTypes t0, DATASETFIELDTYPE t1
      1 SELECT ID, INFO, REASON, STARTTIME, DATASET_ID, USER_ID FROM DATASETLOCK
      1 SELECT ID, DISPLAYNAME, NAME, namespaceuri, owner_id FROM METADATABLOCK

(some of this is kind of crazy - like the 50K dataverse lookups... that said, it didn't look like the number of queries was going up much when it went to 6.5 hours under index-all... in other words, it was looking like it was not the number of queries that makes it slower; the queries are largely the same, but everything somehow takes longer. so... "something to do with transactions" again. My main interest in this issue would be to find out why this is happening)

@sekmiller
Copy link
Contributor Author

Running Index all in the perf environment after ordering the datasets by files owned from smallest to largest improved the time to about 12 hours 20 minutes. Note that the sorting query disregards the fact that files can be deleted or replaced, that is it is purely based on files owned - not file metadata related to the latest published/draft version. Also it disregards partitioning which is no longer supported.

@sekmiller sekmiller removed their assignment Nov 9, 2021
@landreev
Copy link
Contributor

@sekmiller Do you remember how much of the total time (12 hr. 20 min.) was spent indexing 3CTMKP? It is a surprisingly good result (3CTMKP we've been experimenting with is not even the most file-rich dataset in prod., I don't think). As @scolapasta and I have been saying, we didn't expect the overall time to improve on account of this change at all - as the benefit of bringing the rest of the indexed catalog back, in a from-scratch reindex scenario, was good enough on its own.

@landreev
Copy link
Contributor

I'm definitely ok with making a PR with the improvements in the branch right now. Of course I still want to figure out the root cause of the catastrophic slowdown of the indexing and how it relates to where the dataset is initialized in transaction context (and how it can be mitigated here and elsewhere in the code). It looks like the easiest way to go about that is to open a new dedicated issue and copy-and-paste all the relevant research from this PR (I'll do that); and then we can prioritize that.

@sekmiller
Copy link
Contributor Author

I didn't get an elapsed time for 3CTMKP because it was indexed during the "overnight" while I wasn't monitoring the process.

@landreev
Copy link
Contributor

Should be easy to check the elapsed time in the server log, I'll do that. I'm very curious to see how much time the last few datasets took. Was the overnight indexing in question run on Mon. Nov. 8?

@sekmiller
Copy link
Contributor Author

I think I kicked it off around noon on Monday 11/8.

@landreev
Copy link
Contributor

@sekmiller I don't want you to be on the hook for this never-ending PR. May I offer to put my name on it (and drag it back into "in progress")?

@sekmiller
Copy link
Contributor Author

@landreev sure. what needs to be done coding-wise?

@landreev
Copy link
Contributor

A full re-reindex has completed on the perf. cluster without failures and took 45+ hours.

  index all took 164443729 milliseconds|#]
  7466 dataverses and 155387 datasets indexed. index all took 164443729 milliseconds. Solr index was not cleared before indexing.

Important numbers/observations:
Of the total of 155,387 datasets, 155,000 were reindexed in 18 hours. (that roughly translates into all the datasets with less than 500 files).
~140,000 datasets with 10 or less files were reindexed in ~13 hours.
This confirms that to the users of an installation that needs to erase and rebuild their index from scratch, the search catalog would appear to be repopulated much sooner. This is also a big step up when it comes to being able to search for the most recent datasets, that under the old scheme were not reindexed until the very end of the ordeal.

The last 10 datasets took 21+ hours to index; almost 1/2 of the total time spent.
The evil 25K file dataset (3CTMKP) took the same 6 hours. This is an important data point, confirming what was observed earlier: individual datasets take the same time to index in the context of indexAll/continue, whether it's just 1 dataset on the stack, or if it's being processed after most of the other datasets in the db have been indexed.

3CTMKP is no longer the champion. ZYKVED (44+K files) takes 11 hours.

The sheer number of files does NOT translate into time. It appears that it's only super expensive when there's a draft version that needs to be indexed. GZSBVP has 33+K files, has one published version - and takes minutes to process. It doesn't appear to matter whether 1, or 2 versions need to be indexed. Unlike 3CTMKP, ZYKVED only has a draft.

The time ratio between 3CTMKP and ZYKVED appears to be largely linear to the number of files.

I still support merging the PR for the sake of the reordering change alone. (If nothing else, we can say to the user that we are delivering exactly what he requested). But all of the above, plus all the other useful experiments and research from this PR, need to be copy-and-pasted into a new issue that will be dedicated to investigating the root cause behind the performance degradation first. I also need to archive the Postgres logs and indexing times for individual datasets. So I will be doing that. Plus I want to add a few words to the release note.

@landreev
Copy link
Contributor

P.S. I was having snapshots of memory use/gc taken every 5 min. during the reindex run. Memory did not appear to be an issue at any point. I'm not seeing any glaring leaks (i.e., E was reliably going back to low % after garbage collections throughout the process). about 20 min. total was spent garbage-collecting during the 45+ hour run.

*/
public List<Long> findAllOrSubsetOrderByFilesOwned(boolean skipIndexed) {
/*
Disregards deleted or replaced files when determining 'size' of dataset.
Copy link
Contributor

@landreev landreev Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I am definitely OK with using the raw number of files for the ordering. We know that it's not a 100% accurate predictor of speed (both because of potentially replaced or deleted files; and because of drafts vs. published versions). However, our goal is not to be super accurate. But to be able to index the bulk of the database faster, by delaying indexing of a few outlier cases. That strategy appears to work. Even if some of these outliers don't actually take hours to index.
... And I'm very happy with the efficient implementation of the actual sorting - by a native query, all on the database side and not requiring any instantiations on the application side.

Copy link
Contributor

@landreev landreev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving to QA; made an edit to the PR description explaining what's in it, and linking the newly opened issue for continuing the investigation of the core problem.

@landreev landreev removed their assignment Nov 23, 2021
@kcondon kcondon self-assigned this Nov 24, 2021
@kcondon kcondon merged commit 7f3aa0b into develop Nov 29, 2021
@kcondon kcondon deleted the 8097-improve-index-speed-for-many-files branch November 29, 2021 18:28
@djbrooke djbrooke added this to the 5.9 milestone Nov 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 hours to index a dataset with thousands of files

6 participants