8097 improve index speed for many files#8152
Conversation
remove debug code
landreev
left a comment
There was a problem hiding this comment.
Very happy about the 10X speed improvement. Some very good catches in this PR.
I feel like I should've been more proactive with my offer to help count the actual database queries before/after while you were working on this. The fact that processing the published version first has made such a big difference by itself makes me wonder why exactly; as I mentioned, I'm wondering if we were re-instantiating the entire 25K list of filemetadatas in the published version 25K times (!) otherwise. So I'd like to confirm that now.
| ) { | ||
| logger.fine("Checking if this file metadata is a duplicate."); | ||
| if (fileMetadata.getDataFile() != null) { | ||
| FileMetadata findReleasedFileMetadata = dataFileService.findFileMetadataByDatasetVersionIdAndDataFileId(dataset.getReleasedVersion().getId(), fileMetadata.getDataFile().getId()); |
There was a problem hiding this comment.
The big win here is no longer doing an n**2 nested loop, but the replacement does n db queries. Would it be faster still to just run once through the dataset.getReleasedVersion().getFileMetadatas()) and make a map of file id to fileMetadata to use to find the matching filemetadata?
There was a problem hiding this comment.
@qqmyers That was my immediate concern too (Steve and I talked about this directly yesterday).
The 25K calls to findFileMetadataByDatasetVersionIdAndDataFileId() = 125K extra database queries, that are not necessarily cheap. The extra (*n) in the original loop would be iterating through objects already looked up and instantiated in memory... and that should not be that expensive - ? The only possible explanation I could think of how the above could be an improvement was if somehow that line dataset.getReleasedVersion().getFileMetadatas() in the original code was actually looking up and instantiating the entire list of filemetadatas on every iteration (!). What Steve said in the PR description, that changing the indexing order, and processing the released version first actually made a huge difference made that crazy guess a little more plausible. I mentioned it in my comment above.
I ran some tests on a prod. db copy last night, and got some strange results. We'll revisit this when Steve is back.
There was a problem hiding this comment.
Would it be faster still to just run once through the dataset.getReleasedVersion().getFileMetadatas()) and make a map of file id to fileMetadata to use to find the matching filemetadata?
If this is really necessary, this lookup of the filemetadata in the released version could be made more efficient, yes. A map was the first thing that came to my mind too. But there should be a simpler way still. Like, sorting both lists of FileMetadatas in the 2 versions by datafile id, then making a pass through them and selecting a list of file ids that need to be indexed, for example? - that's one n loop.
There was a problem hiding this comment.
You'd have sorting costs though :-) . In any case - if there's gain to be had I'd guess it would be from avoiding db calls and any local O(n) solution would be good.
There was a problem hiding this comment.
(doh, with that approach we would not even need an extra pass/a saved list of ids - we could just sort both lists, then iterate through both of them in parallel as we index)
There was a problem hiding this comment.
OCD or ADHD, I had to measure it... that brute force iteration through the java List of filemetadatas costs the total of 30+ seconds when indexing the dataset. So, 10% of the time it takes to index it, under normal circumstances.
|
|
||
| if (this.originalFileMetadata == null && this.newFileMetadata.getDataFile() != null ){ | ||
| //File Added | ||
| if (!details) return false; |
There was a problem hiding this comment.
is there any way through the series of if statements where you don't end up returning false if ! details? Just wondering if you can do this check once.
| public boolean compareContent(FileMetadata other){ | ||
| FileVersionDifference diffObj = new FileVersionDifference(this, other, false); | ||
| return diffObj.compareMetadata(this, other); | ||
| return diffObj.isSame(); |
There was a problem hiding this comment.
does this help? Seems like you create a new FileVersionDifference and call the compareMetadata method once either way since diffObject isn't reused.
There was a problem hiding this comment.
@qqmyers I thought the point was that the comparison was already performed once inside the constructor; so the saving was from not performing it twice.
But I'll let Steve give an authoritative answer when he's back.
There was a problem hiding this comment.
Yep - got it. And other places the FileVersionDifference is kept around for a while. Could/should compareMetadata() be a private method now?
|
One extra data point: However, the difference between the 2 cases above may not have anything to do with how much data in other datasets has been processed by the time index-all gets to this dataset. I.e., the difference in performance appears to be present even when "all" in "index-all" is just this one dataset. The experiment illustrating the point is to have this dataset be the only one in the database with indextime=NULL, then run My gut feeling has always been that whatever was making datasets with too many files unmanageable had something to do with transactions. But we do need something more solid than that. There were most definitely some algorithmic inefficiencies inside the indexDataset methods (like the comparison methods that have already been looked into). It's not clear however how much that has to do with the awful performance of index-all on datasets like this one. Case in point, the dataset 10.7910/DVN/ZYKVED - it has even more files; but only one version, the draft. Indexing it should not involve any comparisons (?), but it appears to be taking a very long time too. So could it really be a function of the sheer number of child objects being forced into the transaction context? That in turn may also be a function of something about the way we instantiate these objects inside the method. I'll point out that at the very least we have a workaround for index-all specifically; even if it's bulky. I.e., knowing that datasets like the one above can be indexed in a more manageable way outside of index-all, that's what we could be doing. But we absolutely need to figure out what's going on, since this appears to be the same issue that makes such datasets unmanageable in other ways - takes forever to edit, save, etc. |
|
A followup update:
Actually, there's a 3rd difference, which appears to be the crucial one: Here's the experiment: We know that indexing 3CTMKP via It appears that all it takes to make it take hours instead is to change the last line to So we are omitting the So... it appears that what makes the crucial difference is whether the dataset entity is instantiated within, or outside of the working transaction. This must be some known and documented EJB behavior at play. So, stackoverflow is our friend? (I tested the above; but, as always, do not take my word for it necessarily) |
|
And another followup: In the "slow" mode, under index-all/continue, it appears that the first pass, through the draft version, still happens fairly quickly. And then it is the second pass, through the published version, that slows down to a crawl. Which is consistent with what Steve observed earlier when trying to instantiate the published filemetadatas from scratch inside the indexing method. (Also, I need to remember to add the query counts here. There are way too many - even with the direct |
|
This btw is a query count I saved earlier: (some of this is kind of crazy - like the 50K dataverse lookups... that said, it didn't look like the number of queries was going up much when it went to 6.5 hours under index-all... in other words, it was looking like it was not the number of queries that makes it slower; the queries are largely the same, but everything somehow takes longer. so... "something to do with transactions" again. My main interest in this issue would be to find out why this is happening) |
|
Running Index all in the perf environment after ordering the datasets by files owned from smallest to largest improved the time to about 12 hours 20 minutes. Note that the sorting query disregards the fact that files can be deleted or replaced, that is it is purely based on files owned - not file metadata related to the latest published/draft version. Also it disregards partitioning which is no longer supported. |
|
@sekmiller Do you remember how much of the total time (12 hr. 20 min.) was spent indexing 3CTMKP? It is a surprisingly good result (3CTMKP we've been experimenting with is not even the most file-rich dataset in prod., I don't think). As @scolapasta and I have been saying, we didn't expect the overall time to improve on account of this change at all - as the benefit of bringing the rest of the indexed catalog back, in a from-scratch reindex scenario, was good enough on its own. |
|
I'm definitely ok with making a PR with the improvements in the branch right now. Of course I still want to figure out the root cause of the catastrophic slowdown of the indexing and how it relates to where the dataset is initialized in transaction context (and how it can be mitigated here and elsewhere in the code). It looks like the easiest way to go about that is to open a new dedicated issue and copy-and-paste all the relevant research from this PR (I'll do that); and then we can prioritize that. |
|
I didn't get an elapsed time for 3CTMKP because it was indexed during the "overnight" while I wasn't monitoring the process. |
|
Should be easy to check the elapsed time in the server log, I'll do that. I'm very curious to see how much time the last few datasets took. Was the overnight indexing in question run on Mon. Nov. 8? |
|
I think I kicked it off around noon on Monday 11/8. |
|
@sekmiller I don't want you to be on the hook for this never-ending PR. May I offer to put my name on it (and drag it back into "in progress")? |
|
@landreev sure. what needs to be done coding-wise? |
|
A full re-reindex has completed on the perf. cluster without failures and took 45+ hours. Important numbers/observations: The last 10 datasets took 21+ hours to index; almost 1/2 of the total time spent. 3CTMKP is no longer the champion. ZYKVED (44+K files) takes 11 hours. The sheer number of files does NOT translate into time. It appears that it's only super expensive when there's a draft version that needs to be indexed. GZSBVP has 33+K files, has one published version - and takes minutes to process. It doesn't appear to matter whether 1, or 2 versions need to be indexed. Unlike 3CTMKP, ZYKVED only has a draft. The time ratio between 3CTMKP and ZYKVED appears to be largely linear to the number of files. I still support merging the PR for the sake of the reordering change alone. (If nothing else, we can say to the user that we are delivering exactly what he requested). But all of the above, plus all the other useful experiments and research from this PR, need to be copy-and-pasted into a new issue that will be dedicated to investigating the root cause behind the performance degradation first. I also need to archive the Postgres logs and indexing times for individual datasets. So I will be doing that. Plus I want to add a few words to the release note. |
|
P.S. I was having snapshots of memory use/gc taken every 5 min. during the reindex run. Memory did not appear to be an issue at any point. I'm not seeing any glaring leaks (i.e., E was reliably going back to low % after garbage collections throughout the process). about 20 min. total was spent garbage-collecting during the 45+ hour run. |
| */ | ||
| public List<Long> findAllOrSubsetOrderByFilesOwned(boolean skipIndexed) { | ||
| /* | ||
| Disregards deleted or replaced files when determining 'size' of dataset. |
There was a problem hiding this comment.
BTW, I am definitely OK with using the raw number of files for the ordering. We know that it's not a 100% accurate predictor of speed (both because of potentially replaced or deleted files; and because of drafts vs. published versions). However, our goal is not to be super accurate. But to be able to index the bulk of the database faster, by delaying indexing of a few outlier cases. That strategy appears to work. Even if some of these outliers don't actually take hours to index.
... And I'm very happy with the efficient implementation of the actual sorting - by a native query, all on the database side and not requiring any instantiations on the application side.
landreev
left a comment
There was a problem hiding this comment.
Moving to QA; made an edit to the PR description explaining what's in it, and linking the newly opened issue for continuing the investigation of the core problem.
What this PR does / why we need it:
[Edit: please see the release note/last comments for what's in the PR. It contains an agreed-upon "incremental improvement" - does not really speed up the overall time significantly, but makes a from-scratch reindexing of an installation like ours less traumatic. The investigation of the root cause of the performance degradation will continue in the newly opened #8256. - L.A.]
Speeds up the processing of re-indexing datasets with a large number of files, in particular when there is a draft version of the dataset. (also fixes a fault where the compare metadata was run twice - once in the constructor and then again to get the boolean result.) The underlying reason is that the index only creates a solr doc for a draft version of a file if there's a difference in the file metadata. Comparing the two versions of file metadata slows down considerably when there is a large number of files.
Which issue(s) this PR closes:
Closes #8097 8 hours to index a dataset with thousands of files.
Special notes for your reviewer: The big performance gain comes from indexing the draft version after indexing the published version. In a test against a copy of prod the dataset in question went from 8+ hours to under 45 minutes. Smaller gains were accomplished by doing a direct lookup of the comparable fmd version and short circuiting the comparison once a difference was found.
Suggestions on how to test this: re-index a dataset with a large number of files and see that the index acts as expected and runs faster.
Does this PR introduce a user interface change? If mockups are available, please link/include them here: no
Is there a release notes update needed for this change?: no
Additional documentation: none