-
Notifications
You must be signed in to change notification settings - Fork 536
8097 improve index speed for many files #8152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
44a674a
45ccdff
2c57406
21154ce
9b6827e
a8ca8d1
995e2cf
9d79c95
d94b3f0
fbcd03a
a3dbf5d
4776534
029c3d0
1ca3427
a243763
d78a68e
94e04e2
eca09c7
ed23eae
6667d1a
3985cb8
61ce332
2675397
09e43e0
ab1d0e8
089efbe
b508edf
f9311dd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| ### Indexing performance on datasets with large numbers of files | ||
|
|
||
| We discovered that whenever a full reindexing needs to be performed, datasets with large numbers of files take exceptionally long time to index (for example, in the IQSS repository it takes several hours for a dataset that has 25,000 files). In situations where the Solr index needs to be erased and rebuilt from scratch (such as a Solr version upgrade, or a corrupt index, etc.) this can significantly delay the repopulation of the search catalog. | ||
|
|
||
| We are still investigating the reasons behind this performance issue. For now, even though some improvements have been made, a dataset with thousands of files is still going to take a long time to index. But we've made a simple change to the reindexing process, to index any such datasets at the very end of the batch, after all the datasets with fewer files have been reindexed. This does not improve the overall reindexing time, but will repopulate the bulk of the search index much faster for the users of the installation. | ||
|
|
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -532,7 +532,7 @@ public boolean contentEquals(FileMetadata other) { | |
|
|
||
| public boolean compareContent(FileMetadata other){ | ||
| FileVersionDifference diffObj = new FileVersionDifference(this, other, false); | ||
| return diffObj.compareMetadata(this, other); | ||
| return diffObj.isSame(); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does this help? Seems like you create a new FileVersionDifference and call the compareMetadata method once either way since diffObject isn't reused.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @qqmyers I thought the point was that the comparison was already performed once inside the constructor; so the saving was from not performing it twice.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep - got it. And other places the FileVersionDifference is kept around for a while. Could/should compareMetadata() be a private method now? |
||
| } | ||
|
|
||
| @Override | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -10,7 +10,6 @@ | |
| import java.util.ArrayList; | ||
| import java.util.List; | ||
| import java.util.Objects; | ||
| import java.util.ResourceBundle; | ||
|
|
||
| /** | ||
| * | ||
|
|
@@ -21,6 +20,9 @@ public final class FileVersionDifference { | |
| private FileMetadata newFileMetadata; | ||
| private FileMetadata originalFileMetadata; | ||
| private boolean details = false; | ||
| private boolean same = false; | ||
|
|
||
|
|
||
|
|
||
| private List<FileDifferenceSummaryGroup> differenceSummaryGroups = new ArrayList<>(); | ||
| private List<FileDifferenceDetailItem> differenceDetailItems = new ArrayList<>(); | ||
|
|
@@ -37,7 +39,7 @@ public FileVersionDifference(FileMetadata newFileMetadata, FileMetadata original | |
| this.originalFileMetadata = originalFileMetadata; | ||
| this.details = details; | ||
|
|
||
| compareMetadata(newFileMetadata, originalFileMetadata); | ||
| this.same = compareMetadata(newFileMetadata, originalFileMetadata); | ||
| //Compare versions - File Metadata first | ||
|
|
||
| } | ||
|
|
@@ -50,7 +52,7 @@ public boolean compareMetadata(FileMetadata newFileMetadata, FileMetadata origin | |
| and it updates the FileVersionDifference object which is used to display the differences on the dataset versions tab. | ||
| The return value is used by the index service bean tomark whether a file needs to be re-indexed in the context of a dataset update. | ||
| When there are changes (after v4.19)to the file metadata data model this method must be updated. | ||
| retVal of True means metadatas are equal. | ||
| retVal of True means metadatas are equal. | ||
| */ | ||
|
|
||
| boolean retVal = true; | ||
|
|
@@ -68,13 +70,15 @@ When there are changes (after v4.19)to the file metadata data model this method | |
|
|
||
| if (this.originalFileMetadata == null && this.newFileMetadata.getDataFile() != null ){ | ||
| //File Added | ||
| if (!details) return false; | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is there any way through the series of if statements where you don't end up returning false if ! details? Just wondering if you can do this check once. |
||
| retVal = false; | ||
| updateDifferenceSummary( "", BundleUtil.getStringFromBundle("file.versionDifferences.fileGroupTitle"), 1, 0, 0, 0); | ||
| } | ||
|
|
||
| //Check to see if File replaced | ||
| if (originalFileMetadata != null && | ||
| newFileMetadata.getDataFile() != null && originalFileMetadata.getDataFile() != null &&!this.originalFileMetadata.getDataFile().equals(this.newFileMetadata.getDataFile())){ | ||
| if (!details) return false; | ||
| updateDifferenceSummary( "", BundleUtil.getStringFromBundle("file.versionDifferences.fileGroupTitle"), 0, 0, 0, 1); | ||
| retVal = false; | ||
| } | ||
|
|
@@ -83,6 +87,8 @@ When there are changes (after v4.19)to the file metadata data model this method | |
| if (!newFileMetadata.getLabel().equals(originalFileMetadata.getLabel())) { | ||
| if (details) { | ||
| differenceDetailItems.add(new FileDifferenceDetailItem(BundleUtil.getStringFromBundle("file.versionDifferences.fileNameDetailTitle"), originalFileMetadata.getLabel(), newFileMetadata.getLabel())); | ||
| } else{ | ||
| return false; | ||
| } | ||
| updateDifferenceSummary(BundleUtil.getStringFromBundle("file.versionDifferences.fileMetadataGroupTitle"), | ||
| BundleUtil.getStringFromBundle("file.versionDifferences.fileNameDetailTitle"), 0, 1, 0, 0); | ||
|
|
@@ -97,6 +103,8 @@ When there are changes (after v4.19)to the file metadata data model this method | |
| && !newFileMetadata.getDescription().equals(originalFileMetadata.getDescription())) { | ||
| if (details) { | ||
| differenceDetailItems.add(new FileDifferenceDetailItem(BundleUtil.getStringFromBundle("file.versionDifferences.descriptionDetailTitle"), originalFileMetadata.getDescription(), newFileMetadata.getDescription())); | ||
| } else { | ||
| return false; | ||
| } | ||
| updateDifferenceSummary(BundleUtil.getStringFromBundle("file.versionDifferences.fileMetadataGroupTitle"), | ||
| BundleUtil.getStringFromBundle("file.versionDifferences.descriptionDetailTitle"), 0, 1, 0, 0); | ||
|
|
@@ -107,6 +115,8 @@ When there are changes (after v4.19)to the file metadata data model this method | |
| ) { | ||
| if (details) { | ||
| differenceDetailItems.add(new FileDifferenceDetailItem(BundleUtil.getStringFromBundle("file.versionDifferences.descriptionDetailTitle"), "", newFileMetadata.getDescription())); | ||
| } else { | ||
| return false; | ||
| } | ||
| updateDifferenceSummary(BundleUtil.getStringFromBundle("file.versionDifferences.fileMetadataGroupTitle"), | ||
| BundleUtil.getStringFromBundle("file.versionDifferences.descriptionDetailTitle"), 1, 0, 0, 0); | ||
|
|
@@ -117,6 +127,8 @@ When there are changes (after v4.19)to the file metadata data model this method | |
| ) { | ||
| if (details) { | ||
| differenceDetailItems.add(new FileDifferenceDetailItem(BundleUtil.getStringFromBundle("file.versionDifferences.descriptionDetailTitle"), originalFileMetadata.getDescription(), "" )); | ||
| } else { | ||
| return false; | ||
| } | ||
| updateDifferenceSummary(BundleUtil.getStringFromBundle("file.versionDifferences.fileMetadataGroupTitle"), | ||
| BundleUtil.getStringFromBundle("file.versionDifferences.descriptionDetailTitle"), 0, 0, 1, 0); | ||
|
|
@@ -130,6 +142,8 @@ When there are changes (after v4.19)to the file metadata data model this method | |
| && !newFileMetadata.getProvFreeForm().equals(originalFileMetadata.getProvFreeForm())) { | ||
| if (details) { | ||
| differenceDetailItems.add(new FileDifferenceDetailItem(BundleUtil.getStringFromBundle("file.versionDifferences.provenanceDetailTitle"), originalFileMetadata.getProvFreeForm(), newFileMetadata.getProvFreeForm())); | ||
| } else { | ||
| return false; | ||
| } | ||
| updateDifferenceSummary(BundleUtil.getStringFromBundle("file.versionDifferences.fileMetadataGroupTitle"), | ||
| BundleUtil.getStringFromBundle("file.versionDifferences.provenanceDetailTitle"), 0, 1, 0, 0); | ||
|
|
@@ -140,6 +154,8 @@ When there are changes (after v4.19)to the file metadata data model this method | |
| ) { | ||
| if (details) { | ||
| differenceDetailItems.add(new FileDifferenceDetailItem(BundleUtil.getStringFromBundle("file.versionDifferences.provenanceDetailTitle"), "", newFileMetadata.getProvFreeForm())); | ||
| } else { | ||
| return false; | ||
| } | ||
| updateDifferenceSummary(BundleUtil.getStringFromBundle("file.versionDifferences.fileMetadataGroupTitle"), | ||
| BundleUtil.getStringFromBundle("file.versionDifferences.provenanceDetailTitle"), 1, 0, 0, 0); | ||
|
|
@@ -150,6 +166,8 @@ When there are changes (after v4.19)to the file metadata data model this method | |
| ) { | ||
| if (details) { | ||
| differenceDetailItems.add(new FileDifferenceDetailItem(BundleUtil.getStringFromBundle("file.versionDifferences.provenanceDetailTitle"), originalFileMetadata.getProvFreeForm(), "" )); | ||
| } else { | ||
| return false; | ||
| } | ||
| updateDifferenceSummary(BundleUtil.getStringFromBundle("file.versionDifferences.fileMetadataGroupTitle"), | ||
| BundleUtil.getStringFromBundle("file.versionDifferences.provenanceDetailTitle"), 0, 0, 1, 0); | ||
|
|
@@ -170,7 +188,7 @@ When there are changes (after v4.19)to the file metadata data model this method | |
| } | ||
|
|
||
| if (!value1.equals(value2)) { | ||
|
|
||
| if (!details) return false; | ||
| int added = 0; | ||
| int deleted = 0; | ||
|
|
||
|
|
@@ -254,6 +272,14 @@ public void setOriginalFileMetadata(FileMetadata originalFileMetadata) { | |
| this.originalFileMetadata = originalFileMetadata; | ||
| } | ||
|
|
||
| public boolean isSame() { | ||
| return same; | ||
| } | ||
|
|
||
| public void setSame(boolean same) { | ||
| this.same = same; | ||
| } | ||
|
|
||
|
|
||
| public List<FileDifferenceSummaryGroup> getDifferenceSummaryGroups() { | ||
| return differenceSummaryGroups; | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I am definitely OK with using the raw number of files for the ordering. We know that it's not a 100% accurate predictor of speed (both because of potentially replaced or deleted files; and because of drafts vs. published versions). However, our goal is not to be super accurate. But to be able to index the bulk of the database faster, by delaying indexing of a few outlier cases. That strategy appears to work. Even if some of these outliers don't actually take hours to index.
... And I'm very happy with the efficient implementation of the actual sorting - by a native query, all on the database side and not requiring any instantiations on the application side.