Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
44a674a
Merge branch 'develop' into 8097-improve-index-speed-for-many-files
sekmiller Sep 28, 2021
45ccdff
#8097 speed up file indexing
sekmiller Sep 28, 2021
2c57406
Merge branch 'develop' into 8097-improve-index-speed-for-many-files
sekmiller Sep 30, 2021
21154ce
Merge branch 'develop' into 8097-improve-index-speed-for-many-files
sekmiller Oct 4, 2021
9b6827e
Merge branch 'develop' into 8097-improve-index-speed-for-many-files
sekmiller Oct 5, 2021
a8ca8d1
#8097 post file docs every 100
sekmiller Oct 7, 2021
995e2cf
Merge branch 'develop' into 8097-improve-index-speed-for-many-files
sekmiller Oct 7, 2021
9d79c95
Merge branch 'develop' into 8097-improve-index-speed-for-many-files
sekmiller Oct 8, 2021
d94b3f0
#8097 only run compare once
sekmiller Oct 8, 2021
fbcd03a
Merge branch 'develop' into 8097-improve-index-speed-for-many-files
sekmiller Oct 13, 2021
a3dbf5d
#8097 index pub before draft
sekmiller Oct 13, 2021
4776534
Merge branch 'develop' into 8097-improve-index-speed-for-many-files
sekmiller Oct 14, 2021
029c3d0
#8097 remove doc commits
sekmiller Oct 14, 2021
1ca3427
#8097 remove reminder comments
sekmiller Oct 14, 2021
a243763
Merge branch 'develop' into 8097-improve-index-speed-for-many-files
sekmiller Oct 18, 2021
d78a68e
Merge branch 'develop' into 8097-improve-index-speed-for-many-files
sekmiller Oct 21, 2021
94e04e2
Merge branch 'develop' into 8097-improve-index-speed-for-many-files
sekmiller Oct 27, 2021
eca09c7
#8097 use map lookup
sekmiller Oct 28, 2021
ed23eae
#8097 return to dataset id passed to index in new transaction
sekmiller Nov 1, 2021
6667d1a
Merge branch 'develop' into 8097-improve-index-speed-for-many-files
sekmiller Nov 1, 2021
3985cb8
Merge branch 'develop' into 8097-improve-index-speed-for-many-files
sekmiller Nov 2, 2021
61ce332
#8097 order datasets to index by number of files
sekmiller Nov 3, 2021
2675397
Merge branch 'develop' into 8097-improve-index-speed-for-many-files
sekmiller Nov 3, 2021
09e43e0
Merge branch 'develop' into 8097-improve-index-speed-for-many-files
sekmiller Nov 8, 2021
ab1d0e8
#8097 rename method for clarity
sekmiller Nov 9, 2021
089efbe
Merge branch 'develop' into 8097-improve-index-speed-for-many-files
sekmiller Nov 9, 2021
b508edf
added a release note explaining the nature of the improvement. (#8097)
landreev Nov 23, 2021
f9311dd
typos in the release note (#8097)
landreev Nov 29, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions doc/release-notes/8097-indexall-performance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
### Indexing performance on datasets with large numbers of files

We discovered that whenever a full reindexing needs to be performed, datasets with large numbers of files take exceptionally long time to index (for example, in the IQSS repository it takes several hours for a dataset that has 25,000 files). In situations where the Solr index needs to be erased and rebuilt from scratch (such as a Solr version upgrade, or a corrupt index, etc.) this can significantly delay the repopulation of the search catalog.

We are still investigating the reasons behind this performance issue. For now, even though some improvements have been made, a dataset with thousands of files is still going to take a long time to index. But we've made a simple change to the reindexing process, to index any such datasets at the very end of the batch, after all the datasets with fewer files have been reindexed. This does not improve the overall reindexing time, but will repopulate the bulk of the search index much faster for the users of the installation.

45 changes: 45 additions & 0 deletions src/main/java/edu/harvard/iq/dataverse/DatasetServiceBean.java
Original file line number Diff line number Diff line change
Expand Up @@ -225,6 +225,51 @@ public List<Long> findAllOrSubset(long numPartitions, long partitionId, boolean
return typedQuery.getResultList();
}

/**
* For docs, see the equivalent method on the DataverseServiceBean.
* @param numPartitions
* @param partitionId
* @param skipIndexed
* @return a list of datasets
* @see DataverseServiceBean#findAllOrSubset(long, long, boolean)
*/
public List<Long> findAllOrSubsetOrderByFilesOwned(boolean skipIndexed) {
/*
Disregards deleted or replaced files when determining 'size' of dataset.
Copy link
Contributor

@landreev landreev Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I am definitely OK with using the raw number of files for the ordering. We know that it's not a 100% accurate predictor of speed (both because of potentially replaced or deleted files; and because of drafts vs. published versions). However, our goal is not to be super accurate. But to be able to index the bulk of the database faster, by delaying indexing of a few outlier cases. That strategy appears to work. Even if some of these outliers don't actually take hours to index.
... And I'm very happy with the efficient implementation of the actual sorting - by a native query, all on the database side and not requiring any instantiations on the application side.

Could possibly make more efficient by getting file metadata counts
of latest published/draft version.
Also disregards partitioning which is no longer supported.
SEK - 11/09/2021
*/

String skipClause = skipIndexed ? "AND o.indexTime is null " : "";
Query query = em.createNativeQuery(" Select distinct(o.id), count(f.id) as numFiles FROM dvobject o " +
"left join dvobject f on f.owner_id = o.id where o.dtype = 'Dataset' "
+ skipClause
+ " group by o.id "
+ "ORDER BY count(f.id) asc, o.id");

List<Object[]> queryResults;
queryResults = query.getResultList();

List<Long> retVal = new ArrayList();
for (Object[] result : queryResults) {
Long dsId;
if (result[0] != null) {
try {
dsId = Long.parseLong(result[0].toString()) ;
} catch (Exception ex) {
dsId = null;
}
if (dsId == null) {
continue;
}
retVal.add(dsId);
}
}
return retVal;
}

/**
* Merges the passed dataset to the persistence context.
* @param ds the dataset whose new state we want to persist.
Expand Down
2 changes: 1 addition & 1 deletion src/main/java/edu/harvard/iq/dataverse/FileMetadata.java
Original file line number Diff line number Diff line change
Expand Up @@ -532,7 +532,7 @@ public boolean contentEquals(FileMetadata other) {

public boolean compareContent(FileMetadata other){
FileVersionDifference diffObj = new FileVersionDifference(this, other, false);
return diffObj.compareMetadata(this, other);
return diffObj.isSame();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this help? Seems like you create a new FileVersionDifference and call the compareMetadata method once either way since diffObject isn't reused.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qqmyers I thought the point was that the comparison was already performed once inside the constructor; so the saving was from not performing it twice.
But I'll let Steve give an authoritative answer when he's back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - got it. And other places the FileVersionDifference is kept around for a while. Could/should compareMetadata() be a private method now?

}

@Override
Expand Down
34 changes: 30 additions & 4 deletions src/main/java/edu/harvard/iq/dataverse/FileVersionDifference.java
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@
import java.util.ArrayList;
import java.util.List;
import java.util.Objects;
import java.util.ResourceBundle;

/**
*
Expand All @@ -21,6 +20,9 @@ public final class FileVersionDifference {
private FileMetadata newFileMetadata;
private FileMetadata originalFileMetadata;
private boolean details = false;
private boolean same = false;



private List<FileDifferenceSummaryGroup> differenceSummaryGroups = new ArrayList<>();
private List<FileDifferenceDetailItem> differenceDetailItems = new ArrayList<>();
Expand All @@ -37,7 +39,7 @@ public FileVersionDifference(FileMetadata newFileMetadata, FileMetadata original
this.originalFileMetadata = originalFileMetadata;
this.details = details;

compareMetadata(newFileMetadata, originalFileMetadata);
this.same = compareMetadata(newFileMetadata, originalFileMetadata);
//Compare versions - File Metadata first

}
Expand All @@ -50,7 +52,7 @@ public boolean compareMetadata(FileMetadata newFileMetadata, FileMetadata origin
and it updates the FileVersionDifference object which is used to display the differences on the dataset versions tab.
The return value is used by the index service bean tomark whether a file needs to be re-indexed in the context of a dataset update.
When there are changes (after v4.19)to the file metadata data model this method must be updated.
retVal of True means metadatas are equal.
retVal of True means metadatas are equal.
*/

boolean retVal = true;
Expand All @@ -68,13 +70,15 @@ When there are changes (after v4.19)to the file metadata data model this method

if (this.originalFileMetadata == null && this.newFileMetadata.getDataFile() != null ){
//File Added
if (!details) return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any way through the series of if statements where you don't end up returning false if ! details? Just wondering if you can do this check once.

retVal = false;
updateDifferenceSummary( "", BundleUtil.getStringFromBundle("file.versionDifferences.fileGroupTitle"), 1, 0, 0, 0);
}

//Check to see if File replaced
if (originalFileMetadata != null &&
newFileMetadata.getDataFile() != null && originalFileMetadata.getDataFile() != null &&!this.originalFileMetadata.getDataFile().equals(this.newFileMetadata.getDataFile())){
if (!details) return false;
updateDifferenceSummary( "", BundleUtil.getStringFromBundle("file.versionDifferences.fileGroupTitle"), 0, 0, 0, 1);
retVal = false;
}
Expand All @@ -83,6 +87,8 @@ When there are changes (after v4.19)to the file metadata data model this method
if (!newFileMetadata.getLabel().equals(originalFileMetadata.getLabel())) {
if (details) {
differenceDetailItems.add(new FileDifferenceDetailItem(BundleUtil.getStringFromBundle("file.versionDifferences.fileNameDetailTitle"), originalFileMetadata.getLabel(), newFileMetadata.getLabel()));
} else{
return false;
}
updateDifferenceSummary(BundleUtil.getStringFromBundle("file.versionDifferences.fileMetadataGroupTitle"),
BundleUtil.getStringFromBundle("file.versionDifferences.fileNameDetailTitle"), 0, 1, 0, 0);
Expand All @@ -97,6 +103,8 @@ When there are changes (after v4.19)to the file metadata data model this method
&& !newFileMetadata.getDescription().equals(originalFileMetadata.getDescription())) {
if (details) {
differenceDetailItems.add(new FileDifferenceDetailItem(BundleUtil.getStringFromBundle("file.versionDifferences.descriptionDetailTitle"), originalFileMetadata.getDescription(), newFileMetadata.getDescription()));
} else {
return false;
}
updateDifferenceSummary(BundleUtil.getStringFromBundle("file.versionDifferences.fileMetadataGroupTitle"),
BundleUtil.getStringFromBundle("file.versionDifferences.descriptionDetailTitle"), 0, 1, 0, 0);
Expand All @@ -107,6 +115,8 @@ When there are changes (after v4.19)to the file metadata data model this method
) {
if (details) {
differenceDetailItems.add(new FileDifferenceDetailItem(BundleUtil.getStringFromBundle("file.versionDifferences.descriptionDetailTitle"), "", newFileMetadata.getDescription()));
} else {
return false;
}
updateDifferenceSummary(BundleUtil.getStringFromBundle("file.versionDifferences.fileMetadataGroupTitle"),
BundleUtil.getStringFromBundle("file.versionDifferences.descriptionDetailTitle"), 1, 0, 0, 0);
Expand All @@ -117,6 +127,8 @@ When there are changes (after v4.19)to the file metadata data model this method
) {
if (details) {
differenceDetailItems.add(new FileDifferenceDetailItem(BundleUtil.getStringFromBundle("file.versionDifferences.descriptionDetailTitle"), originalFileMetadata.getDescription(), "" ));
} else {
return false;
}
updateDifferenceSummary(BundleUtil.getStringFromBundle("file.versionDifferences.fileMetadataGroupTitle"),
BundleUtil.getStringFromBundle("file.versionDifferences.descriptionDetailTitle"), 0, 0, 1, 0);
Expand All @@ -130,6 +142,8 @@ When there are changes (after v4.19)to the file metadata data model this method
&& !newFileMetadata.getProvFreeForm().equals(originalFileMetadata.getProvFreeForm())) {
if (details) {
differenceDetailItems.add(new FileDifferenceDetailItem(BundleUtil.getStringFromBundle("file.versionDifferences.provenanceDetailTitle"), originalFileMetadata.getProvFreeForm(), newFileMetadata.getProvFreeForm()));
} else {
return false;
}
updateDifferenceSummary(BundleUtil.getStringFromBundle("file.versionDifferences.fileMetadataGroupTitle"),
BundleUtil.getStringFromBundle("file.versionDifferences.provenanceDetailTitle"), 0, 1, 0, 0);
Expand All @@ -140,6 +154,8 @@ When there are changes (after v4.19)to the file metadata data model this method
) {
if (details) {
differenceDetailItems.add(new FileDifferenceDetailItem(BundleUtil.getStringFromBundle("file.versionDifferences.provenanceDetailTitle"), "", newFileMetadata.getProvFreeForm()));
} else {
return false;
}
updateDifferenceSummary(BundleUtil.getStringFromBundle("file.versionDifferences.fileMetadataGroupTitle"),
BundleUtil.getStringFromBundle("file.versionDifferences.provenanceDetailTitle"), 1, 0, 0, 0);
Expand All @@ -150,6 +166,8 @@ When there are changes (after v4.19)to the file metadata data model this method
) {
if (details) {
differenceDetailItems.add(new FileDifferenceDetailItem(BundleUtil.getStringFromBundle("file.versionDifferences.provenanceDetailTitle"), originalFileMetadata.getProvFreeForm(), "" ));
} else {
return false;
}
updateDifferenceSummary(BundleUtil.getStringFromBundle("file.versionDifferences.fileMetadataGroupTitle"),
BundleUtil.getStringFromBundle("file.versionDifferences.provenanceDetailTitle"), 0, 0, 1, 0);
Expand All @@ -170,7 +188,7 @@ When there are changes (after v4.19)to the file metadata data model this method
}

if (!value1.equals(value2)) {

if (!details) return false;
int added = 0;
int deleted = 0;

Expand Down Expand Up @@ -254,6 +272,14 @@ public void setOriginalFileMetadata(FileMetadata originalFileMetadata) {
this.originalFileMetadata = originalFileMetadata;
}

public boolean isSame() {
return same;
}

public void setSame(boolean same) {
this.same = same;
}


public List<FileDifferenceSummaryGroup> getDifferenceSummaryGroups() {
return differenceSummaryGroups;
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
package edu.harvard.iq.dataverse.search;

import edu.harvard.iq.dataverse.Dataset;
import edu.harvard.iq.dataverse.DatasetServiceBean;
import edu.harvard.iq.dataverse.Dataverse;
import edu.harvard.iq.dataverse.DataverseServiceBean;
import edu.harvard.iq.dataverse.DvObjectServiceBean;
import edu.harvard.iq.dataverse.util.SystemConfig;
import java.io.IOException;
import java.sql.Timestamp;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import java.util.concurrent.Future;
import java.util.logging.Level;
Expand Down Expand Up @@ -200,7 +203,7 @@ public Future<String> indexAllOrSubset(long numPartitions, long partitionId, boo

int datasetIndexCount = 0;
int datasetFailureCount = 0;
List<Long> datasetIds = datasetService.findAllOrSubset(numPartitions, partitionId, skipIndexed);
List<Long> datasetIds = datasetService.findAllOrSubsetOrderByFilesOwned(skipIndexed);
for (Long id : datasetIds) {
try {
datasetIndexCount++;
Expand Down
Loading