Investigate performance degradation in reindex of datasets with large numbers of files

This issue is to finish the investigation started in #8097. 
Will copy-and-paste relevant experimental data/discussion from the corresponding PR #8152. 

The short version of it is that it takes about 6 minutes to directly index a prod. dataset with 25K files ("directly" = via `/api/admin/index/dataset`), but the time goes up to 6 hours for the same dataset during an async. reindex (via `/api/admin/index` or `/api/admin/index/continue`). The difference between the 2 scenarios appears to have to do with where the dataset entity is instantiated in relation to the main transaction. (this is all explained in more details in the comments from #8152 below). This must have some rational explanation, related to how the transaction context is managed by the EJB. There's a good chance the same issue is affecting the performance elsewhere in the code when we have to modify datasets with similar numbers of files. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate performance degradation in reindex of datasets with large numbers of files #8256

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate performance degradation in reindex of datasets with large numbers of files #8256

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions