Skip to content

Investigate performance degradation in reindex of datasets with large numbers of files #8256

@landreev

Description

@landreev

This issue is to finish the investigation started in #8097.
Will copy-and-paste relevant experimental data/discussion from the corresponding PR #8152.

The short version of it is that it takes about 6 minutes to directly index a prod. dataset with 25K files ("directly" = via /api/admin/index/dataset), but the time goes up to 6 hours for the same dataset during an async. reindex (via /api/admin/index or /api/admin/index/continue). The difference between the 2 scenarios appears to have to do with where the dataset entity is instantiated in relation to the main transaction. (this is all explained in more details in the comments from #8152 below). This must have some rational explanation, related to how the transaction context is managed by the EJB. There's a good chance the same issue is affecting the performance elsewhere in the code when we have to modify datasets with similar numbers of files.

Metadata

Metadata

Assignees

No one assigned

    Labels

    D: Dataset: large number of fileshttps://github.com/IQSS/dataverse-pm/issues/27FY26 Sprint 4FY26 Sprint 4 (2025-08-13 - 2025-08-27)Size: 30A percentage of a sprint. 21 hours. (formerly size:33)

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions