Skip to content

Comments

Version summaries fixes, enhancements and pagination#11859

Merged
ofahimIQSS merged 55 commits intodevelopfrom
11855-version-summaries-pagination
Oct 23, 2025
Merged

Version summaries fixes, enhancements and pagination#11859
ofahimIQSS merged 55 commits intodevelopfrom
11855-version-summaries-pagination

Conversation

@GPortas
Copy link
Contributor

@GPortas GPortas commented Oct 1, 2025

What this PR does / why we need it:

This pull request introduces pagination mechanisms for the dataset and datafile version summary/difference endpoints, along with a substantial reimplementation of the existing functionality.

While the initial scope was limited to adding pagination, a preliminary investigation revealed significant underlying issues that would have made a direct implementation inefficient and unsustainable. The core problems identified were:

  • High Coupling and Poor Separation of Concerns: The existing code lacked clear architectural layering and encapsulation for these use cases, leading to tightly coupled components that are difficult to maintain, test, and extend.

  • Severe Performance Bottlenecks: The implementation relied on fetching bulk data from the database and then post-processing it in Java using multiple nested loops. This approach caused significant performance degradation, especially for datasets or files with a large number of versions.

  • Low Test Coverage: The lack of a comprehensive test suite made it risky to extend/alter the existing functionality without introducing regressions.

  • Inconsistent Behavior and Errors: The code for file version summaries didn't work in some scenarios, generating errors like the one mentioned at API: File versionDifferences has 500 error after uploading files to published dataset #11561 or not functioning as it should.

Changes Made
Given the issues discovered, the decision was made to perform a comprehensive, end-to-end refactoring and reimplementation of these features. This ensures that the new pagination functionality is built upon a robust, performant, and maintainable foundation. I had to revisit each case of the file version summaries logic to fix incorrect behaviors based on the original JSF behavior.

The key changes include:

  • Architectural Realignment: The entire workflow, from the API endpoint to the data access layer, has been refactored to align with the established Dataverse architecture using Commands and Services. This improves modularity and clarifies responsibilities within the code.

  • Performance Optimization with JPA Criteria: All data processing has been pushed down to the database layer. In-memory processing with loops has been replaced with specific, performant JPA Criteria-based queries. This dramatically improves response times for entities with extensive version histories.

  • Improved Test Coverage: New unit tests have been introduced to cover the refactored code, addressing critical logic that was previously untested. This ensures the stability of the new implementation and simplifies future development.

The 500 errors that were occurring and reported in #11561 have been resolved. These errors were caused by a null pointer exception resulting from null datafiles. Below, the error trace is shown, followed by a screenshot of the error being reproduced, and then another screenshot taken after deploying this PR branch.

Reproduced error:
Screenshot 2025-10-14 at 16 40 13
Screenshot 2025-10-14 at 16 40 47

Fixed error:
Screenshot 2025-10-17 at 11 52 43

As you can see, the response time is 543 ms, which shows a significant improvement in load time compared to JSF. In addition, pagination hasn’t been implemented in the SPA yet, which would make it even faster — especially for data files with many more versions than this example.

In the UI, we still need to apply a few tweaks — specifically, to display the message “File not included in this version” instead of “No changes associated with this version” for versions where the file hadn’t yet been added. This change has already been discussed with the frontend team for implementation. cc: @ChengShi-1

Which issue(s) this PR closes:

Sharing some thoughts...:

Let's keep our focus on making sure our codebase stays healthy and easy to work with.

Sticking to our layered architecture is key. It's what makes it way easier for everyone to jump in and make changes without breaking things.

Let's also be serious about our tests. Good unit tests prove the little pieces work, and API tests make sure the whole thing hangs together. Both are necessary. Test cases should be complete enough, especially for features like these summaries with many different scenarios.

Finally, let's all try to follow the 'campsite rule' with tech debt: leave the code a little cleaner than you found it. If you spot something you can quickly improve, do it. It's a small effort that saves us from huge headaches down the road.

Suggestions on how to test this:

Performance enhancements could be tested by running this branch on an installation with a dataset or file with a large number of versions, and calling the endpoint below without pagination, and compare the response time with develop.

You can control pagination of the results using the following optional query parameters.

  • limit: The maximum number of version differences to return.
  • offset: The number of version differences to skip from the beginning of the list. Used for retrieving subsequent pages of results.

For example, to get the second page of results, with 2 items per page, you would use limit=2 and offset=2 (skipping the first two results).

For datasets:

curl -H "X-Dataverse-key: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" -X GET "https://demo.dataverse.org/api/datasets/:persistentId/versions/compareSummary?persistentId=doi:10.5072/FK2/BCCP9Z&limit=2&offset=2"

For files:

curl -H "X-Dataverse-key: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" -X GET "https://demo.dataverse.org/api/files/1234/versionDifferences?limit=2&offset=2"

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

No

Is there a release notes update needed for this change?:

Yes, attached.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

1 similar comment
@github-actions

This comment has been minimized.

@GPortas GPortas force-pushed the 11855-version-summaries-pagination branch from 943da66 to b3c4fbd Compare October 2, 2025 18:21
@github-actions

This comment has been minimized.

3 similar comments
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

1 similar comment
@github-actions

This comment has been minimized.

@GPortas GPortas changed the title Version summaries pagination Version summaries enhancements and pagination Oct 6, 2025
@github-actions

This comment has been minimized.

1 similar comment
@github-actions

This comment has been minimized.

@GPortas
Copy link
Contributor Author

GPortas commented Oct 22, 2025

Everything should be working now as expected, except for the pre-existing issue mentioned in #11921. Thanks, @ChengShi-1!

@github-actions

This comment has been minimized.

@GPortas GPortas removed their assignment Oct 22, 2025
@github-actions

This comment has been minimized.


//TODO: this could use some refactoring to cut down on the number of for loops!
private FileMetadata getPreviousFileMetadata(FileMetadata fileMetadata, DatasetVersion currentversion) {
public FileMetadata getPreviousFileMetadata(FileMetadata fileMetadata, DatasetVersion currentversion) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this was made public. the only usage is inside the helper

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — I forgot to make it private again. I had to set it to public during the multi-step refactor while extracting logic and responsibilities from this class.

FWIW - The API no longer uses this method; instead, it uses a more efficient JPA Criteria–based DataFileServiceBean method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just made it private again

@cmbz cmbz added the FY26 Sprint 9 FY26 Sprint 9 (2025-10-22 - 2025-11-05) label Oct 23, 2025
@github-actions

This comment has been minimized.

@ChengShi-1
Copy link
Contributor

Hi Guillermo, thanks a lot for these fixes! 💯

I tested it locally and found something that may need attention.

For dataset:

  1. The result is missing contributors field for dataset, the payload was like
 {
id: number
versionNumber: string
summary?: DatasetVersionSummary | DatasetVersionSummaryStringValues
contributors: string
publishedOn?: string
}

Now, it is

{

"id": 4,
"versionNumber": "3.1",
"publishedOn": "2025-10-20",
"summary": { }
}
  1. It seems like it counts "changedFileMetaData" : 2 forever. However, it happens to JSF as well, which always count "changedFileMetaData" : 1, no matter how many changes the file has.

For file summaries, fileDifferenceSummary field shows File: added, deleted.... but it is not showing File Categories and Tabular Tags, File Access and File Metadata there.

I tested dataset summaries and file summaries again, and they work well now. Thanks for resolving them!

Copy link
Contributor

@sekmiller sekmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good stuff here. Thanks.

@github-project-automation github-project-automation bot moved this from In Review 🔎 to Ready for QA ⏩ in IQSS Dataverse Project Oct 23, 2025
@ofahimIQSS ofahimIQSS moved this from Ready for QA ⏩ to QA ✅ in IQSS Dataverse Project Oct 23, 2025
@ofahimIQSS ofahimIQSS self-assigned this Oct 23, 2025
@github-actions

This comment has been minimized.

@ofahimIQSS
Copy link
Contributor

ofahimIQSS commented Oct 23, 2025

looks good - going to perform regression in QA. Merging after CI completes

@github-actions
Copy link

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:11855-version-summaries-pagination
ghcr.io/gdcc/configbaker:11855-version-summaries-pagination

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

@ofahimIQSS
Copy link
Contributor

everythings looking good here- merging

@ofahimIQSS ofahimIQSS merged commit 40bf831 into develop Oct 23, 2025
23 checks passed
@github-project-automation github-project-automation bot moved this from QA ✅ to Merged 🚀 in IQSS Dataverse Project Oct 23, 2025
@ofahimIQSS ofahimIQSS deleted the 11855-version-summaries-pagination branch October 23, 2025 20:49
@ofahimIQSS ofahimIQSS removed their assignment Oct 23, 2025
@pdurbin pdurbin added this to the 6.9 milestone Oct 23, 2025
@scolapasta scolapasta moved this from Merged 🚀 to Done 🧹 in IQSS Dataverse Project Oct 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

FY26 Sprint 8 FY26 Sprint 8 (2025-10-08 - 2025-10-22) FY26 Sprint 9 FY26 Sprint 9 (2025-10-22 - 2025-11-05) GREI Re-arch Issues related to the GREI Dataverse rearchitecture Original size: 30 Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) SPA.Q3.2025 Not related to any specific Q3 2025 feature SPA These changes are required for the Dataverse SPA

Projects

Status: Done 🧹

7 participants