Skip to content

fix: work around breaking json-ld export by incomplete html tags in file descriptions#11600

Merged
ofahimIQSS merged 5 commits intoIQSS:developfrom
Athemis:fix-json-ld
Aug 18, 2025
Merged

fix: work around breaking json-ld export by incomplete html tags in file descriptions#11600
ofahimIQSS merged 5 commits intoIQSS:developfrom
Athemis:fix-json-ld

Conversation

@Athemis
Copy link
Contributor

@Athemis Athemis commented Jul 2, 2025

What this PR does / why we need it:

Currently incomplete HTML tags or similar strings (e.g. <img, <CSP not the missing >) will cause a truncation of the json-ld string for published datasets. Such datasets cannot be rendered anymore and lead to a server-side 500 error.

Which issue(s) this PR closes:

Special notes for your reviewer:

This does not fix the string sanitation itself. The issue probably lies within Jsoup.clean and hence upstream. What is does though is sanitizing the description field first (which still is truncated). This potentially truncated description then does not impact the stripping of tags from the whole json-ld string.

The proper/sane way to do this would probably be to make sure all user-provided fields are individually stripped of (harmful) html tags before the json-ld is constructed. The sanitation of the final json-ld string could then be safely ommited.

Suggestions on how to test this:

  1. Create a dataset with a file attachment and add a description to this file containing e.g. <CSP (note the missing >).
  2. Publish the dataset
  3. Try to view the dataset or export metadata as JSON-LD. Note: In the docker-based dev environment this failed silently for me: The dataset view worked, however the json-ld export returned an empty page.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

N/A

Is there a release notes update needed for this change?:

Probably not.

Additional documentation:

@pdurbin pdurbin moved this to Ready for Triage in IQSS Dataverse Project Jul 2, 2025
@ofahimIQSS ofahimIQSS added the Size: 3 A percentage of a sprint. 2.1 hours. label Jul 2, 2025
@ofahimIQSS ofahimIQSS moved this from Ready for Triage to Ready for Review ⏩ in IQSS Dataverse Project Jul 2, 2025
@cmbz cmbz added the FY25 Sprint 26 FY25 Sprint 26 (2025-06-18 - 2025-07-02) label Jul 2, 2025
@pdurbin
Copy link
Member

pdurbin commented Jul 2, 2025

@Athemis thanks for the PR. We discussed it briefly today during "Triage Tuesday" (delayed a day this week).

I'm thinking we should add a test. If you're interested in helping with this, let's talk more in the Zulip topic.

Also, I just ran across this issue which seems related:

@cmbz cmbz added the FY26 Sprint 1 FY26 Sprint 1 (2025-07-02 - 2025-07-16) label Jul 2, 2025
@stevenwinship stevenwinship moved this from Ready for Review ⏩ to This Sprint 🏃‍♀️ 🏃 in IQSS Dataverse Project Jul 11, 2025
@cmbz cmbz added the FY26 Sprint 2 FY26 Sprint 2 (2025-07-16 - 2025-07-30) label Jul 17, 2025
@stevenwinship stevenwinship self-assigned this Jul 23, 2025
@stevenwinship stevenwinship moved this from This Sprint 🏃‍♀️ 🏃 to In Review 🔎 in IQSS Dataverse Project Jul 23, 2025
@coveralls
Copy link

coveralls commented Jul 23, 2025

Coverage Status

coverage: 23.227%. remained the same
when pulling 01c5681 on Athemis:fix-json-ld
into 88b3f0b on IQSS:develop.

@github-project-automation github-project-automation bot moved this from In Review 🔎 to Ready for QA ⏩ in IQSS Dataverse Project Jul 23, 2025
@stevenwinship stevenwinship removed their assignment Jul 23, 2025
@pdurbin pdurbin moved this from Ready for QA ⏩ to In Review 🔎 in IQSS Dataverse Project Jul 23, 2025
@pdurbin
Copy link
Member

pdurbin commented Jul 23, 2025

I pulled this out of "ready for QA" because @Athemis said he'd like to write a test for this (thanks!).

@Athemis
Copy link
Contributor Author

Athemis commented Jul 24, 2025

@pdurbin @stevenwinship Would you like the test to be part of this PR?

Edit: The test is available here Athemis/dataverse@7cd1b41 (based on current develop branch, without the fix applied). In my tests, this correctly fails without the applied fix and succeeds once the fix is applied. If desired, I'll pull the test into this branch to make it part of the PR.

Athemis added 2 commits July 24, 2025 10:17
Tests JSON-LD export functionality when file descriptions contain
incomplete HTML tags like '<CSP' or '<img' that cause JsonParsingException
due to MarkupChecker.stripAllTags() processing JSON-LD content as HTML.

The test verifies that the export succeeds and produces valid JSON-LD
output even with problematic HTML patterns in file descriptions.
@pdurbin
Copy link
Member

pdurbin commented Jul 24, 2025

@Athemis yes, if you would put the test directly in the PR, that would be lovely.

@Athemis
Copy link
Contributor Author

Athemis commented Jul 28, 2025

@Athemis yes, if you would put the test directly in the PR, that would be lovely.

Done and merged with current state of develop branch.

@Athemis
Copy link
Contributor Author

Athemis commented Jul 28, 2025

Btw I have no idea what jenkins is complaining about (cannot access the check details). Builds locally for me.

"major",
apiToken
);
publishResponse.then().assertThat().statusCode(200);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in Zulip, this is returning a 403:

Error
1 expectation failed.
Expected status code <200> but was <403>.
Stacktrace
java.lang.AssertionError:
1 expectation failed.
Expected status code <200> but was <403>.
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:500)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:481)
    at org.codehaus.groovy.reflection.CachedConstructor.invoke(CachedConstructor.java:73)
    at org.codehaus.groovy.runtime.callsite.ConstructorSite$ConstructorSiteNoUnwrapNoCoerce.callConstructor(ConstructorSite.java:108)
    at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallConstructor(CallSiteArray.java:57)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callConstructor(AbstractCallSite.java:263)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callConstructor(AbstractCallSite.java:277)
    at io.restassured.internal.ResponseSpecificationImpl$HamcrestAssertionClosure.validate(ResponseSpecificationImpl.groovy:512)
    at io.restassured.internal.ResponseSpecificationImpl$HamcrestAssertionClosure$validate$1.call(Unknown Source)
    at io.restassured.internal.ResponseSpecificationImpl.validateResponseIfRequired(ResponseSpecificationImpl.groovy:696)
    at io.restassured.internal.ResponseSpecificationImpl.this$2$validateResponseIfRequired(ResponseSpecificationImpl.groovy)
    at jdk.internal.reflect.GeneratedMethodAccessor144.invoke(Unknown Source)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:569)
    at org.codehaus.groovy.runtime.callsite.PlainObjectMetaMethodSite.doInvoke(PlainObjectMetaMethodSite.java:43)
    at org.codehaus.groovy.runtime.callsite.PogoMetaMethodSite$PogoCachedMethodSiteNoUnwrapNoCoerce.invoke(PogoMetaMethodSite.java:198)
    at org.codehaus.groovy.runtime.callsite.PogoMetaMethodSite.callCurrent(PogoMetaMethodSite.java:62)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:185)
    at io.restassured.internal.ResponseSpecificationImpl.statusCode(ResponseSpecificationImpl.groovy:135)
    at io.restassured.specification.ResponseSpecification$statusCode$0.callCurrent(Unknown Source)
    at io.restassured.internal.ResponseSpecificationImpl.statusCode(ResponseSpecificationImpl.groovy:143)
    at io.restassured.internal.ValidatableResponseOptionsImpl.statusCode(ValidatableResponseOptionsImpl.java:89)
    at edu.harvard.iq.dataverse.api.JsonLDExportIT.testJsonLDExportWithIncompleteHtmlTagsInFileDescription(JsonLDExportIT.java:88)
    at java.base/java.lang.reflect.Method.invoke(Method.java:569)
    at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
    at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
Standard Error
Jul 28, 2025 1:02:23 PM edu.harvard.iq.dataverse.api.UtilIT getRestAssuredBaseUri
INFO: Base URL for tests: http://localhost:8080
Jul 28, 2025 1:02:23 PM edu.harvard.iq.dataverse.api.UtilIT createRandomUser
INFO: Creating random test user user3614cf35
Jul 28, 2025 1:02:24 PM edu.harvard.iq.dataverse.api.UtilIT getApiTokenFromResponse
INFO: API token found in create user response: 1dc8f8a1-73c8-4dda-9e86-25e9dd6d01e0
Jul 28, 2025 1:02:24 PM edu.harvard.iq.dataverse.api.UtilIT getUsernameFromResponse
INFO: Username found in create user response: user3614cf35
Jul 28, 2025 1:02:24 PM edu.harvard.iq.dataverse.api.UtilIT getAliasFromResponse
INFO: Alias found in create dataverse response: dv1edbd9e9
Jul 28, 2025 1:02:24 PM edu.harvard.iq.dataverse.api.UtilIT getDatasetIdFromResponse
INFO: Id found in create dataset response: 317
Jul 28, 2025 1:02:24 PM edu.harvard.iq.dataverse.api.UtilIT getDatasetPersistentIdFromResponse
INFO: Persistent id found in create dataset response: doi:10.5072/FK2/YFGBI7

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error was: "This dataset is locked. Reason: Ingest. Please try publishing later."

I pushed a fix at 767a191.

Another fix could have been to switch to uploading a non-ingestable file, such as an image.

I'll wait for the tests to pass before moving this to QA.

Copy link
Member

@pdurbin pdurbin Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh. SearchIT#testSearchFilesAndUrlImages is failing with a 500 error. It should be unrelated. I clicked "build now" to try again and will keep an eye on https://jenkins.dataverse.org/job/IQSS-Dataverse-Develop-PR/job/PR-11600/7/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good, job 7 passed. I'll move this to "ready for QA".

Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-project-automation github-project-automation bot moved this from In Review 🔎 to Ready for QA ⏩ in IQSS Dataverse Project Jul 30, 2025
@pdurbin pdurbin removed their assignment Jul 30, 2025
@cmbz cmbz added the FY26 Sprint 3 (2025-07-30 - 2025-08-13) label Jul 31, 2025
@ofahimIQSS ofahimIQSS self-assigned this Jul 31, 2025
@cmbz cmbz added the FY26 Sprint 4 FY26 Sprint 4 (2025-08-13 - 2025-08-27) label Aug 14, 2025
@ofahimIQSS
Copy link
Contributor

Fix looks good - merging PR

@ofahimIQSS ofahimIQSS merged commit ba839a1 into IQSS:develop Aug 18, 2025
15 checks passed
@github-project-automation github-project-automation bot moved this from QA ✅ to Merged 🚀 in IQSS Dataverse Project Aug 18, 2025
@ofahimIQSS ofahimIQSS removed their assignment Aug 18, 2025
@pdurbin pdurbin added this to the 6.8 milestone Aug 18, 2025
@scolapasta scolapasta moved this from Merged 🚀 to Done 🧹 in IQSS Dataverse Project Aug 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

FY25 Sprint 26 FY25 Sprint 26 (2025-06-18 - 2025-07-02) FY26 Sprint 1 FY26 Sprint 1 (2025-07-02 - 2025-07-16) FY26 Sprint 2 FY26 Sprint 2 (2025-07-16 - 2025-07-30) FY26 Sprint 3 (2025-07-30 - 2025-08-13) FY26 Sprint 4 FY26 Sprint 4 (2025-08-13 - 2025-08-27) Size: 3 A percentage of a sprint. 2.1 hours.

Projects

Status: Done 🧹

Development

Successfully merging this pull request may close these issues.

JSON-LD creation fails due to errors in string sanitation of file descriptions leading to server error and failure to display dataset

6 participants