fix: work around breaking json-ld export by incomplete html tags in file descriptions#11600
fix: work around breaking json-ld export by incomplete html tags in file descriptions#11600ofahimIQSS merged 5 commits intoIQSS:developfrom
Conversation
|
@Athemis thanks for the PR. We discussed it briefly today during "Triage Tuesday" (delayed a day this week). I'm thinking we should add a test. If you're interested in helping with this, let's talk more in the Zulip topic. Also, I just ran across this issue which seems related: |
|
@pdurbin @stevenwinship Would you like the test to be part of this PR? Edit: The test is available here Athemis/dataverse@7cd1b41 (based on current |
Tests JSON-LD export functionality when file descriptions contain incomplete HTML tags like '<CSP' or '<img' that cause JsonParsingException due to MarkupChecker.stripAllTags() processing JSON-LD content as HTML. The test verifies that the export succeeds and produces valid JSON-LD output even with problematic HTML patterns in file descriptions.
|
@Athemis yes, if you would put the test directly in the PR, that would be lovely. |
Done and merged with current state of |
|
Btw I have no idea what jenkins is complaining about (cannot access the check details). Builds locally for me. |
| "major", | ||
| apiToken | ||
| ); | ||
| publishResponse.then().assertThat().statusCode(200); |
There was a problem hiding this comment.
As I mentioned in Zulip, this is returning a 403:
Error
1 expectation failed.
Expected status code <200> but was <403>.
Stacktrace
java.lang.AssertionError:
1 expectation failed.
Expected status code <200> but was <403>.
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:500)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:481)
at org.codehaus.groovy.reflection.CachedConstructor.invoke(CachedConstructor.java:73)
at org.codehaus.groovy.runtime.callsite.ConstructorSite$ConstructorSiteNoUnwrapNoCoerce.callConstructor(ConstructorSite.java:108)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallConstructor(CallSiteArray.java:57)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callConstructor(AbstractCallSite.java:263)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callConstructor(AbstractCallSite.java:277)
at io.restassured.internal.ResponseSpecificationImpl$HamcrestAssertionClosure.validate(ResponseSpecificationImpl.groovy:512)
at io.restassured.internal.ResponseSpecificationImpl$HamcrestAssertionClosure$validate$1.call(Unknown Source)
at io.restassured.internal.ResponseSpecificationImpl.validateResponseIfRequired(ResponseSpecificationImpl.groovy:696)
at io.restassured.internal.ResponseSpecificationImpl.this$2$validateResponseIfRequired(ResponseSpecificationImpl.groovy)
at jdk.internal.reflect.GeneratedMethodAccessor144.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:569)
at org.codehaus.groovy.runtime.callsite.PlainObjectMetaMethodSite.doInvoke(PlainObjectMetaMethodSite.java:43)
at org.codehaus.groovy.runtime.callsite.PogoMetaMethodSite$PogoCachedMethodSiteNoUnwrapNoCoerce.invoke(PogoMetaMethodSite.java:198)
at org.codehaus.groovy.runtime.callsite.PogoMetaMethodSite.callCurrent(PogoMetaMethodSite.java:62)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:185)
at io.restassured.internal.ResponseSpecificationImpl.statusCode(ResponseSpecificationImpl.groovy:135)
at io.restassured.specification.ResponseSpecification$statusCode$0.callCurrent(Unknown Source)
at io.restassured.internal.ResponseSpecificationImpl.statusCode(ResponseSpecificationImpl.groovy:143)
at io.restassured.internal.ValidatableResponseOptionsImpl.statusCode(ValidatableResponseOptionsImpl.java:89)
at edu.harvard.iq.dataverse.api.JsonLDExportIT.testJsonLDExportWithIncompleteHtmlTagsInFileDescription(JsonLDExportIT.java:88)
at java.base/java.lang.reflect.Method.invoke(Method.java:569)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
Standard Error
Jul 28, 2025 1:02:23 PM edu.harvard.iq.dataverse.api.UtilIT getRestAssuredBaseUri
INFO: Base URL for tests: http://localhost:8080
Jul 28, 2025 1:02:23 PM edu.harvard.iq.dataverse.api.UtilIT createRandomUser
INFO: Creating random test user user3614cf35
Jul 28, 2025 1:02:24 PM edu.harvard.iq.dataverse.api.UtilIT getApiTokenFromResponse
INFO: API token found in create user response: 1dc8f8a1-73c8-4dda-9e86-25e9dd6d01e0
Jul 28, 2025 1:02:24 PM edu.harvard.iq.dataverse.api.UtilIT getUsernameFromResponse
INFO: Username found in create user response: user3614cf35
Jul 28, 2025 1:02:24 PM edu.harvard.iq.dataverse.api.UtilIT getAliasFromResponse
INFO: Alias found in create dataverse response: dv1edbd9e9
Jul 28, 2025 1:02:24 PM edu.harvard.iq.dataverse.api.UtilIT getDatasetIdFromResponse
INFO: Id found in create dataset response: 317
Jul 28, 2025 1:02:24 PM edu.harvard.iq.dataverse.api.UtilIT getDatasetPersistentIdFromResponse
INFO: Persistent id found in create dataset response: doi:10.5072/FK2/YFGBI7
There was a problem hiding this comment.
The error was: "This dataset is locked. Reason: Ingest. Please try publishing later."
I pushed a fix at 767a191.
Another fix could have been to switch to uploading a non-ingestable file, such as an image.
I'll wait for the tests to pass before moving this to QA.
There was a problem hiding this comment.
Huh. SearchIT#testSearchFilesAndUrlImages is failing with a 500 error. It should be unrelated. I clicked "build now" to try again and will keep an eye on https://jenkins.dataverse.org/job/IQSS-Dataverse-Develop-PR/job/PR-11600/7/
There was a problem hiding this comment.
Oh good, job 7 passed. I'll move this to "ready for QA".
pdurbin
left a comment
There was a problem hiding this comment.
Tests are passing: https://jenkins.dataverse.org/job/IQSS-Dataverse-Develop-PR/job/PR-11600/7/testReport/
I'm approving this.
|
Fix looks good - merging PR |
What this PR does / why we need it:
Currently incomplete HTML tags or similar strings (e.g.
<img,<CSPnot the missing>) will cause a truncation of the json-ld string for published datasets. Such datasets cannot be rendered anymore and lead to a server-side 500 error.Which issue(s) this PR closes:
Special notes for your reviewer:
This does not fix the string sanitation itself. The issue probably lies within
Jsoup.cleanand hence upstream. What is does though is sanitizing the description field first (which still is truncated). This potentially truncated description then does not impact the stripping of tags from the whole json-ld string.The proper/sane way to do this would probably be to make sure all user-provided fields are individually stripped of (harmful) html tags before the json-ld is constructed. The sanitation of the final json-ld string could then be safely ommited.
Suggestions on how to test this:
<CSP(note the missing>).Does this PR introduce a user interface change? If mockups are available, please link/include them here:
N/A
Is there a release notes update needed for this change?:
Probably not.
Additional documentation: