Skip to content

give expected file extension when downloading auxiliary files #8241#8255

Closed
pdurbin wants to merge 20 commits intodevelopfrom
8241-ext
Closed

give expected file extension when downloading auxiliary files #8241#8255
pdurbin wants to merge 20 commits intodevelopfrom
8241-ext

Conversation

@pdurbin
Copy link
Member

@pdurbin pdurbin commented Nov 22, 2021

What this PR does / why we need it:

People are uploading aux files like "foo.json" and getting "foo.txt" (the wrong extension) back.

Which issue(s) this PR closes:

Closes #8241

Special notes for your reviewer:

Because pull request #8237 is in flight and is highly related, this pull request was branched from it. So, the other pull request should go through QA first. In order to see the diff of just this pull request: QualitativeDataRepository/dataverse@IQSS/8235-auxfile_enhancements...IQSS:8241-ext

Suggestions on how to test this:

A .json aux file, for example, and download it again. On develop it will be ".txt". In the pull request, it should be ".json".

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

No.

Is there a release notes update needed for this change?:

Included but may not be strictly necessary.

Additional documentation:

Included.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 18.964% when pulling 1c95a95 on 8241-ext into c91689a on develop.

@qqmyers
Copy link
Member

qqmyers commented Nov 24, 2021

Looking quickly at the code, I didn't see any bugs/it all looks reasonable.

I'm not so sure on the design though - it adds a filename just to use its extension to ~override the mimetype and that filename is otherwise unused in the API (since uploads are by format_tag/format_version and downloads get named the same way with an extension from the mimetype) and the contentType column isn't being synced with the extension. Since the contentType is in the current design, I wonder if something like mirroring the proposed design but just sending the mimetype in the contentDispositionHeader and using that to override the detected mimetype (as the value stored in contentType) would serve the purpose (application/json files would download with a .json extension) without adding another column, and without confusing people as to why two uploads with the same format_tag/format_version but different filenames result in an error, etc.
(I'm also slightly concerned that allowing a complete override of the mimetype is potentially an issue - I guess regardless of how one let's the user supply a mimetype, one could eventually do a sanity check against what tika returns (e.g. application/json is a subtype of text, so the user supplied application/json could override that but not be used to re-type an exe file.))

@landreev landreev self-requested a review November 29, 2021 03:44
@landreev landreev self-assigned this Nov 29, 2021
Copy link
Contributor

@landreev landreev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same question as @qqmyers above - not sure about saving the complete filename, as uploaded; but then only using the extension. Should we just use the entire filename, if available?

My main question/concern is the mime type. The PR appears to be fixing the extension, as requested. But the original issue was about the content type too. The original title says "... extension/type is lost".

I see this as part of a general problem with how we save the mime type in the database, on upload: I'm not entirely sure why we rely solely on detecting the type of the file (with Tika). If we want this aux. file upload logic to somewhat follow how we handle regular datafile uploads, I feel like we should accept and save the type as supplied by the client; and only call tika.detect() if it's not supplied.
In other words, to address the "lost type" part of the original issue, I don't think we need to do anything to improve how we detect the type; but simply give the API user an option to add the -H "Content-type:application/json" header to the upload request.
Does this make sense?

@landreev
Copy link
Contributor

landreev commented Dec 2, 2021

P.S. What I just proposed may be contradicting what @qqmyers was saying earlier; the part about trusting the client when it comes to the mime type submitted. So this should be up for debate/extra discussion if necessary.

@landreev landreev removed their assignment Dec 2, 2021
@qqmyers
Copy link
Member

qqmyers commented Dec 2, 2021

I'm not too concerned about trusting the client (is this the worst thing you could do with an API key?), but doing so would be the opposite of what happens with data files. There we trust only if detecting fails.

@djbrooke
Copy link
Contributor

djbrooke commented Dec 3, 2021

@scolapasta can you weigh in on the appropriate path forward here since there seems to be differing opinions? Thanks!

@landreev
Copy link
Contributor

landreev commented Dec 3, 2021

@qqmyers You are correct of course, about detecting, vs. accepting the client-supplied type, when regular datafiles are uploaded. I shouldn't have phrased it like that. But we seem to agree, that there should be some way for a client to at least try to supply the mime type.

Whether we blindly take their word for it, or try to detect first - I don't have a super strong opinion either. But, if we were to try to implement what you were describing earlier:

... one could eventually do a sanity check against what tika returns (e.g. application/json is a subtype of text, so the user supplied application/json could override that but not be used to re-type an exe file.)

to be able to make that check - which type is a sub-type of the other/which of the 2 types is more granular - is there some readily-available library for this? Or would that need to be implemented as a (potentially messy) set of special case logic?

With the DP case, it's definitely reasonable to expect the API caller to know exactly what their file types should be. But I don't know about your case(s).
I do agree that it's likely not "the worst thing you could do with an API key". 😄

@landreev
Copy link
Contributor

landreev commented Dec 3, 2021

(unrelated to this issue - this reminds me that I need to open an issue for the mime type detection not working with certain api upload scenarios, for regular datafile uploads...)

@qqmyers
Copy link
Member

qqmyers commented Dec 3, 2021

I don't know of a library, or of special cases of interest beyond text - application/json. Since this is an api call (versus a form), it is probably a reasonable assumption that callers would only add a mimetype header if they care, so a detect only if not supplied approach probably wouldn't have accidental overwrites (fwiw: I think for datafiles, they key thing is that text/tsv, text/tab-separated-value files get detected so we avoid letting raw files in with the mimetype used for ingested tab files - which isn't an issue with aux files).
So - I'm OK with a trust a mimetype header approach.
FWIW: Also would be OK with a use the whole filename approach but that seems like a larger rewrite as the apis all assemble filename/path from the tags and version number.

@scolapasta
Copy link
Contributor

OK, I reviewed the comments above - seems like @landreev and @qqmyers are both OK with a trust a mimetype header approach. If that's the case, so am I. :)

@scolapasta scolapasta assigned scolapasta and pdurbin and unassigned scolapasta Dec 3, 2021
@pdurbin
Copy link
Member Author

pdurbin commented Dec 6, 2021

@raprasad (since you opened the issue) I'm just checking in quick to see if you're ok with sending a mimetype header.

@raprasad
Copy link
Contributor

raprasad commented Dec 6, 2021

@pdurbin: Thanks for the curl example, e.g. adding the Content-type to the header:

That sounds fine with us. Thx.

curl -H X-Dataverse-key:$API_TOKEN -X POST -H "Content-type:application/json" $SERVER_URL/api/dataverses/$ID/roles --upload-file roles.json

@pdurbin
Copy link
Member Author

pdurbin commented Dec 6, 2021

@raprasad sure, alternatively we could change

-F "file=@foo.json"

to

-F "file=@foo.json;type=application/json"

I have this working in the code. The upload method is annotated with @Consumes(MediaType.MULTIPART_FORM_DATA) so my question to @landreev @qqmyers and @scolapasta is if we're ok with this rather than -H "Content-type:application/json".

The reason I went down this -F road is that if you add -H "Content-type:application/json" to the API call, you get {"status":"ERROR","code":415,"message":"HTTP 415 Unsupported Media Type"} and from a look at https://stackoverflow.com/questions/44807767/jersey-api-return-415-unsupported-media-type-when-using-multipart-form-data/44814996#44814996 it seems that if unless we move away from @Consumes(MediaType.MULTIPART_FORM_DATA) we should probably send the mimetype like in the -F example above. From that StackOverflow answer:

Use a different client that knows how to specifically send files as multipart. You generally don't want to manually create the request (or set the Content-Type header) when it comes to multipart as the it is a little more complicated than a normal request. For example, this a what a multipart request looks like

Content-Type: multipart/form-data; boundary=AaB03x

--AaB03x
Content-Disposition: form-data; name="submit-name"

Larry
--AaB03x
Content-Disposition: form-data; name="files"; filename="file1.txt"
Content-Type: text/plain

... contents of file1.txt ...
--AaB03x--

Incidentally, we already have a note about -F "file=@foo.json;type=application/json" in our Add a File to a Dataset docs (highlighted in blue below):

Screen Shot 2021-12-06 at 2 29 15 PM

I'll clean up what I have and push it to a branch so you can see in code what I'm talking about.

@qqmyers
Copy link
Member

qqmyers commented Dec 6, 2021

+1 - Yeah - I was just thinking that you'd keep the multipart form and just add the file type as one of the form parts, and I think I looked in the file upload code to see what Java would be needed.

@pdurbin
Copy link
Member Author

pdurbin commented Dec 6, 2021

@qqmyers thanks. I just made pull request #8282 as an alternative to this one. As discussed above, it allows the user to specify the content type.

I don't think we need two pull requests on the board at once so I'll remove this one and put the new one in review. We can just close out whichever one doesn't get merged.

@pdurbin pdurbin removed their assignment Dec 6, 2021
@pdurbin
Copy link
Member Author

pdurbin commented Dec 7, 2021

Closing in favor of pull request #8282.

@pdurbin pdurbin closed this Dec 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

auxiliary file download: JSON file extension/type is lost

7 participants