Conversation
Conflicts (from PR #8192 it seems): src/main/java/edu/harvard/iq/dataverse/api/Access.java
Conflicts: src/main/java/edu/harvard/iq/dataverse/api/Access.java
|
Looking quickly at the code, I didn't see any bugs/it all looks reasonable. I'm not so sure on the design though - it adds a filename just to use its extension to ~override the mimetype and that filename is otherwise unused in the API (since uploads are by format_tag/format_version and downloads get named the same way with an extension from the mimetype) and the contentType column isn't being synced with the extension. Since the contentType is in the current design, I wonder if something like mirroring the proposed design but just sending the mimetype in the contentDispositionHeader and using that to override the detected mimetype (as the value stored in contentType) would serve the purpose (application/json files would download with a .json extension) without adding another column, and without confusing people as to why two uploads with the same format_tag/format_version but different filenames result in an error, etc. |
landreev
left a comment
There was a problem hiding this comment.
I have the same question as @qqmyers above - not sure about saving the complete filename, as uploaded; but then only using the extension. Should we just use the entire filename, if available?
My main question/concern is the mime type. The PR appears to be fixing the extension, as requested. But the original issue was about the content type too. The original title says "... extension/type is lost".
I see this as part of a general problem with how we save the mime type in the database, on upload: I'm not entirely sure why we rely solely on detecting the type of the file (with Tika). If we want this aux. file upload logic to somewhat follow how we handle regular datafile uploads, I feel like we should accept and save the type as supplied by the client; and only call tika.detect() if it's not supplied.
In other words, to address the "lost type" part of the original issue, I don't think we need to do anything to improve how we detect the type; but simply give the API user an option to add the -H "Content-type:application/json" header to the upload request.
Does this make sense?
|
P.S. What I just proposed may be contradicting what @qqmyers was saying earlier; the part about trusting the client when it comes to the mime type submitted. So this should be up for debate/extra discussion if necessary. |
|
I'm not too concerned about trusting the client (is this the worst thing you could do with an API key?), but doing so would be the opposite of what happens with data files. There we trust only if detecting fails. |
|
@scolapasta can you weigh in on the appropriate path forward here since there seems to be differing opinions? Thanks! |
|
@qqmyers You are correct of course, about detecting, vs. accepting the client-supplied type, when regular datafiles are uploaded. I shouldn't have phrased it like that. But we seem to agree, that there should be some way for a client to at least try to supply the mime type. Whether we blindly take their word for it, or try to detect first - I don't have a super strong opinion either. But, if we were to try to implement what you were describing earlier:
to be able to make that check - which type is a sub-type of the other/which of the 2 types is more granular - is there some readily-available library for this? Or would that need to be implemented as a (potentially messy) set of special case logic? With the DP case, it's definitely reasonable to expect the API caller to know exactly what their file types should be. But I don't know about your case(s). |
|
(unrelated to this issue - this reminds me that I need to open an issue for the mime type detection not working with certain api upload scenarios, for regular datafile uploads...) |
|
I don't know of a library, or of special cases of interest beyond text - application/json. Since this is an api call (versus a form), it is probably a reasonable assumption that callers would only add a mimetype header if they care, so a detect only if not supplied approach probably wouldn't have accidental overwrites (fwiw: I think for datafiles, they key thing is that text/tsv, text/tab-separated-value files get detected so we avoid letting raw files in with the mimetype used for ingested tab files - which isn't an issue with aux files). |
|
@raprasad (since you opened the issue) I'm just checking in quick to see if you're ok with sending a mimetype header. |
|
@pdurbin: Thanks for the curl example, e.g. adding the That sounds fine with us. Thx. |
|
@raprasad sure, alternatively we could change to I have this working in the code. The upload method is annotated with The reason I went down this
Incidentally, we already have a note about I'll clean up what I have and push it to a branch so you can see in code what I'm talking about. |
|
+1 - Yeah - I was just thinking that you'd keep the multipart form and just add the file type as one of the form parts, and I think I looked in the file upload code to see what Java would be needed. |
|
@qqmyers thanks. I just made pull request #8282 as an alternative to this one. As discussed above, it allows the user to specify the content type. I don't think we need two pull requests on the board at once so I'll remove this one and put the new one in review. We can just close out whichever one doesn't get merged. |
|
Closing in favor of pull request #8282. |

What this PR does / why we need it:
People are uploading aux files like "foo.json" and getting "foo.txt" (the wrong extension) back.
Which issue(s) this PR closes:
Closes #8241
Special notes for your reviewer:
Because pull request #8237 is in flight and is highly related, this pull request was branched from it. So, the other pull request should go through QA first. In order to see the diff of just this pull request: QualitativeDataRepository/dataverse@IQSS/8235-auxfile_enhancements...IQSS:8241-ext
Suggestions on how to test this:
A .json aux file, for example, and download it again. On develop it will be ".txt". In the pull request, it should be ".json".
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
No.
Is there a release notes update needed for this change?:
Included but may not be strictly necessary.
Additional documentation:
Included.