Conversation
… API. will post more info tomorrow, about what may still needs to be done there. #7275
| InputStreamIO auxStreamIO = new InputStreamIO(storageIO.getAuxFileAsInputStream(auxTag), auxFileSize); | ||
| auxStreamIO.setFileName(storageIO.getFileName() + "." + auxTag); | ||
| auxStreamIO.setMimeType(di.getAuxiliaryFile().getContentType()); | ||
| storageIO = auxStreamIO; |
There was a problem hiding this comment.
storageIO.open called above creates the inputStream for the main file. If it's not going to be used, it needs to be closed (storageIO.getInpputStream().close() - StorageIO doesn't implement Closeable yet/have a close() method - which could be useful someday).
There was a problem hiding this comment.
Thanks, I missed that. Fixed.
It's not pretty, what goes on in that DownloadInstance/DownloadInstanceWriter setup, btw. I am opening an issue for cleaning up/simplifying/rewriting it.
| storageIO.saveInputStreamAsAux(fileInputStream, auxExtension); | ||
| auxFile.setChecksum(FileUtil.calculateChecksum(storageIO.getAuxFileAsInputStream(auxExtension), systemConfig.getFileFixityChecksumAlgorithm())); | ||
|
|
||
| Tika tika = new Tika(); |
There was a problem hiding this comment.
So far Tika is only used for full-text indexing and content-type detection is done differently for Datafiles. Could/should these both be done the same way? (not necessarily in this PR)
There was a problem hiding this comment.
for Datafiles, JHove is used, which works on a file as opposed to an inputStream. So to use JHove, I think I would have to save the aux file to a temp location, unless I'm missing something. Is this something we want to do?
There was a problem hiding this comment.
FWIW: Requiring a temp file for mime detection is a problem for Datafiles too, so it may make sense to use Tika instead of JHOVE there as well. (#6937 would make it possible to get ranges of bytes from S3 to help avoid having to get a whole file).
There was a problem hiding this comment.
ok, should we create a separate ticket for changing mime detection to Tika for Datafiles? (Is this something that is usually talked about at dv_tech hour?)
There was a problem hiding this comment.
The method that's used for DataFiles is determineContentType in FileUtil. It would probably be good to use the same code path, instead of using Tika. Or switch it all to Tika, I guess, but that seems daunting to me. Or I guess we could use Tika here for now and work on consistency later. And sure, this is a fine topic for tech hours. 😄
There was a problem hiding this comment.
Discussed this during tech hours, we agreed that it would be good to eventually switch from JHove to Tika in other parts of the code for consistency
There was a problem hiding this comment.
I'm about to approve the PR. There's just one thing I regret not having thought about earlier, when we were having this discussion, above: WHY are we trying to detect mime types at all? As opposed to just making the uploading client supply it, as another parameter?
We try to detect mime types on "normal" uploaded files - but that's because users upload arbitrary files... This API is for something that's structured; at least in the immediate use case scenario. I.e. that preprocessed summary stats fragment will always be in JSON, the diff. private DDI will always be XML etc. etc. We don't expect any variety there...
I don't see this as a problem, having this detection code in place. It may come in handy for other cases. And, even if we decide that these aux uploads should, or can supply the mime type as a parameter - we could use this detection as an extra validation step.
Still, I can't help but feel I should've said "let's not even worry about it, let's just add a parameter instead", early on.
| storageIO = dataFile.getStorageIO(); | ||
| AuxiliaryFile auxFile = new AuxiliaryFile(); | ||
| storageIO.saveInputStreamAsAux(fileInputStream, auxExtension); | ||
| auxFile.setChecksum(FileUtil.calculateChecksum(storageIO.getAuxFileAsInputStream(auxExtension), systemConfig.getFileFixityChecksumAlgorithm())); |
There was a problem hiding this comment.
It's not too hard to wrap the fileInputStream and calculate the has during the original write (versus having to retrieve the file and scan it a second time. (The util.Bag code has examples.)
There was a problem hiding this comment.
Can you explain more what you had it mind? I looked at BagGenerator and BagValidationJob, and can't find an example of this. I could see how you could combine the logic that reads the stream to calculate the checksum, with the read from stream that happens in StorageIO, but I don't see an example of where that's happening.
There was a problem hiding this comment.
Sorry - it would help if I pointed to the right place. It's actually the archiving commands that do this. E.g. the Google Cloud one. They complicate things with threads, but the basic idea is that you wrap your stream with a java.security.DigestInputStream and, once you've finished reading the bytes, you can get the digest for the algorithm you set from that stream.
There was a problem hiding this comment.
ok, that makes sense, thank you
| public Response saveAuxiliaryFileWithVersion(@PathParam("fileId") Long fileId, | ||
| @PathParam("formatTag") String formatTag, | ||
| @PathParam("formatVersion") String formatVersion, | ||
| @FormDataParam("origin") String origin, |
There was a problem hiding this comment.
FWIW: the normal file add uses a single jsonData param rather than multiple FormDataParams. Not sure if that's intended to be consistent across the API.
There was a problem hiding this comment.
I agree with the desire for consistency, but the parameters listed in Auxiliary files are required, whereas for depositing a Datafile, the jsonData parameter contains only optional params. So I think it's probably better to keep the Auxiliary API as is so that the required parameters are more obvious.
is being written to storage
There was a problem hiding this comment.
Overall, this looks good. Below I'll make some high-level comments and summarize feedback I left on specific lines.
The pull request template wasn't used so I'm not sure if this pull request is meant to close #7275 or not. This should be clarified.
The branch is quite behind develop so the latest need to be merged in.
There are no docs. An update to the API Guide should be made with full curl examples of how to use these new endpoints. Also, a release note fragment should be added to describe the feature.
There seems to be a security/privacy issue with how isPublic is implemented: You can download an aux file from a dataset draft. I don't know if this is by design or not but this will probably be unexpected behavior. Update: I re-tested this and got ..."code": 403, "message": "Not authorized to access this object via this API endpoint...
It's clear that this pull request is a small chunk along the way to integration between OpenDP and Dataverse, which is fine. I see that there's an issue at #7400 to worry more about the UI implications of these new aux files.
Since the focus in this pull request is on APIs I'm wondering if we should do a little more. Specifically, here are some use cases I can anticipate people wanting:
- I want to be able to delete an aux file.
- I want to be able to list all the aux files for a data file.
- For one of my aux files, I want to be able to check its content type and other information.
@qqmyers noticed that filetype detection is done through a different method (Tika) that than our normal methods. I'm ok with QA testing it as-is but we should probably make these consistent at some point.
I'm still a little troubled that we have API endpoints for files under both "access" and "files". We need to keep all endpoints working, of course, but eventually we should come up with guidance on what should go where. I prefer "files" but I understand that "access" was there first.
Again, overall, looks good.
| storageIO.saveInputStreamAsAux(fileInputStream, auxExtension); | ||
| auxFile.setChecksum(FileUtil.calculateChecksum(storageIO.getAuxFileAsInputStream(auxExtension), systemConfig.getFileFixityChecksumAlgorithm())); | ||
|
|
||
| Tika tika = new Tika(); |
There was a problem hiding this comment.
The method that's used for DataFiles is determineContentType in FileUtil. It would probably be good to use the same code path, instead of using Tika. Or switch it all to Tika, I guess, but that seems daunting to me. Or I guess we could use Tika here for now and work on consistency later. And sure, this is a fine topic for tech hours. 😄
src/main/java/edu/harvard/iq/dataverse/AuxiliaryFileServiceBean.java
Outdated
Show resolved
Hide resolved
| @Produces({"text/xml"}) | ||
|
|
||
| public DownloadInstance tabularDatafileMetadataPreprocessed(@PathParam("fileId") String fileId, @QueryParam("key") String apiToken, @Context UriInfo uriInfo, @Context HttpHeaders headers, @Context HttpServletResponse response) throws ServiceUnavailableException { | ||
| public DownloadInstance tabularDatafileMetadataPreprocessed(@PathParam("fileId") String fileId, |
There was a problem hiding this comment.
Instead of tabularDatafileMetadataPreprocessed maybe this should be called tabularDatafileMetadataAux or something (or something entirely new) to deemphasize the older preprocessed format.
| @FormDataParam("file") InputStream fileInputStream, | ||
| @FormDataParam("file") FormDataContentDisposition contentDispositionHeader, | ||
| @FormDataParam("file") final FormDataBodyPart formDataBodyPart | ||
| ) { |
There was a problem hiding this comment.
It seems like only InputStream is used. What are FormDataContentDisposition and FormDataBodyPart for?
| boolean saved = auxiliaryFileService.processAuxiliaryFile(fileInputStream, dataFile, formatTag, formatVersion, origin, isPublic); | ||
|
|
||
| if (saved) { | ||
| return ok("Auxiliary file has been saved."); |
There was a problem hiding this comment.
What if we return the aux file id and the detected content type? If we add AuxiliaryFile to JsonPrinter.java, what would the JSON look like?
…e into 7275-upload-auxliary-files
pdurbin
left a comment
There was a problem hiding this comment.
I made a minor comment about created vs ok.
|
@ekraffmiller Getting 404 endpoint does not exist error. I'm sure it is my fault but could use some help. I used the example command line but with params on the same line. |
fixed typo in deposit URL
What this PR does / why we need it: This is a generic API for uploading and downloading Auxiliary files. Needed for OpenDP use case
Which issue(s) this PR closes: 7275
Closes #7275
Special notes for your reviewer:
Suggestions on how to test this: Integration test in AccessIT
Does this PR introduce a user interface change? If mockups are available, please link/include them here: no UI change
Is there a release notes update needed for this change?:
Additional documentation: documentation added to Dataverse Developer doc