Skip to content

Comments

7275 upload auxiliary files#7350

Merged
kcondon merged 18 commits intodevelopfrom
7275-upload-auxliary-files
Nov 30, 2020
Merged

7275 upload auxiliary files#7350
kcondon merged 18 commits intodevelopfrom
7275-upload-auxliary-files

Conversation

@ekraffmiller
Copy link
Contributor

@ekraffmiller ekraffmiller commented Oct 22, 2020

What this PR does / why we need it: This is a generic API for uploading and downloading Auxiliary files. Needed for OpenDP use case

Which issue(s) this PR closes: 7275

Closes #7275

Special notes for your reviewer:

Suggestions on how to test this: Integration test in AccessIT

Does this PR introduce a user interface change? If mockups are available, please link/include them here: no UI change

Is there a release notes update needed for this change?:

Additional documentation: documentation added to Dataverse Developer doc

@coveralls
Copy link

coveralls commented Oct 22, 2020

Coverage Status

Coverage decreased (-0.03%) to 19.446% when pulling 103b591 on 7275-upload-auxliary-files into f940279 on develop.

InputStreamIO auxStreamIO = new InputStreamIO(storageIO.getAuxFileAsInputStream(auxTag), auxFileSize);
auxStreamIO.setFileName(storageIO.getFileName() + "." + auxTag);
auxStreamIO.setMimeType(di.getAuxiliaryFile().getContentType());
storageIO = auxStreamIO;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storageIO.open called above creates the inputStream for the main file. If it's not going to be used, it needs to be closed (storageIO.getInpputStream().close() - StorageIO doesn't implement Closeable yet/have a close() method - which could be useful someday).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I missed that. Fixed.
It's not pretty, what goes on in that DownloadInstance/DownloadInstanceWriter setup, btw. I am opening an issue for cleaning up/simplifying/rewriting it.

storageIO.saveInputStreamAsAux(fileInputStream, auxExtension);
auxFile.setChecksum(FileUtil.calculateChecksum(storageIO.getAuxFileAsInputStream(auxExtension), systemConfig.getFileFixityChecksumAlgorithm()));

Tika tika = new Tika();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far Tika is only used for full-text indexing and content-type detection is done differently for Datafiles. Could/should these both be done the same way? (not necessarily in this PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for Datafiles, JHove is used, which works on a file as opposed to an inputStream. So to use JHove, I think I would have to save the aux file to a temp location, unless I'm missing something. Is this something we want to do?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: Requiring a temp file for mime detection is a problem for Datafiles too, so it may make sense to use Tika instead of JHOVE there as well. (#6937 would make it possible to get ranges of bytes from S3 to help avoid having to get a whole file).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, should we create a separate ticket for changing mime detection to Tika for Datafiles? (Is this something that is usually talked about at dv_tech hour?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method that's used for DataFiles is determineContentType in FileUtil. It would probably be good to use the same code path, instead of using Tika. Or switch it all to Tika, I guess, but that seems daunting to me. Or I guess we could use Tika here for now and work on consistency later. And sure, this is a fine topic for tech hours. 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this during tech hours, we agreed that it would be good to eventually switch from JHove to Tika in other parts of the code for consistency

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm about to approve the PR. There's just one thing I regret not having thought about earlier, when we were having this discussion, above: WHY are we trying to detect mime types at all? As opposed to just making the uploading client supply it, as another parameter?
We try to detect mime types on "normal" uploaded files - but that's because users upload arbitrary files... This API is for something that's structured; at least in the immediate use case scenario. I.e. that preprocessed summary stats fragment will always be in JSON, the diff. private DDI will always be XML etc. etc. We don't expect any variety there...
I don't see this as a problem, having this detection code in place. It may come in handy for other cases. And, even if we decide that these aux uploads should, or can supply the mime type as a parameter - we could use this detection as an extra validation step.
Still, I can't help but feel I should've said "let's not even worry about it, let's just add a parameter instead", early on.

storageIO = dataFile.getStorageIO();
AuxiliaryFile auxFile = new AuxiliaryFile();
storageIO.saveInputStreamAsAux(fileInputStream, auxExtension);
auxFile.setChecksum(FileUtil.calculateChecksum(storageIO.getAuxFileAsInputStream(auxExtension), systemConfig.getFileFixityChecksumAlgorithm()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not too hard to wrap the fileInputStream and calculate the has during the original write (versus having to retrieve the file and scan it a second time. (The util.Bag code has examples.)

Copy link
Contributor Author

@ekraffmiller ekraffmiller Oct 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain more what you had it mind? I looked at BagGenerator and BagValidationJob, and can't find an example of this. I could see how you could combine the logic that reads the stream to calculate the checksum, with the read from stream that happens in StorageIO, but I don't see an example of where that's happening.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - it would help if I pointed to the right place. It's actually the archiving commands that do this. E.g. the Google Cloud one. They complicate things with threads, but the basic idea is that you wrap your stream with a java.security.DigestInputStream and, once you've finished reading the bytes, you can get the digest for the algorithm you set from that stream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, that makes sense, thank you

public Response saveAuxiliaryFileWithVersion(@PathParam("fileId") Long fileId,
@PathParam("formatTag") String formatTag,
@PathParam("formatVersion") String formatVersion,
@FormDataParam("origin") String origin,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: the normal file add uses a single jsonData param rather than multiple FormDataParams. Not sure if that's intended to be consistent across the API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the desire for consistency, but the parameters listed in Auxiliary files are required, whereas for depositing a Datafile, the jsonData parameter contains only optional params. So I think it's probably better to keep the Auxiliary API as is so that the required parameters are more obvious.

@pdurbin pdurbin changed the title 7275 upload auxliary files 7275 upload auxiliary files Nov 6, 2020
@pdurbin pdurbin self-assigned this Nov 6, 2020
Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks good. Below I'll make some high-level comments and summarize feedback I left on specific lines.

The pull request template wasn't used so I'm not sure if this pull request is meant to close #7275 or not. This should be clarified.

The branch is quite behind develop so the latest need to be merged in.

There are no docs. An update to the API Guide should be made with full curl examples of how to use these new endpoints. Also, a release note fragment should be added to describe the feature.

There seems to be a security/privacy issue with how isPublic is implemented: You can download an aux file from a dataset draft. I don't know if this is by design or not but this will probably be unexpected behavior. Update: I re-tested this and got ..."code": 403, "message": "Not authorized to access this object via this API endpoint...

It's clear that this pull request is a small chunk along the way to integration between OpenDP and Dataverse, which is fine. I see that there's an issue at #7400 to worry more about the UI implications of these new aux files.

Since the focus in this pull request is on APIs I'm wondering if we should do a little more. Specifically, here are some use cases I can anticipate people wanting:

  • I want to be able to delete an aux file.
  • I want to be able to list all the aux files for a data file.
  • For one of my aux files, I want to be able to check its content type and other information.

@qqmyers noticed that filetype detection is done through a different method (Tika) that than our normal methods. I'm ok with QA testing it as-is but we should probably make these consistent at some point.

I'm still a little troubled that we have API endpoints for files under both "access" and "files". We need to keep all endpoints working, of course, but eventually we should come up with guidance on what should go where. I prefer "files" but I understand that "access" was there first.

Again, overall, looks good.

storageIO.saveInputStreamAsAux(fileInputStream, auxExtension);
auxFile.setChecksum(FileUtil.calculateChecksum(storageIO.getAuxFileAsInputStream(auxExtension), systemConfig.getFileFixityChecksumAlgorithm()));

Tika tika = new Tika();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method that's used for DataFiles is determineContentType in FileUtil. It would probably be good to use the same code path, instead of using Tika. Or switch it all to Tika, I guess, but that seems daunting to me. Or I guess we could use Tika here for now and work on consistency later. And sure, this is a fine topic for tech hours. 😄

@Produces({"text/xml"})

public DownloadInstance tabularDatafileMetadataPreprocessed(@PathParam("fileId") String fileId, @QueryParam("key") String apiToken, @Context UriInfo uriInfo, @Context HttpHeaders headers, @Context HttpServletResponse response) throws ServiceUnavailableException {
public DownloadInstance tabularDatafileMetadataPreprocessed(@PathParam("fileId") String fileId,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of tabularDatafileMetadataPreprocessed maybe this should be called tabularDatafileMetadataAux or something (or something entirely new) to deemphasize the older preprocessed format.

Comment on lines 1137 to 1140
@FormDataParam("file") InputStream fileInputStream,
@FormDataParam("file") FormDataContentDisposition contentDispositionHeader,
@FormDataParam("file") final FormDataBodyPart formDataBodyPart
) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like only InputStream is used. What are FormDataContentDisposition and FormDataBodyPart for?

boolean saved = auxiliaryFileService.processAuxiliaryFile(fileInputStream, dataFile, formatTag, formatVersion, origin, isPublic);

if (saved) {
return ok("Auxiliary file has been saved.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we return the aux file id and the detected content type? If we add AuxiliaryFile to JsonPrinter.java, what would the JSON look like?

@pdurbin pdurbin assigned ekraffmiller and unassigned pdurbin Nov 9, 2020
@djbrooke djbrooke assigned landreev and unassigned ekraffmiller Nov 18, 2020
@landreev landreev removed their assignment Nov 23, 2020
@pdurbin pdurbin self-assigned this Nov 24, 2020
Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a minor comment about created vs ok.

Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@landreev and I decided to make the "add" response 200/OK. I also merged the latest from develop.

@pdurbin pdurbin removed their assignment Nov 24, 2020
@kcondon kcondon self-assigned this Nov 30, 2020
@kcondon
Copy link
Contributor

kcondon commented Nov 30, 2020

@ekraffmiller Getting 404 endpoint does not exist error. I'm sure it is my fault but could use some help. I used the example command line but with params on the same line.

fixed typo in deposit URL
@kcondon kcondon merged commit 6f88c11 into develop Nov 30, 2020
@kcondon kcondon deleted the 7275-upload-auxliary-files branch November 30, 2020 21:53
@djbrooke djbrooke added this to the 5.3 milestone Nov 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add API to (relatively) arbitrarily upload auxiliary files

7 participants