7275 upload auxiliary files by ekraffmiller · Pull Request #7350 · IQSS/dataverse

ekraffmiller · 2020-10-22T19:00:42Z

What this PR does / why we need it: This is a generic API for uploading and downloading Auxiliary files. Needed for OpenDP use case

Which issue(s) this PR closes: 7275

Closes #7275

Special notes for your reviewer:

Suggestions on how to test this: Integration test in AccessIT

Does this PR introduce a user interface change? If mockups are available, please link/include them here: no UI change

Is there a release notes update needed for this change?:

Additional documentation: documentation added to Dataverse Developer doc

… API. will post more info tomorrow, about what may still needs to be done there. #7275

coveralls · 2020-10-22T20:17:37Z

Coverage decreased (-0.03%) to 19.446% when pulling 103b591 on 7275-upload-auxliary-files into f940279 on develop.

qqmyers · 2020-10-23T20:56:36Z

src/main/java/edu/harvard/iq/dataverse/api/DownloadInstanceWriter.java

+                    InputStreamIO auxStreamIO = new InputStreamIO(storageIO.getAuxFileAsInputStream(auxTag), auxFileSize);
+                    auxStreamIO.setFileName(storageIO.getFileName() + "." + auxTag);
+                    auxStreamIO.setMimeType(di.getAuxiliaryFile().getContentType());
+                    storageIO = auxStreamIO;


storageIO.open called above creates the inputStream for the main file. If it's not going to be used, it needs to be closed (storageIO.getInpputStream().close() - StorageIO doesn't implement Closeable yet/have a close() method - which could be useful someday).

Thanks, I missed that. Fixed.
It's not pretty, what goes on in that DownloadInstance/DownloadInstanceWriter setup, btw. I am opening an issue for cleaning up/simplifying/rewriting it.

qqmyers · 2020-10-23T21:05:39Z

src/main/java/edu/harvard/iq/dataverse/AuxiliaryFileServiceBean.java

+            storageIO.saveInputStreamAsAux(fileInputStream, auxExtension);
+            auxFile.setChecksum(FileUtil.calculateChecksum(storageIO.getAuxFileAsInputStream(auxExtension), systemConfig.getFileFixityChecksumAlgorithm()));
+
+            Tika tika = new Tika();


So far Tika is only used for full-text indexing and content-type detection is done differently for Datafiles. Could/should these both be done the same way? (not necessarily in this PR)

for Datafiles, JHove is used, which works on a file as opposed to an inputStream. So to use JHove, I think I would have to save the aux file to a temp location, unless I'm missing something. Is this something we want to do?

FWIW: Requiring a temp file for mime detection is a problem for Datafiles too, so it may make sense to use Tika instead of JHOVE there as well. (#6937 would make it possible to get ranges of bytes from S3 to help avoid having to get a whole file).

ok, should we create a separate ticket for changing mime detection to Tika for Datafiles? (Is this something that is usually talked about at dv_tech hour?)

The method that's used for DataFiles is determineContentType in FileUtil. It would probably be good to use the same code path, instead of using Tika. Or switch it all to Tika, I guess, but that seems daunting to me. Or I guess we could use Tika here for now and work on consistency later. And sure, this is a fine topic for tech hours. 😄

Discussed this during tech hours, we agreed that it would be good to eventually switch from JHove to Tika in other parts of the code for consistency

I'm about to approve the PR. There's just one thing I regret not having thought about earlier, when we were having this discussion, above: WHY are we trying to detect mime types at all? As opposed to just making the uploading client supply it, as another parameter?
We try to detect mime types on "normal" uploaded files - but that's because users upload arbitrary files... This API is for something that's structured; at least in the immediate use case scenario. I.e. that preprocessed summary stats fragment will always be in JSON, the diff. private DDI will always be XML etc. etc. We don't expect any variety there...
I don't see this as a problem, having this detection code in place. It may come in handy for other cases. And, even if we decide that these aux uploads should, or can supply the mime type as a parameter - we could use this detection as an extra validation step.
Still, I can't help but feel I should've said "let's not even worry about it, let's just add a parameter instead", early on.

qqmyers · 2020-10-23T21:07:40Z

src/main/java/edu/harvard/iq/dataverse/AuxiliaryFileServiceBean.java

+            storageIO = dataFile.getStorageIO();
+            AuxiliaryFile auxFile = new AuxiliaryFile();
+            storageIO.saveInputStreamAsAux(fileInputStream, auxExtension);
+            auxFile.setChecksum(FileUtil.calculateChecksum(storageIO.getAuxFileAsInputStream(auxExtension), systemConfig.getFileFixityChecksumAlgorithm()));


It's not too hard to wrap the fileInputStream and calculate the has during the original write (versus having to retrieve the file and scan it a second time. (The util.Bag code has examples.)

Can you explain more what you had it mind? I looked at BagGenerator and BagValidationJob, and can't find an example of this. I could see how you could combine the logic that reads the stream to calculate the checksum, with the read from stream that happens in StorageIO, but I don't see an example of where that's happening.

Sorry - it would help if I pointed to the right place. It's actually the archiving commands that do this. E.g. the Google Cloud one. They complicate things with threads, but the basic idea is that you wrap your stream with a java.security.DigestInputStream and, once you've finished reading the bytes, you can get the digest for the algorithm you set from that stream.

ok, that makes sense, thank you

qqmyers · 2020-10-23T21:13:20Z

src/main/java/edu/harvard/iq/dataverse/api/Access.java

+    public Response saveAuxiliaryFileWithVersion(@PathParam("fileId") Long fileId,
+            @PathParam("formatTag") String formatTag,
+            @PathParam("formatVersion") String formatVersion,
+            @FormDataParam("origin") String origin,


FWIW: the normal file add uses a single jsonData param rather than multiple FormDataParams. Not sure if that's intended to be consistent across the API.

I agree with the desire for consistency, but the parameters listed in Auxiliary files are required, whereas for depositing a Datafile, the jsonData parameter contains only optional params. So I think it's probably better to keep the Auxiliary API as is so that the required parameters are more obvious.

is being written to storage

pdurbin

Overall, this looks good. Below I'll make some high-level comments and summarize feedback I left on specific lines.

The pull request template wasn't used so I'm not sure if this pull request is meant to close #7275 or not. This should be clarified.

The branch is quite behind develop so the latest need to be merged in.

There are no docs. An update to the API Guide should be made with full curl examples of how to use these new endpoints. Also, a release note fragment should be added to describe the feature.

There seems to be a security/privacy issue with how isPublic is implemented: You can download an aux file from a dataset draft. I don't know if this is by design or not but this will probably be unexpected behavior. Update: I re-tested this and got ..."code": 403, "message": "Not authorized to access this object via this API endpoint...

It's clear that this pull request is a small chunk along the way to integration between OpenDP and Dataverse, which is fine. I see that there's an issue at #7400 to worry more about the UI implications of these new aux files.

Since the focus in this pull request is on APIs I'm wondering if we should do a little more. Specifically, here are some use cases I can anticipate people wanting:

I want to be able to delete an aux file.
I want to be able to list all the aux files for a data file.
For one of my aux files, I want to be able to check its content type and other information.

@qqmyers noticed that filetype detection is done through a different method (Tika) that than our normal methods. I'm ok with QA testing it as-is but we should probably make these consistent at some point.

I'm still a little troubled that we have API endpoints for files under both "access" and "files". We need to keep all endpoints working, of course, but eventually we should come up with guidance on what should go where. I prefer "files" but I understand that "access" was there first.

Again, overall, looks good.

pdurbin · 2020-11-06T20:31:14Z

src/main/java/edu/harvard/iq/dataverse/AuxiliaryFileServiceBean.java

+            storageIO.saveInputStreamAsAux(fileInputStream, auxExtension);
+            auxFile.setChecksum(FileUtil.calculateChecksum(storageIO.getAuxFileAsInputStream(auxExtension), systemConfig.getFileFixityChecksumAlgorithm()));
+
+            Tika tika = new Tika();


The method that's used for DataFiles is determineContentType in FileUtil. It would probably be good to use the same code path, instead of using Tika. Or switch it all to Tika, I guess, but that seems daunting to me. Or I guess we could use Tika here for now and work on consistency later. And sure, this is a fine topic for tech hours. 😄

src/main/java/edu/harvard/iq/dataverse/AuxiliaryFileServiceBean.java

src/main/java/edu/harvard/iq/dataverse/DataFile.java

pdurbin · 2020-11-06T21:00:25Z

src/main/java/edu/harvard/iq/dataverse/api/Access.java

-    @Produces({"text/xml"})

-    public DownloadInstance tabularDatafileMetadataPreprocessed(@PathParam("fileId") String fileId, @QueryParam("key") String apiToken, @Context UriInfo uriInfo, @Context HttpHeaders headers, @Context HttpServletResponse response) throws ServiceUnavailableException {
+    public DownloadInstance tabularDatafileMetadataPreprocessed(@PathParam("fileId") String fileId,


Instead of tabularDatafileMetadataPreprocessed maybe this should be called tabularDatafileMetadataAux or something (or something entirely new) to deemphasize the older preprocessed format.

pdurbin · 2020-11-06T21:17:28Z

src/main/java/edu/harvard/iq/dataverse/api/Access.java

+            @FormDataParam("file") InputStream fileInputStream,
+            @FormDataParam("file") FormDataContentDisposition contentDispositionHeader,
+            @FormDataParam("file") final FormDataBodyPart formDataBodyPart
+    ) {


It seems like only InputStream is used. What are FormDataContentDisposition and FormDataBodyPart for?

pdurbin · 2020-11-06T21:20:33Z

src/main/java/edu/harvard/iq/dataverse/api/Access.java

+        boolean saved = auxiliaryFileService.processAuxiliaryFile(fileInputStream, dataFile, formatTag, formatVersion, origin, isPublic);
+
+        if (saved) {
+            return ok("Auxiliary file has been saved.");


What if we return the aux file id and the detected content type? If we add AuxiliaryFile to JsonPrinter.java, what would the JSON look like?

…e into 7275-upload-auxliary-files

pdurbin

I made a minor comment about created vs ok.

src/main/java/edu/harvard/iq/dataverse/api/Access.java

pdurbin

@landreev and I decided to make the "add" response 200/OK. I also merged the latest from develop.

kcondon · 2020-11-30T21:08:33Z

@ekraffmiller Getting 404 endpoint does not exist error. I'm sure it is my fault but could use some help. I used the example command line but with params on the same line.

fixed typo in deposit URL

ekraffmiller and others added 3 commits October 13, 2020 14:13

new API method for saving a generic auxiliary file to a data file

5c2391b

modified framework for the download/GET part of the new aux. metadata…

91748b3

… API. will post more info tomorrow, about what may still needs to be done there. #7275

added checksum and contentType to AuxiliaryFile entity

d6ce1d7

qqmyers reviewed Oct 23, 2020

View reviewed changes

Making sure there's no open inputstreams left. (#7275)

1e47316

djbrooke assigned ekraffmiller Oct 26, 2020

Use DigestInputStream so that checksum can be calculated when stream

1af5a24

is being written to storage

djbrooke unassigned ekraffmiller Nov 5, 2020

pdurbin changed the title ~~7275 upload auxliary files~~ 7275 upload auxiliary files Nov 6, 2020

pdurbin self-assigned this Nov 6, 2020

djbrooke mentioned this pull request Nov 9, 2020

Auxiliary Files - Accessing DP releases in Dataverse #7400

Closed

pdurbin requested changes Nov 9, 2020

View reviewed changes

pdurbin assigned ekraffmiller and unassigned pdurbin Nov 9, 2020

ekraffmiller and others added 8 commits November 16, 2020 15:28

Merge branch 'develop' into 7275-upload-auxliary-files

8098f6f

Updated deposit API method to return JSon data for saved file

c1e03a0

Merge branch '7275-upload-auxliary-files' of github.com:IQSS/datavers…

ee61a38

…e into 7275-upload-auxliary-files

removed unneeded comments

11b9db6

removed redundant initialization

867ad10

added documentation for Auxiliary File APIs

6e88b55

doc updates

a2c3f0d

removing :

36b69df

djbrooke assigned landreev and unassigned ekraffmiller Nov 18, 2020

landreev approved these changes Nov 23, 2020

View reviewed changes

landreev removed their assignment Nov 23, 2020

pdurbin self-assigned this Nov 24, 2020

pdurbin reviewed Nov 24, 2020

View reviewed changes

src/main/java/edu/harvard/iq/dataverse/api/Access.java Outdated Show resolved Hide resolved

pdurbin added 3 commits November 24, 2020 13:48

switch from created to ok #7275

471dd25

Merge branch 'develop' into 7275-upload-auxliary-files #7275

023c5b5

make docs match OK/200 response from add #7275

21977d0

pdurbin approved these changes Nov 24, 2020

View reviewed changes

pdurbin removed their assignment Nov 24, 2020

kcondon self-assigned this Nov 30, 2020

adding release notes

103b591

Update aux-file-support.rst

82c5f66

fixed typo in deposit URL

kcondon merged commit 6f88c11 into develop Nov 30, 2020

kcondon deleted the 7275-upload-auxliary-files branch November 30, 2020 21:53

djbrooke added this to the 5.3 milestone Nov 30, 2020

Comments

Conversation

ekraffmiller commented Oct 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Oct 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ekraffmiller Oct 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pdurbin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pdurbin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pdurbin left a comment

Choose a reason for hiding this comment

Uh oh!

kcondon commented Nov 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

ekraffmiller commented Oct 22, 2020 •

edited

Loading

coveralls commented Oct 22, 2020 •

edited

Loading

ekraffmiller Oct 30, 2020 •

edited

Loading

pdurbin left a comment •

edited

Loading

kcondon commented Nov 30, 2020 •

edited

Loading