-
Notifications
You must be signed in to change notification settings - Fork 535
7275 upload auxiliary files #7350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
5c2391b
new API method for saving a generic auxiliary file to a data file
ekraffmiller 91748b3
modified framework for the download/GET part of the new aux. metadata…
landreev d6ce1d7
added checksum and contentType to AuxiliaryFile entity
ekraffmiller 1e47316
Making sure there's no open inputstreams left. (#7275)
landreev 1af5a24
Use DigestInputStream so that checksum can be calculated when stream
ekraffmiller 8098f6f
Merge branch 'develop' into 7275-upload-auxliary-files
ekraffmiller c1e03a0
Updated deposit API method to return JSon data for saved file
ekraffmiller ee61a38
Merge branch '7275-upload-auxliary-files' of github.com:IQSS/datavers…
ekraffmiller 11b9db6
removed unneeded comments
ekraffmiller 867ad10
removed redundant initialization
ekraffmiller 6e88b55
added documentation for Auxiliary File APIs
ekraffmiller a2c3f0d
doc updates
djbrooke 36b69df
removing :
djbrooke 471dd25
switch from created to ok #7275
pdurbin 023c5b5
Merge branch 'develop' into 7275-upload-auxliary-files #7275
pdurbin 21977d0
make docs match OK/200 response from add #7275
pdurbin 103b591
adding release notes
djbrooke 82c5f66
Update aux-file-support.rst
ekraffmiller File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| ## Notes for Tool Developers and Integrators | ||
|
|
||
| Experimental endpoints have been added to allow auxiliary files to be added to datafiles. These auxiliary files can be deposited and accessed via API. Later releases will include options for accessing these files through the UI. For more information, see the Auxiliary File Support section of the [Developer Guide](https://guides.dataverse.org/en/5.3/developers/). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| Auxiliary File Support | ||
| ====================== | ||
|
|
||
| Auxiliary file support is experimental. Auxiliary files in Dataverse are being added to support depositing and downloading differentially private metadata, as part of the OpenDP project (OpenDP.io). In future versions, this approach may become more broadly used and supported. | ||
|
|
||
| Adding an Auxiliary File to a Datafile | ||
| -------------------------------------- | ||
| To add an auxiliary file, specify the primary key of the datafile (FILE_ID), and the formatTag and formatVersion (if applicable) associated with the auxiliary file. There are two form parameters. "Origin" specifies the application/entity that created the auxiliary file, an "isPublic" controls access to downloading the file. If "isPublic" is true, any user can download the file, else, access authorization is based on the access rules as defined for the DataFile itself. | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| export API_TOKEN=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx | ||
| export FILENAME='auxfile.txt' | ||
| export FILE_ID='12345' | ||
| export FORMAT_TAG='dpJson' | ||
| export FORMAT_VERSION='v1' | ||
| export SERVER_URL=https://demo.dataverse.org | ||
|
|
||
| curl -H X-Dataverse-key:$API_TOKEN -X POST -F "file=@$FILENAME" -F 'origin=myApp' -F 'isPublic=true' "$SERVER_URL/api/access/datafile/$FILE_ID/metadata/$FORMAT_TAG/$FORMAT_VERSION" | ||
|
|
||
| You should expect a 200 ("OK") response and JSON with information about your newly uploaded auxiliary file. | ||
|
|
||
| Downloading an Auxiliary File that belongs to a Datafile | ||
| -------------------------------------------------------- | ||
| To download an auxiliary file, use the primary key of the datafile, and the | ||
| formatTag and formatVersion (if applicable) associated with the auxiliary file: | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| export API_TOKEN=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx | ||
| export SERVER_URL=https://demo.dataverse.org | ||
| export FILE_ID='12345' | ||
| export FORMAT_TAG='dpJson' | ||
| export FORMAT_VERSION='v1' | ||
|
|
||
| curl "$SERVER_URL/api/access/datafile/$FILE_ID/$FORMAT_TAG/$FORMAT_VERSION" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -32,4 +32,5 @@ Developer Guide | |
| geospatial | ||
| selinux | ||
| big-data-support | ||
| aux-file-support | ||
| workflows | ||
120 changes: 120 additions & 0 deletions
120
src/main/java/edu/harvard/iq/dataverse/AuxiliaryFile.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,120 @@ | ||
|
|
||
| package edu.harvard.iq.dataverse; | ||
|
|
||
| import java.io.Serializable; | ||
| import javax.persistence.Entity; | ||
| import javax.persistence.GeneratedValue; | ||
| import javax.persistence.GenerationType; | ||
| import javax.persistence.Id; | ||
| import javax.persistence.JoinColumn; | ||
| import javax.persistence.ManyToOne; | ||
|
|
||
| /** | ||
| * | ||
| * @author ekraffmiller | ||
| * Represents a generic file that is associated with a dataFile. | ||
| * This is a data representation of a physical file in StorageIO | ||
| */ | ||
| @Entity | ||
| public class AuxiliaryFile implements Serializable { | ||
|
|
||
| @Id | ||
| @GeneratedValue(strategy = GenerationType.IDENTITY) | ||
| private Long id; | ||
|
|
||
| /** | ||
| * The data file that this AuxiliaryFile belongs to | ||
| * a data file may have many auxiliaryFiles | ||
| */ | ||
| @ManyToOne | ||
| @JoinColumn(nullable=false) | ||
| private DataFile dataFile; | ||
|
|
||
| private String formatTag; | ||
|
|
||
| private String formatVersion; | ||
|
|
||
| private String origin; | ||
|
|
||
| private boolean isPublic; | ||
|
|
||
| private String contentType; | ||
|
|
||
| private Long fileSize; | ||
|
|
||
| private String checksum; | ||
|
|
||
| public Long getId() { | ||
| return id; | ||
| } | ||
|
|
||
| public void setId(Long id) { | ||
| this.id = id; | ||
| } | ||
|
|
||
| public DataFile getDataFile() { | ||
| return dataFile; | ||
| } | ||
|
|
||
| public void setDataFile(DataFile dataFile) { | ||
| this.dataFile = dataFile; | ||
| } | ||
|
|
||
| public String getFormatTag() { | ||
| return formatTag; | ||
| } | ||
|
|
||
| public void setFormatTag(String formatTag) { | ||
| this.formatTag = formatTag; | ||
| } | ||
|
|
||
| public String getFormatVersion() { | ||
| return formatVersion; | ||
| } | ||
|
|
||
| public void setFormatVersion(String formatVersion) { | ||
| this.formatVersion = formatVersion; | ||
| } | ||
|
|
||
| public String getOrigin() { | ||
| return origin; | ||
| } | ||
|
|
||
| public void setOrigin(String origin) { | ||
| this.origin = origin; | ||
| } | ||
|
|
||
| public boolean getIsPublic() { | ||
| return isPublic; | ||
| } | ||
|
|
||
| public void setIsPublic(boolean isPublic) { | ||
| this.isPublic = isPublic; | ||
| } | ||
|
|
||
| public String getContentType() { | ||
| return this.contentType; | ||
| } | ||
|
|
||
| public void setContentType(String contentType) { | ||
| this.contentType = contentType; | ||
| } | ||
|
|
||
| public Long getFileSize() { | ||
| return fileSize; | ||
| } | ||
|
|
||
| public void setFileSize(long fileSize) { | ||
| this.fileSize = fileSize; | ||
| } | ||
|
|
||
| public String getChecksum() { | ||
| return checksum; | ||
| } | ||
|
|
||
| public void setChecksum(String checksum) { | ||
| this.checksum = checksum; | ||
| } | ||
|
|
||
|
|
||
| } |
117 changes: 117 additions & 0 deletions
117
src/main/java/edu/harvard/iq/dataverse/AuxiliaryFileServiceBean.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,117 @@ | ||
|
|
||
| package edu.harvard.iq.dataverse; | ||
|
|
||
| import edu.harvard.iq.dataverse.dataaccess.StorageIO; | ||
| import edu.harvard.iq.dataverse.util.FileUtil; | ||
| import edu.harvard.iq.dataverse.util.SystemConfig; | ||
| import java.io.IOException; | ||
| import java.io.InputStream; | ||
| import java.security.DigestInputStream; | ||
| import java.security.MessageDigest; | ||
| import java.util.logging.Logger; | ||
| import javax.ejb.EJB; | ||
| import javax.ejb.Stateless; | ||
| import javax.inject.Named; | ||
| import javax.persistence.EntityManager; | ||
| import javax.persistence.PersistenceContext; | ||
| import javax.persistence.Query; | ||
| import org.apache.tika.Tika; | ||
|
|
||
| /** | ||
| * | ||
| * @author ekraffmiller | ||
| * Methods related to the AuxiliaryFile Entity. | ||
| */ | ||
| @Stateless | ||
| @Named | ||
| public class AuxiliaryFileServiceBean implements java.io.Serializable { | ||
| private static final Logger logger = Logger.getLogger(AuxiliaryFileServiceBean.class.getCanonicalName()); | ||
|
|
||
| @PersistenceContext(unitName = "VDCNet-ejbPU") | ||
| private EntityManager em; | ||
|
|
||
| @EJB | ||
| private SystemConfig systemConfig; | ||
|
|
||
|
|
||
| public AuxiliaryFile find(Object pk) { | ||
| return em.find(AuxiliaryFile.class, pk); | ||
| } | ||
|
|
||
| public AuxiliaryFile save(AuxiliaryFile auxiliaryFile) { | ||
| AuxiliaryFile savedFile = em.merge(auxiliaryFile); | ||
| return savedFile; | ||
|
|
||
| } | ||
|
|
||
| /** | ||
| * Save the physical file to storageIO, and save the AuxiliaryFile entity | ||
| * to the database. This should be an all or nothing transaction - if either | ||
| * process fails, than nothing will be saved | ||
| * @param fileInputStream - auxiliary file data to be saved | ||
| * @param dataFile - the dataFile entity this will be added to | ||
| * @param formatTag - type of file being saved | ||
| * @param formatVersion - to distinguish between multiple versions of a file | ||
| * @param origin - name of the tool/system that created the file | ||
| * @param isPublic boolean - is this file available to any user? | ||
| * @return success boolean - returns whether the save was successful | ||
| */ | ||
| public AuxiliaryFile processAuxiliaryFile(InputStream fileInputStream, DataFile dataFile, String formatTag, String formatVersion, String origin, boolean isPublic) { | ||
|
|
||
| StorageIO<DataFile> storageIO =null; | ||
| AuxiliaryFile auxFile = new AuxiliaryFile(); | ||
| String auxExtension = formatTag + "_" + formatVersion; | ||
| try { | ||
| // Save to storage first. | ||
| // If that is successful (does not throw exception), | ||
| // then save to db. | ||
| // If the db fails for any reason, then rollback | ||
| // by removing the auxfile from storage. | ||
| storageIO = dataFile.getStorageIO(); | ||
| MessageDigest md = MessageDigest.getInstance(systemConfig.getFileFixityChecksumAlgorithm().toString()); | ||
| DigestInputStream di | ||
| = new DigestInputStream(fileInputStream, md); | ||
|
|
||
| storageIO.saveInputStreamAsAux(fileInputStream, auxExtension); | ||
| auxFile.setChecksum(FileUtil.checksumDigestToString(di.getMessageDigest().digest()) ); | ||
|
|
||
| Tika tika = new Tika(); | ||
| auxFile.setContentType(tika.detect(storageIO.getAuxFileAsInputStream(auxExtension))); | ||
| auxFile.setFormatTag(formatTag); | ||
| auxFile.setFormatVersion(formatVersion); | ||
| auxFile.setOrigin(origin); | ||
| auxFile.setIsPublic(isPublic); | ||
| auxFile.setDataFile(dataFile); | ||
| auxFile.setFileSize(storageIO.getAuxObjectSize(auxExtension)); | ||
| auxFile = save(auxFile); | ||
| } catch (IOException ioex) { | ||
| logger.info("IO Exception trying to save auxiliary file: " + ioex.getMessage()); | ||
| return null; | ||
| } catch (Exception e) { | ||
| // If anything fails during database insert, remove file from storage | ||
| try { | ||
| storageIO.deleteAuxObject(auxExtension); | ||
| } catch(IOException ioex) { | ||
| logger.info("IO Exception trying remove auxiliary file in exception handler: " + ioex.getMessage()); | ||
| return null; | ||
| } | ||
| } | ||
| return auxFile; | ||
| } | ||
|
|
||
| public AuxiliaryFile lookupAuxiliaryFile(DataFile dataFile, String formatTag, String formatVersion) { | ||
|
|
||
| Query query = em.createQuery("select object(o) from AuxiliaryFile as o where o.dataFile.id = :dataFileId and o.formatTag = :formatTag and o.formatVersion = :formatVersion"); | ||
|
|
||
| query.setParameter("dataFileId", dataFile.getId()); | ||
| query.setParameter("formatTag", formatTag); | ||
| query.setParameter("formatVersion", formatVersion); | ||
| try { | ||
| AuxiliaryFile retVal = (AuxiliaryFile)query.getSingleResult(); | ||
| return retVal; | ||
| } catch(Exception ex) { | ||
| return null; | ||
| } | ||
| } | ||
|
|
||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far Tika is only used for full-text indexing and content-type detection is done differently for Datafiles. Could/should these both be done the same way? (not necessarily in this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for Datafiles, JHove is used, which works on a file as opposed to an inputStream. So to use JHove, I think I would have to save the aux file to a temp location, unless I'm missing something. Is this something we want to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW: Requiring a temp file for mime detection is a problem for Datafiles too, so it may make sense to use Tika instead of JHOVE there as well. (#6937 would make it possible to get ranges of bytes from S3 to help avoid having to get a whole file).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, should we create a separate ticket for changing mime detection to Tika for Datafiles? (Is this something that is usually talked about at dv_tech hour?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method that's used for DataFiles is determineContentType in FileUtil. It would probably be good to use the same code path, instead of using Tika. Or switch it all to Tika, I guess, but that seems daunting to me. Or I guess we could use Tika here for now and work on consistency later. And sure, this is a fine topic for tech hours. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed this during tech hours, we agreed that it would be good to eventually switch from JHove to Tika in other parts of the code for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm about to approve the PR. There's just one thing I regret not having thought about earlier, when we were having this discussion, above: WHY are we trying to detect mime types at all? As opposed to just making the uploading client supply it, as another parameter?
We try to detect mime types on "normal" uploaded files - but that's because users upload arbitrary files... This API is for something that's structured; at least in the immediate use case scenario. I.e. that preprocessed summary stats fragment will always be in JSON, the diff. private DDI will always be XML etc. etc. We don't expect any variety there...
I don't see this as a problem, having this detection code in place. It may come in handy for other cases. And, even if we decide that these aux uploads should, or can supply the mime type as a parameter - we could use this detection as an extra validation step.
Still, I can't help but feel I should've said "let's not even worry about it, let's just add a parameter instead", early on.