diff --git a/doc/release-notes/6505-zipdownload-service.md b/doc/release-notes/6505-zipdownload-service.md new file mode 100644 index 00000000000..a3a83b9e31a --- /dev/null +++ b/doc/release-notes/6505-zipdownload-service.md @@ -0,0 +1,30 @@ +### A multi-file, zipped download optimization + +In this release we are offering an experimental optimization for the +multi-file, download-as-zip functionality. If this option is enabled, +instead of enforcing size limits, we attempt to serve all the files +that the user requested (that they are authorized to download), but +the request is redirected to a standalone zipper service running as a +cgi executable. Thus moving these potentially long-running jobs +completely outside the Application Server (Payara); and preventing +service threads from becoming locked serving them. Since zipping is +also a CPU-intensive task, it is possible to have this service running +on a different host system, thus freeing the cycles on the main +Application Server. (The system running the service needs to have +access to the database as well as to the storage filesystem, and/or S3 +bucket). + +Please consult the scripts/zipdownload/README.md in the Dataverse 5 +source tree. + +The components of the standalone "zipper tool" can also be downloaded +here: +(my plan is to build the executable and to add it to the v5 +release files on github: - L.A.) +https://github.com/IQSS/dataverse/releases/download/v5.0/zipper.zip. + +## New JVM Options and DB Options + +### New DB Option CustomZipDownloadServiceUrl + +If defined, this is the URL of the zipping service outside the main Application Service where zip downloads should be directed (instead of /api/access/datafiles/) diff --git a/doc/sphinx-guides/source/installation/advanced.rst b/doc/sphinx-guides/source/installation/advanced.rst index 9bcba1a9fac..97d353431c2 100644 --- a/doc/sphinx-guides/source/installation/advanced.rst +++ b/doc/sphinx-guides/source/installation/advanced.rst @@ -35,3 +35,51 @@ If you have successfully installed multiple app servers behind a load balancer y You would repeat the steps above for all of your app servers. If users seem to be having a problem with a particular server, you can ask them to visit https://dataverse.example.edu/host.txt and let you know what they see there (e.g. "server1.example.edu") to help you know which server to troubleshoot. Please note that :ref:`network-ports` under the Configuration section has more information on fronting your app server with Apache. The :doc:`shibboleth` section talks about the use of ``ProxyPassMatch``. + +Optional Components +------------------- + +Standalone "Zipper" Service Tool +++++++++++++++++++++++++++++++++ + +As of Dataverse v5.0 we offer an experimental optimization for the +multi-file, download-as-zip functionality. If this option +(``:CustomZipDownloadServiceUrl``) is enabled, instead of enforcing +the size limit on multi-file zipped downloads (as normally specified +by the option ``:ZipDownloadLimit``), we attempt to serve all the +files that the user requested (that they are authorized to download), +but the request is redirected to a standalone zipper service running +as a cgi-bin executable under Apache. Thus moving these potentially +long-running jobs completely outside the Application Server (Payara); +and preventing worker threads from becoming locked serving them. Since +zipping is also a CPU-intensive task, it is possible to have this +service running on a different host system, freeing the cycles on the +main Application Server. (The system running the service needs to have +access to the database as well as to the storage filesystem, and/or S3 +bucket). + +Please consult the scripts/zipdownload/README.md in the Dataverse 5 +source tree for more information. + +To install: You can follow the instructions in the file above to build +``ZipDownloadService-v1.0.0.jar``. It will also be available, pre-built as part of the Dataverse release on GitHub. Copy it, together with the shell +script scripts/zipdownload/cgi-bin/zipdownload to the cgi-bin +directory of the chosen Apache server (/var/www/cgi-bin standard). + +Make sure the shell script (zipdownload) is executable, and edit it to configure the +database access credentials. Do note that the executable does not need +access to the entire Dataverse database. A security-conscious admin +can create a dedicated database user with access to just one table: +``CUSTOMZIPSERVICEREQUEST``. + +You may need to make extra Apache configuration changes to make sure /cgi-bin/zipdownload is accessible from the outside. +For example, if this is the same Apache that's in front of your Dataverse Payara instance, you will need to add another pass through statement to your configuration: + +``ProxyPassMatch ^/cgi-bin/zipdownload !`` + +Test this by accessing it directly at ``/cgi-bin/download``. You should get a ``404 No such download job!``. If instead you are getting an "internal server error", this may be an SELinux issue; try ``setenforce Permissive``. If you are getting a generic Dataverse "not found" page, review the ``ProxyPassMatch`` rule you have added. + +To activate in Dataverse:: + + curl -X PUT -d '/cgi-bin/zipdownload' http://localhost:8080/api/admin/settings/:CustomZipDownloadServiceUrl + diff --git a/doc/sphinx-guides/source/installation/config.rst b/doc/sphinx-guides/source/installation/config.rst index 25bcea41532..b02e0ea6224 100644 --- a/doc/sphinx-guides/source/installation/config.rst +++ b/doc/sphinx-guides/source/installation/config.rst @@ -2134,3 +2134,16 @@ Unlike other facets, those indexed by Date/Year are sorted chronologically by de If you don’t want date facets to be sorted chronologically, set: ``curl -X PUT -d 'false' http://localhost:8080/api/admin/settings/:ChronologicalDateFacets`` + +:CustomZipDownloadServiceUrl +++++++++++++++++++++++++++++ + +The location of the "Standalone Zipper" service. If this option is specified, Dataverse will be redirecing bulk/mutli-file zip download requests to that location, instead of serving them internally. See the "Advanced" section of the Installation guide for information on how to install the external zipper. (This is still an experimental feature, as of v5.0). + +To enable redirects to the zipper installed on the same server as the main Dataverse application: + +``curl -X PUT -d '/cgi-bin/zipdownload' http://localhost:8080/api/admin/settings/:CustomZipDownloadServiceUrl`` + +To enable redirects to the zipper on a different server: + +``curl -X PUT -d 'https://zipper.example.edu/cgi-bin/zipdownload' http://localhost:8080/api/admin/settings/:CustomZipDownloadServiceUrl`` diff --git a/scripts/zipdownload/README.md b/scripts/zipdownload/README.md new file mode 100644 index 00000000000..74431094782 --- /dev/null +++ b/scripts/zipdownload/README.md @@ -0,0 +1,104 @@ +Work in progress! + +to build: + +cd scripts/zipdownload +mvn clean compile assembly:single + +to install: + +install cgi-bin/zipdownload and ZipDownloadService-v1.0.0.jar in your cgi-bin directory (/var/www/cgi-bin standard). + +Edit the config lines in the shell script (zipdownload) as needed. + +You may need to make extra Apache configuration changes to make sure /cgi-bin/zipdownload is accessible from the outside. +For example, if this is the same Apache that's in front of your Dataverse Payara instance, you'll need to add another pass through statement to your configuration: + +``ProxyPassMatch ^/cgi-bin/zipdownload !`` + +(see the "Advanced" section of the Installation Guide for some extra troubleshooting tips) + +To activate in Dataverse: + +curl -X PUT -d '/cgi-bin/zipdownload' http://localhost:8080/api/admin/settings/:CustomZipDownloadServiceUrl + +How it works: +============= + +(This is an ongoing design discussion - other developers are welcome to contribute) + +The goal: to move this potentially long-running task out of the +Application Server. This is the sole focus of this implementation. It +does not attempt to make it faster. + +The rationale here is a zipped download of a large enough number of +large enough files will always be slow. Zipping (compressing) itself +is a fairly CPU-intensive task. This will most frequently be the +bottleneck of the service. Although with a slow storage location (S3 +or Swift, with a slow link to the share) it may be the speed at which +the application accesses the raw bytes. The exact location of the +bottleneck is in a sense irrelevant. On a very fast system, with the +files stored on a very fast local RAID, the bottleneck for most users +will likely shift to the speed of their internet connection to the +server. The bottom line is, downloading this multi-file compressed +stream will take a long time no matter how you slice it. So this hack +addresses it by moving the task outside Payara, where it's not going +to hog any threads. + +A quick, somewhat unrelated note: attempting to download a multi-GB +stream over http will always have its own inherent risks. If the +download has to take hours or days to complete, it is very likely that +it'll break down somewhere in the middle. Do note that for a zipped +download our users will not be able to utilize `wget --continue`, or +any similar "resume" functionality - because it's impossible to resume +generating a zipped stream from a certain offset. + +The implementation is a hack. It relies on direct access to everything - storage locations (filesystem or S3) and the database. + +There are no network calls between the application (Dataverse) and the zipper (an +implementation relying on such a call was discussed early +on). Dataverse issues a "job key" and sends the user's browser to the +zipper (to, for ex., /cgi-bin/zipdownload?) instead of +/api/access/datafiles/). To authorize the zipdownload for +the "job key", and inform the zipper on which files to zip and where +to find them, the application relies on a database table, that the +zipper also has access to. In other words, there is a saved state +information associated with each zipped download request. Zipper may +be given a limited database access - for example, via a user +authorized to access that one table only. After serving the files, the +zipper removes the database entries. Job records in the database have +time stamps, so on the application side, as an added level of cleanup, +it automatically deletes any records older than 5 minutes (can be +further reduced) every time the service adds new records; as an added +level of cleanup for any records that got stuck in the db because the +corresponding zipper jobs never completed. A paranoid admin may choose +to give the zipper read-only access to the database, and rely on a +cleanup solely on the application side. + +I have explored ways to avoid maintaining this state information. A +potential implementation we discussed early on, where the application +would make a network call to the zipper before redirecting the user +there, would NOT solve that problem - the state would need to somehow +be maintained on the zipper side. The only truly stateless +implementation would rely on including all the file information WITH +the redirect itself, with some pre-signed URL mechanism to make it +secure. Mechanisms for pre-signing requests are readily available and +simple to implement. We could go with something similar to how S3 +presigns their access URLs. Jim Myers has already speced out how this +could be done for Dataverse access urls in a design document +(https://docs.google.com/document/d/1J8GW6zi-vSRKZdtFjLpmYJ2SUIcIkAEwHkP4q1fxL-s/edit#). (Basically, +you hash the product of your request parameters, the issue timestamp +AND some "secret" - like the user's API key - and send the resulting +hash along with the request. Tampering with any of the parameters, or +trying to extend the life span of the request, becomes impossible, +because it would invalidate the hash). What stopped me from trying +something like that was the sheer size of information that would need +to be included with a request, for a potentially long list of files +that need to be zipped. When serving a zipped download from a page +that would be doable - we could javascript together a POST call that +the browser could make to send all that info to the zipper. But if we +want to implement something similar in the API, I felt like I really +wanted to be able to simply issue a quick redirect to a manageable url +- which with the implementation above is simply +/cgi-bin/zipdownload?, with the being just a 16 +character hex string in the current implementation. diff --git a/scripts/zipdownload/cgi-bin/zipdownload b/scripts/zipdownload/cgi-bin/zipdownload new file mode 100644 index 00000000000..c2a9db8ca9b --- /dev/null +++ b/scripts/zipdownload/cgi-bin/zipdownload @@ -0,0 +1,11 @@ +#!/bin/sh + +CLASSPATH=/var/www/cgi-bin; export CLASSPATH + +PGHOST="localhost"; export PGHOST +PGPORT=5432; export PGPORT +PGUSER="dvnapp"; export PGUSER +PGDB="dvndb"; export PGDB +PGPW="xxxxx"; export PGPW + +java -Ddb.serverName=$PGHOST -Ddb.portNumber=$PGPORT -Ddb.user=$PGUSER -Ddb.databaseName=$PGDB -Ddb.password=$PGPW -jar ZipDownloadService-v1.0.0.jar \ No newline at end of file diff --git a/scripts/zipdownload/pom.xml b/scripts/zipdownload/pom.xml new file mode 100644 index 00000000000..256062a46e2 --- /dev/null +++ b/scripts/zipdownload/pom.xml @@ -0,0 +1,86 @@ + + + 4.0.0 + ZipDownloadService + ZipDownloadService + 1.0.0 + + UTF-8 + + + + central + Central Repository + https://repo.maven.apache.org/maven2 + default + + false + + + never + + + + + + central-repo + Central Repository + https://repo1.maven.org/maven2 + default + + + + + + com.amazonaws + aws-java-sdk-bom + 1.11.790 + pom + import + + + + + + + org.postgresql + postgresql + 42.2.2 + + + com.amazonaws + aws-java-sdk-s3 + + + + src/main/java + + + maven-compiler-plugin + 3.1 + + 1.8 + 1.8 + + + + org.apache.maven.plugins + maven-assembly-plugin + 2.4 + + + + edu.harvard.iq.dataverse.custom.service.download.ZipDownloadService + + + + jar-with-dependencies + + ${project.artifactId}-v${project.version} + false + + + + + diff --git a/scripts/zipdownload/src/main/java/edu/harvard/iq/dataverse/custom/service/download/ChunkingOutputStream.java b/scripts/zipdownload/src/main/java/edu/harvard/iq/dataverse/custom/service/download/ChunkingOutputStream.java new file mode 100644 index 00000000000..9ef0279b52d --- /dev/null +++ b/scripts/zipdownload/src/main/java/edu/harvard/iq/dataverse/custom/service/download/ChunkingOutputStream.java @@ -0,0 +1,108 @@ +/* + Copyright (C) 2005-2012, by the President and Fellows of Harvard College. + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + + Dataverse Network - A web application to share, preserve and analyze research data. + Developed at the Institute for Quantitative Social Science, Harvard University. + Version 3.0. +*/ +package edu.harvard.iq.dataverse.custom.service.download; + +import java.io.FilterOutputStream; +import java.io.IOException; +import java.io.OutputStream; + +/** + * Simple implementation of HTTP "chunking encoding" + * + * @author Leonid Andreev + */ +public class ChunkingOutputStream extends FilterOutputStream { + private static final int BUFFER_SIZE = 8192; + private static final byte[] CHUNK_CLOSE = "\r\n".getBytes(); + private static final String CHUNK_SIZE_FORMAT = "%x\r\n"; + + private byte[] buffer = new byte[BUFFER_SIZE]; + private int position = 0; + + public ChunkingOutputStream(OutputStream out) { + super(out); + } + + @Override + public void write(byte[] data) throws IOException { + this.write(data, 0, data.length); + } + + @Override + public void write(byte[] data, int offset, int length) throws IOException { + + // is this going to fill the buffer? + if (position + length > BUFFER_SIZE) { + System.arraycopy(data, offset, buffer, position, BUFFER_SIZE - position); + offset += (BUFFER_SIZE - position); + length -= (BUFFER_SIZE - position); + dumpChunk(buffer, 0, BUFFER_SIZE); + position = 0; + } + + // are there still multiple buffer-worths of bytes? + while (length > BUFFER_SIZE) { + dumpChunk(data, offset, BUFFER_SIZE); + offset += BUFFER_SIZE; + length -= BUFFER_SIZE; + } + + // finally, buffer the leftover bytes: + System.arraycopy(data, offset, buffer, position, length); + position+=length; + + } + + @Override + public void write(int i) throws IOException { + // Hopefully ZipOutputStream never writes single bytes into the stream? + // Uh, actually it does, *a lot* - at the beginning of the archive, and + // when it closes it. + + if (position == BUFFER_SIZE) { + dumpChunk(buffer, 0, position); + position = 0; + } + buffer[position++] = (byte)i; + } + + @Override + public void close() throws IOException { + if (position > 0) { + dumpChunk(buffer, 0, position); + } + + // ... and the final, "zero chunk": + super.out.write('0'); + super.out.write(CHUNK_CLOSE); + super.out.write(CHUNK_CLOSE); + + super.out.close(); + } + + + private void dumpChunk(byte[] data, int offset, int length) throws IOException { + String chunkSizeLine = String.format(CHUNK_SIZE_FORMAT, length); + super.out.write(chunkSizeLine.getBytes()); + super.out.write(data, offset, length); + // don't forget to close the chunk(!): + super.out.write(CHUNK_CLOSE); + } +} diff --git a/scripts/zipdownload/src/main/java/edu/harvard/iq/dataverse/custom/service/download/ZipDownloadService.java b/scripts/zipdownload/src/main/java/edu/harvard/iq/dataverse/custom/service/download/ZipDownloadService.java new file mode 100644 index 00000000000..cda7cbb9505 --- /dev/null +++ b/scripts/zipdownload/src/main/java/edu/harvard/iq/dataverse/custom/service/download/ZipDownloadService.java @@ -0,0 +1,240 @@ +/* + Copyright (C) 2005-2012, by the President and Fellows of Harvard College. + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + + Dataverse Network - A web application to share, preserve and analyze research data. + Developed at the Institute for Quantitative Social Science, Harvard University. + Version 3.0. +*/ +package edu.harvard.iq.dataverse.custom.service.download; + +import edu.harvard.iq.dataverse.custom.service.util.DirectAccessUtil; +import static edu.harvard.iq.dataverse.custom.service.util.DatabaseAccessUtil.lookupZipJob; +import java.io.IOException; +import java.io.InputStream; +import java.util.HashSet; +import java.util.List; +import java.util.Set; +import java.util.zip.ZipEntry; +import java.util.zip.ZipOutputStream; + +/** + * Custom (standalone) download service for Dataverse + * + * @author Leonid Andreev + */ +public class ZipDownloadService { + + private static String jobKey = null; + private List jobFiles = null; + private boolean zipOnly = false; + + private DirectAccessUtil directAccessUtil = null; + private ZipOutputStream zipOutputStream = null; + + public static void main(String args[]) throws Exception { + + ZipDownloadService zipperService = new ZipDownloadService(); + + if (!zipperService.parseArgs(args)) { + zipperService.usage(); + return; + } + + zipperService.parseCgiQueryParameters(); + + zipperService.execute(jobKey); + } + + private static void usage() { + System.out.println("\nUsage:"); + System.out.println(" java -jar ZipDownloadService-1.0.0.jar [-ziponly]>\n"); + + System.out.println(" supported options:"); + System.out.println(" -ziponly = output zip only, no http header/no chunking"); + System.out.println(""); + + } + + // The only option supported at the moment is "zip only" - output just the + // compressed stream, skip the HTTP header and chunking. + public boolean parseArgs(String[] args) { + + if (args == null || args.length == 0) { + return true; + } else if (args.length == 1) { + if (args[0].equals("-ziponly")) { + this.zipOnly = true; + return true; + } + } + + return false; + } + + // Does not support any parameters, except the job-identifying token key, + // supplied as the entire query string. + public void parseCgiQueryParameters() { + String queryString = System.getenv().get("QUERY_STRING"); + if (queryString != null) { + jobKey = queryString; + } + } + + public void print404() { + System.out.println("Status: 404 Not Found\r"); + System.out.println("Content-Type: text/html\r"); + System.out.println("\r"); + + System.out.println("

404 No such download job!

"); + } + + public void printZipHeader() { + System.out.println("Content-disposition: attachment; filename=\"dataverse_files.zip\"\r"); + System.out.println("Content-Type: application/zip; name=\"dataverse_files.zip\"\r"); + System.out.println("Transfer-Encoding: chunked\r"); + System.out.println("\r"); + System.out.flush(); + } + + public void execute(String key) { + + jobFiles = lookupZipJob(key); + + if (jobFiles == null || jobFiles.size() == 0) { + this.print404(); + System.exit(0); + } + + this.processFiles(); + } + + public void processFiles() { + + if (!this.zipOnly) { + this.printZipHeader(); + } + + Set zippedFolders = new HashSet<>(); + + for (String [] fileEntry : jobFiles) { + String storageLocation = fileEntry[0]; + String fileName = fileEntry[1]; + + //System.out.println(storageLocation + ":" + fileName); + + if (this.zipOutputStream == null) { + openZipStream(); + } + + if (this.directAccessUtil == null) { + this.directAccessUtil = new DirectAccessUtil(); + } + + InputStream inputStream = this.directAccessUtil.openDirectAccess(storageLocation); + + // (potential?) TODO: String zipEntryName = checkZipEntryName(fileName); + // this may not be needed anymore - some extra sanitizing of the file + // name we used to have to do - since all the values in a current Dataverse + // database may already be santized enough. + if (inputStream != null && this.zipOutputStream != null) { + + ZipEntry entry = new ZipEntry(fileName); + + byte[] bytes = new byte[2 * 8192]; + int read = 0; + long readSize = 0L; + + try { + // Does this file have a folder name? + if (hasFolder(fileName)) { + addFolderToZipStream(getFolderName(fileName), zippedFolders); + } + + this.zipOutputStream.putNextEntry(entry); + + while ((read = inputStream.read(bytes)) != -1) { + this.zipOutputStream.write(bytes, 0, read); + readSize += read; + } + inputStream.close(); + this.zipOutputStream.closeEntry(); + + /*if (fileSize == readSize) { + //System.out.println("Read "+readSize+" bytes;"); + } else { + throw new IOException("Byte size mismatch: expected " + fileSize + ", read: " + readSize); + }*/ + } catch (IOException ioex) { + System.err.println("Failed to compress "+storageLocation); + } + } else { + System.err.println("Failed to access "+storageLocation); + } + + } + try { + this.zipOutputStream.flush(); + this.zipOutputStream.close(); + + System.out.flush(); + System.out.close(); + } catch (Exception e) { + } + } + + public void openZipStream() { + if (this.zipOutputStream == null) { + if (this.zipOnly) { + this.zipOutputStream = new ZipOutputStream(System.out); + } else { + this.zipOutputStream = new ZipOutputStream(new ChunkingOutputStream(System.out)); + } + } + } + + private boolean hasFolder(String fileName) { + if (fileName == null) { + return false; + } + return fileName.indexOf('/') >= 0; + } + + private String getFolderName(String fileName) { + if (fileName == null) { + return ""; + } + String folderName = fileName.substring(0, fileName.lastIndexOf('/')); + // If any of the saved folder names start with with slashes, + // we want to remove them: + // (i.e., ///foo/bar will become foo/bar) + while (folderName.startsWith("/")) { + folderName = folderName.substring(1); + } + return folderName; + } + + private void addFolderToZipStream(String folderName, Set zippedFolders) throws IOException { + // We don't want to create folders in the output Zip file that have + // already been added: + if (!"".equals(folderName)) { + if (!zippedFolders.contains(folderName)) { + ZipEntry d = new ZipEntry(folderName + "/"); + zipOutputStream.putNextEntry(d); + zipOutputStream.closeEntry(); + zippedFolders.add(folderName); + } + } + } +} diff --git a/scripts/zipdownload/src/main/java/edu/harvard/iq/dataverse/custom/service/util/DatabaseAccessUtil.java b/scripts/zipdownload/src/main/java/edu/harvard/iq/dataverse/custom/service/util/DatabaseAccessUtil.java new file mode 100644 index 00000000000..8f9c34fe0a1 --- /dev/null +++ b/scripts/zipdownload/src/main/java/edu/harvard/iq/dataverse/custom/service/util/DatabaseAccessUtil.java @@ -0,0 +1,164 @@ +/* + Copyright (C) 2005-2012, by the President and Fellows of Harvard College. + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + + Dataverse Network - A web application to share, preserve and analyze research data. + Developed at the Institute for Quantitative Social Science, Harvard University. + Version 3.0. +*/ + +package edu.harvard.iq.dataverse.custom.service.util; + +import java.sql.Connection; +import java.sql.DriverManager; +import java.sql.PreparedStatement; +import java.sql.ResultSet; +import java.sql.Statement; +import java.util.ArrayList; +import java.util.List; + +/** + * Utility methods for directly accessing the Dataverse database to extract + * the file locations and parameters for the zipping jobs. + * + * @author Leonid Andreev + */ +public class DatabaseAccessUtil implements java.io.Serializable { + + // The zipper needs to make one database call to initiate each job. + // So the database connection can be closed immediately. + + private static final int JOB_TOKEN_LENGTH = 16; + // A legitimate token is 16 characters long, and is made up of + // hex digits and one dash. THERE ARE prettier ways to spell out + // this regular expression - I just wanted it to be clear what it does: + private static final String JOB_TOKEN_REGEX = "^[0-9a-f][0-9a-f]*\\-[0-9a-f][0-9a-f]*$"; + private static final String JOB_LOOKUP_QUERY = "SELECT * FROM CustomZipServiceRequest WHERE key=?"; + private static final String JOB_DELETE_QUERY = "DELETE FROM CustomZipServiceRequest WHERE key=?"; + + public static List lookupZipJob(String jobKey) { + // Before we do anything, it is super important to sanitize the + // supplied token - we don't want to insert anything sketchy into + // the db query below (an "injection attack"). + // java.sql PreparedStatement.setString() that we are using below + // should also be checking against an attemp to insert a sub-query. + // But better safe than sorry. + if (!validateTokenFormat(jobKey)) { + return null; // This will result in a "no such job" response. + } + + Connection c = connectToDatabase(); + + if (c == null) { + // no connection - no data, return null queitly + return null; + } + + PreparedStatement stmt; + ResultSet rs; + + List ret = new ArrayList<>(); + + try { + c.setAutoCommit(false); + + stmt = c.prepareStatement(JOB_LOOKUP_QUERY); + stmt.setString(1, jobKey); + rs = stmt.executeQuery(); + + while ( rs.next() ) { + String storageLocation = rs.getString("storageLocation"); + String fileName = rs.getString("fileName"); + + //System.out.println( "storageLocation = " + storageLocation ); + //System.out.println( "fileName = " + fileName ); + + String[] entry = new String[2]; + entry[0] = storageLocation; + entry[1] = fileName; + + ret.add(entry); + } + rs.close(); + stmt.close(); + } catch (Exception e) { + System.err.println( "Database error: " + e.getClass().getName()+" "+ e.getMessage() ); + // return null (but close the connection first): + try { + c.close(); + } catch (Exception ex) {} + return null; + } + + // Delete all the entries associated with the job, now that we are done + // with it. + + try { + stmt = c.prepareStatement(JOB_DELETE_QUERY); + stmt.setString(1, jobKey); + stmt.executeUpdate(); + c.commit(); + } catch (Exception e) { + // Not much we can or want to do, but complain in the Apache logs: + // (not even sure about printing any log messages either; the reason + // this delete failed may be because the admin chose to only give + // the zipper read-only access to the db - in which case this will + // be happening every time a job is processed. which in turn is + // ok - there is a backup cleanup mechanism for deleting older jobs + // on the application side as well). + //System.err.println("Failed to delete the job from the db"); + } + + try { + c.close(); + } catch (Exception e) {} + + return ret; + } + + // Opens the connection to the database. + // Uses the credentials supplied via JVM options + private static Connection connectToDatabase() { + Connection c = null; + + String host = System.getProperty("db.serverName") != null ? System.getProperty("db.serverName") : "localhost"; + String port = System.getProperty("db.portNumber") != null ? System.getProperty("db.portNumber") : "5432"; + String database = System.getProperty("db.databaseName") != null ? System.getProperty("db.databaseName") : "dvndb"; + String pguser = System.getProperty("db.user") != null ? System.getProperty("db.user") : "dvnapp"; + String pgpasswd = System.getProperty("db.password") != null ? System.getProperty("db.password") : "secret"; + + try { + Class.forName("org.postgresql.Driver"); + c = DriverManager + .getConnection("jdbc:postgresql://" + host + ":" + port + "/" + database, + pguser, + pgpasswd); + } catch (Exception e) { + return null; + } + return c; + } + + private static boolean validateTokenFormat(String jobKey) { + // A legitimate token is 16 characters long, and is made up of + // hex digits and one dash. + if (jobKey == null + || jobKey.length() != JOB_TOKEN_LENGTH + || !jobKey.matches(JOB_TOKEN_REGEX)) { + return false; + } + + return true; + } +} \ No newline at end of file diff --git a/scripts/zipdownload/src/main/java/edu/harvard/iq/dataverse/custom/service/util/DirectAccessUtil.java b/scripts/zipdownload/src/main/java/edu/harvard/iq/dataverse/custom/service/util/DirectAccessUtil.java new file mode 100644 index 00000000000..699c431dcf1 --- /dev/null +++ b/scripts/zipdownload/src/main/java/edu/harvard/iq/dataverse/custom/service/util/DirectAccessUtil.java @@ -0,0 +1,111 @@ +/* + Copyright (C) 2005-2012, by the President and Fellows of Harvard College. + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + + Dataverse Network - A web application to share, preserve and analyze research data. + Developed at the Institute for Quantitative Social Science, Harvard University. + Version 3.0. +*/ + +package edu.harvard.iq.dataverse.custom.service.util; + +import com.amazonaws.SdkClientException; +import com.amazonaws.auth.profile.ProfileCredentialsProvider; +import com.amazonaws.services.s3.AmazonS3; +import com.amazonaws.services.s3.AmazonS3ClientBuilder; +import com.amazonaws.services.s3.model.GetObjectRequest; +import com.amazonaws.services.s3.model.ObjectMetadata; +import java.io.File; +import java.io.FileInputStream; +import java.io.IOException; +import java.io.InputStream; + +/** + * Utility methods for directly accessing storage locations + * Supports file system and S3. + * (S3 has only been tested with AWS; non-standard auth may not be supported yet) + * + * @author Leonid Andreev + */ +public class DirectAccessUtil implements java.io.Serializable { + + private AmazonS3 s3 = null; + + public InputStream openDirectAccess(String storageLocation) { + InputStream inputStream = null; + + if (storageLocation.startsWith("s3://")) { + createOrReuseAwsClient(); + + if (this.s3 == null) { + return null; + } + + storageLocation = storageLocation.substring(5); + + String bucket = storageLocation.substring(0, storageLocation.indexOf('/')); + String key = storageLocation.substring(storageLocation.indexOf('/') + 1); + + //System.out.println("bucket: "+bucket); + //System.out.println("key: "+key); + + /* commented-out code below is for looking up S3 metatadata + properties, such as size, etc. prior to making an access call: + ObjectMetadata objectMetadata = null; + long fileSize = 0L; + try { + objectMetadata = s3.getObjectMetadata(bucket, key); + fileSize = objectMetadata.getContentLength(); + //System.out.println("byte size: "+objectMetadata.getContentLength()); + } catch (SdkClientException sce) { + System.err.println("Cannot get S3 object metadata " + key + " from bucket " + bucket); + }*/ + + try { + inputStream = s3.getObject(new GetObjectRequest(bucket, key)).getObjectContent(); + } catch (SdkClientException sce) { + System.err.println("Cannot get S3 object " + key + " from bucket " + bucket); + } + + } else if (storageLocation.startsWith("file://")) { + // This could be a static method; since no reusable client/maintainable + // state is required + + storageLocation = storageLocation.substring(7); + + try { + inputStream = new FileInputStream(new File(storageLocation)); + } catch (IOException ioex) { + System.err.println("Cannot open file " + storageLocation); + } + } + + // Unsupported storage location - return null + return inputStream; + } + + private void createOrReuseAwsClient() { + if (this.s3 == null) { + try { + AmazonS3ClientBuilder s3CB = AmazonS3ClientBuilder.standard(); + s3CB.setCredentials(new ProfileCredentialsProvider("default")); + this.s3 = s3CB.build(); + + } catch (Exception e) { + System.err.println("cannot instantiate an S3 client"); + } + } + } + +} \ No newline at end of file diff --git a/src/main/java/edu/harvard/iq/dataverse/FileDownloadServiceBean.java b/src/main/java/edu/harvard/iq/dataverse/FileDownloadServiceBean.java index 0832ec1db79..683142fc5c4 100644 --- a/src/main/java/edu/harvard/iq/dataverse/FileDownloadServiceBean.java +++ b/src/main/java/edu/harvard/iq/dataverse/FileDownloadServiceBean.java @@ -6,6 +6,8 @@ import edu.harvard.iq.dataverse.authorization.users.AuthenticatedUser; import edu.harvard.iq.dataverse.authorization.users.PrivateUrlUser; import edu.harvard.iq.dataverse.authorization.users.User; +import edu.harvard.iq.dataverse.dataaccess.DataAccess; +import edu.harvard.iq.dataverse.dataaccess.StorageIO; import edu.harvard.iq.dataverse.datasetutility.WorldMapPermissionHelper; import edu.harvard.iq.dataverse.engine.command.exception.CommandException; import edu.harvard.iq.dataverse.engine.command.impl.CreateGuestbookResponseCommand; @@ -16,6 +18,7 @@ import edu.harvard.iq.dataverse.makedatacount.MakeDataCountLoggingServiceBean.MakeDataCountEntry; import edu.harvard.iq.dataverse.privateurl.PrivateUrl; import edu.harvard.iq.dataverse.privateurl.PrivateUrlServiceBean; +import edu.harvard.iq.dataverse.settings.SettingsServiceBean; import edu.harvard.iq.dataverse.util.FileUtil; import java.io.IOException; import java.sql.Timestamp; @@ -23,6 +26,7 @@ import java.util.Arrays; import java.util.Date; import java.util.List; +import java.util.UUID; import java.util.logging.Logger; import javax.ejb.EJB; import javax.ejb.Stateless; @@ -31,6 +35,7 @@ import javax.inject.Named; import javax.persistence.EntityManager; import javax.persistence.PersistenceContext; +import javax.persistence.Query; import javax.servlet.ServletOutputStream; import javax.servlet.http.HttpServletResponse; @@ -68,6 +73,8 @@ public class FileDownloadServiceBean implements java.io.Serializable { AuthenticationServiceBean authService; @EJB PrivateUrlServiceBean privateUrlService; + @EJB + SettingsServiceBean settingsService; @Inject DataverseSession session; @@ -90,9 +97,12 @@ public void writeGuestbookAndStartBatchDownload(GuestbookResponse guestbookRespo } public void writeGuestbookAndStartBatchDownload(GuestbookResponse guestbookResponse, Boolean doNotSaveGuestbookRecord){ + if (guestbookResponse == null || guestbookResponse.getSelectedFileIds() == null) { return; } + + boolean original = "original".equals(guestbookResponse.getFileFormat()); // Let's intercept the case where a multiple download method was called, // with only 1 file on the list. We'll treat it like a single file download @@ -119,21 +129,46 @@ public void writeGuestbookAndStartBatchDownload(GuestbookResponse guestbookRespo } // OK, this is a real batch (multi-file) download. + + String customZipDownloadUrl = settingsService.getValueForKey(SettingsServiceBean.Key.CustomZipDownloadServiceUrl); + boolean useCustomZipService = customZipDownloadUrl != null; + String zipServiceKey = null; + // Do we need to write GuestbookRecord entries for the files? - if (!doNotSaveGuestbookRecord) { + if (!doNotSaveGuestbookRecord || useCustomZipService) { List list = new ArrayList<>(Arrays.asList(guestbookResponse.getSelectedFileIds().split(","))); - + Timestamp timestamp = null; + for (String idAsString : list) { - DataFile df = datafileService.findCheapAndEasy(new Long(idAsString)); + //DataFile df = datafileService.findCheapAndEasy(new Long(idAsString)); + DataFile df = datafileService.find(new Long(idAsString)); if (df != null) { - guestbookResponse.setDataFile(df); - writeGuestbookResponseRecord(guestbookResponse); + if (!doNotSaveGuestbookRecord) { + guestbookResponse.setDataFile(df); + writeGuestbookResponseRecord(guestbookResponse); + } + + if (useCustomZipService) { + if (zipServiceKey == null) { + zipServiceKey = generateServiceKey(); + } + if (timestamp == null) { + timestamp = new Timestamp(new Date().getTime()); + } + + addFileToCustomZipJob(zipServiceKey, df, timestamp, original); + } } } } - - redirectToBatchDownloadAPI(guestbookResponse.getSelectedFileIds(), "original".equals(guestbookResponse.getFileFormat())); + + if (useCustomZipService) { + redirectToCustomZipDownloadService(customZipDownloadUrl, zipServiceKey); + } else { + // Use the "normal" /api/access/datafiles/ API: + redirectToBatchDownloadAPI(guestbookResponse.getSelectedFileIds(),original); + } } public void writeGuestbookAndStartFileDownload(GuestbookResponse guestbookResponse, FileMetadata fileMetadata, String format) { @@ -216,6 +251,17 @@ private void redirectToBatchDownloadAPI(String multiFileString, Boolean guestboo PrimeFaces.current().executeScript("downloadFiles('"+fileDownloadUrl + "','"+ multiFileString+"');"); } + + private void redirectToCustomZipDownloadService(String customZipServiceUrl, String jobKey) { + + customZipServiceUrl += "?" + jobKey; + + try { + FacesContext.getCurrentInstance().getExternalContext().redirect(customZipServiceUrl); + } catch (IOException ex) { + logger.info("Failed to issue a redirect to the custom Zip download service."); + } + } private void redirectToDownloadAPI(String downloadType, Long fileId, boolean guestBookRecordAlreadyWritten, Long fileMetadataId) { String fileDownloadUrl = FileUtil.getFileDownloadUrlPath(downloadType, fileId, guestBookRecordAlreadyWritten, fileMetadataId); @@ -487,6 +533,66 @@ public void sendRequestFileAccessNotification(Dataset dataset, Long fileId, Auth userNotificationService.sendNotification(au, new Timestamp(new Date().getTime()), UserNotification.Type.REQUESTFILEACCESS, fileId, null, requestor, false); }); - } + } + + public String generateServiceKey() { + UUID uid = UUID.randomUUID(); + // last 8 bytes, of the random UUID, 16 hex digits: + return uid.toString().substring(20); + } + + public void addFileToCustomZipJob(String key, DataFile dataFile, Timestamp timestamp, boolean orig) { + String location = null; + String fileName = null; + + try { + StorageIO storageIO = DataAccess.getStorageIO(dataFile); + location = getDirectStorageLocatrion(storageIO.getStorageLocation()); + + if (orig && dataFile.isTabularData()) { + location = location.concat(".orig"); + } + } catch (IOException ioex) { + logger.info("Failed to open StorageIO for datafile " + dataFile.getId()); + } + + if (dataFile.getFileMetadata() != null) { + if (orig && dataFile.isTabularData()) { + fileName = dataFile.getOriginalFileName(); + } else { + fileName = dataFile.getFileMetadata().getLabel(); + } + } + + if (location != null && fileName != null) { + em.createNativeQuery("INSERT INTO CUSTOMZIPSERVICEREQUEST (KEY, STORAGELOCATION, FILENAME, ISSUETIME) VALUES (" + + "'" + key + "'," + + "'" + location + "'," + + "'" + fileName + "'," + + "'" + timestamp + "');").executeUpdate(); + } + + // TODO: + // While we are here, issue another query, to delete all the entries that are + // more than N seconds old? + Timestamp deleteTime = new Timestamp(new Date().getTime() - 300000L); + em.createNativeQuery("DELETE FROM CUSTOMZIPSERVICEREQUEST WHERE ISSUETIME < " + + "'" + deleteTime + "';").executeUpdate(); + } + + public String getDirectStorageLocatrion(String storageLocation) { + String storageDriverId; + int separatorIndex = storageLocation.indexOf("://"); + if ( separatorIndex > 0 ) { + storageDriverId = storageLocation.substring(0,separatorIndex); + + String storageType = DataAccess.getDriverType(storageDriverId); + if ("file".equals(storageType) || "s3".equals(storageType)) { + return storageType.concat(storageLocation.substring(separatorIndex)); + } + } + + return null; + } } diff --git a/src/main/java/edu/harvard/iq/dataverse/api/Access.java b/src/main/java/edu/harvard/iq/dataverse/api/Access.java index ea2d512f98f..97a0c4a6948 100644 --- a/src/main/java/edu/harvard/iq/dataverse/api/Access.java +++ b/src/main/java/edu/harvard/iq/dataverse/api/Access.java @@ -20,6 +20,7 @@ import edu.harvard.iq.dataverse.DataverseServiceBean; import edu.harvard.iq.dataverse.DataverseSession; import edu.harvard.iq.dataverse.DataverseTheme; +import edu.harvard.iq.dataverse.FileDownloadServiceBean; import edu.harvard.iq.dataverse.GuestbookResponse; import edu.harvard.iq.dataverse.GuestbookResponseServiceBean; import edu.harvard.iq.dataverse.PermissionServiceBean; @@ -82,6 +83,7 @@ import javax.json.Json; import javax.json.JsonObjectBuilder; import java.math.BigDecimal; +import java.net.URI; import java.util.HashSet; import java.util.Set; import java.util.function.Consumer; @@ -118,6 +120,8 @@ import static edu.harvard.iq.dataverse.util.json.JsonPrinter.json; import static edu.harvard.iq.dataverse.util.json.JsonPrinter.json; import static edu.harvard.iq.dataverse.util.json.JsonPrinter.json; +import java.net.URISyntaxException; +import javax.ws.rs.RedirectionException; /* Custom API exceptions [NOT YET IMPLEMENTED] @@ -171,6 +175,8 @@ public class Access extends AbstractApiBean { DataverseRoleServiceBean roleService; @EJB UserNotificationServiceBean userNotificationService; + @EJB + FileDownloadServiceBean fileDownloadService; @Inject PermissionsWrapper permissionsWrapper; @Inject @@ -567,6 +573,7 @@ private Response downloadDatafiles(String rawFileIds, boolean gbrecs, String api if (rawFileIds == null || rawFileIds.equals("")) { throw new BadRequestException(); } + final String fileIds; if(rawFileIds.startsWith("fileIds=")) { fileIds = rawFileIds.substring(8); // String "fileIds=" from the front @@ -577,6 +584,9 @@ private Response downloadDatafiles(String rawFileIds, boolean gbrecs, String api * final '\n' - this stops the last item from being parsed in the fileIds.split(","); line below. */ + String customZipServiceUrl = settingsService.getValueForKey(SettingsServiceBean.Key.CustomZipDownloadServiceUrl); + boolean useCustomZipService = customZipServiceUrl != null; + String apiToken = (apiTokenParam == null || apiTokenParam.equals("")) ? headers.getHeaderString(API_KEY_HEADER) : apiTokenParam; @@ -590,6 +600,24 @@ private Response downloadDatafiles(String rawFileIds, boolean gbrecs, String api getOrig = true; } } + + if (useCustomZipService) { + URI redirect_uri = null; + try { + redirect_uri = handleCustomZipDownload(customZipServiceUrl, fileIds, apiToken, apiTokenUser, uriInfo, headers, gbrecs, true); + } catch (WebApplicationException wae) { + throw wae; + } + + Response redirect = Response.seeOther(redirect_uri).build(); + logger.fine("Issuing redirect to the file location on S3."); + throw new RedirectionException(redirect); + + } + + // Not using the "custom service" - API will zip the file, + // and stream the output, in the "normal" manner: + final boolean getOriginal = getOrig; //to use via anon inner class StreamingOutput stream = new StreamingOutput() { @@ -1670,9 +1698,74 @@ private User findAPITokenUser(String apiToken) { return apiTokenUser; } + private URI handleCustomZipDownload(String customZipServiceUrl, String fileIds, String apiToken, User apiTokenUser, UriInfo uriInfo, HttpHeaders headers, boolean gbrecs, boolean orig) throws WebApplicationException { + String zipServiceKey = null; + Timestamp timestamp = null; + + String fileIdParams[] = fileIds.split(","); + int validIdCount = 0; + int validFileCount = 0; + int downloadAuthCount = 0; + if (fileIdParams == null || fileIdParams.length == 0) { + throw new BadRequestException(); + } + + for (int i = 0; i < fileIdParams.length; i++) { + Long fileId = null; + try { + fileId = new Long(fileIdParams[i]); + validIdCount++; + } catch (NumberFormatException nfe) { + fileId = null; + } + if (fileId != null) { + DataFile file = dataFileService.find(fileId); + if (file != null) { + validFileCount++; + if (isAccessAuthorized(file, apiToken)) { + logger.fine("adding datafile (id=" + file.getId() + ") to the download list of the ZippedDownloadInstance."); + if (gbrecs != true && file.isReleased()) { + GuestbookResponse gbr = guestbookResponseService.initAPIGuestbookResponse(file.getOwner(), file, session, apiTokenUser); + guestbookResponseService.save(gbr); + MakeDataCountEntry entry = new MakeDataCountEntry(uriInfo, headers, dvRequestService, file); + mdcLogService.logEntry(entry); + } - - - + if (zipServiceKey == null) { + zipServiceKey = fileDownloadService.generateServiceKey(); + } + if (timestamp == null) { + timestamp = new Timestamp(new Date().getTime()); + } + + fileDownloadService.addFileToCustomZipJob(zipServiceKey, file, timestamp, true); + downloadAuthCount++; + } + } + } + } + + if (validIdCount == 0) { + throw new BadRequestException(); + } + + if (validFileCount == 0) { + // no supplied id translated into an existing DataFile + throw new NotFoundException(); + } + + if (downloadAuthCount == 0) { + // none of the DataFiles were authorized for download + throw new ForbiddenException(); + } + + URI redirectUri = null; + try { + redirectUri = new URI(customZipServiceUrl + "?" + zipServiceKey); + } catch (URISyntaxException use) { + throw new BadRequestException(); + } + return redirectUri; + } } diff --git a/src/main/java/edu/harvard/iq/dataverse/engine/command/DataverseRequest.java b/src/main/java/edu/harvard/iq/dataverse/engine/command/DataverseRequest.java index 1384b7aef2e..1b75b040d48 100644 --- a/src/main/java/edu/harvard/iq/dataverse/engine/command/DataverseRequest.java +++ b/src/main/java/edu/harvard/iq/dataverse/engine/command/DataverseRequest.java @@ -91,6 +91,7 @@ public DataverseRequest(User aUser, HttpServletRequest aHttpServletRequest) { if (index >= 0) { ip = ip.substring(index + 1); } + ip=ip.trim(); /* * We should have a valid, single IP address string here. The IpAddress.valueOf * call will throw an exception if it can't be parsed into a valid address (e.g. diff --git a/src/main/java/edu/harvard/iq/dataverse/settings/SettingsServiceBean.java b/src/main/java/edu/harvard/iq/dataverse/settings/SettingsServiceBean.java index 4b69f016a46..b2e82d92dc3 100644 --- a/src/main/java/edu/harvard/iq/dataverse/settings/SettingsServiceBean.java +++ b/src/main/java/edu/harvard/iq/dataverse/settings/SettingsServiceBean.java @@ -431,11 +431,16 @@ Whether Harvesting (OAI) service is enabled * Validate physical files for all the datafiles in the dataset when publishing */ FileValidationOnPublishEnabled, + /** + * If defined, this is the URL of the zipping service outside + * the main Application Service where zip downloads should be directed + * instead of /api/access/datafiles/ + */ + CustomZipDownloadServiceUrl, /** * Sort Date Facets Chronologically instead or presenting them in order of # of hits as other facets are. Default is true */ ChronologicalDateFacets - ; @Override diff --git a/src/main/resources/db/migration/V4.20.0.5__6505-zipdownload-jobs.sql b/src/main/resources/db/migration/V4.20.0.5__6505-zipdownload-jobs.sql new file mode 100644 index 00000000000..484d5dd0784 --- /dev/null +++ b/src/main/resources/db/migration/V4.20.0.5__6505-zipdownload-jobs.sql @@ -0,0 +1,2 @@ +-- maybe temporary? - work in progress +CREATE TABLE IF NOT EXISTS CUSTOMZIPSERVICEREQUEST (KEY VARCHAR(63), STORAGELOCATION VARCHAR(255), FILENAME VARCHAR(255), ISSUETIME TIMESTAMP);