-
Notifications
You must be signed in to change notification settings - Fork 535
6505 optimize zip downloads #6986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
b8d268a
8b1765a
e3973d1
ad1787a
1dc597b
ddfc88c
5402e07
aa923ba
5d27982
c99fa60
8dfe4c4
46584da
5aaaff5
48a56df
3eb3976
1cd8629
69297fb
6e2e396
72394a4
96c3708
e01c213
9e42aec
6100ed6
aaaa035
c5cca50
1d4b83f
d34ecca
f580a80
e0e0a45
2fcdfac
757a120
7f2bf94
2553845
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| ### A multi-file, zipped download optimization | ||
|
|
||
| In this release we are offering an experimental optimization for the | ||
| multi-file, download-as-zip functionality. If this option is enabled, | ||
| instead of enforcing size limits, we attempt to serve all the files | ||
| that the user requested (that they are authorized to download), but | ||
| the request is redirected to a standalone zipper service running as a | ||
| cgi executable. Thus moving these potentially long-running jobs | ||
| completely outside the Application Server (Payara); and preventing | ||
| service threads from becoming locked serving them. Since zipping is | ||
| also a CPU-intensive task, it is possible to have this service running | ||
| on a different host system, thus freeing the cycles on the main | ||
| Application Server. (The system running the service needs to have | ||
| access to the database as well as to the storage filesystem, and/or S3 | ||
| bucket). | ||
|
|
||
| Please consult the scripts/zipdownload/README.md in the Dataverse 5 | ||
| source tree. | ||
|
|
||
| The components of the standalone "zipper tool" can also be downloaded | ||
| here: | ||
| (my plan is to build the executable and to add it to the v5 | ||
| release files on github: - L.A.) | ||
| https://github.com/IQSS/dataverse/releases/download/v5.0/zipper.zip. | ||
|
|
||
| ## New JVM Options and DB Options | ||
|
|
||
| ### New DB Option CustomZipDownloadServiceUrl | ||
|
|
||
| If defined, this is the URL of the zipping service outside the main Application Service where zip downloads should be directed (instead of /api/access/datafiles/) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,104 @@ | ||
| Work in progress! | ||
|
|
||
| to build: | ||
|
|
||
| cd scripts/zipdownload | ||
| mvn clean compile assembly:single | ||
|
|
||
| to install: | ||
|
|
||
| install cgi-bin/zipdownload and ZipDownloadService-v1.0.0.jar in your cgi-bin directory (/var/www/cgi-bin standard). | ||
|
|
||
| Edit the config lines in the shell script (zipdownload) as needed. | ||
|
|
||
landreev marked this conversation as resolved.
Show resolved
Hide resolved
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I ran into problems with the CGI script setup: [Wed Jun 24 19:29:43.616634 2020] [cgid:error] [pid 57182:tid 140001334073088] [client 75.69.182.6:50927] AH01241: error spawning CGI child: exec of '/var/www/cgi-bin/zipdownload' failed (Permission denied): /var/www/cgi-bin/zipdownload, referer: http://ec2-34-224-6-242.compute-1.amazonaws.com/dataset.xhtml?persistentId=doi:10.5072/FK2/J5MISN&version=1.0 To overcome these problems I had to do two things, and both should be documented, I believe:
Ideally, we would figure out how to run the zipper with SELinux enabled. I once wrote some tips at http://guides.dataverse.org/en/4.20/developers/selinux.html
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't really want to specifically describe every possible problem with the Apache configuration - but mentioned in the instruction that "you'll need to do whatever is necessary to make sure /cgi-bin/zipdownload is accessible from the outside". Exercise for the reader, an extra layer of weeding out those who have no business attempting to install this thing, if you wish.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (after talking to Kevin I did incorporate some of the above in the doc, just to be kind :) |
||
| You may need to make extra Apache configuration changes to make sure /cgi-bin/zipdownload is accessible from the outside. | ||
| For example, if this is the same Apache that's in front of your Dataverse Payara instance, you'll need to add another pass through statement to your configuration: | ||
|
|
||
| ``ProxyPassMatch ^/cgi-bin/zipdownload !`` | ||
|
|
||
| (see the "Advanced" section of the Installation Guide for some extra troubleshooting tips) | ||
|
|
||
| To activate in Dataverse: | ||
|
|
||
| curl -X PUT -d '/cgi-bin/zipdownload' http://localhost:8080/api/admin/settings/:CustomZipDownloadServiceUrl | ||
|
|
||
| How it works: | ||
| ============= | ||
|
|
||
| (This is an ongoing design discussion - other developers are welcome to contribute) | ||
|
|
||
| The goal: to move this potentially long-running task out of the | ||
| Application Server. This is the sole focus of this implementation. It | ||
| does not attempt to make it faster. | ||
|
|
||
| The rationale here is a zipped download of a large enough number of | ||
| large enough files will always be slow. Zipping (compressing) itself | ||
| is a fairly CPU-intensive task. This will most frequently be the | ||
| bottleneck of the service. Although with a slow storage location (S3 | ||
| or Swift, with a slow link to the share) it may be the speed at which | ||
| the application accesses the raw bytes. The exact location of the | ||
| bottleneck is in a sense irrelevant. On a very fast system, with the | ||
| files stored on a very fast local RAID, the bottleneck for most users | ||
| will likely shift to the speed of their internet connection to the | ||
| server. The bottom line is, downloading this multi-file compressed | ||
| stream will take a long time no matter how you slice it. So this hack | ||
| addresses it by moving the task outside Payara, where it's not going | ||
| to hog any threads. | ||
|
|
||
| A quick, somewhat unrelated note: attempting to download a multi-GB | ||
| stream over http will always have its own inherent risks. If the | ||
| download has to take hours or days to complete, it is very likely that | ||
| it'll break down somewhere in the middle. Do note that for a zipped | ||
| download our users will not be able to utilize `wget --continue`, or | ||
| any similar "resume" functionality - because it's impossible to resume | ||
| generating a zipped stream from a certain offset. | ||
|
|
||
| The implementation is a hack. It relies on direct access to everything - storage locations (filesystem or S3) and the database. | ||
|
|
||
| There are no network calls between the application (Dataverse) and the zipper (an | ||
| implementation relying on such a call was discussed early | ||
| on). Dataverse issues a "job key" and sends the user's browser to the | ||
| zipper (to, for ex., /cgi-bin/zipdownload?<job key>) instead of | ||
| /api/access/datafiles/<file ids>). To authorize the zipdownload for | ||
| the "job key", and inform the zipper on which files to zip and where | ||
| to find them, the application relies on a database table, that the | ||
| zipper also has access to. In other words, there is a saved state | ||
| information associated with each zipped download request. Zipper may | ||
| be given a limited database access - for example, via a user | ||
| authorized to access that one table only. After serving the files, the | ||
| zipper removes the database entries. Job records in the database have | ||
| time stamps, so on the application side, as an added level of cleanup, | ||
| it automatically deletes any records older than 5 minutes (can be | ||
| further reduced) every time the service adds new records; as an added | ||
| level of cleanup for any records that got stuck in the db because the | ||
| corresponding zipper jobs never completed. A paranoid admin may choose | ||
| to give the zipper read-only access to the database, and rely on a | ||
| cleanup solely on the application side. | ||
|
|
||
| I have explored ways to avoid maintaining this state information. A | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's a little weird to use the first person here. The reader was directed to this README from the Installation Guide. If "I" is used this letter should probably be signed. (There's some "me" below too.)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would prefer to leave this as is. It's not part of the real Dataverse guide. It says "work in progress" at the top; and that particular section of the file is a design discussion where I (as a developer) am addressing other developers. (I added a line saying just that; and encouraging others to offer feedback). |
||
| potential implementation we discussed early on, where the application | ||
| would make a network call to the zipper before redirecting the user | ||
| there, would NOT solve that problem - the state would need to somehow | ||
| be maintained on the zipper side. The only truly stateless | ||
| implementation would rely on including all the file information WITH | ||
| the redirect itself, with some pre-signed URL mechanism to make it | ||
| secure. Mechanisms for pre-signing requests are readily available and | ||
| simple to implement. We could go with something similar to how S3 | ||
| presigns their access URLs. Jim Myers has already speced out how this | ||
| could be done for Dataverse access urls in a design document | ||
| (https://docs.google.com/document/d/1J8GW6zi-vSRKZdtFjLpmYJ2SUIcIkAEwHkP4q1fxL-s/edit#). (Basically, | ||
| you hash the product of your request parameters, the issue timestamp | ||
| AND some "secret" - like the user's API key - and send the resulting | ||
| hash along with the request. Tampering with any of the parameters, or | ||
| trying to extend the life span of the request, becomes impossible, | ||
| because it would invalidate the hash). What stopped me from trying | ||
| something like that was the sheer size of information that would need | ||
| to be included with a request, for a potentially long list of files | ||
| that need to be zipped. When serving a zipped download from a page | ||
| that would be doable - we could javascript together a POST call that | ||
| the browser could make to send all that info to the zipper. But if we | ||
| want to implement something similar in the API, I felt like I really | ||
| wanted to be able to simply issue a quick redirect to a manageable url | ||
| - which with the implementation above is simply | ||
| /cgi-bin/zipdownload?<job key>, with the <job key> being just a 16 | ||
| character hex string in the current implementation. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| #!/bin/sh | ||
|
|
||
| CLASSPATH=/var/www/cgi-bin; export CLASSPATH | ||
|
|
||
| PGHOST="localhost"; export PGHOST | ||
| PGPORT=5432; export PGPORT | ||
| PGUSER="dvnapp"; export PGUSER | ||
| PGDB="dvndb"; export PGDB | ||
| PGPW="xxxxx"; export PGPW | ||
|
|
||
| java -Ddb.serverName=$PGHOST -Ddb.portNumber=$PGPORT -Ddb.user=$PGUSER -Ddb.databaseName=$PGDB -Ddb.password=$PGPW -jar ZipDownloadService-v1.0.0.jar |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,86 @@ | ||
| <?xml version="1.0" encoding="UTF-8"?> | ||
| <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" | ||
| xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> | ||
| <modelVersion>4.0.0</modelVersion> | ||
| <groupId>ZipDownloadService</groupId> | ||
| <artifactId>ZipDownloadService</artifactId> | ||
| <version>1.0.0</version> | ||
| <properties> | ||
| <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> | ||
| </properties> | ||
| <pluginRepositories> | ||
| <pluginRepository> | ||
| <id>central</id> | ||
| <name>Central Repository</name> | ||
| <url>https://repo.maven.apache.org/maven2</url> | ||
| <layout>default</layout> | ||
| <snapshots> | ||
| <enabled>false</enabled> | ||
| </snapshots> | ||
| <releases> | ||
| <updatePolicy>never</updatePolicy> | ||
| </releases> | ||
| </pluginRepository> | ||
| </pluginRepositories> | ||
| <repositories> | ||
| <repository> | ||
| <id>central-repo</id> | ||
| <name>Central Repository</name> | ||
| <url>https://repo1.maven.org/maven2</url> | ||
| <layout>default</layout> | ||
| </repository> | ||
| </repositories> | ||
| <dependencyManagement> | ||
| <dependencies> | ||
| <dependency> | ||
| <groupId>com.amazonaws</groupId> | ||
| <artifactId>aws-java-sdk-bom</artifactId> | ||
| <version>1.11.790</version> | ||
| <type>pom</type> | ||
| <scope>import</scope> | ||
| </dependency> | ||
| </dependencies> | ||
| </dependencyManagement> | ||
| <dependencies> | ||
| <!-- https://mvnrepository.com/artifact/org.postgresql/postgresql --> | ||
| <dependency> | ||
| <groupId>org.postgresql</groupId> | ||
| <artifactId>postgresql</artifactId> | ||
| <version>42.2.2</version> | ||
| </dependency> | ||
| <dependency> | ||
| <groupId>com.amazonaws</groupId> | ||
| <artifactId>aws-java-sdk-s3</artifactId> | ||
| </dependency> | ||
| </dependencies> | ||
| <build> | ||
| <sourceDirectory>src/main/java</sourceDirectory> | ||
| <plugins> | ||
| <plugin> | ||
| <artifactId>maven-compiler-plugin</artifactId> | ||
| <version>3.1</version> | ||
| <configuration> | ||
| <source>1.8</source> | ||
| <target>1.8</target> | ||
| </configuration> | ||
| </plugin> | ||
| <plugin> | ||
| <groupId>org.apache.maven.plugins</groupId> | ||
| <artifactId>maven-assembly-plugin</artifactId> | ||
| <version>2.4</version> | ||
| <configuration> | ||
| <archive> | ||
| <manifest> | ||
| <mainClass>edu.harvard.iq.dataverse.custom.service.download.ZipDownloadService</mainClass> | ||
| </manifest> | ||
| </archive> | ||
| <descriptorRefs> | ||
| <descriptorRef>jar-with-dependencies</descriptorRef> | ||
| </descriptorRefs> | ||
| <finalName>${project.artifactId}-v${project.version}</finalName> | ||
| <appendAssemblyId>false</appendAssemblyId> | ||
| </configuration> | ||
| </plugin> | ||
| </plugins> | ||
| </build> | ||
| </project> |

Uh oh!
There was an error while loading. Please reload this page.