support HTTP "Range" header #6937 by pdurbin · Pull Request #8087 · IQSS/dataverse

pdurbin · 2021-09-03T20:57:38Z

What this PR does / why we need it:

We'd like users to be able to download parts of files: the beginning (like head), the end (like tail), and a range in the middle.

Which issue(s) this PR closes:

Closes #6937

Special notes for your reviewer:

Support is not full. As noted in the docs, only a single range is supported and the "If-Range" header is not supported.

Suggestions on how to test this:

Try the examples in the docs.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

No.

Is there a release notes update needed for this change?:

Yes.

Additional documentation:

Included.

Support is not full. As noted in the docs, only a single range is supported and the "If-Range" header is not supported.

coveralls · 2021-09-03T21:04:33Z

Coverage increased (+0.03%) to 19.044% when pulling d197b96 on 6937-range into 389373d on develop.

This was null for aux files, for example.

src/main/java/edu/harvard/iq/dataverse/api/DownloadInstanceWriter.java

landreev · 2021-09-09T22:53:31Z

One other thing I'm debating/wondering is if we should throw a "not supported" exception not just on requests like subsetting, that are generated dynamically - but on ANY byte range requests except for the main physical file, or the saved original.
I'm not sure I can think of a practical real world scenario where someone would want to read byte ranges on say, thumbnails. Or any other auxiliary files we store, except for the original of an ingested tab. file.

src/main/java/edu/harvard/iq/dataverse/api/DownloadInstanceWriter.java

landreev · 2021-09-10T19:41:56Z

Let me know if you have any questions/need any assistance with this PR.
It should really just be a matter of rearranging the code and logic you've already added a bit to make it fully ready.

Also start writing real integration tests. A FIXME now notes that ranges don't currently work well with tabular files, due to how we store the header.

pdurbin · 2021-09-30T21:03:32Z

Thanks @landreev for discussing this today. I went ahead and pushed what I have so that you can look at the backend. I'll keep working on the tests and will think more about the problem we discovered with ranges for tabular files. (In short, the header is always written, even if you request the last bytes of a file.)

To summarize the decisions made:

We're only using the leftToRead logic for range requests. Non-range requests get the same logic as in previous versions of Dataverse.
We'll support range requests for original and tabular files and also (for now at least) the other types of auxiliary files, such as thumbnails. There may not be a use case, but it seems to work fine.

Note to self that I plan to add a release note, work more on the tests, and see what I can do about tabular files.

If a range is not requested, write the whole variable header line. Otherwise, make a reasonable effort to write what we can.

pdurbin · 2021-10-04T20:11:19Z

@landreev in adab554 and added some logic to try to deal with variable headers. Basically:

write the headers if no range
if the range starts with 0, write as much as you can within the range

I'll keep working on tests and the release note.

landreev · 2021-10-05T16:14:43Z

src/main/java/edu/harvard/iq/dataverse/dataaccess/FileAccessIO.java


        try {
            in = new FileInputStream(getFileSystemPath().toFile());
+            in.skip(this.getOffset());


I believe this is what we want to rearrange: this in.skip() should not be happening here, in the open method, but in the setOffset() method itself. Because we want to be able to change that offset after the initial open.

The setOffset() method will need to throw an IOException, if it's called while the InputStream is still null; or if the skip() call itself results in an IOException()

Yes, excellent idea. Implemented in 4724e11 and now offset seems to be working again, at least for non-tabular files. Thanks!

Offset stopped working in 29ea256. Before that commit we were relying on setOffset() being called after open(). Now we can (and do) call setOffset() after open().

github-actions · 2021-10-05T20:36:41Z

src/test/java/edu/harvard/iq/dataverse/api/FilesIT.java

+        Response downloadFileNoArgs = UtilIT.downloadFile(fileIdCsv, null, null, null, authorApiToken);
+        downloadFileNoArgs.then().assertThat()
+                .statusCode(OK.getStatusCode())
+                .body(equalTo("name	pounds	species\n"


[reviewdog] _{reported by reviewdog 🐶}
File contains tab characters (this is the first instance).

Uh oh. Note to self: Ask @poikilotherm how to tell reviewdog that this is just a test and I actually want the tabs. Worst case I guess I could replace them with \t.

I couldn't find a fix in the reviewdog docs so in a57e629 I replaced the tabs with \t. It would be good to figure this out someday though.

pdurbin · 2021-10-05T20:56:02Z

I've been talking with @landreev and we got offset working again in 4724e11 except for tabular files, which I need to look into.

I also added some API tests, which I'll keep adding to.

I also added a release note.

Just above `ranges` is initialized to `new ArrayList<>()`.

landreev

I believe those are the only comments I had, otherwise I'm happy with it.
But, since I did contribute a little bit to it in the end, maybe somebody else could also take a look.

Most importantly, this commit introduces using status code 416 (Range Not Satisfiable) when the range is invalid. https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Range says, "If the ranges are invalid, the server returns the 416 Range Not Satisfiable error." I also rearranged the logic around testing for a single range. Now it fails fast and reports that only one range is supported rather than potentially showing an error about the range being beyond the file size, for example. I also changed the error message "Start is larger than end" to "Start is larger than end or size of file" to more accurately reflect what the check is doing. Finally, I did some cleanup and added additional tests.

pdurbin · 2021-10-15T15:28:22Z

After 3024d67 I'm ready for more review. Here's the detailed comment I left in that commit

clean up range header error handling #6937

Most importantly, this commit introduces using status code 416
(Range Not Satisfiable) when the range is invalid.
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Range
says, "If the ranges are invalid, the server returns the 416 Range Not
Satisfiable error."

I also rearranged the logic around testing for a single range. Now it
fails fast and reports that only one range is supported rather than
potentially showing an error about the range being beyond the file size,
for example.

I also changed the error message
"Start is larger than end" to
"Start is larger than end or size of file" to more accurately reflect
what the check is doing.

Finally, I did some cleanup and added additional tests.

scolapasta · 2021-10-21T20:55:17Z

src/main/java/edu/harvard/iq/dataverse/dataaccess/Range.java

+    public Range(long start, long end) {
+        this.start = start;
+        this.end = end;
+        this.length = end - start + 1;


is there any reason to store the length here and not just calcualte in in the get? (future proofing for if we some day add setters for start and end)

Sure, fixed in 22d5ffb.

scolapasta

Just added a couple of minor thoughts - neither is critical, but would make (imo) for cleaner code.

scolapasta · 2021-10-21T20:58:23Z

src/test/java/edu/harvard/iq/dataverse/api/FilesIT.java

+        Response downloadTxtNoArgs = UtilIT.downloadFile(fileIdTxt, null, null, null, authorApiToken);
+        downloadTxtNoArgs.then().assertThat()
+                .statusCode(OK.getStatusCode())
+                .body(equalTo("first is the worst\n"


could we simplify the body(equals to, by using String functions? i.e instead of typing the specific text again and again, just do the proper substring on contentOfTxt? Feels like it would make the tests. more readable.

I had a play with this and I like my way better. I'll show some examples below but for the ranges in particular, it's weird and confusing to have to add one to the "end" parameter in substring. You're asking for a range of 0-9 but you have to do substring(0, 10) to get the test to pass.

My idea with these tests is that you establish a small bit of text up front and then visually reason about "did I get the first 10 characters or not?"

From chatting with @scolapasta it sounded like he didn't feel super strongly about these tests anyway. Just a suggestion.

// Download the whole file. Response downloadTxtNoArgs = UtilIT.downloadFile(fileIdTxt, null, null, null, authorApiToken); downloadTxtNoArgs.then().assertThat() .statusCode(OK.getStatusCode()) .body(equalTo(contentOfTxt.substring(0, contentOfTxt.length()))); // .body(equalTo("first is the worst\n" // + "second is the best\n" // + "third is the one with the hairy chest\n")); // Download the first 10 bytes. Response downloadTxtFirst10 = UtilIT.downloadFile(fileIdTxt, "0-9", null, null, authorApiToken); downloadTxtFirst10.then().assertThat() .statusCode(OK.getStatusCode()) .body(equalTo(contentOfTxt.substring(0, 10))); // .body(equalTo("first is t")); // Download the last 6 bytes. Response downloadTxtLast6 = UtilIT.downloadFile(fileIdTxt, "-6", null, null, authorApiToken); downloadTxtLast6.then().assertThat() .statusCode(OK.getStatusCode()) .body(equalTo(contentOfTxt.substring(contentOfTxt.length() - 6, contentOfTxt.length()))); // .body(equalTo("chest\n")); // Download some bytes from the middle. Response downloadTxtMiddle = UtilIT.downloadFile(fileIdTxt, "09-19", null, null, authorApiToken); downloadTxtMiddle.then().assertThat() .statusCode(OK.getStatusCode()) .body(equalTo(contentOfTxt.substring(9, 20))); // .body(equalTo("the worst\ns")); // Skip the first 10 bytes and download the rest. Response downloadTxtSkipFirst10 = UtilIT.downloadFile(fileIdTxt, "9-", null, null, authorApiToken); downloadTxtSkipFirst10.then().assertThat() .statusCode(OK.getStatusCode()) .body(equalTo(contentOfTxt.substring(9, contentOfTxt.length()))); // .body(equalTo("the worst\n" // + "second is the best\n" // + "third is the one with the hairy chest\n"));

pdurbin · 2021-10-22T20:55:14Z

Ok, based on code review, I fixed up the code in 22d5ffb and played around with the tests. I also merged the latest from develop. I'm ready for more review.

pdurbin · 2021-11-08T14:48:13Z

On Friday I merged the latest from develop into this branch to pick up the 5.8 pom.xml change but the API tests failed because certain things hadn't happened yet:

The 5.8 dvinstall.zip hadn't been uploaded to the release yet. This has since been done.
dataverse-ansible hadn't been bumped to 5.8 yet. This has since been done at gdcc/dataverse-ansible@85e9dc6

This morning I made a small change to the release note to force the API tests to run and they passed.

support HTTP "Range" header #6937

e82771a

Support is not full. As noted in the docs, only a single range is supported and the "If-Range" header is not supported.

landreev self-requested a review September 7, 2021 13:26

landreev self-assigned this Sep 7, 2021

pdurbin added 3 commits September 8, 2021 14:03

avoid NPE when HttpHeaders is null #6937

875d08d

This was null for aux files, for example.

set file size properly for original tabular files #6937

fb8d559

send content size from range #6937

809d20b