Skip to content

support HTTP "Range" header #6937#8087

Merged
kcondon merged 31 commits intodevelopfrom
6937-range
Nov 8, 2021
Merged

support HTTP "Range" header #6937#8087
kcondon merged 31 commits intodevelopfrom
6937-range

Conversation

@pdurbin
Copy link
Member

@pdurbin pdurbin commented Sep 3, 2021

What this PR does / why we need it:

We'd like users to be able to download parts of files: the beginning (like head), the end (like tail), and a range in the middle.

Which issue(s) this PR closes:

Closes #6937

Special notes for your reviewer:

Support is not full. As noted in the docs, only a single range is supported and the "If-Range" header is not supported.

Suggestions on how to test this:

Try the examples in the docs.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

No.

Is there a release notes update needed for this change?:

Yes.

Additional documentation:

Included.

Support is not full. As noted in the docs, only a single range is
supported and the "If-Range" header is not supported.
@coveralls
Copy link

coveralls commented Sep 3, 2021

Coverage Status

Coverage increased (+0.03%) to 19.044% when pulling d197b96 on 6937-range into 389373d on develop.

@landreev landreev self-requested a review September 7, 2021 13:26
@landreev landreev self-assigned this Sep 7, 2021
@landreev
Copy link
Contributor

landreev commented Sep 9, 2021

One other thing I'm debating/wondering is if we should throw a "not supported" exception not just on requests like subsetting, that are generated dynamically - but on ANY byte range requests except for the main physical file, or the saved original.
I'm not sure I can think of a practical real world scenario where someone would want to read byte ranges on say, thumbnails. Or any other auxiliary files we store, except for the original of an ingested tab. file.

@landreev
Copy link
Contributor

Let me know if you have any questions/need any assistance with this PR.
It should really just be a matter of rearranging the code and logic you've already added a bit to make it fully ready.

Also start writing real integration tests.

A FIXME now notes that ranges don't currently work well with tabular
files, due to how we store the header.
@pdurbin
Copy link
Member Author

pdurbin commented Sep 30, 2021

Thanks @landreev for discussing this today. I went ahead and pushed what I have so that you can look at the backend. I'll keep working on the tests and will think more about the problem we discovered with ranges for tabular files. (In short, the header is always written, even if you request the last bytes of a file.)

To summarize the decisions made:

  • We're only using the leftToRead logic for range requests. Non-range requests get the same logic as in previous versions of Dataverse.
  • We'll support range requests for original and tabular files and also (for now at least) the other types of auxiliary files, such as thumbnails. There may not be a use case, but it seems to work fine.

Note to self that I plan to add a release note, work more on the tests, and see what I can do about tabular files.

If a range is not requested, write the whole variable header line.
Otherwise, make a reasonable effort to write what we can.
@pdurbin
Copy link
Member Author

pdurbin commented Oct 4, 2021

@landreev in adab554 and added some logic to try to deal with variable headers. Basically:

  • write the headers if no range
  • if the range starts with 0, write as much as you can within the range

I'll keep working on tests and the release note.


try {
in = new FileInputStream(getFileSystemPath().toFile());
in.skip(this.getOffset());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is what we want to rearrange: this in.skip() should not be happening here, in the open method, but in the setOffset() method itself. Because we want to be able to change that offset after the initial open.

The setOffset() method will need to throw an IOException, if it's called while the InputStream is still null; or if the skip() call itself results in an IOException()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, excellent idea. Implemented in 4724e11 and now offset seems to be working again, at least for non-tabular files. Thanks!

Offset stopped working in 29ea256. Before that commit we were relying on
setOffset() being called after open().

Now we can (and do) call setOffset() after open().
Response downloadFileNoArgs = UtilIT.downloadFile(fileIdCsv, null, null, null, authorApiToken);
downloadFileNoArgs.then().assertThat()
.statusCode(OK.getStatusCode())
.body(equalTo("name pounds species\n"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[reviewdog] reported by reviewdog 🐶
File contains tab characters (this is the first instance).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh oh. Note to self: Ask @poikilotherm how to tell reviewdog that this is just a test and I actually want the tabs. Worst case I guess I could replace them with \t.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a fix in the reviewdog docs so in a57e629 I replaced the tabs with \t. It would be good to figure this out someday though.

@pdurbin
Copy link
Member Author

pdurbin commented Oct 5, 2021

I've been talking with @landreev and we got offset working again in 4724e11 except for tabular files, which I need to look into.

I also added some API tests, which I'll keep adding to.

I also added a release note.

Copy link
Contributor

@landreev landreev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe those are the only comments I had, otherwise I'm happy with it.
But, since I did contribute a little bit to it in the end, maybe somebody else could also take a look.

@landreev landreev assigned landreev and unassigned landreev Oct 14, 2021
@pdurbin pdurbin self-assigned this Oct 14, 2021
Most importantly, this commit introduces using status code 416
(Range Not Satisfiable) when the range is invalid.
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Range
says, "If the ranges are invalid, the server returns the 416 Range Not
Satisfiable error."

I also rearranged the logic around testing for a single range. Now it
fails fast and reports that only one range is supported rather than
potentially showing an error about the range being beyond the file size,
for example.

I also changed the error message
"Start is larger than end" to
"Start is larger than end or size of file" to more accurately reflect
what the check is doing.

Finally, I did some cleanup and added additional tests.
@pdurbin
Copy link
Member Author

pdurbin commented Oct 15, 2021

After 3024d67 I'm ready for more review. Here's the detailed comment I left in that commit

clean up range header error handling #6937

Most importantly, this commit introduces using status code 416
(Range Not Satisfiable) when the range is invalid.
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Range
says, "If the ranges are invalid, the server returns the 416 Range Not
Satisfiable error."

I also rearranged the logic around testing for a single range. Now it
fails fast and reports that only one range is supported rather than
potentially showing an error about the range being beyond the file size,
for example.

I also changed the error message
"Start is larger than end" to
"Start is larger than end or size of file" to more accurately reflect
what the check is doing.

Finally, I did some cleanup and added additional tests.

@pdurbin pdurbin removed their assignment Oct 15, 2021
@djbrooke djbrooke assigned scolapasta and unassigned landreev Oct 18, 2021
public Range(long start, long end) {
this.start = start;
this.end = end;
this.length = end - start + 1;
Copy link
Contributor

@scolapasta scolapasta Oct 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any reason to store the length here and not just calcualte in in the get? (future proofing for if we some day add setters for start and end)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, fixed in 22d5ffb.

Copy link
Contributor

@scolapasta scolapasta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added a couple of minor thoughts - neither is critical, but would make (imo) for cleaner code.

Response downloadTxtNoArgs = UtilIT.downloadFile(fileIdTxt, null, null, null, authorApiToken);
downloadTxtNoArgs.then().assertThat()
.statusCode(OK.getStatusCode())
.body(equalTo("first is the worst\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we simplify the body(equals to, by using String functions? i.e instead of typing the specific text again and again, just do the proper substring on contentOfTxt? Feels like it would make the tests. more readable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a play with this and I like my way better. I'll show some examples below but for the ranges in particular, it's weird and confusing to have to add one to the "end" parameter in substring. You're asking for a range of 0-9 but you have to do substring(0, 10) to get the test to pass.

My idea with these tests is that you establish a small bit of text up front and then visually reason about "did I get the first 10 characters or not?"

From chatting with @scolapasta it sounded like he didn't feel super strongly about these tests anyway. Just a suggestion.

// Download the whole file.
Response downloadTxtNoArgs = UtilIT.downloadFile(fileIdTxt, null, null, null, authorApiToken);
downloadTxtNoArgs.then().assertThat()
        .statusCode(OK.getStatusCode())
        .body(equalTo(contentOfTxt.substring(0, contentOfTxt.length())));
//        .body(equalTo("first is the worst\n"
//                + "second is the best\n"
//                + "third is the one with the hairy chest\n"));

// Download the first 10 bytes.
Response downloadTxtFirst10 = UtilIT.downloadFile(fileIdTxt, "0-9", null, null, authorApiToken);
downloadTxtFirst10.then().assertThat()
        .statusCode(OK.getStatusCode())
        .body(equalTo(contentOfTxt.substring(0, 10)));
//        .body(equalTo("first is t"));

// Download the last 6 bytes.
Response downloadTxtLast6 = UtilIT.downloadFile(fileIdTxt, "-6", null, null, authorApiToken);
downloadTxtLast6.then().assertThat()
        .statusCode(OK.getStatusCode())
        .body(equalTo(contentOfTxt.substring(contentOfTxt.length() - 6, contentOfTxt.length())));
//        .body(equalTo("chest\n"));

// Download some bytes from the middle.
Response downloadTxtMiddle = UtilIT.downloadFile(fileIdTxt, "09-19", null, null, authorApiToken);
downloadTxtMiddle.then().assertThat()
        .statusCode(OK.getStatusCode())
        .body(equalTo(contentOfTxt.substring(9, 20)));
//        .body(equalTo("the worst\ns"));

// Skip the first 10 bytes and download the rest.
Response downloadTxtSkipFirst10 = UtilIT.downloadFile(fileIdTxt, "9-", null, null, authorApiToken);
downloadTxtSkipFirst10.then().assertThat()
        .statusCode(OK.getStatusCode())
        .body(equalTo(contentOfTxt.substring(9, contentOfTxt.length())));
//        .body(equalTo("the worst\n"
//                + "second is the best\n"
//                + "third is the one with the hairy chest\n"));

@scolapasta scolapasta assigned pdurbin and unassigned scolapasta Oct 21, 2021
@pdurbin
Copy link
Member Author

pdurbin commented Oct 22, 2021

Ok, based on code review, I fixed up the code in 22d5ffb and played around with the tests. I also merged the latest from develop. I'm ready for more review.

@pdurbin pdurbin removed their assignment Oct 22, 2021
@pdurbin
Copy link
Member Author

pdurbin commented Nov 8, 2021

On Friday I merged the latest from develop into this branch to pick up the 5.8 pom.xml change but the API tests failed because certain things hadn't happened yet:

  • The 5.8 dvinstall.zip hadn't been uploaded to the release yet. This has since been done.
  • dataverse-ansible hadn't been bumped to 5.8 yet. This has since been done at gdcc/dataverse-ansible@85e9dc6

This morning I made a small change to the release note to force the API tests to run and they passed.

@kcondon kcondon self-assigned this Nov 8, 2021
@kcondon kcondon merged commit b117a31 into develop Nov 8, 2021
@kcondon kcondon deleted the 6937-range branch November 8, 2021 22:21
@djbrooke djbrooke added this to the 5.9 milestone Nov 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add byte offset functionality to StorageIO and Access API

6 participants