-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-7681: [Rust] Explicitly seeking kills BufRead performance. #6280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? Then could you also rename pull request title in the following format? See also: |
|
Hi @maxburke, would you be able to show some benchmark results before and after your PR? Your change also introduces an |
|
@nevi-me I don't have any official benchmarks. For reference, I am not reading parquet files from disk but rather streamed over a network connection. In the code I have that was performing the network fetching I was noticing a lot of reads of 8kb but offset by only one byte, ie: It seemed to be doing a lot of 1-byte reads when it reading page headers, of which most data was being discarded and fetched again. I never completed any sort of benchmark run because it was just too slow. The documentation for std::io::BufReader mentions this behavior:
(emphasis mine) https://doc.rust-lang.org/std/io/struct.BufReader.html#impl-Seek I'll take a look at the test failures. |
|
Please rebase after #6281 is merged, as it contains the fix for the failing tests |
rust/parquet/src/lib.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's our intention to move to stable Rust at some point, we can revisit these in the future
nevi-me
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds 2 unstable features, whose timeline I'm unsure of. One feature entered its final comment period in 2018, but seems was never stabilised. Given that there's also uncertainty about specialization, I'm fine with these unstable features for now.
@maxburke do you think the behaviour that you're fixing here could also be applicable to the CSV and IPC readers?
I'll have to take a look. I grepped for
|
|
@nevi-me One problem I found is that the I've put a request in to see if this behavior can be avoided: rust-lang/rust#68559 Anyways, at worst, the performance profile will be the same as it was before. At best, if BufReader can be improved, this will be a lot better. (The issue I mentioned, and struck out, above was in my code, not this change). |
|
edit: disregard, I don't think this works; it doesn't provide methods for getting at the underlying reader which are required by a few areas in the Parquet code. |
Hey @maxburke, I'll wait for a second review from another Arrow commiter before merging this, because we introduce the 2 unstable flags. |
liurenjie1024
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm generally fine with this change. But I have same concern with @nevi-me since we are moving towards to stable rust.
rust/parquet/src/util/io.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that u64 cast i64 may overflow, can we change the code to
if self.start > pos {
...
} else {
...
}|
Thanks @maxburke Left some comments. If this is a serious bug fix or impacting users heavily, I would suggest merging this when this feature got merged into stable rust. |
BufRead will discard its internal buffer when seek() is called. This kills much of the performance of a buffered reader, especially when many tiny (1-byte) reads are performed.
…y change if the underlying file descriptor is manipulated. Use relative seeking to prevent BufReader from discarding the internal buffer.
020a1a4 to
7b5b905
Compare
|
I share the concerns about further reliance on nighly Rust. Let's revisit this for the release after 0.17.0 |
…ternal buffer (2) A fix to 7681 that does not use nightly (as oposed to #6280). Closes #6949 from rdettai/ARROW-7681 Authored-by: rdettai <rdettai@gmail.com> Signed-off-by: Chao Sun <sunchao@apache.org>
…ternal buffer (2) A fix to 7681 that does not use nightly (as oposed to apache/arrow#6280). Closes #6949 from rdettai/ARROW-7681 Authored-by: rdettai <rdettai@gmail.com> Signed-off-by: Chao Sun <sunchao@apache.org>
BufRead will discard its internal buffer when seek() is called. This
kills much of the performance of a buffered reader, especially when many
tiny (1-byte) reads are performed.