Skip to content

Conversation

@maxburke
Copy link
Contributor

BufRead will discard its internal buffer when seek() is called. This
kills much of the performance of a buffered reader, especially when many
tiny (1-byte) reads are performed.

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@nevi-me
Copy link
Contributor

nevi-me commented Jan 25, 2020

Hi @maxburke, would you be able to show some benchmark results before and after your PR? Your change also introduces an end of file failure. Would you be able to look into that? Thanks

@maxburke
Copy link
Contributor Author

maxburke commented Jan 26, 2020

@nevi-me I don't have any official benchmarks. For reference, I am not reading parquet files from disk but rather streamed over a network connection. In the code I have that was performing the network fetching I was noticing a lot of reads of 8kb but offset by only one byte, ie:

Seek to offset 0, read 8192 bytes.
Seek to offset 1, read 8192 bytes.
Seek to offset 2, read 8192 bytes.
...

It seemed to be doing a lot of 1-byte reads when it reading page headers, of which most data was being discarded and fetched again. I never completed any sort of benchmark run because it was just too slow.

The documentation for std::io::BufReader mentions this behavior:

Seeking always discards the internal buffer, even if the seek position would otherwise fall within it. This guarantees that calling .into_inner() immediately after a seek yields the underlying reader at the same position.

(emphasis mine)

https://doc.rust-lang.org/std/io/struct.BufReader.html#impl-Seek

I'll take a look at the test failures.

@maxburke maxburke changed the title [rust] Explicitly seeking kills BufRead performance. ARROW-7681: [rust] Explicitly seeking kills BufRead performance. Jan 26, 2020
@github-actions
Copy link

@nevi-me
Copy link
Contributor

nevi-me commented Jan 26, 2020

Please rebase after #6281 is merged, as it contains the fix for the failing tests

@nevi-me nevi-me changed the title ARROW-7681: [rust] Explicitly seeking kills BufRead performance. ARROW-7681: [Rust] Explicitly seeking kills BufRead performance. Jan 26, 2020
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's our intention to move to stable Rust at some point, we can revisit these in the future

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds 2 unstable features, whose timeline I'm unsure of. One feature entered its final comment period in 2018, but seems was never stabilised. Given that there's also uncertainty about specialization, I'm fine with these unstable features for now.

@maxburke do you think the behaviour that you're fixing here could also be applicable to the CSV and IPC readers?

@maxburke maxburke changed the title ARROW-7681: [Rust] Explicitly seeking kills BufRead performance. WIP: ARROW-7681: [Rust] Explicitly seeking kills BufRead performance. Jan 26, 2020
@maxburke
Copy link
Contributor Author

maxburke commented Jan 26, 2020

This adds 2 unstable features, whose timeline I'm unsure of. One feature entered its final comment period in 2018, but seems was never stabilised. Given that there's also uncertainty about specialization, I'm fine with these unstable features for now.

@maxburke do you think the behaviour that you're fixing here could also be applicable to the CSV and IPC readers?

I'll have to take a look. I grepped for seek in the Parquet module and didn't see anything else that looked like it followed a pattern.

I've changed the title to WIP; there's a bug I'm chasing down and I don't know if it's part of this change or not.

@maxburke maxburke changed the title WIP: ARROW-7681: [Rust] Explicitly seeking kills BufRead performance. ARROW-7681: [Rust] Explicitly seeking kills BufRead performance. Jan 26, 2020
@maxburke
Copy link
Contributor Author

@nevi-me One problem I found is that the stream_position convenience function just calls seek(SeekFrom::Current(0)), which causes the BufReader to discard its internal buffer anyways.

I've put a request in to see if this behavior can be avoided: rust-lang/rust#68559

Anyways, at worst, the performance profile will be the same as it was before. At best, if BufReader can be improved, this will be a lot better.

(The issue I mentioned, and struck out, above was in my code, not this change).

@maxburke
Copy link
Contributor Author

maxburke commented Jan 26, 2020

Actually... @nevi-me what would you think of using the seek_bufread crate instead of the std::io BufReader https://docs.rs/seek_bufread/1.2.2/seek_bufread/ ?

edit: disregard, I don't think this works; it doesn't provide methods for getting at the underlying reader which are required by a few areas in the Parquet code.

@nevi-me
Copy link
Contributor

nevi-me commented Feb 4, 2020

edit: disregard, I don't think this works; it doesn't provide methods for getting at the underlying reader which are required by a few areas in the Parquet code.

Hey @maxburke, I'll wait for a second review from another Arrow commiter before merging this, because we introduce the 2 unstable flags.
CC @liurenjie1024 @paddyhoran @andygrove

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally fine with this change. But I have same concern with @nevi-me since we are moving towards to stable rust.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that u64 cast i64 may overflow, can we change the code to

if self.start > pos {
 ...
} else {
 ...
}

@liurenjie1024
Copy link
Contributor

Thanks @maxburke Left some comments. If this is a serious bug fix or impacting users heavily, I would suggest merging this when this feature got merged into stable rust.

BufRead will discard its internal buffer when seek() is called. This
kills much of the performance of a buffered reader, especially when many
tiny (1-byte) reads are performed.
…y change if the underlying file descriptor is manipulated. Use relative seeking to prevent BufReader from discarding the internal buffer.
@kszucs kszucs force-pushed the rust_fix_parquet_bufread_seek branch from 020a1a4 to 7b5b905 Compare February 7, 2020 10:11
@andygrove
Copy link
Member

I share the concerns about further reliance on nighly Rust. Let's revisit this for the release after 0.17.0

sunchao pushed a commit that referenced this pull request Apr 27, 2020
…ternal buffer (2)

A fix to 7681 that does not use nightly (as oposed to #6280).

Closes #6949 from rdettai/ARROW-7681

Authored-by: rdettai <rdettai@gmail.com>
Signed-off-by: Chao Sun <sunchao@apache.org>
@nevi-me
Copy link
Contributor

nevi-me commented Apr 28, 2020

@maxburke can you please test master, an alternative solution has been provided by #6949. If it doesn't resolve the problem, we can reopen this PR.

@nevi-me nevi-me closed this Apr 28, 2020
alamb pushed a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
…ternal buffer (2)

A fix to 7681 that does not use nightly (as oposed to apache/arrow#6280).

Closes #6949 from rdettai/ARROW-7681

Authored-by: rdettai <rdettai@gmail.com>
Signed-off-by: Chao Sun <sunchao@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants