-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-14192: [C++][Dataset] Backpressure broken on ordered scans #11294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-14192: [C++][Dataset] Backpressure broken on ordered scans #11294
Conversation
|
|
3cf0f03 to
daa1840
Compare
daa1840 to
9b63bd5
Compare
c9373f4 to
68b997d
Compare
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, but left some questions (more for my own understanding).
python/pyarrow/tests/test_dataset.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we have a good number of these now, but timing-based tests always make me feel somewhat icky. Is there value to testing this in Python as well as C++?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do something like https://issues.apache.org/jira/browse/ARROW-14367 we can remove the timing dependency. That being said, I understand the concern, and there was similar concern in the dataset write backpressure PR: #11286 (comment)
I will remove this test from this PR and rely on the existing python test. When ARROW-14367 is implemented the timing dependency can be removed from the other test (and we can maybe reintroduce this test at that point).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test has been removed.
…readahead but won't allow backpressure to explode if scanning in a sequenced fashion.
… some grammar in async_generator.h comments. Removed timing-dependent backpressure test to avoid having too many timing dependent tests.
Co-authored-by: David Li <li.davidm96@gmail.com>
e6ec85b to
189516f
Compare
|
Rebased and addressed PR comments. |
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM.
|
CI failures are unrelated (could not find LLVM) and these tests passed on earlier versions. I will proceed with merging. |
|
Benchmark runs are scheduled for baseline = f2f663b and contender = 9abd2b1. 9abd2b1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
@westonpace It seems that "C++ / AMD64 Ubuntu 20.04 C++ ASAN UBSAN" is failed by this. |
While scanning we do our best to readahead multiple files so we will read files 1, 2, 3, and 4 all at the same time. This helps to maintain bandwidth when some files hit a snag (sometimes happens on AWS). However, when doing an ordered scan, this can cause backpressure to explode when there is a slow consumer.
The sequencer (placed at the end of the pipeline) can get into a situation where it pulls aggressively from files 2, 3, and 4 while waiting for the next chunk from file 1. Since the sequencer is consuming the batches the backpressure mechanism thinks they are being consumed. However, the actual consumer is leaving the batches piling up at the sequencer.
This PR introduces one possible solution (and it may be the only possible solution) which is to sequence the batches at merge time (early in the pipeline). The sequencer won't need to pull aggressively and backpressure will be maintained. This pretty significantly reduces (but does not eliminate) the amount of file readahead we do in ordered scans. We can worry about that if it ends up being a bottleneck at some point but for now I think it is better we do not explode RAM.
This builds on ARROW-13611 and will remain in draft until that PR has merged.