-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-12254: [Rust][DataFusion] Stop polling limit input once limit is reached #9926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual code change for this PR is very small -- the rest of the changes are related to writing a proper test for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a vague memory that FusedStream may have something to do with this property (although /noideadog)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point.
One benefit of the this PR over fuse() is that this PR will actually drop the input stream (freeing resources) in addition to not calling the input stream again: https://docs.rs/futures-util/0.3.13/src/futures_util/stream/stream/fuse.rs.html#10
f2632af to
99505b4
Compare
andygrove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. This should have quite an impact on some of our benchmarks I imagine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just compare self.current_len == self.limit and short-cirtcuit before polling the wrapped stream, instead of the Option plumbing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was that the option plumbing actually drops the input, freeing its resources when the limit has been hit, rather than waiting for the execution to be complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we do this in other places too? Isn't a SendableRecordBatchStream a small struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't a SendableRecordBatchStream a small struct?
It is a trait, so there are various things that implement it. Some, like the ParquetStream
| Ok(Box::pin(ParquetStream { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand it now, it can consist of the whole tree of dependent streams. Probably still not a big resource hog but more than a few bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imagine it is actually a subquery with a group by hash or join with a large hash table :) It may actually be hanging on to a substantial amount of memory I suspect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yeah that's right!
|
So I can understand from the code it polling the inner stream once before discarding the results of this and returning None, but at this point it should not be polled again. According to the stream trait
If this is occurring, there may be a more fundamental issue at play here that also needs fixing |
@tustvold my initial reading of The I had a test (99505b4#diff-34dec6459ccea51c881a6ea392be9ad35f112395e6b8742df32a1742ac651e31L1799) that ran an entire query and I interpreted some I am confident that this is an improvement to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? It can be based on self.current_len == self.limit or otherwise a boolean like limit_exhausted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Snap - #9926 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha 👍 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't a SendableRecordBatchStream a small struct?
It is a trait, so there are various things that implement it. Some, like the ParquetStream
| Ok(Box::pin(ParquetStream { |
Codecov Report
@@ Coverage Diff @@
## master #9926 +/- ##
==========================================
+ Coverage 82.70% 82.75% +0.04%
==========================================
Files 257 258 +1
Lines 60486 60620 +134
==========================================
+ Hits 50027 50167 +140
+ Misses 10459 10453 -6
Continue to review full report at Codecov.
|
99505b4 to
ad7c712
Compare
|
Any last thoughts on this PR? |
Dandandan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! (Small linting error)
d7cc9ca to
b923fdd
Compare
Thanks @Dandandan -- I fixed that up so hopefully this will be good to go tomorrow as well |
Rationale
Once the number of rows needed for a limit query has been produced, any further work done to read values from its input is wasted.
The current implementation of LimitStream will keep polling its input for the next value, and returning
Poll::Ready(None), even once the limit has been reachedFor queries like
select * from foo limit 10used for initial data exploration this is very wasteful.Changes
This PR changes
LimitStreamso that it drops its input once the limit has been reached -- this both potentially frees resources (memory, file handles, etc) it also avoids unnecessary computation