-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Refine the statistics estimation for the limit and aggregate operator #4716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
714c5df to
d229c08
Compare
|
Hi @alamb, @Dandandan, @mingmwang, could you help review this PR? By the way, this PR should be merged after #4714, since it depends on the global sort algorithm selection. |
|
I believe this PR actually builds on #4714 (the idea of improving the statistics for limits and aggregates is a good one, 👍 ) |
d229c08 to
ecdc0e2
Compare
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks reasonable to me 👍
I think this PR needs some tests to verify the behavior (and make sure we don't break it by accident in a follow on PR)
|
Marking as draft so it is more clear the PR is awaiting some tests prior to merge |
ecdc0e2 to
e9b819f
Compare
|
Hi @alamb, is this PR ready for review and merge now? |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM -- thank you @yahoNanJing
|
Benchmark runs are scheduled for baseline = 03ef500 and contender = 8ab3a91. 8ab3a91 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #4715.
Rationale for this change
With these introduced row count info, for a SQL similar to the following pattern, the
JoinSelectionoptimizer rule will successfully be able to choose theCollectLeftpartition mode rather than thePartitioned, which reduces the query duration running on Ballista from 7.5s to 4.5s for a data set of 1.3 billion rows.What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?