-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-39678][SQL] Improve stats estimation for v2 tables #37083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-39678][SQL] Improve stats estimation for v2 tables #37083
Conversation
wangyum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you enable spark.sql.cbo.enabled to estimate row count?
Lines 33 to 40 in b1d719e
| def stats: Statistics = statsCache.getOrElse { | |
| if (conf.cboEnabled) { | |
| statsCache = Option(BasicStatsPlanVisitor.visit(self)) | |
| } else { | |
| statsCache = Option(SizeInBytesOnlyStatsPlanVisitor.visit(self)) | |
| } | |
| statsCache.get | |
| } |
Thanks @wangyum, I am aware of the alternate visitor we use with cbo. I raised this pr considering :
Are you recommending it's an expected behavior / by design ? |
6f683ef to
dcaebec
Compare
|
rebased and regenerated the golden files via :
|
dcaebec to
c5526c2
Compare
c5526c2 to
2175a1a
Compare
5e5e72c to
7d88612
Compare
|
I think it's by design. So enabling |
|
Thanks @wangyum !
I believe then setting
for my knowledge, can you please point me to some jira's ,happy to learn more. Love to know your thoughts on the same, Happy to close this as well if we consider this is not a problem at all. |
|
Can one of the admins verify this patch? |
|
I'm a bit confused. After this PR, what's the difference between |
BasicStatsPlanVisitor additionally takes has columnStats such as (NDV / NullCount / min / max etc) on estimation, which generally is not passed from DSv1 / Dsv2 relation itself. As per my understanding, prior to this PR, SizeInBytesOnlyStatsPlanVisitor was estimating stats on the subset of info i.e only sizeInBytes and BasicStatsPlanVisitor on all 3 info (sizeInBytes, rowcount,ColumStats (min /max /NDV etc), now via this PR SizeInBytesOnlyStatsPlanVisitor is estimating stats on the subset of info but this subset is now (sizeInBytes / rowCount) and BasicStatsPlanVisitor on all 3 info (sizeInBytes, rowcount,ColumStats (min /max /NDV etc). |
|
Maybe we should name them BTW, with CBO off, where do we use row count? |
we use it in places like : Lines 93 to 100 in 161c596
where we just multiply row-count with row size. We also use it for BF to create bloomFilterAgg. In v1 scenario in case of logical relation row-count can seep in from catalog stats but as you correctly pointed out it has a has a chance of Lines 54 to 58 in 161c596
|
|
OK I think the idea makes sense. With CBO off, the optimizer/planner only needs size in bytes, but row count is also an important statistics to estimate size in bytes, and should be propagated in the stats plan visitor. |
c21
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks singhpk234 for the work! Having some comments/questions.
...a/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/AdvancedStatsPlanVisitor.scala
Outdated
Show resolved
Hide resolved
|
|
||
| override def visitIntersect(p: Intersect): Statistics = fallback(p) | ||
|
|
||
| override def visitJoin(p: Join): Statistics = fallback(p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we fallback here, but not use JoinEstimation.estimate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fallback here would endup calling BasicStatsPlanVisitor.visit(p) which, will in turn call BasicStatsPlanVisitor#visitJoin which will be JoinEstimation(p).estimate.getOrElse(default(p)). Hence added the same.
...a/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/AdvancedStatsPlanVisitor.scala
Outdated
Show resolved
Hide resolved
...cala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala
Outdated
Show resolved
Hide resolved
|
|
||
| // v2 sources can bubble-up rowCount, so always propagate. | ||
| // Don't propagate attributeStats, since they are not estimated here. | ||
| Statistics(sizeInBytes = sizeInBytes, rowCount = p.child.stats.rowCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused here. In the top-level comment - computes a single dimension for plan stats: size in bytes. But why we populate rowCount as well here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this estimator i.e visitUnaryNode, we adjust the size by scaling it by (input row size / output row size) but since we don't have much info (in terms of min / max / ndv etc) to estimate the row count we just say the node output's child output row count which is mostly true for operators like project etc.
Since we were just computing sizeInBytes and just propagating rowCounts as it is.
Appologies I forgot to update the comment as per proposed behaviour.
Should I rephrase it to:
estimates size in bytes, row count for plan stats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for dsv2 sources rowCount can be passed from the relation itself without running analyze, hence BasicStatsPlanVisitor which will be our default now, post this change will take rowcount into consideration.
| +- ReusedExchange (28) | ||
| TakeOrderedAndProject (37) | ||
| +- * Project (36) | ||
| +- * SortMergeJoin Inner (35) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this expected ? looks like a plan regression to me
thought these stats are available in AQE and more accurate though |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
We should propagate the row count stats in SizeInBytesOnlyStatsPlanVisitor if available. Row counts are propagated from connectors to spark in case of v2 tables.
Why are the changes needed?
This can improve stats estimation for v2 tables, since row count is used at places to estimate sizeInBytes.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Modified existing UT's to match the proposed behavior.