feat: do not fallback to Spark for COUNT(distinct)#2429
feat: do not fallback to Spark for COUNT(distinct)#2429comphead merged 11 commits intoapache:mainfrom
COUNT(distinct)#2429Conversation
|
Might be fixed by #2407 Running |
|
I think it was fixed by several issues. I think #2407 is one, but I think the native shuffle rewrite and bumping the Arrow Java version contributed as well. It's encouraging to see! |
|
I'm adding more tests to make sure it is working now |
| binding: Boolean, | ||
| conf: SQLConf): Option[AggExpr] = { | ||
|
|
||
| if (aggExpr.isDistinct) { |
There was a problem hiding this comment.
Don't we need to pass the aggExpr.isDistinct value into the protobuf plan?
There was a problem hiding this comment.
this is a good point, I was thinking the same but IMO Spark doesn't call the count distinct on partial phase.
+-----------------------------+
| Driver |
| COUNT(DISTINCT name) |
+-------------+---------------+
|
v
+-------------------+ +-------------------+
| Executor 1 | | Executor 2 |
| Partitions P0,P1 | | Partitions P2,P3 |
| Local distinct: | | Local distinct: |
| {Alice,Bob,Eve} | | {Mallory,Eve,Bob,Trent}|
+---------+---------+ +---------+---------+
| |
| Shuffle |
v v
+--------------+ +--------------+
| Reducer R0 | | Reducer R1 |
| {Alice,Bob,Eve} | {Mallory,Trent} |
+--------------+ +--------------+
\ /
\ /
\ /
+-----------+-------------+
v
Driver Final Merge
DISTINCT = 5
The local distinct is made by HashAggregate so when count distinct get called as aggExpr it might not be needing the flag as data already deduped on reducers. Checking the Final stage though
There was a problem hiding this comment.
Thanks for explaining that. We should add tests for other distinct aggregates as well, such as sum and avg. I'm not sure if there are others?
There was a problem hiding this comment.
Spark has tests for the following aggregates with DISTINCT:
- count
- sum
- avg
- first
- last
- corr
- var_pop
- var_samp
There was a problem hiding this comment.
For this PR, we could just remove the fallback for COUNT?
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2429 +/- ##
============================================
+ Coverage 56.12% 58.48% +2.35%
- Complexity 976 1440 +464
============================================
Files 119 146 +27
Lines 11743 13519 +1776
Branches 2251 2352 +101
============================================
+ Hits 6591 7906 +1315
- Misses 4012 4379 +367
- Partials 1140 1234 +94 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
depends on #2258 |
| @@ -26,8 +26,18 @@ class CometFuzzAggregateSuite extends CometFuzzTestBase { | |||
| df.createOrReplaceTempView("t1") | |||
| for (col <- df.columns) { | |||
| val sql = s"SELECT count(distinct $col) FROM t1" | |||
There was a problem hiding this comment.
could you also add tests for count distinct with multiple columns e.g. COUNT(DISTINCT col1, col2, col3)
There was a problem hiding this comment.
Thanks @andygrove I'll add them separately as well
|
Fuzz tests fall to Spark checking |
COUNT(distinct)
|
Can we add a case to CometAggregateBenchmark to tout our gains? :) |
Added |
mbutrovich
left a comment
There was a problem hiding this comment.
LGTM. I guess Comet already knew how to serde everything and it was just a matter of removing the fallback and beefing up tests. Thanks @comphead!
I also ran the benchmark locally: ~1.4x on most cases! 🚀
|
I found that we fall back to Spark when there are multiple count distinct e.g. This isn't an issue for this PR, but thought I should make a note of this. |
Filed #2456 |
* feat: do not fallback to Spark for distincts
Which issue does this PR close?
Just experiment, related to #2292
Closes #.
Rationale for this change
What changes are included in this PR?
How are these changes tested?