Allow Spark partial / Comet final for compatible aggregates#2994
Allow Spark partial / Comet final for compatible aggregates#2994Shekharrajak wants to merge 14 commits intoapache:mainfrom
Conversation
f2e6748 to
51869b1
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2994 +/- ##
============================================
- Coverage 56.12% 54.58% -1.54%
- Complexity 976 1256 +280
============================================
Files 119 167 +48
Lines 11743 15505 +3762
Branches 2251 2571 +320
============================================
+ Hits 6591 8464 +1873
- Misses 4012 5822 +1810
- Partials 1140 1219 +79 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
91e0de7 to
274f38b
Compare
|
Sorry for the late review @Shekharrajak. This LGTM except for the missing end-to-end tests for bitwise aggregates that @parthchandra already stated. I will go ahead and add those tests and push to this branch if permissions allow, or create a new branch from this one. |
Something is wrong with the git history on this branch so I cannot rebase or upmerge. @Shekharrajak let me know if you are still interested in working on this. If not, I will create a new PR based on your changes. |
|
Thanks for checking, Let me work on this. |
863ba03 to
141ba57
Compare
andygrove
left a comment
There was a problem hiding this comment.
LGTM pending CI. Thanks @Shekharrajak
|
@Shekharrajak golden files now need updating because more aggregates can run natively 🎉 |
4eaef6b to
a8d6c0f
Compare
|
Regenerated the golden files using |
|
This seems to have exposed a bug: |
|
Looks like this open ticket is connected to the failure #2837 - we have array of map in test case. Let me try fixing to accept the array of map groupingExpressions |
|
The previous check only rejected top-level MapType grouping expressions. Fix by recursively checking for MapType within ArrayType and StructType. |
ff7bcf9 to
b5e0c71
Compare
|
locally CometSqlFileTestSuite and AdaptiveQueryExecSuite tests all pass now. |
| })) { | ||
| def containsMapType(dt: DataType): Boolean = dt match { | ||
| case _: MapType => true | ||
| case a: ArrayType => containsMapType(a.elementType) |
There was a problem hiding this comment.
some queries have array of maps - recursively checking for MapType within ArrayType and StructType.
This ensures we correctly fall back to Spark for any grouping expression
that contains a MapType at any nesting level.
|
Looks like some Spark diffs now need updating: |
4eec927 to
34bee9b
Compare
34bee9b to
ba19751
Compare
|
Struggling a bit to make sure all checks are passing in github CI. Somehow I miss one or another tests locally when I run them to verify if everything is fine. |
|
Below tests assert on Spark-internal AQE optimization behavior (empty relation propagation, partition coalescing) that legitimately doesn't work when Comet's native operators are in the plan. IgnoreCometNativeDataFusion annotations skip these tests only in native_datafusion mode. SPARK-35442: Support propagate empty relation through aggregate |
8d4b8ae to
3089f35
Compare
3089f35 to
91684b2
Compare
…, STRUCT containing MAP)
…for all Spark versions
91684b2 to
74ddc04
Compare
|
CI check failure - debugging : df1: count(DISTINCT 2), count(DISTINCT 2, 3) | [1, 1] -- PASS Seems related to #3835 |
|
Found issue in CI checks : #3881 |
…egate-fallback # Conflicts: # dev/diffs/3.4.3.diff # dev/diffs/4.0.1.diff # spark/src/test/scala/org/apache/comet/exec/CometExecSuite.scala
85872c9 to
a660456
Compare
|
Looks fine, now. |
|
There is a newer PR #4015 which was partly inspired by this one, so will close this. Thanks again @Shekharrajak for working on this. |
Which issue does this PR close?
Closes #2894
Rationale for this change
Comet currently falls back to Spark for ALL final hash aggregates when there's no Comet partial aggregate in the child plan. This is overly conservative because some aggregates have compatible intermediate buffer formats between Spark and Comet.
For example, MIN, MAX, COUNT, and bitwise aggregates (BIT_AND, BIT_OR, BIT_XOR) have simple intermediate buffers (single value) that are compatible between Spark and Comet. These can safely run with "Spark partial / Comet final" execution.
Other aggregates like SUM, AVG, VARIANCE, etc. have known incompatibilities (e.g., decimal overflow handling differences, complex intermediate buffers) and should continue to fall back when there's no Comet partial aggregate.
What changes are included in this PR?
Added supportsSparkPartialCometFinal method to CometAggregateExpressionSerde trait - Default is false
Added helper function - aggSupportsMixedExecution() in QueryPlanSerde
How are these changes tested?
"CometExecRule should not allow Spark partial and Comet final for unsafe aggregates" - Verifies SUM still falls back to Spark
"CometExecRule should allow Spark partial and Comet final for safe aggregates" - Verifies MIN/MAX/COUNT can use Comet final with Spark partial