perf: Improve count aggregate performance#784
Merged
andygrove merged 10 commits intoapache:mainfrom Aug 6, 2024
Merged
Conversation
Member
Author
Member
Author
|
Average of 3 runs, main branch versus this PR. This shows a 15.5% speedup. Command used for both runs: $SPARK_HOME/bin/spark-submit \
--master $SPARK_MASTER \
--conf spark.driver.memory=8G \
--conf spark.executor.instances=1 \
--conf spark.executor.memory=32G \
--conf spark.executor.cores=8 \
--conf spark.cores.max=8 \
--conf spark.eventLog.enabled=true \
--jars $COMET_JAR \
--conf spark.driver.extraClassPath=$COMET_JAR \
--conf spark.executor.extraClassPath=$COMET_JAR \
--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
--conf spark.comet.enabled=true \
--conf spark.comet.exec.enabled=true \
--conf spark.comet.exec.all.enabled=true \
--conf spark.comet.cast.allowIncompatible=true \
--conf spark.comet.shuffle.enforceMode.enabled=true \
--conf spark.comet.exec.shuffle.enabled=true \
--conf spark.comet.exec.shuffle.mode=auto \
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
tpcbench.py \
--benchmark tpcds \
--data /mnt/bigdata/tpcds/sf100/ \
--queries ../../tpcds/queries-spark \
--iterations 3 |
huaxingao
approved these changes
Aug 6, 2024
Contributor
huaxingao
left a comment
There was a problem hiding this comment.
LGTM. Thanks for the PR @andygrove
viirya
reviewed
Aug 6, 2024
| .iter() | ||
| .map(|child| self.create_expr(child, schema.clone())) | ||
| .collect::<Result<Vec<_>, _>>()?; | ||
| if expr.children.iter().len() == 1 { |
Member
There was a problem hiding this comment.
Hmm, I think we can also do this for multiple child expressions?
Member
Author
There was a problem hiding this comment.
Thanks. I have extended this approach for the multiple argument case.
viirya
approved these changes
Aug 6, 2024
Member
viirya
left a comment
There was a problem hiding this comment.
Looks okay. Actually it is how Spark count does internally:
/* count = */ If(nullableChildren.map(IsNull).reduce(Or), count, count + 1L)
himadripal
pushed a commit
to himadripal/datafusion-comet
that referenced
this pull request
Sep 7, 2024
* Workaround for COUNT performance * add comments * remove benchmark results * fix regression * revert change to datafusion version * Revert change to Cargo.lock * fix * unify code for single and multiple arguments * clippy
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Which issue does this PR close?
Closes #744
Rationale for this change
For some reason,
COUNTis really slow when used fromComet, butSUMis fast, so let's translateCOUNT(expr)toSUM(IF(expr IS NULL, 0, 1))until we can get to the bottom of the real issue.edit: It turns out that Spark also implements
COUNTthis way, so I think this closes the issue.What changes are included in this PR?
How are these changes tested?