Skip to content

perf: Add COMET_RESPECT_PARQUET_FILTER_PUSHDOWN config#1936

Merged
andygrove merged 14 commits intoapache:mainfrom
andygrove:parquet-pushdown-config
Jun 27, 2025
Merged

perf: Add COMET_RESPECT_PARQUET_FILTER_PUSHDOWN config#1936
andygrove merged 14 commits intoapache:mainfrom
andygrove:parquet-pushdown-config

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented Jun 26, 2025

Which issue does this PR close?

N/A

Rationale for this change

The new native scans perform poorly when Parquet filter pushdown is enabled, which is the default in Spark.

See apache/datafusion#3463 for reasons why filter pushdown is not enabled in DataFusion by default yet.

What changes are included in this PR?

Add a new config that tells Comet whether to respect Spark's filter pushdown config. We need to respect the config when running Spark SQL tests, but want to ignore the config by default for best performance.

How are these changes tested?

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jun 26, 2025

Codecov Report

Attention: Patch coverage is 69.23077% with 4 lines in your changes missing coverage. Please review.

Project coverage is 33.43%. Comparing base (f09f8af) to head (e7b2a58).
Report is 291 commits behind head on main.

Files with missing lines Patch % Lines
.../scala/org/apache/comet/serde/QueryPlanSerde.scala 57.14% 1 Missing and 2 partials ⚠️
.../apache/comet/parquet/CometParquetFileFormat.scala 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##               main    #1936       +/-   ##
=============================================
- Coverage     56.12%   33.43%   -22.70%     
+ Complexity      976      804      -172     
=============================================
  Files           119      131       +12     
  Lines         11743    12917     +1174     
  Branches       2251     2402      +151     
=============================================
- Hits           6591     4319     -2272     
- Misses         4012     7660     +3648     
+ Partials       1140      938      -202     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andygrove andygrove changed the title perf: Add COMET_RESPECT_PARQUET_FILTER_PUSHDOWN_ENABLED config perf: Add COMET_RESPECT_PARQUET_FILTER_PUSHDOWN config Jun 26, 2025
@andygrove andygrove marked this pull request as ready for review June 26, 2025 23:07
Comment thread common/src/main/scala/org/apache/comet/CometConf.scala Outdated
Comment thread docs/source/user-guide/configs.md Outdated
Copy link
Copy Markdown
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove I think its lgtm, although it might be confusing having a parameter that enables another parameter 🤔

@andygrove
Copy link
Copy Markdown
Member Author

Thanks @andygrove I think its lgtm, although it might be confusing having a parameter that enables another parameter 🤔

Yeah, I know. The alternative is to ask users to disable the Spark config, but I'm assuming that most users won't read the documentation to discover that this is needed for good performance.

Copy link
Copy Markdown
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

private val datetimeRebaseModeInRead = options.datetimeRebaseModeInRead
private val parquetFilterPushDown = sqlConf.parquetFilterPushDown
private val parquetFilterPushDown = sqlConf.parquetFilterPushDown &&
CometConf.COMET_RESPECT_PARQUET_FILTER_PUSHDOWN.get(sqlConf)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary right now because this is part of DSV2 support and the new native scan impls do not support DSV2.
Though we might add it for native_iceberg_compat

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I reverted this change.

Comment thread dev/diffs/3.4.3.diff
+ conf
+ .set("spark.sql.extensions", "org.apache.comet.CometSparkSessionExtensions")
+ .set("spark.comet.enabled", "true")
+ .set("spark.comet.parquet.respectFilterPushdown", "true")
Copy link
Copy Markdown
Contributor

@kazuyukitanimura kazuyukitanimura Jun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add .set("spark.comet.parquet.respectFilterPushdown", "true") at a few more locations?
E.g. TestHive.scala
Could be other locations as well

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the Spark SQL tests are passing.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked, and there are no hive tests that reference PARQUET_FILTER_PUSHDOWN_ENABLED.

@andygrove andygrove merged commit 469ee6e into apache:main Jun 27, 2025
96 checks passed
@andygrove andygrove deleted the parquet-pushdown-config branch June 27, 2025 20:59
@andygrove
Copy link
Copy Markdown
Member Author

Thanks for the reviews @kazuyukitanimura @parthchandra @comphead

coderfender pushed a commit to coderfender/datafusion-comet that referenced this pull request Dec 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants