perf: Add COMET_RESPECT_PARQUET_FILTER_PUSHDOWN config#1936
perf: Add COMET_RESPECT_PARQUET_FILTER_PUSHDOWN config#1936andygrove merged 14 commits intoapache:mainfrom
COMET_RESPECT_PARQUET_FILTER_PUSHDOWN config#1936Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1936 +/- ##
=============================================
- Coverage 56.12% 33.43% -22.70%
+ Complexity 976 804 -172
=============================================
Files 119 131 +12
Lines 11743 12917 +1174
Branches 2251 2402 +151
=============================================
- Hits 6591 4319 -2272
- Misses 4012 7660 +3648
+ Partials 1140 938 -202 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
COMET_RESPECT_PARQUET_FILTER_PUSHDOWN_ENABLED configCOMET_RESPECT_PARQUET_FILTER_PUSHDOWN config
comphead
left a comment
There was a problem hiding this comment.
Thanks @andygrove I think its lgtm, although it might be confusing having a parameter that enables another parameter 🤔
Yeah, I know. The alternative is to ask users to disable the Spark config, but I'm assuming that most users won't read the documentation to discover that this is needed for good performance. |
| private val datetimeRebaseModeInRead = options.datetimeRebaseModeInRead | ||
| private val parquetFilterPushDown = sqlConf.parquetFilterPushDown | ||
| private val parquetFilterPushDown = sqlConf.parquetFilterPushDown && | ||
| CometConf.COMET_RESPECT_PARQUET_FILTER_PUSHDOWN.get(sqlConf) |
There was a problem hiding this comment.
This is not necessary right now because this is part of DSV2 support and the new native scan impls do not support DSV2.
Though we might add it for native_iceberg_compat
There was a problem hiding this comment.
Thanks. I reverted this change.
Co-authored-by: Oleks V <comphead@users.noreply.github.com>
Co-authored-by: Oleks V <comphead@users.noreply.github.com>
…sion-comet into parquet-pushdown-config
| + conf | ||
| + .set("spark.sql.extensions", "org.apache.comet.CometSparkSessionExtensions") | ||
| + .set("spark.comet.enabled", "true") | ||
| + .set("spark.comet.parquet.respectFilterPushdown", "true") |
There was a problem hiding this comment.
Do we need to add .set("spark.comet.parquet.respectFilterPushdown", "true") at a few more locations?
E.g. TestHive.scala
Could be other locations as well
There was a problem hiding this comment.
All of the Spark SQL tests are passing.
There was a problem hiding this comment.
I checked, and there are no hive tests that reference PARQUET_FILTER_PUSHDOWN_ENABLED.
|
Thanks for the reviews @kazuyukitanimura @parthchandra @comphead |
Which issue does this PR close?
N/A
Rationale for this change
The new native scans perform poorly when Parquet filter pushdown is enabled, which is the default in Spark.
See apache/datafusion#3463 for reasons why filter pushdown is not enabled in DataFusion by default yet.
What changes are included in this PR?
Add a new config that tells Comet whether to respect Spark's filter pushdown config. We need to respect the config when running Spark SQL tests, but want to ignore the config by default for best performance.
How are these changes tested?