feat: Change default value of COMET_NATIVE_SCAN_IMPL to auto#1933
feat: Change default value of COMET_NATIVE_SCAN_IMPL to auto#1933andygrove merged 22 commits intoapache:mainfrom
COMET_NATIVE_SCAN_IMPL to auto#1933Conversation
This reverts commit 38d6643.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1933 +/- ##
============================================
+ Coverage 56.12% 58.47% +2.35%
- Complexity 976 1144 +168
============================================
Files 119 131 +12
Lines 11743 12909 +1166
Branches 2251 2399 +148
============================================
+ Hits 6591 7549 +958
- Misses 4012 4136 +124
- Partials 1140 1224 +84 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
|
||
| // native_iceberg_compat only supports local filesystem and S3 | ||
| if (!scanExec.relation.inputFiles | ||
| .forall(path => path.startsWith("file://") || path.startsWith("s3a://"))) { |
There was a problem hiding this comment.
S3AFileSystem, used by HadoopFileIO class in Iceberg, recognizes s3a scheme.
However, there is a S3FileIO Iceberg class that recognizes s3, s3a and s3n. We might have to handle more schemes in the future.
There was a problem hiding this comment.
It also supports HDFS if the feature is enabled
There was a problem hiding this comment.
I wouldn't bother with s3:// and s3n:// urls. Those are defunct afaik.
| if (CometSparkSessionExtensions.isSpark40Plus) { | ||
| fallbackReasons += s"$SCAN_NATIVE_ICEBERG_COMPAT is not implemented for Spark 4.0.0" | ||
| } |
|
Wondering if it is a good idea to change the default this close to a release. It might be safer to change it at the beginning of a release cycle, perhaps? |
If anyone runs into issues, they can specify The benefit of enabling |
| run: | | ||
| cd apache-spark | ||
| rm -rf /root/.m2/repository/org/apache/parquet # somehow parquet cache requires cleanups | ||
| ENABLE_COMET=true ENABLE_COMET_SHUFFLE=true COMET_PARQUET_SCAN_IMPL=auto build/sbt -Dsbt.log.noformat=true ${{ matrix.module.args1 }} "${{ matrix.module.args2 }}" |
There was a problem hiding this comment.
What about repurpose this test for COMET_PARQUET_SCAN_IMPL=native_comet?
There was a problem hiding this comment.
Yes, that's a good idea.
There was a problem hiding this comment.
I enabled the native_comet tests in spark_sql_test.yaml, alongside the auto and native_iceberg_compat tests.
kazuyukitanimura
left a comment
There was a problem hiding this comment.
Thanks @andygrove
pending with CI
|
Okay, tried this out with a few test queries and real world data and everything worked okay, so I feel more confident that this change is safe. |
Which issue does this PR close?
Closes #1881
Rationale for this change
With this change, most end users no longer need to be aware of
native_comet,native_datafusion, ornative_iceberg_compatscans and what each of them supports. Comet will just pick the best scan for the job. If we hit any issues with this approach then we can still ask users to specify a specific scan to use.What changes are included in this PR?
How are these changes tested?