Skip to content

feat: Change default value of COMET_NATIVE_SCAN_IMPL to auto#1933

Merged
andygrove merged 22 commits intoapache:mainfrom
andygrove:auto-scan
Jun 28, 2025
Merged

feat: Change default value of COMET_NATIVE_SCAN_IMPL to auto#1933
andygrove merged 22 commits intoapache:mainfrom
andygrove:auto-scan

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented Jun 25, 2025

Which issue does this PR close?

Closes #1881

Rationale for this change

With this change, most end users no longer need to be aware of native_comet, native_datafusion, or native_iceberg_compat scans and what each of them supports. Comet will just pick the best scan for the job. If we hit any issues with this approach then we can still ask users to specify a specific scan to use.

What changes are included in this PR?

How are these changes tested?

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jun 25, 2025

Codecov Report

❌ Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 58.47%. Comparing base (f09f8af) to head (cca388d).
⚠️ Report is 1150 commits behind head on main.

Files with missing lines Patch % Lines
...n/scala/org/apache/comet/rules/CometScanRule.scala 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1933      +/-   ##
============================================
+ Coverage     56.12%   58.47%   +2.35%     
- Complexity      976     1144     +168     
============================================
  Files           119      131      +12     
  Lines         11743    12909    +1166     
  Branches       2251     2399     +148     
============================================
+ Hits           6591     7549     +958     
- Misses         4012     4136     +124     
- Partials       1140     1224      +84     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.


// native_iceberg_compat only supports local filesystem and S3
if (!scanExec.relation.inputFiles
.forall(path => path.startsWith("file://") || path.startsWith("s3a://"))) {
Copy link
Copy Markdown
Contributor

@hsiang-c hsiang-c Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S3AFileSystem, used by HadoopFileIO class in Iceberg, recognizes s3a scheme.

However, there is a S3FileIO Iceberg class that recognizes s3, s3a and s3n. We might have to handle more schemes in the future.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also supports HDFS if the feature is enabled

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't bother with s3:// and s3n:// urls. Those are defunct afaik.

Comment on lines +264 to +266
if (CometSparkSessionExtensions.isSpark40Plus) {
fallbackReasons += s"$SCAN_NATIVE_ICEBERG_COMPAT is not implemented for Spark 4.0.0"
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can revisit this after #1830 is merged

@parthchandra
Copy link
Copy Markdown
Contributor

Wondering if it is a good idea to change the default this close to a release. It might be safer to change it at the beginning of a release cycle, perhaps?

@andygrove andygrove marked this pull request as ready for review June 27, 2025 18:18
@andygrove
Copy link
Copy Markdown
Member Author

Wondering if it is a good idea to change the default this close to a release. It might be safer to change it at the beginning of a release cycle, perhaps?

If anyone runs into issues, they can specify spark.comet.scan.impl=native_comet to revert to the previous behavior.

The benefit of enabling auto as the default is that we now get complex type support by default when reading Parquet, as well as much improved performance.

run: |
cd apache-spark
rm -rf /root/.m2/repository/org/apache/parquet # somehow parquet cache requires cleanups
ENABLE_COMET=true ENABLE_COMET_SHUFFLE=true COMET_PARQUET_SCAN_IMPL=auto build/sbt -Dsbt.log.noformat=true ${{ matrix.module.args1 }} "${{ matrix.module.args2 }}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about repurpose this test for COMET_PARQUET_SCAN_IMPL=native_comet?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a good idea.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I enabled the native_comet tests in spark_sql_test.yaml, alongside the auto and native_iceberg_compat tests.

Copy link
Copy Markdown
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove
pending with CI

@parthchandra
Copy link
Copy Markdown
Contributor

Okay, tried this out with a few test queries and real world data and everything worked okay, so I feel more confident that this change is safe.

@andygrove andygrove merged commit 06ed88b into apache:main Jun 28, 2025
122 of 124 checks passed
@andygrove andygrove deleted the auto-scan branch July 10, 2025 22:07
coderfender pushed a commit to coderfender/datafusion-comet that referenced this pull request Dec 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Comet should choose the most suitable Parquet scan implementation automatically

6 participants