fix: Support `auto` scan mode with Spark 4.0.0 by andygrove · Pull Request #1975 · apache/datafusion-comet

andygrove · 2025-07-01T20:46:38Z

Which issue does this PR close?

Closes #1967

Rationale for this change

Provide better performance with Spark 4.0.0 by supporting auto scan mode. Also, increase test coverage of native_iceberg_compat, which we intend to eventually completely replace native_comet.

What changes are included in this PR?

Remove auto fallback for Spark 4
Fall back to Spark for scans with schemas containing collated string types
Update Spark 4.0.0 diff to match changes made in other versions related to skipping tests for native scans
Update Spark CollationSuite to expect Comet plans
Ignore VariantShreddingSuite due to [native_iceberg_compat] VariantShreddingSuite test failures with Spark 4.0.0 #2209

How are these changes tested?

codecov-commenter · 2025-07-01T21:31:14Z

Codecov Report

❌ Patch coverage is 16.66667% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.53%. Comparing base (f09f8af) to head (df9d42f).
⚠️ Report is 410 commits behind head on main.

Files with missing lines	Patch %	Lines
...n/scala/org/apache/comet/rules/CometScanRule.scala	15.38%	6 Missing and 5 partials ⚠️
.../scala/org/apache/spark/sql/comet/util/Utils.scala	20.00%	4 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #1975      +/-   ##
============================================
+ Coverage     56.12%   58.53%   +2.40%     
- Complexity      976     1282     +306     
============================================
  Files           119      143      +24     
  Lines         11743    13254    +1511     
  Branches       2251     2364     +113     
============================================
+ Hits           6591     7758    +1167     
- Misses         4012     4264     +252     
- Partials       1140     1232      +92

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

andygrove · 2025-07-09T21:04:58Z

hive-1 failure to investigate:

2025-07-09T20:18:09.4999985Z [info]   Cause: org.apache.comet.CometNativeException: External error: Arrow: Parquet argument error: Parquet error: encountered non UTF-8 data
2025-07-09T20:18:09.5000678Z [info]   at org.apache.comet.parquet.Native.readNextRecordBatch(Native Method)
2025-07-09T20:18:09.5001239Z [info]   at org.apache.comet.parquet.NativeBatchReader.loadNextBatch(NativeBatchReader.java:812)

andygrove · 2025-08-21T14:26:34Z

Current failures:

core-1:

$ grep "* FAIL" core1.txt 
2025-08-21T02:32:47.7183506Z [info] - SPARK-26677: negated null-safe equality comparison should not filter matched row groups *** FAILED *** (303 milliseconds)
2025-08-21T02:52:05.6178387Z [info] - scalar types rebuild *** FAILED *** (637 milliseconds)
2025-08-21T02:52:05.9761617Z [info] - object rebuild *** FAILED *** (357 milliseconds)
2025-08-21T02:52:06.3954010Z [info] - array rebuild *** FAILED *** (417 milliseconds)
2025-08-21T02:52:07.4434260Z [info] - malformed input *** FAILED *** (1 second, 40 milliseconds)
2025-08-21T02:52:07.6507720Z [info] - extract from shredded object *** FAILED *** (206 milliseconds)
2025-08-21T02:52:07.8370529Z [info] - extract from shredded array *** FAILED *** (182 milliseconds)
2025-08-21T02:52:08.0028967Z [info] - missing fields *** FAILED *** (166 milliseconds)
2025-08-21T02:52:08.1324041Z [info] - custom casts *** FAILED *** (129 milliseconds)

hive-1:

$ grep "* FAIL" hive1.txt 
2025-08-21T02:28:26.0691331Z [info] - SPARK-30201 HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT *** FAILED *** (375 milliseconds)

comphead · 2025-08-25T19:58:44Z

-      case StringType => ArrowType.Utf8.INSTANCE
+      case _: StringType => ArrowType.Utf8.INSTANCE
+      case dt if isStringCollationType(dt) =>
+        // TODO collation information is lost with this transformation


should we have a ticket?

I removed this comment because Spark also does not track this information, as pointed out in #2190 (comment) by @parthchandra

comphead · 2025-08-25T20:01:49Z

      typeChecker.isSchemaSupported(partitionSchema, fallbackReasons)

-    def hasMapsContainingStructs(dataType: DataType): Boolean = {
+    def isComplexType(dt: DataType): Boolean = dt match {


isComplexType method now exists in in DataTypeSupport object

Thanks. Updated.

comphead

lgtm thanks @andygrove

andygrove added 2 commits July 1, 2025 14:46

Remove auto scan fallback for Spark 4.0.0

806a06c

format

d97cbb2

fix

0f3b37c

andygrove changed the title ~~fix: [ignore] Remove auto scan fallback for Spark 4.0.0~~ fix: Support auto scan mode with Spark 4.0.0 Jul 2, 2025

andygrove added 9 commits July 2, 2025 08:41

fix

4ce9e35

update diff

ec029c5

Merge remote-tracking branch 'apache/main' into auto-scan-4.0.0

31d7078

fix diff

161ce4f

fix diff

7a5cb5f

fix

9d8bd5f

Merge remote-tracking branch 'apache/main' into auto-scan-4.0.0

fabf461

improve error message

395eebd

fix

c4cf271

andygrove added 10 commits July 9, 2025 16:01

fix

fd4cdb6

upmerge

398d49d

format

3c6db94

possible fix

cd95aed

fix build?

a78874b

format

dbc3df5

debug

d7f8909

fix?

f8b9619

fix?

47ae97e

Merge branch 'fix-build-tpcds' into auto-scan-4.0.0

d641025

andygrove force-pushed the auto-scan-4.0.0 branch from 0dca8fb to d641025 Compare August 20, 2025 22:15

andygrove added 3 commits August 20, 2025 16:45

upmerge

75c1bd5

update Spark CollationSuite to expect Comet plans

466b452

update Spark CollationSuite to expect Comet plans

24b216f

andygrove added 3 commits August 21, 2025 08:39

fix one regression

e14f6be

ignore tests in VariantShreddingSuite

b061418

ignore test that uses non utf-8 strings

fd395db

andygrove marked this pull request as ready for review August 21, 2025 14:55

andygrove requested review from mbutrovich and parthchandra August 21, 2025 15:04

This was referenced Aug 21, 2025

Fix listagg-collation.sql test in Spark 4.0.0 #1947

Closed

feat: Improve fallback mechanism for ANSI mode #2211

Merged

comphead reviewed Aug 25, 2025

View reviewed changes

Merge remote-tracking branch 'apache/main' into auto-scan-4.0.0

84a71f2

comphead approved these changes Aug 25, 2025

View reviewed changes

andygrove added 2 commits August 25, 2025 14:07

add comments

a2a862e

use DataTypeSupport.isComplexType

df9d42f

parthchandra approved these changes Aug 25, 2025

View reviewed changes

andygrove merged commit 1b344de into apache:main Aug 26, 2025
174 of 177 checks passed

andygrove deleted the auto-scan-4.0.0 branch August 26, 2025 00:12

andygrove mentioned this pull request Aug 28, 2025

fix: Remove unreachable code in CometScanRule #2252

Merged

coderfender pushed a commit to coderfender/datafusion-comet that referenced this pull request Dec 13, 2025

fix: Support auto scan mode with Spark 4.0.0 (apache#1975)

fb236c9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Support `auto` scan mode with Spark 4.0.0#1975

fix: Support `auto` scan mode with Spark 4.0.0#1975
andygrove merged 31 commits intoapache:mainfrom
andygrove:auto-scan-4.0.0

andygrove commented Jul 1, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Jul 1, 2025 •

edited

Loading

Uh oh!

andygrove commented Jul 9, 2025

Uh oh!

andygrove commented Aug 21, 2025

Uh oh!

comphead Aug 25, 2025

Uh oh!

andygrove Aug 25, 2025

Uh oh!

comphead Aug 25, 2025

Uh oh!

andygrove Aug 25, 2025

Uh oh!

comphead left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

andygrove commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

codecov-commenter commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

andygrove commented Jul 9, 2025

Uh oh!

andygrove commented Aug 21, 2025

Uh oh!

comphead Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

andygrove Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

comphead Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

andygrove Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

andygrove commented Jul 1, 2025 •

edited

Loading

codecov-commenter commented Jul 1, 2025 •

edited

Loading