Skip to content

fix: Support auto scan mode with Spark 4.0.0#1975

Merged
andygrove merged 31 commits intoapache:mainfrom
andygrove:auto-scan-4.0.0
Aug 26, 2025
Merged

fix: Support auto scan mode with Spark 4.0.0#1975
andygrove merged 31 commits intoapache:mainfrom
andygrove:auto-scan-4.0.0

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented Jul 1, 2025

Which issue does this PR close?

Closes #1967

Rationale for this change

Provide better performance with Spark 4.0.0 by supporting auto scan mode. Also, increase test coverage of native_iceberg_compat, which we intend to eventually completely replace native_comet.

What changes are included in this PR?

How are these changes tested?

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jul 1, 2025

Codecov Report

❌ Patch coverage is 16.66667% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.53%. Comparing base (f09f8af) to head (df9d42f).
⚠️ Report is 410 commits behind head on main.

Files with missing lines Patch % Lines
...n/scala/org/apache/comet/rules/CometScanRule.scala 15.38% 6 Missing and 5 partials ⚠️
.../scala/org/apache/spark/sql/comet/util/Utils.scala 20.00% 4 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1975      +/-   ##
============================================
+ Coverage     56.12%   58.53%   +2.40%     
- Complexity      976     1282     +306     
============================================
  Files           119      143      +24     
  Lines         11743    13254    +1511     
  Branches       2251     2364     +113     
============================================
+ Hits           6591     7758    +1167     
- Misses         4012     4264     +252     
- Partials       1140     1232      +92     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andygrove andygrove changed the title fix: [ignore] Remove auto scan fallback for Spark 4.0.0 fix: Support auto scan mode with Spark 4.0.0 Jul 2, 2025
@andygrove
Copy link
Copy Markdown
Member Author

hive-1 failure to investigate:

2025-07-09T20:18:09.4999985Z [info]   Cause: org.apache.comet.CometNativeException: External error: Arrow: Parquet argument error: Parquet error: encountered non UTF-8 data
2025-07-09T20:18:09.5000678Z [info]   at org.apache.comet.parquet.Native.readNextRecordBatch(Native Method)
2025-07-09T20:18:09.5001239Z [info]   at org.apache.comet.parquet.NativeBatchReader.loadNextBatch(NativeBatchReader.java:812)

@andygrove
Copy link
Copy Markdown
Member Author

Current failures:

core-1:

$ grep "* FAIL" core1.txt 
2025-08-21T02:32:47.7183506Z [info] - SPARK-26677: negated null-safe equality comparison should not filter matched row groups *** FAILED *** (303 milliseconds)
2025-08-21T02:52:05.6178387Z [info] - scalar types rebuild *** FAILED *** (637 milliseconds)
2025-08-21T02:52:05.9761617Z [info] - object rebuild *** FAILED *** (357 milliseconds)
2025-08-21T02:52:06.3954010Z [info] - array rebuild *** FAILED *** (417 milliseconds)
2025-08-21T02:52:07.4434260Z [info] - malformed input *** FAILED *** (1 second, 40 milliseconds)
2025-08-21T02:52:07.6507720Z [info] - extract from shredded object *** FAILED *** (206 milliseconds)
2025-08-21T02:52:07.8370529Z [info] - extract from shredded array *** FAILED *** (182 milliseconds)
2025-08-21T02:52:08.0028967Z [info] - missing fields *** FAILED *** (166 milliseconds)
2025-08-21T02:52:08.1324041Z [info] - custom casts *** FAILED *** (129 milliseconds)

hive-1:

$ grep "* FAIL" hive1.txt 
2025-08-21T02:28:26.0691331Z [info] - SPARK-30201 HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT *** FAILED *** (375 milliseconds)

case StringType => ArrowType.Utf8.INSTANCE
case _: StringType => ArrowType.Utf8.INSTANCE
case dt if isStringCollationType(dt) =>
// TODO collation information is lost with this transformation
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have a ticket?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this comment because Spark also does not track this information, as pointed out in #2190 (comment) by @parthchandra

typeChecker.isSchemaSupported(partitionSchema, fallbackReasons)

def hasMapsContainingStructs(dataType: DataType): Boolean = {
def isComplexType(dt: DataType): Boolean = dt match {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isComplexType method now exists in in DataTypeSupport object

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Updated.

Copy link
Copy Markdown
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @andygrove

@andygrove andygrove merged commit 1b344de into apache:main Aug 26, 2025
174 of 177 checks passed
@andygrove andygrove deleted the auto-scan-4.0.0 branch August 26, 2025 00:12
coderfender pushed a commit to coderfender/datafusion-comet that referenced this pull request Dec 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable auto scan mode for Spark 4.0.0

4 participants