fix: add ParquetSchemaMismatchSuite and add a new config/fallback for reading INT96 as TimestampNTZ by andygrove · Pull Request #4087 · apache/datafusion-comet

andygrove · 2026-04-25T18:45:07Z

Which issue does this PR close?

Rationale for this change

Some Spark SQL tests across all supported Spark versions are skipped, referencing #3720.

Comet's native_datafusion scan has different behavior to Spark in some cases and can be more permissive regarding type widening.

There were also some correctness issues. Some were resolved in previous commits, and this PR resolves the final one, when reading INT96 timestamps into a requested TimestampNTZ type.

Although we will still have Spark SQL tests ignored and linking to the issue, the behavior is now documented and we have fallbacks in place for any potential correctness issues, so no further work is needed.

What changes are included in this PR?

A new test suite ParquetSchemaMismatchSuite.

New config and fallback rule.

How are these changes tested?

New test.

Both native_datafusion and native_iceberg_compat throw SparkException (matching Spark's reference behavior). The withMismatchedSchema helper was redesigned to accept a separate check lambda so collect() executes while the temp directory is still present.

On Spark 4.0, COMET_SCHEMA_EVOLUTION_ENABLED defaults to true and TypeUtil.checkParquetType has an isSpark40Plus guard, so four native_iceberg_compat tests that previously expected SparkException now succeed with widened values. Make each assertion version-conditional using CometSparkSessionExtensions.isSpark40Plus and update the behavior matrix accordingly.

…ma-mismatch

…4090, apache#4091 Cases 4 (Decimal(10,2)->Decimal(5,0)) and 6 (STRING->INT) now throw SparkException on native_datafusion after the schema adapter rejection fixes landed on main. Update assertions and the behavior matrix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace verbose class-level scaladoc and behavior matrix with a concise description. Per-test comments documenting Comet vs Spark divergence are retained. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…match (apache#3720) The native_datafusion scan silently read INT96 TimestampLTZ columns as TimestampNTZ, potentially returning incorrect wall-clock values. Add a check in CometNativeScan.isSupported that detects TimestampType <-> TimestampNTZType mismatches between the file and read schemas and falls back to Spark, which throws the appropriate error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…TZ scans (apache#3720) The previous approach tried to detect TimestampType/TimestampNTZType mismatches between dataSchema and requiredSchema on the JVM side, but when users provide an explicit read schema, both schemas are identical (the Parquet file's actual physical type is not reflected). Additionally, INT96 timestamps are coerced to Timestamp(Microsecond, None) by DataFusion, making them indistinguishable from TimestampNTZ at the Rust schema adapter level. Instead, add a safety check (spark.comet.scan.timestampNTZSafetyCheck, default true) that falls back to Spark for any native_datafusion scan with TimestampNTZ columns. This follows the same pattern as the existing unsignedSmallIntSafetyCheck for ShortType. Users whose data does not contain INT96 timestamps can set this to false. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mbutrovich · 2026-04-29T21:49:29Z

I'm reviewing this but will likely need into tomorrow to confirm the behavior.

andygrove added 12 commits April 25, 2026 11:37

test: add ParquetSchemaMismatchSuite skeleton for issue apache#3720

d474ba3

test: case 2 int32 read as int64

e19a5d2

test: case 3 timestamp_ltz read as timestamp_ntz

96390ac

test: case 4 incompatible decimal precision/scale

1aa00df

test: case 5 int32 as int64 with row group filter

1d139ee

test: case 6 string read as int

69b1457

test: case 7 timestamp_ntz read as array<timestamp_ntz>

b012e99

test: update matrix row 7 with confirmed throw outcomes

318cac7

test: control case int8 read as int32

60a0ffa

test: control case float read as double

eabfdef

andygrove changed the title ~~test: add ParquetSchemaMismatchSuite documenting Comet vs Spark schema-mismatch behavior~~ test: add ParquetSchemaMismatchSuite documenting Comet vs Spark schema-mismatch behavior [WIP] Apr 25, 2026

This was referenced Apr 25, 2026

fix: reject incompatible decimal precision/scale in native_datafusion scan #4090

Merged

fix: reject string/binary read as numeric in native_datafusion scan #4091

Merged

andygrove added 3 commits April 28, 2026 08:15

Merge remote-tracking branch 'apache/main' into tests/issue-3720-sche…

6bbec69

…ma-mismatch

Merge branch 'main' into tests/issue-3720-schema-mismatch

577e0b1

andygrove changed the title ~~test: add ParquetSchemaMismatchSuite documenting Comet vs Spark schema-mismatch behavior [WIP]~~ test: add ParquetSchemaMismatchSuite documenting Comet vs Spark schema-mismatch behavior Apr 29, 2026

andygrove and others added 4 commits April 29, 2026 10:05

ci: add ParquetSchemaMismatchSuite to Linux and macOS workflows

6c5cd1b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: simplify ParquetSchemaMismatchSuite class-level documentation

972158d

Replace verbose class-level scaladoc and behavior matrix with a concise description. Per-test comments documenting Comet vs Spark divergence are retained. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

andygrove changed the title ~~test: add ParquetSchemaMismatchSuite documenting Comet vs Spark schema-mismatch behavior~~ test: add ParquetSchemaMismatchSuite and add a new config/fallback for reading INT96 as TimestampNTZ Apr 29, 2026

andygrove marked this pull request as ready for review April 29, 2026 16:47

andygrove requested review from comphead and mbutrovich April 29, 2026 16:48

andygrove changed the title ~~test: add ParquetSchemaMismatchSuite and add a new config/fallback for reading INT96 as TimestampNTZ~~ fix: add ParquetSchemaMismatchSuite and add a new config/fallback for reading INT96 as TimestampNTZ Apr 29, 2026

andygrove mentioned this pull request Apr 30, 2026

test: demonstrate INT96 read as TimestampNTZ correctness issue #4154

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add ParquetSchemaMismatchSuite and add a new config/fallback for reading INT96 as TimestampNTZ#4087

fix: add ParquetSchemaMismatchSuite and add a new config/fallback for reading INT96 as TimestampNTZ#4087
andygrove wants to merge 19 commits intoapache:mainfrom
andygrove:tests/issue-3720-schema-mismatch

andygrove commented Apr 25, 2026 •

edited

Loading

Uh oh!

mbutrovich commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andygrove commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

mbutrovich commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andygrove commented Apr 25, 2026 •

edited

Loading