Skip to content

fix: add ParquetSchemaMismatchSuite and add a new config/fallback for reading INT96 as TimestampNTZ#4087

Open
andygrove wants to merge 19 commits intoapache:mainfrom
andygrove:tests/issue-3720-schema-mismatch
Open

fix: add ParquetSchemaMismatchSuite and add a new config/fallback for reading INT96 as TimestampNTZ#4087
andygrove wants to merge 19 commits intoapache:mainfrom
andygrove:tests/issue-3720-schema-mismatch

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented Apr 25, 2026

Which issue does this PR close?

Closes #3720

Rationale for this change

Some Spark SQL tests across all supported Spark versions are skipped, referencing #3720.

Comet's native_datafusion scan has different behavior to Spark in some cases and can be more permissive regarding type widening.

There were also some correctness issues. Some were resolved in previous commits, and this PR resolves the final one, when reading INT96 timestamps into a requested TimestampNTZ type.

Although we will still have Spark SQL tests ignored and linking to the issue, the behavior is now documented and we have fallbacks in place for any potential correctness issues, so no further work is needed.

What changes are included in this PR?

A new test suite ParquetSchemaMismatchSuite.

New config and fallback rule.

How are these changes tested?

New test.

Both native_datafusion and native_iceberg_compat throw SparkException
(matching Spark's reference behavior). The withMismatchedSchema helper
was redesigned to accept a separate check lambda so collect() executes
while the temp directory is still present.
On Spark 4.0, COMET_SCHEMA_EVOLUTION_ENABLED defaults to true and
TypeUtil.checkParquetType has an isSpark40Plus guard, so four
native_iceberg_compat tests that previously expected SparkException now
succeed with widened values. Make each assertion version-conditional
using CometSparkSessionExtensions.isSpark40Plus and update the behavior
matrix accordingly.
…4090, apache#4091

Cases 4 (Decimal(10,2)->Decimal(5,0)) and 6 (STRING->INT) now throw
SparkException on native_datafusion after the schema adapter rejection
fixes landed on main. Update assertions and the behavior matrix.
@andygrove andygrove changed the title test: add ParquetSchemaMismatchSuite documenting Comet vs Spark schema-mismatch behavior [WIP] test: add ParquetSchemaMismatchSuite documenting Comet vs Spark schema-mismatch behavior Apr 29, 2026
andygrove and others added 4 commits April 29, 2026 10:05
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace verbose class-level scaladoc and behavior matrix with a concise
description. Per-test comments documenting Comet vs Spark divergence are
retained.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…match (apache#3720)

The native_datafusion scan silently read INT96 TimestampLTZ columns as
TimestampNTZ, potentially returning incorrect wall-clock values. Add a
check in CometNativeScan.isSupported that detects TimestampType <->
TimestampNTZType mismatches between the file and read schemas and falls
back to Spark, which throws the appropriate error.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…TZ scans (apache#3720)

The previous approach tried to detect TimestampType/TimestampNTZType
mismatches between dataSchema and requiredSchema on the JVM side, but
when users provide an explicit read schema, both schemas are identical
(the Parquet file's actual physical type is not reflected). Additionally,
INT96 timestamps are coerced to Timestamp(Microsecond, None) by
DataFusion, making them indistinguishable from TimestampNTZ at the Rust
schema adapter level.

Instead, add a safety check (spark.comet.scan.timestampNTZSafetyCheck,
default true) that falls back to Spark for any native_datafusion scan
with TimestampNTZ columns. This follows the same pattern as the existing
unsignedSmallIntSafetyCheck for ShortType. Users whose data does not
contain INT96 timestamps can set this to false.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@andygrove andygrove changed the title test: add ParquetSchemaMismatchSuite documenting Comet vs Spark schema-mismatch behavior test: add ParquetSchemaMismatchSuite and add a new config/fallback for reading INT96 as TimestampNTZ Apr 29, 2026
@andygrove andygrove marked this pull request as ready for review April 29, 2026 16:47
@andygrove andygrove changed the title test: add ParquetSchemaMismatchSuite and add a new config/fallback for reading INT96 as TimestampNTZ fix: add ParquetSchemaMismatchSuite and add a new config/fallback for reading INT96 as TimestampNTZ Apr 29, 2026
@mbutrovich
Copy link
Copy Markdown
Contributor

I'm reviewing this but will likely need into tomorrow to confirm the behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

native_datafusion: no error thrown for schema mismatch when reading Parquet with incompatible types

2 participants