fix: add ParquetSchemaMismatchSuite and add a new config/fallback for reading INT96 as TimestampNTZ#4087
Open
andygrove wants to merge 19 commits intoapache:mainfrom
Open
fix: add ParquetSchemaMismatchSuite and add a new config/fallback for reading INT96 as TimestampNTZ#4087andygrove wants to merge 19 commits intoapache:mainfrom
andygrove wants to merge 19 commits intoapache:mainfrom
Conversation
Both native_datafusion and native_iceberg_compat throw SparkException (matching Spark's reference behavior). The withMismatchedSchema helper was redesigned to accept a separate check lambda so collect() executes while the temp directory is still present.
On Spark 4.0, COMET_SCHEMA_EVOLUTION_ENABLED defaults to true and TypeUtil.checkParquetType has an isSpark40Plus guard, so four native_iceberg_compat tests that previously expected SparkException now succeed with widened values. Make each assertion version-conditional using CometSparkSessionExtensions.isSpark40Plus and update the behavior matrix accordingly.
This was referenced Apr 25, 2026
This was referenced Apr 25, 2026
…4090, apache#4091 Cases 4 (Decimal(10,2)->Decimal(5,0)) and 6 (STRING->INT) now throw SparkException on native_datafusion after the schema adapter rejection fixes landed on main. Update assertions and the behavior matrix.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace verbose class-level scaladoc and behavior matrix with a concise description. Per-test comments documenting Comet vs Spark divergence are retained. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…match (apache#3720) The native_datafusion scan silently read INT96 TimestampLTZ columns as TimestampNTZ, potentially returning incorrect wall-clock values. Add a check in CometNativeScan.isSupported that detects TimestampType <-> TimestampNTZType mismatches between the file and read schemas and falls back to Spark, which throws the appropriate error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…TZ scans (apache#3720) The previous approach tried to detect TimestampType/TimestampNTZType mismatches between dataSchema and requiredSchema on the JVM side, but when users provide an explicit read schema, both schemas are identical (the Parquet file's actual physical type is not reflected). Additionally, INT96 timestamps are coerced to Timestamp(Microsecond, None) by DataFusion, making them indistinguishable from TimestampNTZ at the Rust schema adapter level. Instead, add a safety check (spark.comet.scan.timestampNTZSafetyCheck, default true) that falls back to Spark for any native_datafusion scan with TimestampNTZ columns. This follows the same pattern as the existing unsignedSmallIntSafetyCheck for ShortType. Users whose data does not contain INT96 timestamps can set this to false. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
|
I'm reviewing this but will likely need into tomorrow to confirm the behavior. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #3720
Rationale for this change
Some Spark SQL tests across all supported Spark versions are skipped, referencing #3720.
Comet's
native_datafusionscan has different behavior to Spark in some cases and can be more permissive regarding type widening.There were also some correctness issues. Some were resolved in previous commits, and this PR resolves the final one, when reading INT96 timestamps into a requested TimestampNTZ type.
Although we will still have Spark SQL tests ignored and linking to the issue, the behavior is now documented and we have fallbacks in place for any potential correctness issues, so no further work is needed.
What changes are included in this PR?
A new test suite
ParquetSchemaMismatchSuite.New config and fallback rule.
How are these changes tested?
New test.