Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
d474ba3
test: add ParquetSchemaMismatchSuite skeleton for issue #3720
andygrove Apr 25, 2026
bee01e8
test: case 1 binary read as timestamp
andygrove Apr 25, 2026
e19a5d2
test: case 2 int32 read as int64
andygrove Apr 25, 2026
96390ac
test: case 3 timestamp_ltz read as timestamp_ntz
andygrove Apr 25, 2026
1aa00df
test: case 4 incompatible decimal precision/scale
andygrove Apr 25, 2026
1d139ee
test: case 5 int32 as int64 with row group filter
andygrove Apr 25, 2026
69b1457
test: case 6 string read as int
andygrove Apr 25, 2026
b012e99
test: case 7 timestamp_ntz read as array<timestamp_ntz>
andygrove Apr 25, 2026
318cac7
test: update matrix row 7 with confirmed throw outcomes
andygrove Apr 25, 2026
60a0ffa
test: control case int8 read as int32
andygrove Apr 25, 2026
eabfdef
test: control case float read as double
andygrove Apr 25, 2026
cd8e68b
test: handle Spark 4.0 type widening in iceberg-compat assertions
andygrove Apr 25, 2026
6bbec69
Merge remote-tracking branch 'apache/main' into tests/issue-3720-sche…
andygrove Apr 28, 2026
dbdcca6
test: align native_datafusion schema-mismatch assertions with #4090, …
andygrove Apr 28, 2026
577e0b1
Merge branch 'main' into tests/issue-3720-schema-mismatch
andygrove Apr 29, 2026
6c5cd1b
ci: add ParquetSchemaMismatchSuite to Linux and macOS workflows
andygrove Apr 29, 2026
972158d
docs: simplify ParquetSchemaMismatchSuite class-level documentation
andygrove Apr 29, 2026
751af1e
fix: fall back to Spark for TimestampType/TimestampNTZType schema mis…
andygrove Apr 29, 2026
2c6bd00
fix: add timestampNTZSafetyCheck to fall back to Spark for TimestampN…
andygrove Apr 29, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/pr_build_linux.yml
Original file line number Diff line number Diff line change
Expand Up @@ -323,6 +323,7 @@ jobs:
org.apache.comet.parquet.ParquetReadV1Suite
org.apache.comet.parquet.ParquetReadV2Suite
org.apache.comet.parquet.ParquetReadFromFakeHadoopFsSuite
org.apache.comet.parquet.ParquetSchemaMismatchSuite
org.apache.spark.sql.comet.ParquetDatetimeRebaseV1Suite
org.apache.spark.sql.comet.ParquetDatetimeRebaseV2Suite
org.apache.spark.sql.comet.ParquetEncryptionITCase
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/pr_build_macos.yml
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,7 @@ jobs:
org.apache.comet.parquet.ParquetReadV1Suite
org.apache.comet.parquet.ParquetReadV2Suite
org.apache.comet.parquet.ParquetReadFromFakeHadoopFsSuite
org.apache.comet.parquet.ParquetSchemaMismatchSuite
org.apache.spark.sql.comet.ParquetDatetimeRebaseV1Suite
org.apache.spark.sql.comet.ParquetDatetimeRebaseV2Suite
org.apache.spark.sql.comet.ParquetEncryptionITCase
Expand Down
14 changes: 14 additions & 0 deletions common/src/main/scala/org/apache/comet/CometConf.scala
Original file line number Diff line number Diff line change
Expand Up @@ -785,6 +785,20 @@ object CometConf extends ShimCometConf {
.booleanConf
.createWithDefault(true)

val COMET_PARQUET_TIMESTAMP_NTZ_CHECK: ConfigEntry[Boolean] =
conf("spark.comet.scan.timestampNTZSafetyCheck")
.category(CATEGORY_SCAN)
.doc(
"Parquet files may contain INT96 timestamps (TimestampType/LTZ) which the " +
"native_datafusion scan cannot distinguish from TimestampNTZType after Parquet " +
"schema coercion. When this config is true (default), the native_datafusion scan " +
"falls back to Spark for TimestampNTZ columns to avoid silently returning incorrect " +
"timestamp values. Set to false to allow native execution if you know your Parquet " +
"files do not contain INT96 timestamps being read as TimestampNTZ. See " +
s"https://github.com/apache/datafusion-comet/issues/3720 for details. $COMPAT_GUIDE.")
.booleanConf
.createWithDefault(true)

val COMET_EXEC_STRICT_FLOATING_POINT: ConfigEntry[Boolean] =
conf("spark.comet.exec.strictFloatingPoint")
.category(CATEGORY_EXEC)
Expand Down
6 changes: 6 additions & 0 deletions docs/source/user-guide/latest/compatibility/scans.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,12 @@ requires `spark.comet.exec.enabled=true` because the scan node must be wrapped b
- Duplicate field names in case-insensitive mode (e.g., a Parquet file with both `B` and `b` columns)
are detected at read time and raise a `SparkRuntimeException` with error class `_LEGACY_ERROR_TEMP_2093`,
matching Spark's behavior.
- `TimestampNTZType` columns, by default. Parquet files may contain INT96 timestamps (`TimestampType`/LTZ)
which the `native_datafusion` scan cannot distinguish from `TimestampNTZType` after Parquet schema coercion,
potentially returning incorrect timestamp values. When `spark.comet.scan.timestampNTZSafetyCheck=true`
(default), the scan falls back to Spark for `TimestampNTZ` columns. Set to `false` if your Parquet files
do not contain INT96 timestamps being read as `TimestampNTZ`. See
[issue #3720](https://github.com/apache/datafusion-comet/issues/3720) for more details.

## `native_iceberg_compat` Limitations

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -709,6 +709,15 @@ case class CometScanTypeChecker(scanImpl: String) extends DataTypeSupport with C
"native execution if your data does not contain unsigned small integers. " +
CometConf.COMPAT_GUIDE
false
case _: TimestampNTZType
if scanImpl == CometConf.SCAN_NATIVE_DATAFUSION &&
CometConf.COMET_PARQUET_TIMESTAMP_NTZ_CHECK.get() =>
fallbackReasons +=
s"$scanImpl scan may read INT96 timestamps as TimestampNTZ incorrectly. " +
s"Set ${CometConf.COMET_PARQUET_TIMESTAMP_NTZ_CHECK.key}=false to allow " +
"native execution if your Parquet files do not contain INT96 timestamps " +
s"being read as TimestampNTZ. ${CometConf.COMPAT_GUIDE}"
false
case dt if isStringCollationType(dt) =>
// we don't need specific support for collation in scans, but this
// is a convenient place to force the whole query to fall back to Spark for now
Expand Down
Loading
Loading