Skip to content

fix: native_datafusion: case-insensitive mode doesn't detect duplicate/ambiguous Parquet fields #3808

Merged
andygrove merged 8 commits intoapache:mainfrom
vaibhawvipul:issue-3760
Mar 31, 2026
Merged

fix: native_datafusion: case-insensitive mode doesn't detect duplicate/ambiguous Parquet fields #3808
andygrove merged 8 commits intoapache:mainfrom
vaibhawvipul:issue-3760

Conversation

@vaibhawvipul
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #3760 .

Rationale for this change

When running Spark SQL tests with native_datafusion scan, tests expecting errors for duplicate/ambiguous fields in case-insensitive mode fail because DataFusion's Parquet reader doesn't enforce Spark's case-sensitivity validation. Instead of detecting duplicates and raising the proper Spark error, the native reader silently returns wrong results or falls back to Spark.

What changes are included in this PR?

Native duplicate field detection (Rust):

  • Added per-column duplicate detection in schema_adapter.rs via check_column_duplicate() - checks each Column expression in the physical plan for ambiguous case-insensitive matches against the original physical schema

Removed plan-time fallback (Scala):

  • Removed the fallback block in CometScanRule.scala that detected duplicate field names at plan time and fell back to Spark - duplicates are now detected at read time in the native reader

Spark SQL test diffs (3.4.3, 3.5.8, 4.0.1):

  • Removed IgnoreCometNativeDataFusion annotations for issue-3760 from FileBasedDataSourceSuite and ParquetFilterSuite
  • Adapted error interception in tests to handle both Spark's SparkException(FAILED_READ_FILE) wrapper and Comet's direct SparkRuntimeException

How are these changes tested?

  • Rust and Scala tests
  • Spark SQL tests verified:
    • Spark native readers should respect spark.sql.caseSensitive - parquet
    • SPARK-25207: exception when duplicate fields in case-insensitive mode

…usion scan

Instead of falling back to Spark when duplicate field names are found in
case-insensitive mode, the native DataFusion reader now detects ambiguous
columns per-expression and raises SparkRuntimeException with error class
_LEGACY_ERROR_TEMP_2093, matching Spark's behavior.

This enables the previously ignored Spark SQL tests:
- FileBasedDataSourceSuite: caseSensitive test
- ParquetFilterSuite V1/V2: SPARK-25207 duplicate fields test

Closes apache#3760
…sensitive mode

Remove IgnoreCometNativeDataFusion annotations for issue apache#3760 from
FileBasedDataSourceSuite and ParquetFilterSuite. Adapt tests to handle
both Spark's SparkException wrapper and Comet's direct SparkRuntimeException.
… in case-insensitive mode

Remove IgnoreCometNativeDataFusion annotations for issue apache#3760 and adapt
tests to handle both Spark's SparkException wrapper and Comet's direct
RuntimeException/SparkRuntimeException.
@vaibhawvipul
Copy link
Copy Markdown
Contributor Author

A lot of the CI failures are the following -

/usr/bin/docker pull amd64/rust
  Using default tag: latest
  Error response from daemon: Head "https://registry-1.docker.io/v2/amd64/rust/manifests/latest": toomanyrequests: too many failed login attempts for username or IP address
  Warning: Docker pull failed with exit code 1, back off 1.615 seconds before retry.
  /usr/bin/docker pull amd64/rust
  Using default tag: latest
  Error response from daemon: Head "https://registry-1.docker.io/v2/amd64/rust/manifests/latest": toomanyrequests: too many failed login attempts for username or IP address
  Warning: Docker pull failed with exit code 1, back off 8.993 seconds before retry.
  /usr/bin/docker pull amd64/rust
  Using default tag: latest
  Error response from daemon: Head "https://registry-1.docker.io/v2/amd64/rust/manifests/latest": toomanyrequests: too many failed login attempts for username or IP address
  Error: Docker pull failed with exit code 1

any ideas how to fix them?

@vaibhawvipul
Copy link
Copy Markdown
Contributor Author

only clippy issue in CI, fixed.

@andygrove
Copy link
Copy Markdown
Member

Sorry @vaibhawvipul, could you fix the conflict

@andygrove
Copy link
Copy Markdown
Member

LGTM. Will review again once conflict is fixed.

# Conflicts:
#	dev/diffs/3.5.8.diff
@vaibhawvipul
Copy link
Copy Markdown
Contributor Author

@andygrove thank you. I have fixed the merge conflict.

Copy link
Copy Markdown
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @vaibhawvipul

@andygrove andygrove merged commit c97f033 into apache:main Mar 31, 2026
159 checks passed
@vaibhawvipul vaibhawvipul deleted the issue-3760 branch March 31, 2026 15:37
vaibhawvipul added a commit to vaibhawvipul/datafusion-comet that referenced this pull request Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

native_datafusion: case-insensitive mode doesn't detect duplicate/ambiguous Parquet fields

2 participants