Skip to content

Comet native scan returns wrong schema for missing struct fields in Parquet #4136

@andygrove

Description

@andygrove

Description

When all struct fields are missing from a Parquet file, Spark's vectorized reader returns struct<> (empty struct) as the schema, but Comet's native scan returns the full schema with null values (e.g., struct<_1:struct<_3:int,_4:bigint>>).

This causes 5 tests to fail in ParquetIOSuite when running Spark 4.1.1 SQL tests with Comet enabled:

  • vectorized reader: missing all struct fields
  • SPARK-53535: vectorized reader: missing all struct fields, struct with complex fields
  • SPARK-53535: vectorized reader: missing all struct fields, struct with map field only
  • SPARK-53535: vectorized reader: missing all struct fields, struct with cheap map and more expensive array field
  • SPARK-54220: vectorized reader: missing all struct fields, struct with NullType only

These tests are new in Spark 4.1 (SPARK-53535, SPARK-54220).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions