[SPARK-53535][SQL] Fix missing structs always being assumed as nulls #52557

ZiyaZa · 2025-10-09T08:34:41Z

What changes were proposed in this pull request?

Currently, if all fields of a struct mentioned in the read schema are missing in a Parquet file, the reader populates the struct with nulls.

This PR modifies the scan behavior so that if the struct exists in the Parquet schema but none of the fields from the read schema are present, we instead pick an arbitrary field from the Parquet file to read and use that to populate NULLs (as well as outer NULLs and array sizes if the struct is nested in another nested type).

This is done by changing the schema requested by the readers. We add an additional field to the requested schema when clipping the Parquet file schema according to the Spark schema. This means that the readers actually read and return more data than requested, which can cause problems. This is only a problem for the VectorizedParquetRecordReader, since for the other read code path via parquet-mr, we already have an UnsafeProjection for outputting only requested schema fields in ParquetFileFormat.

To ensure VectorizedParquetRecordReader only returns Spark requested fields, we create the ColumnarBatch with vectors that match the requested schema (we get rid of the additional fields by recursively matching sparkSchema with sparkRequestedSchema and ensuring structs have the same length in both). Then ParquetColumnVectors are responsible for allocating dummy vectors to hold the data temporarily while reading, but these are not exposed to the outside.

The heuristic to pick the arbitrary leaf field is as follows: We try to minimize the amount of arrays or maps (repeated fields) in the path to a leaf column, because the more repeated fields we have the more likely we are to read larger amount of data. At the same repetition level, we consider the type of each column to pick the cheapest column to read (struct nesting do not affect the decision here). We look at the byte size of the column type to pick the cheapest one as follows:
- BOOLEAN: 1 byte
- INT32, FLOAT: 4 bytes
- INT64, DOUBLE: 8 bytes
- INT96: 12 bytes
- BINARY, FIXED_LEN_BYTE_ARRAY, default case for future types: 32 bytes (high cost due to variable/large size)

Why are the changes needed?

This is a bug fix, because we were incorrectly assuming non-null struct values to be missing from the file depending on requested fields and returning null values.

Does this PR introduce any user-facing change?

Yes. We previously assumed structs to be null if all the fields we are trying to read from a Parquet file were missing from that file, even if the file contained other fields that could be used to take definition levels from. See an example from the Jira ticket below:

df_a = sql('SELECT 1 as id, named_struct("a", 1) AS s')
path = "/tmp/missing_col_test"
df_a.write.format("parquet").save(path)

df_b = sql('SELECT 2 as id, named_struct("b", 3) AS s')
spark.read.format("parquet").schema(df_b.schema).load(path).show()

This used to return:

+---+----+
| id|   s|
+---+----+
|  1|NULL|
+---+----+

It now returns:

+---+------+
| id|     s|
+---+------+
|  1|{NULL}|
+---+------+

How was this patch tested?

Added new unit tests, also fixed an old test to expect this new behavior.

Was this patch authored or co-authored using generative AI tooling?

No.

Kimahriman · 2025-10-10T01:05:15Z

Wow this has been a problem for us for so long, especially when you read non nullable strict this actually throws an NPE instead of just giving you the wrong data. Thanks for the fix!

juliuszsompolski

@gengliangwang could you take a look?

gengliangwang · 2025-10-17T04:49:06Z

@ZiyaZa I am not a big fan of such behavior changes.
Also the PR description is a bit confusing to me:

we instead pick an arbitrary field from the Parquet file to read and use that to populate NULLs

why just picking one arbitrary field, instead of setting all the fields null?

The heuristic to pick the arbitrary field is as follows: we pick one at the lowest array nesting level (i.e., any scalar field is preferred to array, which is preferred to array)

Could you provide more details on this one?

gengliangwang · 2025-10-17T04:49:57Z

For breaking changes, usually we should introduce a SQL configuration to control the new/legacy behaviors, also we should update https://spark.apache.org/docs/latest/sql-migration-guide.html

ZiyaZa · 2025-10-17T16:34:09Z

@gengliangwang

I am not a big fan of such behavior changes.

This is a behavior change that is required to fix a correctness issue. The issue is described in more detail in the linked JIRA ticket. According to the comment #52557 (comment) above, we could also get NullPointerException if the struct is marked as nullable, because we would wrongly assume all struct values to be null previously.

why just picking one arbitrary field, instead of setting all the fields null?

We need to understand for each row if the struct value is null or it is a struct with all the fields as null (to explain in JSON notation, null and { 'a': null } are not the same thing). The only way we can understand this is by looking at a child field of a struct that is present in the file, because Parquet stores nullability information in the definition levels of leaf columns. Based on that definition levels, we can identify in which rows struct is null or non-null.

Could you provide more details on this one?

Updated the description.

For breaking changes, usually we should introduce a SQL configuration to control the new/legacy behaviors

Added a flag to control this behavior.

also we should update https://spark.apache.org/docs/latest/sql-migration-guide.html

Can you please explain how we update that? It looks like that is built from the sql-migration-guide.md file, but it currently does not contain anything for the next release if I read it correctly.

gengliangwang · 2025-10-17T16:54:23Z

It looks like that is built from the sql-migration-guide.md file, but it currently does not contain anything for the next release if I read it correctly.

Yes, please create a new section ## Upgrading from Spark SQL 4.0 to 4.1

cloud-fan · 2025-10-17T18:03:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .createWithDefault(true)

+  val PARQUET_READ_ANY_FIELD_FOR_MISSING_STRUCT =
+    buildConf("spark.sql.parquet.readAnyFieldForMissingStruct")


this should be a legacy config that is off by default, how about spark.sql.legacy.returnNullStructIfAllFieldsMissing

I named it spark.sql.legacy.parquet.returnNullStructIfAllFieldsMissing to show that it is for Parquet files.

...n/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java

...e/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala

...n/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java

cloud-fan · 2025-10-20T00:29:40Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala

-          )
+          val df = spark.read.schema(readSchema).parquet(file)
+          val scanNode = df.queryExecution.executedPlan.collectLeaves().head
+          VerifyNoAdditionalScanOutputExec(scanNode).execute().first()


shall we use the existing ColumnarToRowExec and then verify the rows? But I'm also fine with this custom physical plan.

I tried to use that first, but couldn't get it to work because ColumnarToRowExec adds its own UnsafeProjection with child.output. Let's keep the custom plan node.

cloud-fan · 2025-10-20T00:33:44Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala

+    val childOutputTypes = child.output.map(_.dataType)
+    child.executeColumnar().mapPartitionsInternal { batches =>
+      batches.flatMap { input =>
+        input.rowIterator().asScala


if we decide to keep this custom physical plan, we can simplify it by checking columnar batches directly:

0.until(input.numCols).foreach { index => assert(childOutputTypes(index).dataType == input.column(index).dataType) }

Thanks! Simplified it.

cloud-fan · 2025-10-20T23:46:23Z

thanks, merging to master!

Currently, if all fields of a struct mentioned in the read schema are missing in a Parquet file, the reader populates the struct with nulls. This PR modifies the scan behavior so that if the struct exists in the Parquet schema but none of the fields from the read schema are present, we instead pick an arbitrary field from the Parquet file to read and use that to populate NULLs (as well as outer NULLs and array sizes if the struct is nested in another nested type). This is done by changing the schema requested by the readers. We add an additional field to the requested schema when clipping the Parquet file schema according to the Spark schema. This means that the readers actually read and return more data than requested, which can cause problems. This is only a problem for the `VectorizedParquetRecordReader`, since for the other read code path via parquet-mr, we already have an `UnsafeProjection` for outputting only requested schema fields in `ParquetFileFormat`. To ensure `VectorizedParquetRecordReader` only returns Spark requested fields, we create the `ColumnarBatch` with vectors that match the requested schema (we get rid of the additional fields by recursively matching `sparkSchema` with `sparkRequestedSchema` and ensuring structs have the same length in both). Then `ParquetColumnVector`s are responsible for allocating dummy vectors to hold the data temporarily while reading, but these are not exposed to the outside. The heuristic to pick the arbitrary leaf field is as follows: We try to minimize the amount of arrays or maps (repeated fields) in the path to a leaf column, because the more repeated fields we have the more likely we are to read larger amount of data. At the same repetition level, we consider the type of each column to pick the cheapest column to read (struct nesting do not affect the decision here). We look at the byte size of the column type to pick the cheapest one as follows: - BOOLEAN: 1 byte - INT32, FLOAT: 4 bytes - INT64, DOUBLE: 8 bytes - INT96: 12 bytes - BINARY, FIXED_LEN_BYTE_ARRAY, default case for future types: 32 bytes (high cost due to variable/large size) This is a bug fix, because we were incorrectly assuming non-null struct values to be missing from the file depending on requested fields and returning null values. Yes. We previously assumed structs to be null if all the fields we are trying to read from a Parquet file were missing from that file, even if the file contained other fields that could be used to take definition levels from. See an example from the Jira ticket below: ```python df_a = sql('SELECT 1 as id, named_struct("a", 1) AS s') path = "/tmp/missing_col_test" df_a.write.format("parquet").save(path) df_b = sql('SELECT 2 as id, named_struct("b", 3) AS s') spark.read.format("parquet").schema(df_b.schema).load(path).show() ``` This used to return: ``` +---+----+ | id| s| +---+----+ | 1|NULL| +---+----+ ``` It now returns: ``` +---+------+ | id| s| +---+------+ | 1|{NULL}| +---+------+ ``` Added new unit tests, also fixed an old test to expect this new behavior. No. Closes apache#52557 from ZiyaZa/missing_struct. Authored-by: Ziya Mukhtarov <ziya5muxtarov@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…alid Map ### What changes were proposed in this pull request? This PR fixes a bug from #52557, where we are reading an additional field if all the requested fields of a struct are missing from the Parquet file. We used to always pick the cheapest leaf column of the struct. However, if this leaf was inside a Map column, then we'd generate an invalid Map type like the following: ``` optional group _1 (MAP) { repeated group key_value { required boolean key; } } ``` Since there is no `value` field in this group, we'd fail later when trying to convert this Parquet type to a Spark type. This PR changes the additional field selection logic to enforce selecting a field from both the key and the value of the map, which can now give us a type like following: ``` optional group _1 (MAP) { repeated group key_value { required boolean key; optional group value { optional int32 _2; } } } ``` ### Why are the changes needed? To fix a critical bug where we would throw an exception when reading a Parquet file. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #52758 from ZiyaZa/fix-missing-struct-with-map. Authored-by: Ziya Mukhtarov <ziya5muxtarov@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Currently, if all fields of a struct mentioned in the read schema are missing in a Parquet file, the reader populates the struct with nulls. This PR modifies the scan behavior so that if the struct exists in the Parquet schema but none of the fields from the read schema are present, we instead pick an arbitrary field from the Parquet file to read and use that to populate NULLs (as well as outer NULLs and array sizes if the struct is nested in another nested type). This is done by changing the schema requested by the readers. We add an additional field to the requested schema when clipping the Parquet file schema according to the Spark schema. This means that the readers actually read and return more data than requested, which can cause problems. This is only a problem for the `VectorizedParquetRecordReader`, since for the other read code path via parquet-mr, we already have an `UnsafeProjection` for outputting only requested schema fields in `ParquetFileFormat`. To ensure `VectorizedParquetRecordReader` only returns Spark requested fields, we create the `ColumnarBatch` with vectors that match the requested schema (we get rid of the additional fields by recursively matching `sparkSchema` with `sparkRequestedSchema` and ensuring structs have the same length in both). Then `ParquetColumnVector`s are responsible for allocating dummy vectors to hold the data temporarily while reading, but these are not exposed to the outside. The heuristic to pick the arbitrary leaf field is as follows: We try to minimize the amount of arrays or maps (repeated fields) in the path to a leaf column, because the more repeated fields we have the more likely we are to read larger amount of data. At the same repetition level, we consider the type of each column to pick the cheapest column to read (struct nesting do not affect the decision here). We look at the byte size of the column type to pick the cheapest one as follows: - BOOLEAN: 1 byte - INT32, FLOAT: 4 bytes - INT64, DOUBLE: 8 bytes - INT96: 12 bytes - BINARY, FIXED_LEN_BYTE_ARRAY, default case for future types: 32 bytes (high cost due to variable/large size) ### Why are the changes needed? This is a bug fix, because we were incorrectly assuming non-null struct values to be missing from the file depending on requested fields and returning null values. ### Does this PR introduce _any_ user-facing change? Yes. We previously assumed structs to be null if all the fields we are trying to read from a Parquet file were missing from that file, even if the file contained other fields that could be used to take definition levels from. See an example from the Jira ticket below: ```python df_a = sql('SELECT 1 as id, named_struct("a", 1) AS s') path = "/tmp/missing_col_test" df_a.write.format("parquet").save(path) df_b = sql('SELECT 2 as id, named_struct("b", 3) AS s') spark.read.format("parquet").schema(df_b.schema).load(path).show() ``` This used to return: ``` +---+----+ | id| s| +---+----+ | 1|NULL| +---+----+ ``` It now returns: ``` +---+------+ | id| s| +---+------+ | 1|{NULL}| +---+------+ ``` ### How was this patch tested? Added new unit tests, also fixed an old test to expect this new behavior. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52557 from ZiyaZa/missing_struct. Authored-by: Ziya Mukhtarov <ziya5muxtarov@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…alid Map ### What changes were proposed in this pull request? This PR fixes a bug from apache#52557, where we are reading an additional field if all the requested fields of a struct are missing from the Parquet file. We used to always pick the cheapest leaf column of the struct. However, if this leaf was inside a Map column, then we'd generate an invalid Map type like the following: ``` optional group _1 (MAP) { repeated group key_value { required boolean key; } } ``` Since there is no `value` field in this group, we'd fail later when trying to convert this Parquet type to a Spark type. This PR changes the additional field selection logic to enforce selecting a field from both the key and the value of the map, which can now give us a type like following: ``` optional group _1 (MAP) { repeated group key_value { required boolean key; optional group value { optional int32 _2; } } } ``` ### Why are the changes needed? To fix a critical bug where we would throw an exception when reading a Parquet file. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52758 from ZiyaZa/fix-missing-struct-with-map. Authored-by: Ziya Mukhtarov <ziya5muxtarov@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…nd `WritableColumnVectorShim` In Java, an overriding method's access modifier cannot be more restrictive than the overridden method. Changing from protected to public is safe and ensures compatibility before the Spark version upgrade. see apache/spark#52557

…ark 4.1 (#11313) * [Fix] Remove `@NotNull` annotations to resolve dependency issues caused by ORC upgrade. apache/spark#51676 * [Fix] Make `reserveNewColumn` public in `ArrowWritableColumnVector` and `WritableColumnVectorShim` In Java, an overriding method's access modifier cannot be more restrictive than the overridden method. Changing from protected to public is safe and ensures compatibility before the Spark version upgrade. see apache/spark#52557 * [Fix] Add GeographyVal and GeometryVal support in ArrowColumnarRow, BatchCarrierRow and ColumnarToCarrierRowExecBase see [SPIP: Add geospatial types in Spark](https://issues.apache.org/jira/browse/SPARK-51658) * [Fix] Update commons-collections to version 4.5.0. see apache/spark#52743 * [Fix] Enable SPARK_TESTING environment variable for Spark test jobs see apache/spark#53344

github-actions bot added the SQL label Oct 9, 2025

ZiyaZa force-pushed the missing_struct branch from 4ba65ee to 11e12aa Compare October 9, 2025 08:36

Fix missing structs always being assumed as nulls

99a288d

ZiyaZa force-pushed the missing_struct branch from 11e12aa to 99a288d Compare October 9, 2025 08:57

ZiyaZa added 2 commits October 9, 2025 09:49

Fix linter errors

223d359

Ensure sparkRequestedSchema is filled

0f81378

ZiyaZa changed the title ~~[WIP][SPARK-53535][SQL] Fix missing structs always being assumed as nulls~~ [SPARK-53535][SQL] Fix missing structs always being assumed as nulls Oct 9, 2025

Kimahriman mentioned this pull request Oct 10, 2025

NPE with schema evolution on non-nullable nested struct delta-io/delta#796

Closed

juliuszsompolski approved these changes Oct 16, 2025

View reviewed changes

ZiyaZa added 2 commits October 17, 2025 16:01

Add a flag

8ad4f32

Merge branch 'master' into missing_struct

68dd952

cloud-fan reviewed Oct 17, 2025

View reviewed changes

ZiyaZa added 3 commits October 17, 2025 18:39

Change to a legacy flag

15483db

Add migration guide

2b1b10e

Split a long line

623691a

github-actions bot added the DOCS label Oct 17, 2025

cloud-fan reviewed Oct 18, 2025

View reviewed changes

...n/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java Show resolved Hide resolved

cloud-fan reviewed Oct 18, 2025

View reviewed changes

...e/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 18, 2025

View reviewed changes

...e/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala Outdated Show resolved Hide resolved

Address review comments

c8c445d

ZiyaZa requested a review from cloud-fan October 18, 2025 09:42

cloud-fan reviewed Oct 18, 2025

View reviewed changes

...n/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java Show resolved Hide resolved

ZiyaZa added 2 commits October 19, 2025 09:44

Test for scan output not containing additional columns

233f61e

Empty

8241b8b

ZiyaZa force-pushed the missing_struct branch from 399f1e7 to 8241b8b Compare October 19, 2025 19:43

cloud-fan reviewed Oct 20, 2025

View reviewed changes

Simplify custom test plan node

0d510c0

cloud-fan approved these changes Oct 20, 2025

View reviewed changes

Empty

54f022c

ZiyaZa force-pushed the missing_struct branch from dd53a1b to 54f022c Compare October 20, 2025 17:02

cloud-fan closed this in 37ee992 Oct 20, 2025

ZiyaZa mentioned this pull request Oct 28, 2025

[SPARK-53535][SQL][FOLLOWUP] Fix findCheapestGroupField returning invalid Map #52758

Closed

This was referenced Dec 29, 2025

[GLUTEN-11340][CORE][VL][CH] Fix Compatibility issues addressed in Spark 4.1 apache/incubator-gluten#11313

Merged

Fix Compatibility issues addressed in Spark 4.1 apache/incubator-gluten#11340

Closed

[SPARK-53535][SQL] Fix missing structs always being assumed as nulls #52557

[SPARK-53535][SQL] Fix missing structs always being assumed as nulls #52557

Uh oh!

Conversation

ZiyaZa commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Kimahriman commented Oct 10, 2025

Uh oh!

juliuszsompolski left a comment

Choose a reason for hiding this comment

Uh oh!

gengliangwang commented Oct 17, 2025

Uh oh!

gengliangwang commented Oct 17, 2025

Uh oh!

ZiyaZa commented Oct 17, 2025

Uh oh!

gengliangwang commented Oct 17, 2025

Uh oh!

cloud-fan Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

ZiyaZa Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cloud-fan Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

ZiyaZa Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

cloud-fan Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

ZiyaZa Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ZiyaZa commented Oct 9, 2025 •

edited

Loading