Skip to content

Conversation

@aihuaxu
Copy link
Contributor

@aihuaxu aihuaxu commented Oct 11, 2025

This change adds support for writing shredded variants in the iceberg-spark module, enabling Spark to write shredded variant data into Iceberg tables.

Ideally, this should follow the approach described in the reader/writer API proposal for Iceberg V4, where the execution engine provides the shredded writer schema before creating the Iceberg writer. This design is cleaner, as it delegates schema generation responsibility to the engine.

As an interim solution, this PR implements a writer with lazy initialization for the actual Parquet writer. It buffers a portion of the data first, derives the shredded schema from the buffered records, then initializes the Parquet writer and flushes the buffered data to the file.

The current shredding algorithm is to shred to the most common type for a field.

@aihuaxu aihuaxu force-pushed the spark-write-iceberg-variant branch from 16b7a09 to dc4f72e Compare October 11, 2025 21:03
@aihuaxu aihuaxu marked this pull request as ready for review October 11, 2025 21:15
@aihuaxu aihuaxu force-pushed the spark-write-iceberg-variant branch 3 times, most recently from 97851f0 to b87e999 Compare October 13, 2025 16:47
@aihuaxu
Copy link
Contributor Author

aihuaxu commented Oct 15, 2025

@amogh-jahagirdar @Fokko @huaxingao Can you help take a look at this PR and if we have better approach for this?

@aihuaxu
Copy link
Contributor Author

aihuaxu commented Oct 21, 2025

cc @RussellSpitzer, @pvary and @rdblue Seems it's better to have the implementation with new File Format proposal but want to check if this is acceptable approach as an interim solution or you see a better alternative.

lazy.initialize(props, compressor, rowGroupOrdinal);
this.parquetSchema = result.getSchema();
this.pageStore = result.getPageStore();
this.writeStore = result.getWriteStore();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the initial writeStore/pageStore from startRowGroup() aren’t closed before being replaced here. Could this cause memory leak?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huaxingao writeStore/pageStore is getting reinitialized before getting really used and shouldn't cause leak. But I added close().

@pvary
Copy link
Contributor

pvary commented Oct 21, 2025

@aihuaxu: Don't we want to do the same but instead of wrapping the ParquetWriter, we could wrap the DataWriter. The schema would be created near the SparkWrite.WriterFactory and it would be easier to move to the new API when it is ready. The added benefit would be that when other formats implement the Variant, we could reuse the code.

Would this be prohibitively complex?

@huaxingao
Copy link
Contributor

In Spark DSv2, planning/validation happens on the driver. BatchWrite#createBatchWriterFactory runs on the driver and returns a DataWriterFactory that is serialized to executors. That factory must already carry the write schema the executors will use when they create DataWriters.

For shredded variant, we don’t know the shredded schema at planning time. We have to inspect some records to derive it. Doing a read on the driver during createBatchWriterFactory would mean starting a second job inside planning, which is not how DSv2 is intended to work.

Because of that, the current proposed Spark approach is: put the logical variant in the writer factory, on the executor, buffer the first N rows, infer the shredded schema from data, then initialize the concrete writer and flush the buffer. I believe this PR follow the same approach, which seems like a practical solution to me given DSV2's constraints.

@pvary
Copy link
Contributor

pvary commented Oct 22, 2025

Thanks for the explanation, @huaxingao! I see several possible workarounds for the DataWriterFactory serialization issue, but I have some more fundamental concerns about the overall approach.
I believe shredding should be driven by future reader requirements rather than by the actual data being written. Ideally, it should remain relatively stable across data files within the same table and originate from a writer job configuration—or even better, from a table-level configuration.

Even if we accept that the written data should dictate the shredding logic, Spark’s implementation—while dependent on input order—is at least somewhat stable. It drops rarely used fields, handles inconsistent types, and limits the number of columns.
I understand this is only a PoC implementation for shredding, but I’m concerned that the current simplifications make it very unstable. If I’m interpreting correctly, the logic infers the type from the first occurrence of each field and creates a column for every field. This could lead to highly inconsistent column layouts within a table, especially in IoT scenarios where multiple sensors produce vastly different data.
Did I miss anything?

@aihuaxu
Copy link
Contributor Author

aihuaxu commented Oct 24, 2025

Thanks @huaxingao and @pvary for reviewing, and thanks to Huaxin for explaining how the writer works in Spark.

Regarding the concern about unstable schemas, Spark's approach makes sense:

  • If a field appears consistently with a consistent type, create both value and typed_value
  • If a field appears with inconsistent types, create only value
  • Drop fields that occur in less than 10% of sampled rows
  • Cap the total at 300 fields (counting value and typed_value separately)

We could implement similar heuristics. Additionally, making the shredded schema configurable would allow users to choose which fields to shred at write time based on their read patterns.

For this POC, I'd like any feedback on whether there are any significant high-level design options to consider first and if this approach is acceptable. This seems hacky. I may have missed big picture on how the writers work across Spark + Iceberg + Parquet and we may have better way.

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Nov 24, 2025
@Tishj
Copy link

Tishj commented Nov 30, 2025

This PR caught my eye, as I've implemented the equivalent in DuckDB: duckdb/duckdb#19336

The PR description doesn't give much away, but I think the approach is similar to the proposed (interim) solution here: buffer the first rowgroup, infer the shredded schema from this, then finalize the file schema and start writing data.

We've opted to create a typed_value even though the type isn't 100% consistent within the buffered data, as long as it's the most common. I think you're losing potential compression by not doing that.

We've also added a copy option to force the shredded schema, for debugging purposes and for power users.

As for DECIMAL, it's kind of a special case in the shredding inference. We only shred on a DECIMAL type if all the decimal values we've seen for a column/field have the same width+scale, if any decimal value differs, DECIMAL won't be considered anymore when determining the shredded type of the column/field

@github-actions github-actions bot removed the stale label Dec 1, 2025
@yguy-ryft
Copy link
Contributor

This PR is super exciting!
Does this rely on variant shredding support in Spark? Is it supported in Spark 4.1 already, or planned for future releases?

Regarding the heuristics - I'd like to propose adding table properties as hints for variant shredding.
Similarly to properties used for bloom filters, it could be good to introduce something like write.parquet.variant-shredding-enabled.column.col1, which will hint to the writer that this column is important for shredding.
Many variants have important fields for which shredding should be enforced, and other fields which are less central and can be managed with simpler heuristics.
Would love to hear your thoughts!

@aihuaxu
Copy link
Contributor Author

aihuaxu commented Jan 9, 2026

This PR caught my eye, as I've implemented the equivalent in DuckDB: duckdb/duckdb#19336

The PR description doesn't give much away, but I think the approach is similar to the proposed (interim) solution here: buffer the first rowgroup, infer the shredded schema from this, then finalize the file schema and start writing data.

That is correct.

We've opted to create a typed_value even though the type isn't 100% consistent within the buffered data, as long as it's the most common. I think you're losing potential compression by not doing that.

I'm still trying to improve the heuristics to use the most common one as shredding type rather than the first one and probably cap the number of shredded fields, etc. but it doesn't need 100% consistent type to be shredded.

We've also added a copy option to force the shredded schema, for debugging purposes and for power users.

Yeah. I think that makes sense for advanced user to determine the shredded schema since they may know the read pattern.

As for DECIMAL, it's kind of a special case in the shredding inference. We only shred on a DECIMAL type if all the decimal values we've seen for a column/field have the same width+scale, if any decimal value differs, DECIMAL won't be considered anymore when determining the shredded type of the column/field

Why is DECIMAL special here? If we determine DECIMAL4 to be shredded type, then we may shred as DECIMAL4 or not shred if they cannot fit in DECIMAL4, right?

@aihuaxu
Copy link
Contributor Author

aihuaxu commented Jan 9, 2026

This PR is super exciting! Does this rely on variant shredding support in Spark? Is it supported in Spark 4.1 already, or planned for future releases?

Regarding the heuristics - I'd like to propose adding table properties as hints for variant shredding. Similarly to properties used for bloom filters, it could be good to introduce something like write.parquet.variant-shredding-enabled.column.col1, which will hint to the writer that this column is important for shredding. Many variants have important fields for which shredding should be enforced, and other fields which are less central and can be managed with simpler heuristics. Would love to hear your thoughts!

Yeah. I'm also thinking of that too. Will address that separately. Basically based on read pattern, the user can specify the shredding schema.

Copy link

@gkpanda4 gkpanda4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When processing JSON objects containing null field values (e.g., {"field": null}), the variant shredding creates schema columns for these null fields instead of omitting them entirely. This would cause schema bloat.

Adding a null check in ParquetVariantUtil.java:386 in the object() method should fix it.

@aihuaxu aihuaxu force-pushed the spark-write-iceberg-variant branch from b87e999 to 2e81d79 Compare January 14, 2026 05:39
@aihuaxu aihuaxu force-pushed the spark-write-iceberg-variant branch from 2e81d79 to 7e1b608 Compare January 15, 2026 19:35
@aihuaxu
Copy link
Contributor Author

aihuaxu commented Jan 15, 2026

When processing JSON objects containing null field values (e.g., {"field": null}), the variant shredding creates schema columns for these null fields instead of omitting them entirely. This would cause schema bloat.

Adding a null check in ParquetVariantUtil.java:386 in the object() method should fix it.

I addressed this null value check in VariantShreddingAnalyzer.java instead. If it's NULL, then we will not add the shredded field.

@aihuaxu aihuaxu force-pushed the spark-write-iceberg-variant branch from 7e1b608 to b74addb Compare January 15, 2026 19:53
@aihuaxu aihuaxu force-pushed the spark-write-iceberg-variant branch 3 times, most recently from 7c805f6 to 67dbe97 Compare January 15, 2026 22:50
@aihuaxu aihuaxu requested review from gkpanda4 and huaxingao January 15, 2026 22:51
@aihuaxu aihuaxu force-pushed the spark-write-iceberg-variant branch from 67dbe97 to 5c0533e Compare January 16, 2026 06:25
}
}

PhysicalType getMostCommonType() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this logic work when a field would have the same number of different datatypes, for eg 2 of INT8, STRING, DECIMAL8?

Will this logic choose one at random?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same question. I think we should add an explicit deterministic tie-break and also add a regression test that creates a perfect tie to ensure inference is stable.

* <li>shred to the most common type
* </ul>
*/
public class VariantShreddingAnalyzer {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test scenario for a field, for example, ZIP codes like 98101, 97201, and 10001, it get parsed as different integer types (INT32 + INT16).

Should having a type family check makes more sense? Like grouping them as

  • Integer Family: INT8, INT16, INT32, INT64 → promote to most capable type
  • Decimal Family: DECIMAL4, DECIMAL8, DECIMAL16 → promote to most capable type
  • Boolean Family: TRUE, FALSE → treat as single boolean type

Bit on lines with Spark side implementation https://github.com/apache/spark/pull/52406/files#diff-fb3268e5296f089d5f57c168f3e9cd74a401b184db3f30982588a134d8abfa53R322-R326 where all integer types are converted to Long

if (parquetCompressionLevel != null) {
writeProperties.put(PARQUET_COMPRESSION_LEVEL, parquetCompressionLevel);
}
writeProperties.put(SparkSQLProperties.SHRED_VARIANTS, String.valueOf(shredVariants()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: shredVariants() is evaluated twice. Could we store it in a local boolean shredVariants = shredVariants()?


@Override
public void setColumnStore(ColumnWriteStore columnStore) {
// Ignored for lazy initialization - will be set on actualWriter after initialization
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setColumnStore is currently a no-op. That’s fine during the buffering phase, but after actualWriter is initialized, Parquet will call setColumnStore again for new row groups. Should we forward the store to actualWriter when it’s non-null (e.g.,

if (actualWriter != null) actualWriter.setColumnStore(columnStore);

) to avoid writing to a stale store?

Also, can we add a regression test that forces multiple row groups (e.g., tiny row-group size) to ensure the writer remains correct across row-group rollover?

64,
ParquetProperties.DEFAULT_PAGE_WRITE_CHECKSUM_ENABLED,
null,
rowGroupOrdinal);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we’re constructing a new ColumnChunkPageWriteStore with a hardcoded column index truncate length (64) and fileEncryptor = null. Should we instead reuse the ParquetWriter’s configured values (truncate length / encryption) to avoid behavior differences when variant shredding is enabled? Also shall we add a small regression test that enables Parquet encryption (or sets a non-default truncate length) ?

}
}

PhysicalType getMostCommonType() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same question. I think we should add an explicit deterministic tie-break and also add a regression test that creates a perfect tie to ensure inference is stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants