HIVE-29287: Iceberg: [V3] Variant Shredding support #6152

deniskuzZ · 2025-10-23T14:45:56Z

What changes were proposed in this pull request?

Support for variant shredding, enabling Hive to write shredded variant data into Iceberg tables.

Ideally, this should follow the approach described in the reader/writer API proposal for Iceberg V4, where an execution engine provides the shredded writer schema.

As an interim solution, this PR introduces a writer that infers the shredded schema from the sample record captured before the Parquet writer is initialized.

Why are the changes needed?

Enables data skipping (predicate pushdown)

Does this PR introduce any user-facing change?

No

How was this patch tested?

TestHiveIcebergSelects#testVariantSelectProjection
variant_type_shredding.q

deniskuzZ · 2025-10-31T13:58:02Z

same thing as apache/iceberg#14297

deniskuzZ · 2025-10-31T15:25:36Z

iceberg/iceberg-handler/src/test/results/positive/variant_type_shredding.q.out

+                TableScan
+                  alias: tbl_shredded_variant
+                  filterExpr: (UDFToDouble(variant_get(data, '$.age')) > 25.0D) (type: boolean)
+                  Statistics: Num rows: 3 Data size: 1020 Basic stats: COMPLETE Column stats: NONE


PPD is not supported here, would be addressed in a separate JIRA

sonarqubecloud · 2025-11-01T12:30:00Z

Quality Gate passed

Issues
27 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

kokila-19 · 2025-11-18T11:51:09Z

I tested variant_type_shredding.q by removing 'variant.shredding.enabled'='true' from the table properties, and the qtest still passes without any failures.
This test verifies that basic INSERT/SELECT operations succeed with VARIANT columns but not actual shredding in the file.
This testing is not possible in qtest.

so maybe we can add a JUnit test (e.g., TestVariantShredding) that:
Writes VARIANT data with variant.shredding.enabled=true and false
Opens the resulting Parquet files via ParquetFileReader
Asserts that the typed_value field is present/absent accordingly

deniskuzZ · 2025-11-26T09:34:21Z

I tested variant_type_shredding.q by removing 'variant.shredding.enabled'='true' from the table properties, and the qtest still passes without any failures. This test verifies that basic INSERT/SELECT operations succeed with VARIANT columns but not actual shredding in the file. This testing is not possible in qtest.

so maybe we can add a JUnit test (e.g., TestVariantShredding) that: Writes VARIANT data with variant.shredding.enabled=true and false Opens the resulting Parquet files via ParquetFileReader Asserts that the typed_value field is present/absent accordingly

that test was added in iceberg: Expose variantShreddingFunc() in Parquet.DataWriteBuilder
added here as well: TestHiveIcebergSelects#testVariantSelectProjection

plan was to fully cover the functionality with explain plan once PPD support is added.

aturoczy

As all of the comment is address, can we merge?

aturoczy · 2025-12-11T23:21:28Z

...g/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/writer/HiveFileWriterFactory.java

+   */
+  public void initialize(Supplier<Record> record) {
+    if (sampleRecord == null) {
+      sampleRecord = record;


It only needs to initialize when the sampleRecord is null? Wouldn't be easier just to always initialize? Maybe there is a special place for caller to handle this.

It captures the first record being written and stores it in sampleRecord. the same strategy is applied in Spark to perform variant shredding.

deniskuzZ · 2025-12-12T05:28:31Z

As all of the comment is address, can we merge?

need to get a green build, it's flaky atm

sonarqubecloud · 2025-12-12T15:19:39Z

Quality Gate passed

Issues
27 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

ayushtkn · 2025-12-13T07:59:25Z

iceberg/iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergSelects.java

+    shell.executeStatement(
+        String.format(
+            "INSERT INTO %s VALUES " +
+                "(1, parse_json('{\"name\":\"Alice\",\"age\":30}'))," +


u change this to "(1, parse_json('null'))," +, the whole feature gets disabled. If u remove the assertion values, this fails

assertThat(variantType.containsField("typed_value")).isTrue();

We could have used DEFAULT values; however, Hive doesn’t support them for STRUCT or VARIANT types.

Default values for payload of type variant are not supported

In Iceberg V4, the execution engine may be able to pass the shredded writer schema, which would make this easier.

fixed in #6234

ayushtkn · 2025-12-13T08:16:33Z

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/parquet/Parquet.java

+import static org.apache.iceberg.TableProperties.PARQUET_ROW_GROUP_SIZE_BYTES;
+import static org.apache.iceberg.TableProperties.PARQUET_ROW_GROUP_SIZE_BYTES_DEFAULT;
+
+// TODO: remove class once upgraded to Iceberg v1.11.0 (https://github.com/apache/iceberg/pull/14153)


@deniskuzZ We copied a 1KLOC file just to fastrack this? we didn't had a hive release coming either

We have done patching for stuff, for stuff which Iceberg community didn't agree or for some bug fixes or to maintain compat with our current version.

Hardly Iceberg people accept stuff, for accepted stuff we should have waited for an official release and this isn't a small file nor something urgent either

Please read the TODO comment carefully.
The proposed three-line change has already been merged into Iceberg but is not yet released.
https://github.com/apache/hive/blame/master/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/parquet/Parquet.java#L209-L212

I requested it to be included in version 1.10.1, but it wasn’t accepted since that release is treated as a patch release and only includes bug fixes. It has already been nearly three months since the 1.10.1 release discussion began.
https://lists.apache.org/thread/9tjs060vzs6nnghk2brcw0hv89h4drp0

they would release it at one point, usually if we need something from a thirdparty lib, we wait for the release, or use the snapshot version till then. We already have copied a lot of iceberg code, one more doesn't make it worse though, but we have to maybe set some precedent one day, what all we can copy & when. Maybe not today or here, but some day.

Lets park a ticket for sure so we don't forget to drop it when we upgarde or if someone else does it, they doesn't skip it

ayushtkn · 2025-12-13T08:22:34Z

iceberg/iceberg-handler/src/test/queries/positive/variant_type_shredding.q

+-- Disable vectorized execution until Variant type is supported
+set hive.vectorized.execution.enabled=false;


removing this isn't throwing any exception. In some cases it is actually giving wrong results. We should fallback ideally if we have a shredded variant column & vectorization enabled.
In your test if you don't have this it gives

Caused by: java.lang.RuntimeException: MALFORMED_VARIANT at org.apache.hadoop.hive.serde2.variant.VariantUtil.malformedVariant(VariantUtil.java:180) at org.apache.hadoop.hive.serde2.variant.Variant.convertToByteArray(Variant.java:81) at org.apache.hadoop.hive.serde2.variant.Variant.from(Variant.java:59) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFVariantGet.evaluate(GenericUDFVariantGet.java:102)

I tried another case, where it silently gives wrong result

CREATE TABLE t ( id INT, v VARIANT ) STORED BY ICEBERG TBLPROPERTIES ( 'format-version'='3', 'variant.shredding.enabled'='true' ); INSERT INTO t VALUES (1, parse_json('{"a": 1}')), (2, parse_json('{"b": 2}')); SELECT try_variant_get(v, '$.a'), try_variant_get(v, '$.b') FROM t ORDER BY id;

With vectorization off

1 NULL NULL 2

With vectorization on

NULL 2 NULL NULL

Without Shredding but vecotrization on

1 NULL NULL 2

Please create a ticket for that.

https://issues.apache.org/jira/browse/HIVE-29372

ayushtkn · 2025-12-13T08:59:10Z

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/parquet/VariantUtil.java

+   */
+  private static boolean hasVariantColumns(Schema schema) {
+    return schema.columns().stream()
+        .anyMatch(field -> field.type() instanceof Types.VariantType);


what if field is of type struct and within that there is a Variant data type?. A table like

CREATE TABLE t_struct_variant ( id INT, s STRUCT< user_id: INT, payload: VARIANT > ) STORED BY ICEBERG TBLPROPERTIES ( 'format-version'='3', 'variant.shredding.enabled'='true' ); -- Insert JSON structures INSERT INTO t_struct_variant VALUES ( 1, named_struct( 'user_id', 101, 'payload', parse_json('{"name":"Alice","age":30}') ) ), ( 2, named_struct( 'user_id', 102, 'payload', parse_json('{"name":"Bob"}') ) ), ( 3, named_struct( 'user_id', 103, 'payload', parse_json('{"active":true,"score":9.5}') ) );

It did had a variant column, but it got skipped

It is not supported, and it generally doesn’t make sense from a modeling perspective. VARIANT is already a container for arbitrary JSON / semi-structured data.

It is supported afaik. Maps & List aren't supported.

Atleast what this databrics doc says
https://docs.databricks.com/aws/en/delta/variant-shredding#limitations

I was searching a bit more when I was trying this:
This is very common:

STRUCT< metadata: STRUCT<ts, source, version>, payload: VARIANT >

Strongly typed envelope

Flexible payload
From a data modeling perspective, this makes perfect sense. Maybe double check once

This is very common

haven't found this on any resources, documentation of variant shredding is pretty limited.

Storing the schema as:

STRUCT< metadata: STRUCT<ts, source, version>, payload: VARIANT >

has some downsides:

Query complexity: Every time you want to filter or join by metadata.ts or metadata.source, you have to dig into the nested struct (PPD for structs is not supported by Hive-Iceberg).

Schema evolution: If you want to add new metadata fields later, nested structs can make evolution tricky without rewriting the table.
not sure we even support this ALTER TABLE my_table ADD COLUMN metadata.new_field STRING;

Analytical usefulness: Often, metadata like ts, source, version is used more frequently for filtering, auditing, or incremental ingestion than the payload itself. Keeping it separate makes this more convenient.

i'll check if that could be easily implemented, though I'm not sure how useful it would be

thnx. I parked one ticket for it: https://issues.apache.org/jira/browse/HIVE-29373

addressed in #6234

ayushtkn · 2025-12-13T09:06:25Z

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/parquet/VariantUtil.java

+      Map<String, String> properties) {
+
+    // Preconditions: must have variant columns + property enabled
+    if (!hasVariantColumns(schema) || !isVariantShreddingEnabled(properties)) {


not a big thing, but i feel in general the check should be flipped

if (!isVariantShreddingEnabled(properties) || !hasVariantColumns(schema))

if isVariantShreddingEnabled isn't enabled, we need not to parse the schema at all, here if it isn't enabled, still the first condition in or will be executed first for no reason, and that is iterating over the entire DS

true, I'll change it in a follow up PR

fixed in #6234

ayushtkn · 2025-12-13T09:19:01Z

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/parquet/VariantUtil.java

+      if (sampleRecord != null) {
+        try {
+          Object variantValue = sampleRecord.getField(name);
+          if (variantValue instanceof Variant variant) {


I think we should infer schema for the first non null variantValue

see #6152 (comment).

To infer the schema from the first non-null value, we must buffer rows until that value is encountered. This adds performance overhead, including increased memory usage for buffering, potential pipeline stalls, and more complex processing.

I think an explicit schema would be a better approach.

just curious does some engine follow this way, like inferring shredding schema from first record. I was looking at a spark commit:
apache/spark@3c3d1a6

it seems explicit only, didn't dig deep though...

yes, i was checking this: apache/iceberg#14297

btw, it doesn't seem that databricks specifies the shredded variant write schema, but does the inference
https://docs.databricks.com/aws/en/delta/variant-shredding

ayushtkn · 2025-12-13T09:40:41Z

...g/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/writer/HiveFileWriterFactory.java

+    if (sampleRecord == null) {
+      sampleRecord = record;


thinking about this, we are taking the first record. Can it lead to Task-level non-determinism, like one insert lead to multiple inserts & each task captures its own first record & schema.

but i think there wasn't a better way, maybe in some later world we allow the user itself to define the columns to be shredded

yes, something like this, an explicit shredded schema would be a better solution

deniskuzZ marked this pull request as draft October 23, 2025 14:46

asf-ci-hive added the tests pending label Oct 23, 2025

deniskuzZ changed the title ~~[DRAFT] HIVE-29287: Variant Shredding~~ [DRAFT] HIVE-29287: Iceberg: Variant Shredding Oct 23, 2025

asf-ci-hive added tests unstable and removed tests pending labels Oct 23, 2025

deniskuzZ force-pushed the HIVE-29287 branch from 3fdd44c to 1d0c300 Compare October 31, 2025 14:13

deniskuzZ changed the title ~~[DRAFT] HIVE-29287: Iceberg: Variant Shredding~~ HIVE-29287: Iceberg: Variant Shredding support Oct 31, 2025

asf-ci-hive added tests pending and removed tests unstable labels Oct 31, 2025

deniskuzZ marked this pull request as ready for review October 31, 2025 14:14

deniskuzZ force-pushed the HIVE-29287 branch from 1d0c300 to 213a62f Compare October 31, 2025 15:24

deniskuzZ commented Oct 31, 2025

View reviewed changes

deniskuzZ requested a review from ayushtkn October 31, 2025 15:37

asf-ci-hive added tests passed tests pending tests unstable and removed tests pending tests passed tests unstable labels Oct 31, 2025

asf-ci-hive added tests passed and removed tests pending labels Nov 1, 2025

deniskuzZ requested a review from difin November 6, 2025 16:19

deniskuzZ mentioned this pull request Dec 2, 2025

HIVE-29354: Iceberg: [V3] Projection and Filter Pushdown for Shredded VARIANT Columns #6224

Open

deniskuzZ changed the title ~~HIVE-29287: Iceberg: Variant Shredding support~~ HIVE-29287: Iceberg: [V3] Variant Shredding support Dec 5, 2025

asf-ci-hive added tests pending and removed tests unstable labels Dec 11, 2025

review comments

d061a53

deniskuzZ force-pushed the HIVE-29287 branch from 1b9fa43 to d061a53 Compare December 11, 2025 16:29

asf-ci-hive added tests failed tests pending tests unstable and removed tests pending tests failed labels Dec 11, 2025

aturoczy reviewed Dec 11, 2025

View reviewed changes

asf-ci-hive added tests pending and removed tests unstable labels Dec 12, 2025

asf-ci-hive added tests unstable tests pending and removed tests pending tests unstable labels Dec 12, 2025

asf-ci-hive added tests passed and removed tests pending labels Dec 12, 2025

deniskuzZ merged commit 78a620e into apache:master Dec 12, 2025
2 checks passed

deniskuzZ deleted the HIVE-29287 branch December 12, 2025 18:06

ayushtkn reviewed Dec 13, 2025

View reviewed changes

ayushtkn mentioned this pull request Dec 18, 2025

HIVE-29372: Iceberg: [V3] Fix inconsistencies when vectorizartion is enabled in case of variant shredding. #6245

Merged

DanielZhu58 pushed a commit to DanielZhu58/hive that referenced this pull request Jan 12, 2026

HIVE-29287: Iceberg: [V3] Variant Shredding support (apache#6152)

1463e76

		-- Disable vectorized execution until Variant type is supported
		set hive.vectorized.execution.enabled=false;

HIVE-29287: Iceberg: [V3] Variant Shredding support #6152

HIVE-29287: Iceberg: [V3] Variant Shredding support #6152

Uh oh!

Conversation

deniskuzZ commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

deniskuzZ commented Oct 31, 2025

Uh oh!

deniskuzZ Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Nov 1, 2025

Quality Gate passed

Uh oh!

kokila-19 commented Nov 18, 2025

Uh oh!

deniskuzZ commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aturoczy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deniskuzZ Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deniskuzZ commented Dec 12, 2025

Uh oh!

sonarqubecloud bot commented Dec 12, 2025

Quality Gate passed

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deniskuzZ Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ayushtkn Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deniskuzZ Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deniskuzZ Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deniskuzZ commented Oct 23, 2025 •

edited

Loading

deniskuzZ Oct 31, 2025 •

edited

Loading

deniskuzZ commented Nov 26, 2025 •

edited

Loading

deniskuzZ Dec 12, 2025 •

edited

Loading

deniskuzZ Dec 15, 2025 •

edited

Loading

ayushtkn Dec 13, 2025 •

edited

Loading

deniskuzZ Dec 15, 2025 •

edited

Loading

deniskuzZ Dec 15, 2025 •

edited

Loading

deniskuzZ Dec 14, 2025 •

edited

Loading

deniskuzZ Dec 15, 2025 •

edited

Loading

deniskuzZ Dec 14, 2025 •

edited

Loading