From f1cd15ac6a51c0a8ec9718321947b76ee63f3b75 Mon Sep 17 00:00:00 2001 From: Matt Topol Date: Thu, 28 Aug 2025 15:32:09 -0400 Subject: [PATCH 01/25] docs: add variant extension type docs --- docs/source/format/CanonicalExtensions.rst | 580 ++++++++++++++++++++- docs/source/status.rst | 2 + 2 files changed, 581 insertions(+), 1 deletion(-) diff --git a/docs/source/format/CanonicalExtensions.rst b/docs/source/format/CanonicalExtensions.rst index 22f2efb16f2b..b92e8139d43b 100644 --- a/docs/source/format/CanonicalExtensions.rst +++ b/docs/source/format/CanonicalExtensions.rst @@ -417,7 +417,585 @@ better zero-copy compatibility with various systems that also store booleans usi Metadata is an empty string. -========================= +.. _variant_extension: + +Variant +======= + +Variant represents a value that may be one of: + +* Primitive: a type and corresponding value (e.g. INT, STRING) + +* Array: An ordered list of Variant values + +* Object: An unordered collection of string/Variant pairs (i.e. key/value pairs). An object may not contain duplicate keys + +Particularly, this provides a way to represent semi-structured data which is stored as a +`Parquet Variant `__ value within Arrow columns in +a lossless fashion. This also provides the ability to represent `shredded `__ +variant values. This will make it easy for systems to pass Variant data around without having to upgrade their Arrow version +or otherwise require special handling unless they want to directly interact with the encoded variant data. See the previous links +to the Parquet format specification for details on what the actual binary values should look like. + +* Extension name: ``parquet.variant``. + +* The storage type of this extension is a ``StructArray`` that obeys the following rules: + + * A *non-nullable* field named ``metadata`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``. + + * At least one (or both) of the following: + + * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``. + *(unshredded variants consist of just the ``metadata`` and ``value`` fields only)* + + * A field named ``typed_value`` which can be any *primitive type* or a ``List``, ``LargeList``, ``ListView`` or ``Struct`` + + * If the ``typed_value`` field is a *nested* type, its elements **must** be *non-nullable* and **must** be a ``struct`` consisting of + at least one (or both) of the following: + + * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``. + + * A field named ``typed_value`` which follows these same rules for the ``typed_value`` field. + +.. note:: + + It is also *permissible* for the ``metadata`` field to be dictionary-encoded with an index type of ``int8``. + +.. note:: + + The fields may be in any order, and thus must be accessed by **name** not by *position*. + +Examples +-------- + +Unshredded +'''''''''' + +The simplest case, an unshredded variant always consists of **exactly** two fields: ``metadata`` and ``value``. Any of +the following storage types are valid (not an exhaustive list): + +* ``struct`` +* ``struct`` +* ``struct required, value: binary_view required>`` + +Simple Shredding +'''''''''''''''' + +Suppose a Variant field named *measurement* and we want to shred the ``int64`` values into a separate column for efficiency. +In Parquet, this could be represented as:: + + required group measurement (VARIANT) { + required binary metadata; + optional binary value; + optional int64 typed_value; + } + +Thus the corresponding storage type for the ``parquet.variant`` Arrow extension type would be: :: + + struct< + metadata: binary required, + value: binary optional, + typed_value: int64 optional + > + +If we suppose a series of measurements consisting of: :: + + 34, null, "n/a", 100 + +The data should be stored/represented in Arrow as: :: + + * Length: 4, Null count: 1 + * Validity Bitmap buffer: + + | Byte 0 (validity bitmap) | Bytes 1-63 | + |--------------------------|---------------| + | 00001011 | 0 (padding) | + + * Children arrays: + * field-0 array (`VarBinary`) + * Length: 4, Null count: 0 + * Offsets buffer: + + | Bytes 0-19 | Bytes 20-63 | + |------------------|--------------------------| + | 0, 2, 4, 6, 8 | unspecified (padding) | + + * Value buffer: (01 00 -> indicates version 1 empty metadata) + + | Bytes 0-7 | Bytes 8-63 | + |--------------------|--------------------------| + | 01 00 01 00 01 00 | unspecified (padding) | + + * field-1 array (`VarBinary`) + * Length: 4, Null count: 2 + * Validity Bitmap buffer: + + | Byte 0 (validity bitmap) | Bytes 1-63 | + |--------------------------|---------------| + | 00000110 | 0 (padding) | + + * Offsets buffer: + + | Bytes 0-19 | Bytes 20-63 | + |------------------|--------------------------| + | 0, 0, 1, 5, 5 | unspecified (padding) | + + * Value buffer: (`00` -> literal null, `0x13 0x6E 0x2F 0x61` -> variant encoding literal string "n/a") + + | Bytes 0-4 | Bytes 5-63 | + |------------------------|--------------------------| + | 00 0x13 0x6E 0x2F 0x61 | unspecified (padding) | + + * field-2 array (int64 array) + * Length: 4, Null count: 2 + * Validity Bitmap buffer: + + | Byte 0 (validity bitmap) | Bytes 1-63 | + |--------------------------|---------------| + | 00001001 | 0 (padding) | + + * Value buffer: + + | Bytes 0-31 | Bytes 32-63 | + |---------------------|--------------------------| + | 34, 00, 00, 100 | unspecified (padding) | + +.. note:: + + Notice that there is a variant ``literal null`` in the ``value`` array, this is due to the + `shredding specification `__ + so that a consumer can tell the difference between a *missing* field and a **null** field. + +Shredding an Array +'''''''''''''''''' + +For our next example, we will represent a shredded array of strings. Let's consider a column that looks like: :: + + ["comedy", "drama"], ["horror", null], ["comedy", "drama", "romance"], null + +Representing this shredded variant in Parquet could look like: :: + + optional group tags (VARIANT) { + required binary metadata; + optional binary value; + optional group typed_value (LIST) { # optional to allow null lists + repeated group list { + required group element { # shredded element + optional binary value; + optional binary typed_value (STRING); + } + } + } + } + +The array structure for Variant encoding does not allow missing elements, so all elements of the array must +be *non-nullable*. As such, either **typed_value** or **value** (*but not both!*) must be *non-null*. A null +element must be encoded as a Variant null: *basic type* ``0`` (primitive) and *physical type* ``0`` (null). + +The storage type to represent this in Arrow as a Variant extension type would be: :: + + struct< + metadata: binary required, + value: binary optional, + typed_value: list required> optional + > + +.. note:: + + As usual, **Binary** could also be **LargeBinary** or **BinaryView**, **String** could also be **LargeString** or **StringView**, + and **List** could also be **LargeList** or **ListView**. + +The data would then be stored in Arrow as follows: :: + + * Length: 4, Null count: 1 + * Validity Bitmap buffer: + + | Byte 0 (validity bitmap) | Bytes 1-63 | + |--------------------------|---------------| + | 00000111 | 0 (padding) | + + * Children arrays: + * field-0 array (`VarBinary` Metadata) + * Length: 4, Null count: 0 + * Offsets buffer: + + | Bytes 0-19 | Bytes 20-63 | + |------------------|--------------------------| + | 0, 2, 4, 6, 8 | unspecified (padding) | + + * Value buffer: (01 00 -> indicates version 1 empty metadata) + + | Bytes 0-7 | Bytes 8-63 | + |--------------------|--------------------------| + | 01 00 01 00 01 00 | unspecified (padding) | + + * field-1 array (`VarBinary` Value) + * Length: 4, Null count: 1 + * Validity bitmap buffer: + + | Byte 0 (validity bitmap) | Bytes 1-63 | + |--------------------------|---------------| + | 00001000 | 0 (padding) | + + * Offsets buffer: + + | Bytes 0-19 | Bytes 20-63 | + |------------------|--------------------------| + | 0, 0, 0, 0, 1 | unspecified (padding) | + + * Value buffer: (00 -> variant null) + + | Bytes 0 | Bytes 1-63 | + |--------------------|--------------------------| + | 00 | unspecified (padding) | + + * field-2 array (`List>` typed_value) + * Length: 4, Null count: 1 + * Validity bitmap buffer: + + | Byte 0 (validity bitmap) | Bytes 1-63 | + |--------------------------|-------------| + | 00000111 | 0 (padding) | + + * Offsets buffer (int32) + + | Bytes 0-19 | Bytes 20-63 | + |-------------------|-----------------------| + | 0, 2, 4, 7, 7 | unspecified (padding) | + + * Values array (`Struct` element): + * Length: 7, Null count: 0 + * Validity bitmap buffer: Not required + + * Children arrays: + * field-0 array (`VarBinary` value) + * Length: 7, Null count: 6 + * Validity bitmap buffer: + + | Byte 0 (validity bitmap) | Bytes 1-63 | + |--------------------------|-------------| + | 00001000 | 0 (padding) | + + * Offsets buffer (int32): + + | Bytes 0-31 | Bytes 32-63 | + |---------------------------|--------------------------| + | 0, 0, 0, 0, 1, 1, 1, 1 | unspecified (padding) | + + * Values buffer (`00` -> variant null): + + | Bytes 0 | Bytes 1-63 | + |--------------------|--------------------------| + | 00 | unspecified (padding) | + + * field-1 array (`String` typed_value) + * Length: 7, Null count: 1 + * Validity bitmap buffer: + + | Byte 0 (validity bitmap) | Bytes 1-63 | + |--------------------------|-------------| + | 01110111 | 0 (padding) | + + * Offsets buffer (int32): + + | Bytes 0-31 | Bytes 32-63 | + |---------------------------------|--------------------------| + | 0, 6, 11, 17, 17, 23, 28, 35 | unspecified (padding) | + + * Values buffer: + + | Bytes 0-35 | Bytes 36-63 | + |--------------------------------------|--------------------------| + | comedydramahorrorcomedydramaromance | unspecified (padding) | + +Shredding an Object +''''''''''''''''''' + +Let's consider a simple JSON column of "events" which contain a field named ``event_type`` (a string) +and a field named ``event_ts`` (a timestamp) that we wish to shred into separate columns, In Parquet, +it could look something like this: :: + + optional group event (VARIANT) { + required binary metadata; + optional binary value; # variant, remaining fields/values + optional group typed_value { # shredded fields for variant object + required group event_type { # event_type shredded field + optional binary value; + optional binary typed_value (STRING); + } + required group event_ts { # event_ts shredded field + optional binary value; + optional int64 typed_value (TIMESTAMP(true, MICROS)) + } + } + } + +We can then, fairly easily, translate this into the expected extension storage type: :: + + struct< + metadata: binary required, + value: binary optional, + typed_value: struct< + event_type: struct< + value: binary optional, + typed_value: string optional + > required, + event_ts: struct< + value: binary optional, + typed_value: timestamp(us, UTC) optional + > required + > optional + > + +If a field *does not exist* in the variant object value, then both the **value** and **typed_value** columns for that row +will be null. If a field is *present*, but the value is null, then **value** must contain a Variant null: *basic type* +``0`` (primitive) and *physical type* ``0`` (null). + +It is *invalid* for both **value** and **typed_value** to be non-null for a given index. A reader can choose not to error +in this scenario, but if so it **must** use the value in the **typed_value** column for that index. + +Let's consider the following series of objects: :: + + {"event_type": "noop", "event_ts": 1729794114937} + + {"event_type": "login", "event_ts": 1729794146402, "email": "user@example.com"} + + {"error_msg": "malformed..."} + + "malformed: not an object" + + {"event_ts": 1729794240241, "click": "_button"} + + {"event_ts": null, "event_ts": 1729794954163} + + {"event_type": "noop", "event_ts": "2024-10-24"} + + {} + + null + + *Entirely missing* + +To represent those values as a column of Variant values using the Variant extension type we get the following: :: + + * Length: 10, Null count: 1 + * Validity bitmap buffer: + + | Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 | + |--------------------------|-----------|-----------------------| + | 11111111 | 00000001 | 0 (padding) | + + * Children arrays + * field-0 array (`VarBinary` Metadata) + * Length: 10, Null count: 0 + * Offsets buffer: + + | Bytes 0-43 (int32) | Bytes 44-63 | + |------------------------------------------|-------------------------| + | 0, 2, 11, 24, 26, 35, 37, 39, 41, 43, 45 | unspecified (padding) | + + * Value buffer: (01 00 -> version 1 empty metadata, + 01 01 00 XX ... -> Version 1, metadata with 1 elem, offset 0, offset XX == len(string), ... is dict string bytes) + + | Bytes 0-1 | Bytes 2-10 | Bytes 11-23 | Bytes 24-25 | Bytes 26-34 | + |-------------------------------|-----------------------|-------------|-------------------| + | 01 00 | 01 01 00 05 email | 01 01 00 09 error_msg | 01 00 | 01 01 00 05 click | + + | Bytes 35-36 | Bytes 37-38 | Bytes 39-40 | Bytes 41-42 | Bytes 43-44 | Bytes 45-63 | + |-------------|-------------|-------------|-------------|-------------|-----------------------| + | 01 00 | 01 00 | 01 00 | 01 00 | 01 00 | unspecified (padding) | + + * field-1 array (`VarBinary` Value) + * Length: 10, Null count: 5 + * Validity bitmap buffer: + + | Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 | + |---------------------------|-----------|-----------------------| + | 00011110 | 00000001 | 0 (padding) | + + * Offsets buffer (filled in based on lengths of encoded variants): + + | ... | + + * Value buffer: + + | VariantEncode({"email": "user@email.com"}) | VariantEncode({"error_msg": "malformed..."}) | + | VariantEncode("malformed: not an object") | VariantEncode({"click": "_button"}) | 00 (null) | + + * field-2 array (`Struct<...>` typed_value) + * Length: 10, Null count: 3 + * Validity bitmap buffer: + + | Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 | + |--------------------------|-----------|-----------------------| + | 11110111 | 00000000 | 0 (padding) | + + * Children arrays: + * field-0 array (`Struct` event_type) + * Length: 10, Null count: 0 + * Validity bitmap buffer: not required + + * Children arrays + * field-0 array (`VarBinary` value) + * Length: 10, Null count: 9 + * Validity bitmap buffer: + + | Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 | + |--------------------------|-----------|-----------------------| + | 01000000 | 00000000 | 0 (padding) | + + * Offsets buffer (int32) + + | Bytes 0-43 (int32) | Bytes 44-63 | + |---------------------------------|-------------------------| + | 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1 | unspecified (padding) | + + * Value buffer: + + | Byte 0 | Bytes 1-63 | + |--------|------------------------| + | 00 | unspecified (padding) | + + * field-1 array (`String` typed_value) + * Length: 10, Null count: 7 + * Validity bitmap buffer: + + | Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 | + |--------------------------|-----------|-----------------------| + | 01000011 | 00000000 | 0 (padding) | + + * Offsets buffer (int32) + + | Byte 0-43 | Bytes 44-63 | + |-------------------------------------|------------------------| + | 0, 4, 9, 9, 9, 9, 9, 13, 13, 13, 13 | unspecified (padding) | + + * Value buffer: + + | Bytes 0-3 | Bytes 4-8 | Bytes 9-12 | Bytes 13-63 | + |-----------|-----------|------------|------------------------| + | noop | login | noop | unspecified (padding) | + + + * field-1 array (`Struct` event_ts) + * Length: 10, Null count: 0 + * Validity bitmap buffer: not required + + * Children arrays + * field-0 array (`VarBinary` value) + * Length: 10, Null count: 9 + * Validity bitmap buffer: + + | Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 | + |--------------------------|-----------|-----------------------| + | 01000000 | 00000000 | 0 (padding) | + + * Offsets buffer (int32) + + | Bytes 0-43 (int32) | Bytes 44-63 | + |---------------------------------|-------------------------| + | ... | unspecified (padding) | + + * Value buffer: + + | VariantEncode("2024-10-24") | + + * field-1 array (`Timestamp(us, UTC)` typed_value) + * Length: 10, Null count: 6 + * Validity bitmap buffer: + + | Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 | + |--------------------------|-----------|-----------------------| + | 00110011 | 00000000 | 0 (padding) | + + * Value buffer: + + | Bytes 0-7 | Bytes 8-15 | Bytes 16-31 | Bytes 32-39 | Bytes 40-47 | Bytes 48-63 | + |---------------|---------------|--------------|---------------|---------------|------------------------| + | 1729794114937 | 1729794146402 | unspecified | 1729794240241 | 1729794954163 | unspecified (padding) | + + +Putting it all together +''''''''''''''''''''''' + +As mentioned, the **typed_value** field associated with a Variant **value** can be of any shredded type. As a result, +as long as we follow the original rules you can have an arbitrary number of nested levels based on how you want to +shred the object. For example, we might have a few more fields alongside **event_type** to shred out. Possibly an object +that looks like this: :: + + { + "event_type": "login", + “event_ts”: 1729794114937, + “location”: { “longitude”: 1.5, “latitude”: 5.5 }, + “tags”: [“foo”, “bar”, “baz”] + } + +If we shred the extra fields out and represent it as Parquet it looks like: :: + + optional group event (VARIANT) { + required binary metadata; + optional binary value; # variant, remaining fields/values + optional group typed_value { # shredded fields for variant object + required group event_type { # event_type shredded field + optional binary value; + optional binary typed_value (STRING); + } + required group event_ts { # event_ts shredded field + optional binary value; + optional int64 typed_value (TIMESTAMP(true, MICROS)) + } + required group location { # location shredded field + optional binary value; + optional group typed_value { + required group longitude { + optional binary value; + optional float64 typed_value; + } + required group latitude { + optional binary value; + optional float64 typed_value; + } + } + } + required group tags { # tags shredded field + optional binary value; + optional group typed_value (LIST) { + repeated group list { + required group element { + optional binary value; + optional binary typed_value (STRING); + } + } + } + } + } + } + +Finally, following the rules we set forth on constructing the Variant Extension Type storage type, we end up with: :: + + struct< + metadata: binary required, + value: binary optional, + typed_value: struct< + event_type: struct required, + event_ts: struct required, + location: struct< + value: binary optional, + typed_value: struct< + longitude: struct required, + latitude: struct required + > optional> required, + tags: struct< + value: binary optional, + typed_value: list required> optional + > required + > optional + > + + Community Extension Types ========================= diff --git a/docs/source/status.rst b/docs/source/status.rst index 0b124bbbebab..ec28d734f26c 100644 --- a/docs/source/status.rst +++ b/docs/source/status.rst @@ -129,6 +129,8 @@ Data Types +-----------------------+-------+-------+-------+------------+-------+-------+-------+-------+ | 8-bit Boolean | ✓ | | ✓ | | | | | | +-----------------------+-------+-------+-------+------------+-------+-------+-------+-------+ +| Variant | | | ✓ | | | | | | ++-----------------------+-------+-------+-------+------------+-------+-------+-------+-------+ Notes: From 712ad764f1bc984046a78e00f7cb37ef425d76c3 Mon Sep 17 00:00:00 2001 From: Matt Topol Date: Thu, 28 Aug 2025 15:41:52 -0400 Subject: [PATCH 02/25] whitespace --- docs/source/format/CanonicalExtensions.rst | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/source/format/CanonicalExtensions.rst b/docs/source/format/CanonicalExtensions.rst index b92e8139d43b..290456bcd6b7 100644 --- a/docs/source/format/CanonicalExtensions.rst +++ b/docs/source/format/CanonicalExtensions.rst @@ -430,7 +430,7 @@ Variant represents a value that may be one of: * Object: An unordered collection of string/Variant pairs (i.e. key/value pairs). An object may not contain duplicate keys -Particularly, this provides a way to represent semi-structured data which is stored as a +Particularly, this provides a way to represent semi-structured data which is stored as a `Parquet Variant `__ value within Arrow columns in a lossless fashion. This also provides the ability to represent `shredded `__ variant values. This will make it easy for systems to pass Variant data around without having to upgrade their Arrow version @@ -714,7 +714,7 @@ The data would then be stored in Arrow as follows: :: Shredding an Object ''''''''''''''''''' -Let's consider a simple JSON column of "events" which contain a field named ``event_type`` (a string) +Let's consider a simple JSON column of "events" which contain a field named ``event_type`` (a string) and a field named ``event_ts`` (a timestamp) that we wish to shred into separate columns, In Parquet, it could look something like this: :: @@ -751,7 +751,7 @@ We can then, fairly easily, translate this into the expected extension storage t > If a field *does not exist* in the variant object value, then both the **value** and **typed_value** columns for that row -will be null. If a field is *present*, but the value is null, then **value** must contain a Variant null: *basic type* +will be null. If a field is *present*, but the value is null, then **value** must contain a Variant null: *basic type* ``0`` (primitive) and *physical type* ``0`` (null). It is *invalid* for both **value** and **typed_value** to be non-null for a given index. A reader can choose not to error @@ -797,7 +797,7 @@ To represent those values as a column of Variant values using the Variant extens |------------------------------------------|-------------------------| | 0, 2, 11, 24, 26, 35, 37, 39, 41, 43, 45 | unspecified (padding) | - * Value buffer: (01 00 -> version 1 empty metadata, + * Value buffer: (01 00 -> version 1 empty metadata, 01 01 00 XX ... -> Version 1, metadata with 1 elem, offset 0, offset XX == len(string), ... is dict string bytes) | Bytes 0-1 | Bytes 2-10 | Bytes 11-23 | Bytes 24-25 | Bytes 26-34 | @@ -874,7 +874,7 @@ To represent those values as a column of Variant values using the Variant extens | 0, 4, 9, 9, 9, 9, 9, 13, 13, 13, 13 | unspecified (padding) | * Value buffer: - + | Bytes 0-3 | Bytes 4-8 | Bytes 9-12 | Bytes 13-63 | |-----------|-----------|------------|------------------------| | noop | login | noop | unspecified (padding) | From aa0611c6745743654a491f88b26df7afbd65b8f0 Mon Sep 17 00:00:00 2001 From: Matt Topol Date: Thu, 28 Aug 2025 16:07:31 -0400 Subject: [PATCH 03/25] Update docs/source/format/CanonicalExtensions.rst Co-authored-by: Ian Cook --- docs/source/format/CanonicalExtensions.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/format/CanonicalExtensions.rst b/docs/source/format/CanonicalExtensions.rst index 290456bcd6b7..c692f2fe5e58 100644 --- a/docs/source/format/CanonicalExtensions.rst +++ b/docs/source/format/CanonicalExtensions.rst @@ -433,7 +433,7 @@ Variant represents a value that may be one of: Particularly, this provides a way to represent semi-structured data which is stored as a `Parquet Variant `__ value within Arrow columns in a lossless fashion. This also provides the ability to represent `shredded `__ -variant values. This will make it easy for systems to pass Variant data around without having to upgrade their Arrow version +variant values. This will make it possible for systems to pass Variant data around without having to upgrade their Arrow version or otherwise require special handling unless they want to directly interact with the encoded variant data. See the previous links to the Parquet format specification for details on what the actual binary values should look like. From fab0ea4fe318634f71901b0121eadd39cb9520e3 Mon Sep 17 00:00:00 2001 From: Matt Topol Date: Thu, 28 Aug 2025 18:15:47 -0400 Subject: [PATCH 04/25] update from comment --- docs/source/format/CanonicalExtensions.rst | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/docs/source/format/CanonicalExtensions.rst b/docs/source/format/CanonicalExtensions.rst index c692f2fe5e58..bfdd0d0da33e 100644 --- a/docs/source/format/CanonicalExtensions.rst +++ b/docs/source/format/CanonicalExtensions.rst @@ -45,7 +45,11 @@ types: * The specification text to be added *must* follow these requirements: - 1) It *must* define a well-defined extension name starting with "``arrow.``". + 1) It *must* define a well-defined extension name starting with an allowed prefix. + The currently allowed prefixes are: + * "``arrow.``" - For general-purpose canonical extension types. + * "``parquet.``" - For canonical extension types that are intended primarily for + interoperability with `Apache Parquet `__ format. 2) Its parameters, if any, *must* be described in the proposal. From 8e4a24ce1d3bfefb4b8c98e9dab29a707bc8c2be Mon Sep 17 00:00:00 2001 From: Matt Topol Date: Fri, 29 Aug 2025 10:24:19 -0400 Subject: [PATCH 05/25] Apply suggestions from code review Co-authored-by: David Li Co-authored-by: Dewey Dunnington --- docs/source/format/CanonicalExtensions.rst | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/docs/source/format/CanonicalExtensions.rst b/docs/source/format/CanonicalExtensions.rst index bfdd0d0da33e..ded52da094b1 100644 --- a/docs/source/format/CanonicalExtensions.rst +++ b/docs/source/format/CanonicalExtensions.rst @@ -443,7 +443,7 @@ to the Parquet format specification for details on what the actual binary values * Extension name: ``parquet.variant``. -* The storage type of this extension is a ``StructArray`` that obeys the following rules: +* The storage type of this extension is a ``Struct`` that obeys the following rules: * A *non-nullable* field named ``metadata`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``. @@ -452,9 +452,9 @@ to the Parquet format specification for details on what the actual binary values * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``. *(unshredded variants consist of just the ``metadata`` and ``value`` fields only)* - * A field named ``typed_value`` which can be any *primitive type* or a ``List``, ``LargeList``, ``ListView`` or ``Struct`` + * A field named ``typed_value`` which can be any :term:`primitive type` or a ``List``, ``LargeList``, ``ListView`` or ``Struct`` - * If the ``typed_value`` field is a *nested* type, its elements **must** be *non-nullable* and **must** be a ``struct`` consisting of + * If the ``typed_value`` field is a *nested* type, its elements **must** be *non-nullable* and **must** be a ``Struct`` consisting of at least one (or both) of the following: * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``. @@ -463,11 +463,11 @@ to the Parquet format specification for details on what the actual binary values .. note:: - It is also *permissible* for the ``metadata`` field to be dictionary-encoded with an index type of ``int8``. + It is also *permissible* for the ``metadata`` field to be dictionary-encoded with an index type of ``int8``. .. note:: - The fields may be in any order, and thus must be accessed by **name** not by *position*. + The fields may be in any order, and thus must be accessed by **name** not by *position*. Examples -------- @@ -485,7 +485,7 @@ the following storage types are valid (not an exhaustive list): Simple Shredding '''''''''''''''' -Suppose a Variant field named *measurement* and we want to shred the ``int64`` values into a separate column for efficiency. +Suppose we have a Variant field named *measurement* and we want to shred the ``int64`` values into a separate column for efficiency. In Parquet, this could be represented as:: required group measurement (VARIANT) { @@ -718,9 +718,9 @@ The data would then be stored in Arrow as follows: :: Shredding an Object ''''''''''''''''''' -Let's consider a simple JSON column of "events" which contain a field named ``event_type`` (a string) +Let's consider a JSON column of "events" which contain a field named ``event_type`` (a string) and a field named ``event_ts`` (a timestamp) that we wish to shred into separate columns, In Parquet, -it could look something like this: :: +it could look something like this:: optional group event (VARIANT) { required binary metadata; @@ -737,7 +737,7 @@ it could look something like this: :: } } -We can then, fairly easily, translate this into the expected extension storage type: :: +We can then translate this into the expected extension storage type: :: struct< metadata: binary required, @@ -926,7 +926,7 @@ Putting it all together ''''''''''''''''''''''' As mentioned, the **typed_value** field associated with a Variant **value** can be of any shredded type. As a result, -as long as we follow the original rules you can have an arbitrary number of nested levels based on how you want to +as long as we follow the original rules we can have an arbitrary number of nested levels based on how you want to shred the object. For example, we might have a few more fields alongside **event_type** to shred out. Possibly an object that looks like this: :: From d0c218b98a635389d2f6fdae371de26a7dda1cbc Mon Sep 17 00:00:00 2001 From: Matt Topol Date: Fri, 29 Aug 2025 10:36:38 -0400 Subject: [PATCH 06/25] updates from feedback --- docs/source/format/CanonicalExtensions.rst | 99 +++++++++++----------- 1 file changed, 49 insertions(+), 50 deletions(-) diff --git a/docs/source/format/CanonicalExtensions.rst b/docs/source/format/CanonicalExtensions.rst index ded52da094b1..20a4c3b6c08f 100644 --- a/docs/source/format/CanonicalExtensions.rst +++ b/docs/source/format/CanonicalExtensions.rst @@ -423,8 +423,8 @@ better zero-copy compatibility with various systems that also store booleans usi .. _variant_extension: -Variant -======= +Parquet Variant +=============== Variant represents a value that may be one of: @@ -478,9 +478,9 @@ Unshredded The simplest case, an unshredded variant always consists of **exactly** two fields: ``metadata`` and ``value``. Any of the following storage types are valid (not an exhaustive list): -* ``struct`` -* ``struct`` -* ``struct required, value: binary_view required>`` +* ``struct`` +* ``struct`` +* ``struct non-nullable, value: binary_view nullable>`` Simple Shredding '''''''''''''''' @@ -494,7 +494,7 @@ In Parquet, this could be represented as:: optional int64 typed_value; } -Thus the corresponding storage type for the ``parquet.variant`` Arrow extension type would be: :: +Thus the corresponding storage type for the ``parquet.variant`` Arrow extension type would be:: struct< metadata: binary required, @@ -502,11 +502,11 @@ Thus the corresponding storage type for the ``parquet.variant`` Arrow extension typed_value: int64 optional > -If we suppose a series of measurements consisting of: :: +If we suppose a series of measurements consisting of:: 34, null, "n/a", 100 -The data should be stored/represented in Arrow as: :: +The data should be stored/represented in Arrow as:: * Length: 4, Null count: 1 * Validity Bitmap buffer: @@ -566,9 +566,10 @@ The data should be stored/represented in Arrow as: :: .. note:: - Notice that there is a variant ``literal null`` in the ``value`` array, this is due to the - `shredding specification `__ - so that a consumer can tell the difference between a *missing* field and a **null** field. + Notice that there is a variant ``literal null`` in the ``value`` array, this is due to the + `shredding specification `__ + so that a consumer can tell the difference between a *missing* field and a **null** field. A null + element must be encoded as a Variant null: *basic type* ``0`` (primitive) and *physical type* ``0`` (null). Shredding an Array '''''''''''''''''' @@ -577,7 +578,7 @@ For our next example, we will represent a shredded array of strings. Let's consi ["comedy", "drama"], ["horror", null], ["comedy", "drama", "romance"], null -Representing this shredded variant in Parquet could look like: :: +Representing this shredded variant in Parquet could look like:: optional group tags (VARIANT) { required binary metadata; @@ -593,18 +594,17 @@ Representing this shredded variant in Parquet could look like: :: } The array structure for Variant encoding does not allow missing elements, so all elements of the array must -be *non-nullable*. As such, either **typed_value** or **value** (*but not both!*) must be *non-null*. A null -element must be encoded as a Variant null: *basic type* ``0`` (primitive) and *physical type* ``0`` (null). +be *non-nullable*. As such, either **typed_value** or **value** (*but not both!*) must be *non-null*. -The storage type to represent this in Arrow as a Variant extension type would be: :: +The storage type to represent this in Arrow as a Variant extension type would be:: struct< - metadata: binary required, - value: binary optional, + metadata: binary non-nullable, + value: binary nullable, typed_value: list required> optional + value: binary nullable, + typed_value: string nullable + > required> nullable > .. note:: @@ -612,7 +612,7 @@ The storage type to represent this in Arrow as a Variant extension type would be As usual, **Binary** could also be **LargeBinary** or **BinaryView**, **String** could also be **LargeString** or **StringView**, and **List** could also be **LargeList** or **ListView**. -The data would then be stored in Arrow as follows: :: +The data would then be stored in Arrow as follows:: * Length: 4, Null count: 1 * Validity Bitmap buffer: @@ -737,31 +737,30 @@ it could look something like this:: } } -We can then translate this into the expected extension storage type: :: +We can then translate this into the expected extension storage type:: struct< - metadata: binary required, - value: binary optional, + metadata: binary non-nullable, + value: binary nullable, typed_value: struct< event_type: struct< - value: binary optional, - typed_value: string optional - > required, + value: binary nullable, + typed_value: string nullable + > non-nullable, event_ts: struct< - value: binary optional, - typed_value: timestamp(us, UTC) optional - > required - > optional + value: binary nullable, + typed_value: timestamp(us, UTC) nullable + > non-nullable + > nullable > If a field *does not exist* in the variant object value, then both the **value** and **typed_value** columns for that row -will be null. If a field is *present*, but the value is null, then **value** must contain a Variant null: *basic type* -``0`` (primitive) and *physical type* ``0`` (null). +will be null. If a field is *present*, but the value is null, then **value** must contain a Variant null. It is *invalid* for both **value** and **typed_value** to be non-null for a given index. A reader can choose not to error in this scenario, but if so it **must** use the value in the **typed_value** column for that index. -Let's consider the following series of objects: :: +Let's consider the following series of objects:: {"event_type": "noop", "event_ts": 1729794114937} @@ -783,7 +782,7 @@ Let's consider the following series of objects: :: *Entirely missing* -To represent those values as a column of Variant values using the Variant extension type we get the following: :: +To represent those values as a column of Variant values using the Variant extension type we get the following:: * Length: 10, Null count: 1 * Validity bitmap buffer: @@ -928,7 +927,7 @@ Putting it all together As mentioned, the **typed_value** field associated with a Variant **value** can be of any shredded type. As a result, as long as we follow the original rules we can have an arbitrary number of nested levels based on how you want to shred the object. For example, we might have a few more fields alongside **event_type** to shred out. Possibly an object -that looks like this: :: +that looks like this:: { "event_type": "login", @@ -937,7 +936,7 @@ that looks like this: :: “tags”: [“foo”, “bar”, “baz”] } -If we shred the extra fields out and represent it as Parquet it looks like: :: +If we shred the extra fields out and represent it as Parquet it looks like:: optional group event (VARIANT) { required binary metadata; @@ -978,25 +977,25 @@ If we shred the extra fields out and represent it as Parquet it looks like: :: } } -Finally, following the rules we set forth on constructing the Variant Extension Type storage type, we end up with: :: +Finally, following the rules we set forth on constructing the Variant Extension Type storage type, we end up with:: struct< - metadata: binary required, - value: binary optional, + metadata: binary non-nullable, + value: binary nullable, typed_value: struct< - event_type: struct required, - event_ts: struct required, + event_type: struct non-nullable, + event_ts: struct non-nullable, location: struct< - value: binary optional, + value: binary nullable, typed_value: struct< - longitude: struct required, - latitude: struct required - > optional> required, + longitude: struct non-nullable, + latitude: struct non-nullable + > nullable> non-nullable, tags: struct< - value: binary optional, - typed_value: list required> optional - > required - > optional + value: binary nullable, + typed_value: list non-nullable> nullable + > non-nullable + > nullable > From 60e21dd9939d765bddf977d8a0229bcc98f94995 Mon Sep 17 00:00:00 2001 From: Matt Topol Date: Mon, 1 Sep 2025 13:42:07 -0400 Subject: [PATCH 07/25] Update docs/source/format/CanonicalExtensions.rst Co-authored-by: Gang Wu --- docs/source/format/CanonicalExtensions.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/format/CanonicalExtensions.rst b/docs/source/format/CanonicalExtensions.rst index 20a4c3b6c08f..0d9fb23e30f3 100644 --- a/docs/source/format/CanonicalExtensions.rst +++ b/docs/source/format/CanonicalExtensions.rst @@ -636,7 +636,7 @@ The data would then be stored in Arrow as follows:: |--------------------|--------------------------| | 01 00 01 00 01 00 | unspecified (padding) | - * field-1 array (`VarBinary` Value) + * field-1 array (`VarBinary` value) * Length: 4, Null count: 1 * Validity bitmap buffer: From 9c8d05f5a91f4b3bcb8bfd7617bcffbb6ddd5661 Mon Sep 17 00:00:00 2001 From: Matt Topol Date: Mon, 1 Sep 2025 13:42:17 -0400 Subject: [PATCH 08/25] Update docs/source/format/CanonicalExtensions.rst Co-authored-by: Gang Wu --- docs/source/format/CanonicalExtensions.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/format/CanonicalExtensions.rst b/docs/source/format/CanonicalExtensions.rst index 0d9fb23e30f3..cb31c37facc7 100644 --- a/docs/source/format/CanonicalExtensions.rst +++ b/docs/source/format/CanonicalExtensions.rst @@ -622,7 +622,7 @@ The data would then be stored in Arrow as follows:: | 00000111 | 0 (padding) | * Children arrays: - * field-0 array (`VarBinary` Metadata) + * field-0 array (`VarBinary` metadata) * Length: 4, Null count: 0 * Offsets buffer: From e91dbd69919d9a553477ab188f7261b28e55708d Mon Sep 17 00:00:00 2001 From: Matt Topol Date: Mon, 1 Sep 2025 13:42:58 -0400 Subject: [PATCH 09/25] Update docs/source/format/CanonicalExtensions.rst Co-authored-by: Gang Wu --- docs/source/format/CanonicalExtensions.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/format/CanonicalExtensions.rst b/docs/source/format/CanonicalExtensions.rst index cb31c37facc7..75fcf3d0002e 100644 --- a/docs/source/format/CanonicalExtensions.rst +++ b/docs/source/format/CanonicalExtensions.rst @@ -526,9 +526,9 @@ The data should be stored/represented in Arrow as:: * Value buffer: (01 00 -> indicates version 1 empty metadata) - | Bytes 0-7 | Bytes 8-63 | - |--------------------|--------------------------| - | 01 00 01 00 01 00 | unspecified (padding) | + | Bytes 0-7 | Bytes 8-63 | + |-------------------------|--------------------------| + | 01 00 01 00 01 00 01 00 | unspecified (padding) | * field-1 array (`VarBinary`) * Length: 4, Null count: 2 From 9ca31a58fdb41bf4ad92ef0e112e2fe1cf099da0 Mon Sep 17 00:00:00 2001 From: Matt Topol Date: Mon, 1 Sep 2025 13:44:41 -0400 Subject: [PATCH 10/25] fix from feedback --- docs/source/format/CanonicalExtensions.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/format/CanonicalExtensions.rst b/docs/source/format/CanonicalExtensions.rst index 75fcf3d0002e..5f3be85b3469 100644 --- a/docs/source/format/CanonicalExtensions.rst +++ b/docs/source/format/CanonicalExtensions.rst @@ -632,9 +632,9 @@ The data would then be stored in Arrow as follows:: * Value buffer: (01 00 -> indicates version 1 empty metadata) - | Bytes 0-7 | Bytes 8-63 | - |--------------------|--------------------------| - | 01 00 01 00 01 00 | unspecified (padding) | + | Bytes 0-7 | Bytes 8-63 | + |-------------------------|--------------------------| + | 01 00 01 00 01 00 01 00 | unspecified (padding) | * field-1 array (`VarBinary` value) * Length: 4, Null count: 1 From 53524961cc4fe6c162d391d4d59b955c1e96ecf6 Mon Sep 17 00:00:00 2001 From: Matt Topol Date: Mon, 1 Sep 2025 13:45:48 -0400 Subject: [PATCH 11/25] clarify case sensitive --- docs/source/format/CanonicalExtensions.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/format/CanonicalExtensions.rst b/docs/source/format/CanonicalExtensions.rst index 5f3be85b3469..ee7509847d43 100644 --- a/docs/source/format/CanonicalExtensions.rst +++ b/docs/source/format/CanonicalExtensions.rst @@ -467,7 +467,7 @@ to the Parquet format specification for details on what the actual binary values .. note:: - The fields may be in any order, and thus must be accessed by **name** not by *position*. + The fields may be in any order, and thus must be accessed by **name** not by *position*. The field names are case sensitive. Examples -------- From 35c89a4306d62b44af3e3054d86dcd6e0f710721 Mon Sep 17 00:00:00 2001 From: Matt Topol Date: Tue, 2 Sep 2025 11:13:41 -0400 Subject: [PATCH 12/25] Apply suggestions from code review Co-authored-by: Antoine Pitrou --- docs/source/format/CanonicalExtensions.rst | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/docs/source/format/CanonicalExtensions.rst b/docs/source/format/CanonicalExtensions.rst index ee7509847d43..64f14358547d 100644 --- a/docs/source/format/CanonicalExtensions.rst +++ b/docs/source/format/CanonicalExtensions.rst @@ -459,7 +459,14 @@ to the Parquet format specification for details on what the actual binary values * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``. - * A field named ``typed_value`` which follows these same rules for the ``typed_value`` field. + * A field named ``typed_value`` which follows the rules outlined above (this allows for arbitrarily nested data). +* Extension type parameters: + + This type does not have any parameters. + +* Description of the serialization: + + Extension metadata is an empty string. .. note:: From 4619dc7b791591a7d43ce65ccff2fd70586a94dc Mon Sep 17 00:00:00 2001 From: Matt Topol Date: Tue, 2 Sep 2025 11:20:44 -0400 Subject: [PATCH 13/25] updates from feedback --- docs/source/format/CanonicalExtensions.rst | 2 +- docs/source/status.rst | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/format/CanonicalExtensions.rst b/docs/source/format/CanonicalExtensions.rst index 64f14358547d..cf9d7d75074a 100644 --- a/docs/source/format/CanonicalExtensions.rst +++ b/docs/source/format/CanonicalExtensions.rst @@ -470,7 +470,7 @@ to the Parquet format specification for details on what the actual binary values .. note:: - It is also *permissible* for the ``metadata`` field to be dictionary-encoded with an index type of ``int8``. + It is also *permissible* for the ``metadata`` field to be dictionary-encoded with a preferred (*but not required*) index type of ``int8``. .. note:: diff --git a/docs/source/status.rst b/docs/source/status.rst index ec28d734f26c..6aff1a8c12bb 100644 --- a/docs/source/status.rst +++ b/docs/source/status.rst @@ -129,7 +129,7 @@ Data Types +-----------------------+-------+-------+-------+------------+-------+-------+-------+-------+ | 8-bit Boolean | ✓ | | ✓ | | | | | | +-----------------------+-------+-------+-------+------------+-------+-------+-------+-------+ -| Variant | | | ✓ | | | | | | +| Parquet Variant | | | ✓ | | | | | | +-----------------------+-------+-------+-------+------------+-------+-------+-------+-------+ Notes: From d8e1cfcf37c0612e5363c4360dd1a87d17e1a4ff Mon Sep 17 00:00:00 2001 From: Matt Topol Date: Wed, 3 Sep 2025 12:57:40 -0400 Subject: [PATCH 14/25] Update docs/source/format/CanonicalExtensions.rst Co-authored-by: Sutou Kouhei --- docs/source/format/CanonicalExtensions.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/format/CanonicalExtensions.rst b/docs/source/format/CanonicalExtensions.rst index cf9d7d75074a..604a5df61c6d 100644 --- a/docs/source/format/CanonicalExtensions.rst +++ b/docs/source/format/CanonicalExtensions.rst @@ -421,7 +421,7 @@ better zero-copy compatibility with various systems that also store booleans usi Metadata is an empty string. -.. _variant_extension: +.. _parquet_variant_extension: Parquet Variant =============== From 23f6fb3486bfd1aca71f5fd38558d698cddcd4db Mon Sep 17 00:00:00 2001 From: Matt Topol Date: Wed, 3 Sep 2025 12:57:59 -0400 Subject: [PATCH 15/25] Update docs/source/format/CanonicalExtensions.rst Co-authored-by: Sutou Kouhei --- docs/source/format/CanonicalExtensions.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/format/CanonicalExtensions.rst b/docs/source/format/CanonicalExtensions.rst index 604a5df61c6d..68c74fa86c7f 100644 --- a/docs/source/format/CanonicalExtensions.rst +++ b/docs/source/format/CanonicalExtensions.rst @@ -428,7 +428,7 @@ Parquet Variant Variant represents a value that may be one of: -* Primitive: a type and corresponding value (e.g. INT, STRING) +* Primitive: a type and corresponding value (e.g. ``INT``, ``STRING``) * Array: An ordered list of Variant values From b3b4063dd0dc5f743f5c6cc58b5c1ef7f1c14e68 Mon Sep 17 00:00:00 2001 From: Matt Topol Date: Wed, 3 Sep 2025 12:59:36 -0400 Subject: [PATCH 16/25] Apply suggestions from code review Co-authored-by: Sutou Kouhei --- docs/source/format/CanonicalExtensions.rst | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/docs/source/format/CanonicalExtensions.rst b/docs/source/format/CanonicalExtensions.rst index 68c74fa86c7f..196119e44850 100644 --- a/docs/source/format/CanonicalExtensions.rst +++ b/docs/source/format/CanonicalExtensions.rst @@ -460,6 +460,7 @@ to the Parquet format specification for details on what the actual binary values * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``. * A field named ``typed_value`` which follows the rules outlined above (this allows for arbitrarily nested data). + * Extension type parameters: This type does not have any parameters. @@ -504,9 +505,9 @@ In Parquet, this could be represented as:: Thus the corresponding storage type for the ``parquet.variant`` Arrow extension type would be:: struct< - metadata: binary required, - value: binary optional, - typed_value: int64 optional + metadata: binary non-nullable, + value: binary nullable, + typed_value: int64 nullable > If we suppose a series of measurements consisting of:: @@ -611,13 +612,13 @@ The storage type to represent this in Arrow as a Variant extension type would be typed_value: list required> nullable + > non-nullable> nullable > .. note:: - As usual, **Binary** could also be **LargeBinary** or **BinaryView**, **String** could also be **LargeString** or **StringView**, - and **List** could also be **LargeList** or **ListView**. + As usual, **Binary** could also be **LargeBinary** or **BinaryView**, **String** could also be **LargeString** or **StringView**, + and **List** could also be **LargeList** or **ListView**. The data would then be stored in Arrow as follows:: @@ -938,9 +939,9 @@ that looks like this:: { "event_type": "login", - “event_ts”: 1729794114937, - “location”: { “longitude”: 1.5, “latitude”: 5.5 }, - “tags”: [“foo”, “bar”, “baz”] + "event_ts": 1729794114937, + "location”: {"longitude": 1.5, "latitude": 5.5}, + "tags": ["foo", "bar", "baz"] } If we shred the extra fields out and represent it as Parquet it looks like:: From 5055efe488e3b1cca53071d679f90a592cb1f765 Mon Sep 17 00:00:00 2001 From: Matt Topol Date: Wed, 3 Sep 2025 13:35:05 -0400 Subject: [PATCH 17/25] Update docs/source/format/CanonicalExtensions.rst Co-authored-by: Bryce Mecum --- docs/source/format/CanonicalExtensions.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/format/CanonicalExtensions.rst b/docs/source/format/CanonicalExtensions.rst index 196119e44850..d7be80a376e7 100644 --- a/docs/source/format/CanonicalExtensions.rst +++ b/docs/source/format/CanonicalExtensions.rst @@ -49,7 +49,7 @@ types: The currently allowed prefixes are: * "``arrow.``" - For general-purpose canonical extension types. * "``parquet.``" - For canonical extension types that are intended primarily for - interoperability with `Apache Parquet `__ format. + interoperability with the `Apache Parquet `__ format. 2) Its parameters, if any, *must* be described in the proposal. From 2490f56a76048507efc879d0b6618b0da0f82c58 Mon Sep 17 00:00:00 2001 From: Matt Topol Date: Wed, 3 Sep 2025 14:18:08 -0400 Subject: [PATCH 18/25] clarify the meaning of the struct fields --- docs/source/format/CanonicalExtensions.rst | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/docs/source/format/CanonicalExtensions.rst b/docs/source/format/CanonicalExtensions.rst index d7be80a376e7..750a37f90ae2 100644 --- a/docs/source/format/CanonicalExtensions.rst +++ b/docs/source/format/CanonicalExtensions.rst @@ -454,8 +454,15 @@ to the Parquet format specification for details on what the actual binary values * A field named ``typed_value`` which can be any :term:`primitive type` or a ``List``, ``LargeList``, ``ListView`` or ``Struct`` - * If the ``typed_value`` field is a *nested* type, its elements **must** be *non-nullable* and **must** be a ``Struct`` consisting of - at least one (or both) of the following: + * If the ``typed_value`` field is a ``List``, ``LargeList`` or ``ListView`` its elements **must** be *non-nullable* and **must** + be a ``Struct`` consisting of at least one (or both) of the following: + + * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``. + + * A field named ``typed_value`` which follows the rules outlined above (this allows for arbitrarily nested data). + + * If the ``typed_value`` field is a ``Struct``, then its fields **must** be *non-nullable*, representing the fields being shredded + from the objects, and **must** be a ``Struct`` consisting of at least one (or both) of the following: * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``. From d9c16cdd2737eeaf6d4c624d0789da2d1debfb11 Mon Sep 17 00:00:00 2001 From: Matt Topol Date: Wed, 3 Sep 2025 14:37:42 -0400 Subject: [PATCH 19/25] Add type mapping --- docs/source/format/CanonicalExtensions.rst | 68 +++++++++++++++++++++- 1 file changed, 66 insertions(+), 2 deletions(-) diff --git a/docs/source/format/CanonicalExtensions.rst b/docs/source/format/CanonicalExtensions.rst index 750a37f90ae2..9566b915261a 100644 --- a/docs/source/format/CanonicalExtensions.rst +++ b/docs/source/format/CanonicalExtensions.rst @@ -452,7 +452,7 @@ to the Parquet format specification for details on what the actual binary values * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``. *(unshredded variants consist of just the ``metadata`` and ``value`` fields only)* - * A field named ``typed_value`` which can be any :term:`primitive type` or a ``List``, ``LargeList``, ``ListView`` or ``Struct`` + * A field named ``typed_value`` which can be a :ref:`variant_primitive_type_mapping` or a ``List``, ``LargeList``, ``ListView`` or ``Struct`` * If the ``typed_value`` field is a ``List``, ``LargeList`` or ``ListView`` its elements **must** be *non-nullable* and **must** be a ``Struct`` consisting of at least one (or both) of the following: @@ -478,12 +478,76 @@ to the Parquet format specification for details on what the actual binary values .. note:: - It is also *permissible* for the ``metadata`` field to be dictionary-encoded with a preferred (*but not required*) index type of ``int8``. + It is also *permissible* for the ``metadata`` field to be dictionary-encoded with a preferred (*but not required*) index type of ``int8``, + or run-end-encoded with a preferred (*but not required*) runs type of ``int8``. .. note:: The fields may be in any order, and thus must be accessed by **name** not by *position*. The field names are case sensitive. +.. _variant_primitive_type_mapping: + +Primitive Type Mappings +----------------------- + ++----------------------+------------------------+ +| Arrow Primitive Type | Variant Primitive Type | ++======================+========================+ +| Null | Null | ++----------------------+------------------------+ +| Boolean (true/false) | Boolean | ++----------------------+------------------------+ +| Int8 | Int8 | ++----------------------+------------------------+ +| Uint8 | Int16 | ++----------------------+------------------------+ +| Int16 | Int16 | ++----------------------+------------------------+ +| Uint16 | Int32 | ++----------------------+------------------------+ +| Int32 | Int32 | ++----------------------+------------------------+ +| Uint32 | Int64 | ++----------------------+------------------------+ +| Int64 | Int64 | ++----------------------+------------------------+ +| Float | Float | ++----------------------+------------------------+ +| Double | Double | ++----------------------+------------------------+ +| Decimal32 | decimal4 | ++----------------------+------------------------+ +| Decimal64 | decimal8 | ++----------------------+------------------------+ +| Decimal128 | decimal16 | ++----------------------+------------------------+ +| Date32 | Date | ++----------------------+------------------------+ +| Time64 | TimeNTZ | ++----------------------+------------------------+ +| Timestamp(us, UTC) | Timestamp (micro) | ++----------------------+------------------------+ +| Timestamp(us) | TimestampNTZ (micro) | ++----------------------+------------------------+ +| Timestamp(ns, UTC) | Timestamp (nano) | ++----------------------+------------------------+ +| Timestamp(ns) | TimestampNTZ (nano) | ++----------------------+------------------------+ +| Binary | Binary | ++----------------------+------------------------+ +| LargeBinary | Binary | ++----------------------+------------------------+ +| BinaryView | Binary | ++----------------------+------------------------+ +| String | String | ++----------------------+------------------------+ +| LargeString | String | ++----------------------+------------------------+ +| StringView | String | ++----------------------+------------------------+ +| UUID extension type | UUID | ++----------------------+------------------------+ + Examples -------- From b415e373b6bcc6fcfe751e877721a4016aec07bf Mon Sep 17 00:00:00 2001 From: Matt Topol Date: Wed, 3 Sep 2025 14:43:25 -0400 Subject: [PATCH 20/25] move examples to new document --- docs/source/format/CanonicalExtensions.rst | 532 +---------------- .../format/CanonicalExtensions/Examples.rst | 555 ++++++++++++++++++ 2 files changed, 556 insertions(+), 531 deletions(-) create mode 100644 docs/source/format/CanonicalExtensions/Examples.rst diff --git a/docs/source/format/CanonicalExtensions.rst b/docs/source/format/CanonicalExtensions.rst index 9566b915261a..0c6870a38542 100644 --- a/docs/source/format/CanonicalExtensions.rst +++ b/docs/source/format/CanonicalExtensions.rst @@ -441,7 +441,7 @@ variant values. This will make it possible for systems to pass Variant data arou or otherwise require special handling unless they want to directly interact with the encoded variant data. See the previous links to the Parquet format specification for details on what the actual binary values should look like. -* Extension name: ``parquet.variant``. +* Extension name: ``arrow.parquet.variant``. * The storage type of this extension is a ``Struct`` that obeys the following rules: @@ -548,536 +548,6 @@ Primitive Type Mappings | UUID extension type | UUID | +----------------------+------------------------+ -Examples --------- - -Unshredded -'''''''''' - -The simplest case, an unshredded variant always consists of **exactly** two fields: ``metadata`` and ``value``. Any of -the following storage types are valid (not an exhaustive list): - -* ``struct`` -* ``struct`` -* ``struct non-nullable, value: binary_view nullable>`` - -Simple Shredding -'''''''''''''''' - -Suppose we have a Variant field named *measurement* and we want to shred the ``int64`` values into a separate column for efficiency. -In Parquet, this could be represented as:: - - required group measurement (VARIANT) { - required binary metadata; - optional binary value; - optional int64 typed_value; - } - -Thus the corresponding storage type for the ``parquet.variant`` Arrow extension type would be:: - - struct< - metadata: binary non-nullable, - value: binary nullable, - typed_value: int64 nullable - > - -If we suppose a series of measurements consisting of:: - - 34, null, "n/a", 100 - -The data should be stored/represented in Arrow as:: - - * Length: 4, Null count: 1 - * Validity Bitmap buffer: - - | Byte 0 (validity bitmap) | Bytes 1-63 | - |--------------------------|---------------| - | 00001011 | 0 (padding) | - - * Children arrays: - * field-0 array (`VarBinary`) - * Length: 4, Null count: 0 - * Offsets buffer: - - | Bytes 0-19 | Bytes 20-63 | - |------------------|--------------------------| - | 0, 2, 4, 6, 8 | unspecified (padding) | - - * Value buffer: (01 00 -> indicates version 1 empty metadata) - - | Bytes 0-7 | Bytes 8-63 | - |-------------------------|--------------------------| - | 01 00 01 00 01 00 01 00 | unspecified (padding) | - - * field-1 array (`VarBinary`) - * Length: 4, Null count: 2 - * Validity Bitmap buffer: - - | Byte 0 (validity bitmap) | Bytes 1-63 | - |--------------------------|---------------| - | 00000110 | 0 (padding) | - - * Offsets buffer: - - | Bytes 0-19 | Bytes 20-63 | - |------------------|--------------------------| - | 0, 0, 1, 5, 5 | unspecified (padding) | - - * Value buffer: (`00` -> literal null, `0x13 0x6E 0x2F 0x61` -> variant encoding literal string "n/a") - - | Bytes 0-4 | Bytes 5-63 | - |------------------------|--------------------------| - | 00 0x13 0x6E 0x2F 0x61 | unspecified (padding) | - - * field-2 array (int64 array) - * Length: 4, Null count: 2 - * Validity Bitmap buffer: - - | Byte 0 (validity bitmap) | Bytes 1-63 | - |--------------------------|---------------| - | 00001001 | 0 (padding) | - - * Value buffer: - - | Bytes 0-31 | Bytes 32-63 | - |---------------------|--------------------------| - | 34, 00, 00, 100 | unspecified (padding) | - -.. note:: - - Notice that there is a variant ``literal null`` in the ``value`` array, this is due to the - `shredding specification `__ - so that a consumer can tell the difference between a *missing* field and a **null** field. A null - element must be encoded as a Variant null: *basic type* ``0`` (primitive) and *physical type* ``0`` (null). - -Shredding an Array -'''''''''''''''''' - -For our next example, we will represent a shredded array of strings. Let's consider a column that looks like: :: - - ["comedy", "drama"], ["horror", null], ["comedy", "drama", "romance"], null - -Representing this shredded variant in Parquet could look like:: - - optional group tags (VARIANT) { - required binary metadata; - optional binary value; - optional group typed_value (LIST) { # optional to allow null lists - repeated group list { - required group element { # shredded element - optional binary value; - optional binary typed_value (STRING); - } - } - } - } - -The array structure for Variant encoding does not allow missing elements, so all elements of the array must -be *non-nullable*. As such, either **typed_value** or **value** (*but not both!*) must be *non-null*. - -The storage type to represent this in Arrow as a Variant extension type would be:: - - struct< - metadata: binary non-nullable, - value: binary nullable, - typed_value: list non-nullable> nullable - > - -.. note:: - - As usual, **Binary** could also be **LargeBinary** or **BinaryView**, **String** could also be **LargeString** or **StringView**, - and **List** could also be **LargeList** or **ListView**. - -The data would then be stored in Arrow as follows:: - - * Length: 4, Null count: 1 - * Validity Bitmap buffer: - - | Byte 0 (validity bitmap) | Bytes 1-63 | - |--------------------------|---------------| - | 00000111 | 0 (padding) | - - * Children arrays: - * field-0 array (`VarBinary` metadata) - * Length: 4, Null count: 0 - * Offsets buffer: - - | Bytes 0-19 | Bytes 20-63 | - |------------------|--------------------------| - | 0, 2, 4, 6, 8 | unspecified (padding) | - - * Value buffer: (01 00 -> indicates version 1 empty metadata) - - | Bytes 0-7 | Bytes 8-63 | - |-------------------------|--------------------------| - | 01 00 01 00 01 00 01 00 | unspecified (padding) | - - * field-1 array (`VarBinary` value) - * Length: 4, Null count: 1 - * Validity bitmap buffer: - - | Byte 0 (validity bitmap) | Bytes 1-63 | - |--------------------------|---------------| - | 00001000 | 0 (padding) | - - * Offsets buffer: - - | Bytes 0-19 | Bytes 20-63 | - |------------------|--------------------------| - | 0, 0, 0, 0, 1 | unspecified (padding) | - - * Value buffer: (00 -> variant null) - - | Bytes 0 | Bytes 1-63 | - |--------------------|--------------------------| - | 00 | unspecified (padding) | - - * field-2 array (`List>` typed_value) - * Length: 4, Null count: 1 - * Validity bitmap buffer: - - | Byte 0 (validity bitmap) | Bytes 1-63 | - |--------------------------|-------------| - | 00000111 | 0 (padding) | - - * Offsets buffer (int32) - - | Bytes 0-19 | Bytes 20-63 | - |-------------------|-----------------------| - | 0, 2, 4, 7, 7 | unspecified (padding) | - - * Values array (`Struct` element): - * Length: 7, Null count: 0 - * Validity bitmap buffer: Not required - - * Children arrays: - * field-0 array (`VarBinary` value) - * Length: 7, Null count: 6 - * Validity bitmap buffer: - - | Byte 0 (validity bitmap) | Bytes 1-63 | - |--------------------------|-------------| - | 00001000 | 0 (padding) | - - * Offsets buffer (int32): - - | Bytes 0-31 | Bytes 32-63 | - |---------------------------|--------------------------| - | 0, 0, 0, 0, 1, 1, 1, 1 | unspecified (padding) | - - * Values buffer (`00` -> variant null): - - | Bytes 0 | Bytes 1-63 | - |--------------------|--------------------------| - | 00 | unspecified (padding) | - - * field-1 array (`String` typed_value) - * Length: 7, Null count: 1 - * Validity bitmap buffer: - - | Byte 0 (validity bitmap) | Bytes 1-63 | - |--------------------------|-------------| - | 01110111 | 0 (padding) | - - * Offsets buffer (int32): - - | Bytes 0-31 | Bytes 32-63 | - |---------------------------------|--------------------------| - | 0, 6, 11, 17, 17, 23, 28, 35 | unspecified (padding) | - - * Values buffer: - - | Bytes 0-35 | Bytes 36-63 | - |--------------------------------------|--------------------------| - | comedydramahorrorcomedydramaromance | unspecified (padding) | - -Shredding an Object -''''''''''''''''''' - -Let's consider a JSON column of "events" which contain a field named ``event_type`` (a string) -and a field named ``event_ts`` (a timestamp) that we wish to shred into separate columns, In Parquet, -it could look something like this:: - - optional group event (VARIANT) { - required binary metadata; - optional binary value; # variant, remaining fields/values - optional group typed_value { # shredded fields for variant object - required group event_type { # event_type shredded field - optional binary value; - optional binary typed_value (STRING); - } - required group event_ts { # event_ts shredded field - optional binary value; - optional int64 typed_value (TIMESTAMP(true, MICROS)) - } - } - } - -We can then translate this into the expected extension storage type:: - - struct< - metadata: binary non-nullable, - value: binary nullable, - typed_value: struct< - event_type: struct< - value: binary nullable, - typed_value: string nullable - > non-nullable, - event_ts: struct< - value: binary nullable, - typed_value: timestamp(us, UTC) nullable - > non-nullable - > nullable - > - -If a field *does not exist* in the variant object value, then both the **value** and **typed_value** columns for that row -will be null. If a field is *present*, but the value is null, then **value** must contain a Variant null. - -It is *invalid* for both **value** and **typed_value** to be non-null for a given index. A reader can choose not to error -in this scenario, but if so it **must** use the value in the **typed_value** column for that index. - -Let's consider the following series of objects:: - - {"event_type": "noop", "event_ts": 1729794114937} - - {"event_type": "login", "event_ts": 1729794146402, "email": "user@example.com"} - - {"error_msg": "malformed..."} - - "malformed: not an object" - - {"event_ts": 1729794240241, "click": "_button"} - - {"event_ts": null, "event_ts": 1729794954163} - - {"event_type": "noop", "event_ts": "2024-10-24"} - - {} - - null - - *Entirely missing* - -To represent those values as a column of Variant values using the Variant extension type we get the following:: - - * Length: 10, Null count: 1 - * Validity bitmap buffer: - - | Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 | - |--------------------------|-----------|-----------------------| - | 11111111 | 00000001 | 0 (padding) | - - * Children arrays - * field-0 array (`VarBinary` Metadata) - * Length: 10, Null count: 0 - * Offsets buffer: - - | Bytes 0-43 (int32) | Bytes 44-63 | - |------------------------------------------|-------------------------| - | 0, 2, 11, 24, 26, 35, 37, 39, 41, 43, 45 | unspecified (padding) | - - * Value buffer: (01 00 -> version 1 empty metadata, - 01 01 00 XX ... -> Version 1, metadata with 1 elem, offset 0, offset XX == len(string), ... is dict string bytes) - - | Bytes 0-1 | Bytes 2-10 | Bytes 11-23 | Bytes 24-25 | Bytes 26-34 | - |-------------------------------|-----------------------|-------------|-------------------| - | 01 00 | 01 01 00 05 email | 01 01 00 09 error_msg | 01 00 | 01 01 00 05 click | - - | Bytes 35-36 | Bytes 37-38 | Bytes 39-40 | Bytes 41-42 | Bytes 43-44 | Bytes 45-63 | - |-------------|-------------|-------------|-------------|-------------|-----------------------| - | 01 00 | 01 00 | 01 00 | 01 00 | 01 00 | unspecified (padding) | - - * field-1 array (`VarBinary` Value) - * Length: 10, Null count: 5 - * Validity bitmap buffer: - - | Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 | - |---------------------------|-----------|-----------------------| - | 00011110 | 00000001 | 0 (padding) | - - * Offsets buffer (filled in based on lengths of encoded variants): - - | ... | - - * Value buffer: - - | VariantEncode({"email": "user@email.com"}) | VariantEncode({"error_msg": "malformed..."}) | - | VariantEncode("malformed: not an object") | VariantEncode({"click": "_button"}) | 00 (null) | - - * field-2 array (`Struct<...>` typed_value) - * Length: 10, Null count: 3 - * Validity bitmap buffer: - - | Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 | - |--------------------------|-----------|-----------------------| - | 11110111 | 00000000 | 0 (padding) | - - * Children arrays: - * field-0 array (`Struct` event_type) - * Length: 10, Null count: 0 - * Validity bitmap buffer: not required - - * Children arrays - * field-0 array (`VarBinary` value) - * Length: 10, Null count: 9 - * Validity bitmap buffer: - - | Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 | - |--------------------------|-----------|-----------------------| - | 01000000 | 00000000 | 0 (padding) | - - * Offsets buffer (int32) - - | Bytes 0-43 (int32) | Bytes 44-63 | - |---------------------------------|-------------------------| - | 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1 | unspecified (padding) | - - * Value buffer: - - | Byte 0 | Bytes 1-63 | - |--------|------------------------| - | 00 | unspecified (padding) | - - * field-1 array (`String` typed_value) - * Length: 10, Null count: 7 - * Validity bitmap buffer: - - | Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 | - |--------------------------|-----------|-----------------------| - | 01000011 | 00000000 | 0 (padding) | - - * Offsets buffer (int32) - - | Byte 0-43 | Bytes 44-63 | - |-------------------------------------|------------------------| - | 0, 4, 9, 9, 9, 9, 9, 13, 13, 13, 13 | unspecified (padding) | - - * Value buffer: - - | Bytes 0-3 | Bytes 4-8 | Bytes 9-12 | Bytes 13-63 | - |-----------|-----------|------------|------------------------| - | noop | login | noop | unspecified (padding) | - - - * field-1 array (`Struct` event_ts) - * Length: 10, Null count: 0 - * Validity bitmap buffer: not required - - * Children arrays - * field-0 array (`VarBinary` value) - * Length: 10, Null count: 9 - * Validity bitmap buffer: - - | Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 | - |--------------------------|-----------|-----------------------| - | 01000000 | 00000000 | 0 (padding) | - - * Offsets buffer (int32) - - | Bytes 0-43 (int32) | Bytes 44-63 | - |---------------------------------|-------------------------| - | ... | unspecified (padding) | - - * Value buffer: - - | VariantEncode("2024-10-24") | - - * field-1 array (`Timestamp(us, UTC)` typed_value) - * Length: 10, Null count: 6 - * Validity bitmap buffer: - - | Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 | - |--------------------------|-----------|-----------------------| - | 00110011 | 00000000 | 0 (padding) | - - * Value buffer: - - | Bytes 0-7 | Bytes 8-15 | Bytes 16-31 | Bytes 32-39 | Bytes 40-47 | Bytes 48-63 | - |---------------|---------------|--------------|---------------|---------------|------------------------| - | 1729794114937 | 1729794146402 | unspecified | 1729794240241 | 1729794954163 | unspecified (padding) | - - -Putting it all together -''''''''''''''''''''''' - -As mentioned, the **typed_value** field associated with a Variant **value** can be of any shredded type. As a result, -as long as we follow the original rules we can have an arbitrary number of nested levels based on how you want to -shred the object. For example, we might have a few more fields alongside **event_type** to shred out. Possibly an object -that looks like this:: - - { - "event_type": "login", - "event_ts": 1729794114937, - "location”: {"longitude": 1.5, "latitude": 5.5}, - "tags": ["foo", "bar", "baz"] - } - -If we shred the extra fields out and represent it as Parquet it looks like:: - - optional group event (VARIANT) { - required binary metadata; - optional binary value; # variant, remaining fields/values - optional group typed_value { # shredded fields for variant object - required group event_type { # event_type shredded field - optional binary value; - optional binary typed_value (STRING); - } - required group event_ts { # event_ts shredded field - optional binary value; - optional int64 typed_value (TIMESTAMP(true, MICROS)) - } - required group location { # location shredded field - optional binary value; - optional group typed_value { - required group longitude { - optional binary value; - optional float64 typed_value; - } - required group latitude { - optional binary value; - optional float64 typed_value; - } - } - } - required group tags { # tags shredded field - optional binary value; - optional group typed_value (LIST) { - repeated group list { - required group element { - optional binary value; - optional binary typed_value (STRING); - } - } - } - } - } - } - -Finally, following the rules we set forth on constructing the Variant Extension Type storage type, we end up with:: - - struct< - metadata: binary non-nullable, - value: binary nullable, - typed_value: struct< - event_type: struct non-nullable, - event_ts: struct non-nullable, - location: struct< - value: binary nullable, - typed_value: struct< - longitude: struct non-nullable, - latitude: struct non-nullable - > nullable> non-nullable, - tags: struct< - value: binary nullable, - typed_value: list non-nullable> nullable - > non-nullable - > nullable - > - - Community Extension Types ========================= diff --git a/docs/source/format/CanonicalExtensions/Examples.rst b/docs/source/format/CanonicalExtensions/Examples.rst new file mode 100644 index 000000000000..7819f68d3cfa --- /dev/null +++ b/docs/source/format/CanonicalExtensions/Examples.rst @@ -0,0 +1,555 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +.. _format_canonical_extension_examples: + +**************************** +Canonical Extension Examples +**************************** + +========================= +Parquet Variant Extension +========================= + + +Unshredded +'''''''''' + +The simplest case, an unshredded variant always consists of **exactly** two fields: ``metadata`` and ``value``. Any of +the following storage types are valid (not an exhaustive list): + +* ``struct`` +* ``struct`` +* ``struct non-nullable, value: binary_view nullable>`` + +Simple Shredding +'''''''''''''''' + +Suppose we have a Variant field named *measurement* and we want to shred the ``int64`` values into a separate column for efficiency. +In Parquet, this could be represented as:: + + required group measurement (VARIANT) { + required binary metadata; + optional binary value; + optional int64 typed_value; + } + +Thus the corresponding storage type for the ``arrow.parquet.variant`` Arrow extension type would be:: + + struct< + metadata: binary non-nullable, + value: binary nullable, + typed_value: int64 nullable + > + +If we suppose a series of measurements consisting of:: + + 34, null, "n/a", 100 + +The data should be stored/represented in Arrow as:: + + * Length: 4, Null count: 1 + * Validity Bitmap buffer: + + | Byte 0 (validity bitmap) | Bytes 1-63 | + |--------------------------|---------------| + | 00001011 | 0 (padding) | + + * Children arrays: + * field-0 array (`VarBinary`) + * Length: 4, Null count: 0 + * Offsets buffer: + + | Bytes 0-19 | Bytes 20-63 | + |------------------|--------------------------| + | 0, 2, 4, 6, 8 | unspecified (padding) | + + * Value buffer: (01 00 -> indicates version 1 empty metadata) + + | Bytes 0-7 | Bytes 8-63 | + |-------------------------|--------------------------| + | 01 00 01 00 01 00 01 00 | unspecified (padding) | + + * field-1 array (`VarBinary`) + * Length: 4, Null count: 2 + * Validity Bitmap buffer: + + | Byte 0 (validity bitmap) | Bytes 1-63 | + |--------------------------|---------------| + | 00000110 | 0 (padding) | + + * Offsets buffer: + + | Bytes 0-19 | Bytes 20-63 | + |------------------|--------------------------| + | 0, 0, 1, 5, 5 | unspecified (padding) | + + * Value buffer: (`00` -> null, + `0x13 0x6E 0x2F 0x61` -> variant encoding literal string "n/a") + + | Bytes 0-4 | Bytes 5-63 | + |------------------------|--------------------------| + | 00 0x13 0x6E 0x2F 0x61 | unspecified (padding) | + + * field-2 array (int64 array) + * Length: 4, Null count: 2 + * Validity Bitmap buffer: + + | Byte 0 (validity bitmap) | Bytes 1-63 | + |--------------------------|---------------| + | 00001001 | 0 (padding) | + + * Value buffer: + + | Bytes 0-31 | Bytes 32-63 | + |---------------------|--------------------------| + | 34, 00, 00, 100 | unspecified (padding) | + +.. note:: + + Notice that there is a variant ``literal null`` in the ``value`` array, this is due to the + `shredding specification `__ + so that a consumer can tell the difference between a *missing* field and a *null* field. A null + element must be encoded as a Variant null: *basic type* ``0`` (primitive) and *physical type* ``0`` (null). + +Shredding an Array +'''''''''''''''''' + +For our next example, we will represent a shredded array of strings. Let's consider a column that looks like: :: + + ["comedy", "drama"], ["horror", null], ["comedy", "drama", "romance"], null + +Representing this shredded variant in Parquet could look like:: + + optional group tags (VARIANT) { + required binary metadata; + optional binary value; + optional group typed_value (LIST) { # optional to allow null lists + repeated group list { + required group element { # shredded element + optional binary value; + optional binary typed_value (STRING); + } + } + } + } + +The array structure for Variant encoding does not allow missing elements, so all elements of the array must +be *non-nullable*. As such, either **typed_value** or **value** (*but not both!*) must be *non-null*. + +The storage type to represent this in Arrow as a Variant extension type would be:: + + struct< + metadata: binary non-nullable, + value: binary nullable, + typed_value: list non-nullable> nullable + > + +.. note:: + + As usual, **Binary** could also be **LargeBinary** or **BinaryView**, **String** could also be **LargeString** or **StringView**, + and **List** could also be **LargeList** or **ListView**. + +The data would then be stored in Arrow as follows:: + + * Length: 4, Null count: 1 + * Validity Bitmap buffer: + + | Byte 0 (validity bitmap) | Bytes 1-63 | + |--------------------------|---------------| + | 00000111 | 0 (padding) | + + * Children arrays: + * field-0 array (`VarBinary` metadata) + * Length: 4, Null count: 0 + * Offsets buffer: + + | Bytes 0-19 | Bytes 20-63 | + |------------------|--------------------------| + | 0, 2, 4, 6, 8 | unspecified (padding) | + + * Value buffer: (01 00 -> indicates version 1 empty metadata) + + | Bytes 0-7 | Bytes 8-63 | + |-------------------------|--------------------------| + | 01 00 01 00 01 00 01 00 | unspecified (padding) | + + * field-1 array (`VarBinary` value) + * Length: 4, Null count: 1 + * Validity bitmap buffer: + + | Byte 0 (validity bitmap) | Bytes 1-63 | + |--------------------------|---------------| + | 00001000 | 0 (padding) | + + * Offsets buffer: + + | Bytes 0-19 | Bytes 20-63 | + |------------------|--------------------------| + | 0, 0, 0, 0, 1 | unspecified (padding) | + + * Value buffer: (00 -> variant null) + + | Bytes 0 | Bytes 1-63 | + |--------------------|--------------------------| + | 00 | unspecified (padding) | + + * field-2 array (`List>` typed_value) + * Length: 4, Null count: 1 + * Validity bitmap buffer: + + | Byte 0 (validity bitmap) | Bytes 1-63 | + |--------------------------|-------------| + | 00000111 | 0 (padding) | + + * Offsets buffer (int32) + + | Bytes 0-19 | Bytes 20-63 | + |-------------------|-----------------------| + | 0, 2, 4, 7, 7 | unspecified (padding) | + + * Values array (`Struct` element): + * Length: 7, Null count: 0 + * Validity bitmap buffer: Not required + + * Children arrays: + * field-0 array (`VarBinary` value) + * Length: 7, Null count: 6 + * Validity bitmap buffer: + + | Byte 0 (validity bitmap) | Bytes 1-63 | + |--------------------------|-------------| + | 00001000 | 0 (padding) | + + * Offsets buffer (int32): + + | Bytes 0-31 | Bytes 32-63 | + |---------------------------|--------------------------| + | 0, 0, 0, 0, 1, 1, 1, 1 | unspecified (padding) | + + * Values buffer (`00` -> variant null): + + | Bytes 0 | Bytes 1-63 | + |--------------------|--------------------------| + | 00 | unspecified (padding) | + + * field-1 array (`String` typed_value) + * Length: 7, Null count: 1 + * Validity bitmap buffer: + + | Byte 0 (validity bitmap) | Bytes 1-63 | + |--------------------------|-------------| + | 01110111 | 0 (padding) | + + * Offsets buffer (int32): + + | Bytes 0-31 | Bytes 32-63 | + |---------------------------------|--------------------------| + | 0, 6, 11, 17, 17, 23, 28, 35 | unspecified (padding) | + + * Values buffer: + + | Bytes 0-35 | Bytes 36-63 | + |--------------------------------------|--------------------------| + | comedydramahorrorcomedydramaromance | unspecified (padding) | + +Shredding an Object +''''''''''''''''''' + +Let's consider a JSON column of "events" which contain a field named ``event_type`` (a string) +and a field named ``event_ts`` (a timestamp) that we wish to shred into separate columns, In Parquet, +it could look something like this:: + + optional group event (VARIANT) { + required binary metadata; + optional binary value; # variant, remaining fields/values + optional group typed_value { # shredded fields for variant object + required group event_type { # event_type shredded field + optional binary value; + optional binary typed_value (STRING); + } + required group event_ts { # event_ts shredded field + optional binary value; + optional int64 typed_value (TIMESTAMP(true, MICROS)) + } + } + } + +We can then translate this into the expected extension storage type:: + + struct< + metadata: binary non-nullable, + value: binary nullable, + typed_value: struct< + event_type: struct< + value: binary nullable, + typed_value: string nullable + > non-nullable, + event_ts: struct< + value: binary nullable, + typed_value: timestamp(us, UTC) nullable + > non-nullable + > nullable + > + +If a field *does not exist* in the variant object value, then both the **value** and **typed_value** columns for that row +will be null. If a field is *present*, but the value is null, then **value** must contain a Variant null. + +It is *invalid* for both **value** and **typed_value** to be non-null for a given index. A reader can choose not to error +in this scenario, but if so it **must** use the value in the **typed_value** column for that index. + +Let's consider the following series of objects:: + + {"event_type": "noop", "event_ts": 1729794114937} + + {"event_type": "login", "event_ts": 1729794146402, "email": "user@example.com"} + + {"error_msg": "malformed..."} + + "malformed: not an object" + + {"event_ts": 1729794240241, "click": "_button"} + + {"event_ts": null, "event_ts": 1729794954163} + + {"event_type": "noop", "event_ts": "2024-10-24"} + + {} + + null + + *Entirely missing* + +To represent those values as a column of Variant values using the Variant extension type we get the following:: + + * Length: 10, Null count: 1 + * Validity bitmap buffer: + + | Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 | + |--------------------------|-----------|-----------------------| + | 11111111 | 00000001 | 0 (padding) | + + * Children arrays + * field-0 array (`VarBinary` Metadata) + * Length: 10, Null count: 0 + * Offsets buffer: + + | Bytes 0-43 (int32) | Bytes 44-63 | + |------------------------------------------|-------------------------| + | 0, 2, 11, 24, 26, 35, 37, 39, 41, 43, 45 | unspecified (padding) | + + * Value buffer: (01 00 -> version 1 empty metadata, + 01 01 00 XX ... -> Version 1, metadata with 1 elem, offset 0, offset XX == len(string), ... is dict string bytes) + + | Bytes 0-1 | Bytes 2-10 | Bytes 11-23 | Bytes 24-25 | Bytes 26-34 | + |-------------------------------|-----------------------|-------------|-------------------| + | 01 00 | 01 01 00 05 email | 01 01 00 09 error_msg | 01 00 | 01 01 00 05 click | + + | Bytes 35-36 | Bytes 37-38 | Bytes 39-40 | Bytes 41-42 | Bytes 43-44 | Bytes 45-63 | + |-------------|-------------|-------------|-------------|-------------|-----------------------| + | 01 00 | 01 00 | 01 00 | 01 00 | 01 00 | unspecified (padding) | + + * field-1 array (`VarBinary` Value) + * Length: 10, Null count: 5 + * Validity bitmap buffer: + + | Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 | + |---------------------------|-----------|-----------------------| + | 00011110 | 00000001 | 0 (padding) | + + * Offsets buffer (filled in based on lengths of encoded variants): + + | ... | + + * Value buffer: + + | VariantEncode({"email": "user@email.com"}) | VariantEncode({"error_msg": "malformed..."}) | + | VariantEncode("malformed: not an object") | VariantEncode({"click": "_button"}) | 00 (null) | + + * field-2 array (`Struct<...>` typed_value) + * Length: 10, Null count: 3 + * Validity bitmap buffer: + + | Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 | + |--------------------------|-----------|-----------------------| + | 11110111 | 00000000 | 0 (padding) | + + * Children arrays: + * field-0 array (`Struct` event_type) + * Length: 10, Null count: 0 + * Validity bitmap buffer: not required + + * Children arrays + * field-0 array (`VarBinary` value) + * Length: 10, Null count: 9 + * Validity bitmap buffer: + + | Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 | + |--------------------------|-----------|-----------------------| + | 01000000 | 00000000 | 0 (padding) | + + * Offsets buffer (int32) + + | Bytes 0-43 (int32) | Bytes 44-63 | + |---------------------------------|-------------------------| + | 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1 | unspecified (padding) | + + * Value buffer: + + | Byte 0 | Bytes 1-63 | + |--------|------------------------| + | 00 | unspecified (padding) | + + * field-1 array (`String` typed_value) + * Length: 10, Null count: 7 + * Validity bitmap buffer: + + | Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 | + |--------------------------|-----------|-----------------------| + | 01000011 | 00000000 | 0 (padding) | + + * Offsets buffer (int32) + + | Byte 0-43 | Bytes 44-63 | + |-------------------------------------|------------------------| + | 0, 4, 9, 9, 9, 9, 9, 13, 13, 13, 13 | unspecified (padding) | + + * Value buffer: + + | Bytes 0-3 | Bytes 4-8 | Bytes 9-12 | Bytes 13-63 | + |-----------|-----------|------------|------------------------| + | noop | login | noop | unspecified (padding) | + + + * field-1 array (`Struct` event_ts) + * Length: 10, Null count: 0 + * Validity bitmap buffer: not required + + * Children arrays + * field-0 array (`VarBinary` value) + * Length: 10, Null count: 9 + * Validity bitmap buffer: + + | Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 | + |--------------------------|-----------|-----------------------| + | 01000000 | 00000000 | 0 (padding) | + + * Offsets buffer (int32) + + | Bytes 0-43 (int32) | Bytes 44-63 | + |---------------------------------|-------------------------| + | ... | unspecified (padding) | + + * Value buffer: + + | VariantEncode("2024-10-24") | + + * field-1 array (`Timestamp(us, UTC)` typed_value) + * Length: 10, Null count: 6 + * Validity bitmap buffer: + + | Byte 0 (validity bitmap) | Byte 1 | Bytes 2-63 | + |--------------------------|-----------|-----------------------| + | 00110011 | 00000000 | 0 (padding) | + + * Value buffer: + + | Bytes 0-7 | Bytes 8-15 | Bytes 16-31 | Bytes 32-39 | Bytes 40-47 | Bytes 48-63 | + |---------------|---------------|--------------|---------------|---------------|------------------------| + | 1729794114937 | 1729794146402 | unspecified | 1729794240241 | 1729794954163 | unspecified (padding) | + + +Putting it all together +''''''''''''''''''''''' + +As mentioned, the **typed_value** field associated with a Variant **value** can be of any shredded type. As a result, +as long as we follow the original rules we can have an arbitrary number of nested levels based on how you want to +shred the object. For example, we might have a few more fields alongside **event_type** to shred out. Possibly an object +that looks like this:: + + { + "event_type": "login", + "event_ts": 1729794114937, + "location”: {"longitude": 1.5, "latitude": 5.5}, + "tags": ["foo", "bar", "baz"] + } + +If we shred the extra fields out and represent it as Parquet it looks like:: + + optional group event (VARIANT) { + required binary metadata; + optional binary value; # variant, remaining fields/values + optional group typed_value { # shredded fields for variant object + required group event_type { # event_type shredded field + optional binary value; + optional binary typed_value (STRING); + } + required group event_ts { # event_ts shredded field + optional binary value; + optional int64 typed_value (TIMESTAMP(true, MICROS)) + } + required group location { # location shredded field + optional binary value; + optional group typed_value { + required group longitude { + optional binary value; + optional float64 typed_value; + } + required group latitude { + optional binary value; + optional float64 typed_value; + } + } + } + required group tags { # tags shredded field + optional binary value; + optional group typed_value (LIST) { + repeated group list { + required group element { + optional binary value; + optional binary typed_value (STRING); + } + } + } + } + } + } + +Finally, following the rules we set forth on constructing the Variant Extension Type storage type, we end up with:: + + struct< + metadata: binary non-nullable, + value: binary nullable, + typed_value: struct< + event_type: struct non-nullable, + event_ts: struct non-nullable, + location: struct< + value: binary nullable, + typed_value: struct< + longitude: struct non-nullable, + latitude: struct non-nullable + > nullable> non-nullable, + tags: struct< + value: binary nullable, + typed_value: list non-nullable> nullable + > non-nullable + > nullable + > + From 728b19aef07e09a5aaf3032f1f9972b9af6ab49d Mon Sep 17 00:00:00 2001 From: Matt Topol Date: Wed, 3 Sep 2025 14:49:27 -0400 Subject: [PATCH 21/25] trim whitespace --- docs/source/format/CanonicalExtensions.rst | 4 ++-- docs/source/format/CanonicalExtensions/Examples.rst | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/format/CanonicalExtensions.rst b/docs/source/format/CanonicalExtensions.rst index 0c6870a38542..19e14f6aa7a4 100644 --- a/docs/source/format/CanonicalExtensions.rst +++ b/docs/source/format/CanonicalExtensions.rst @@ -454,13 +454,13 @@ to the Parquet format specification for details on what the actual binary values * A field named ``typed_value`` which can be a :ref:`variant_primitive_type_mapping` or a ``List``, ``LargeList``, ``ListView`` or ``Struct`` - * If the ``typed_value`` field is a ``List``, ``LargeList`` or ``ListView`` its elements **must** be *non-nullable* and **must** + * If the ``typed_value`` field is a ``List``, ``LargeList`` or ``ListView`` its elements **must** be *non-nullable* and **must** be a ``Struct`` consisting of at least one (or both) of the following: * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``. * A field named ``typed_value`` which follows the rules outlined above (this allows for arbitrarily nested data). - + * If the ``typed_value`` field is a ``Struct``, then its fields **must** be *non-nullable*, representing the fields being shredded from the objects, and **must** be a ``Struct`` consisting of at least one (or both) of the following: diff --git a/docs/source/format/CanonicalExtensions/Examples.rst b/docs/source/format/CanonicalExtensions/Examples.rst index 7819f68d3cfa..2d4303d2e4e4 100644 --- a/docs/source/format/CanonicalExtensions/Examples.rst +++ b/docs/source/format/CanonicalExtensions/Examples.rst @@ -98,7 +98,7 @@ The data should be stored/represented in Arrow as:: |------------------|--------------------------| | 0, 0, 1, 5, 5 | unspecified (padding) | - * Value buffer: (`00` -> null, + * Value buffer: (`00` -> null, `0x13 0x6E 0x2F 0x61` -> variant encoding literal string "n/a") | Bytes 0-4 | Bytes 5-63 | From 6df102c2132a0a7375f1232a787807035e9eb88f Mon Sep 17 00:00:00 2001 From: Matt Topol Date: Wed, 3 Sep 2025 14:52:31 -0400 Subject: [PATCH 22/25] Update docs/source/format/CanonicalExtensions.rst Co-authored-by: Bryce Mecum --- docs/source/format/CanonicalExtensions.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/format/CanonicalExtensions.rst b/docs/source/format/CanonicalExtensions.rst index 19e14f6aa7a4..1f63b780360c 100644 --- a/docs/source/format/CanonicalExtensions.rst +++ b/docs/source/format/CanonicalExtensions.rst @@ -450,7 +450,7 @@ to the Parquet format specification for details on what the actual binary values * At least one (or both) of the following: * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``. - *(unshredded variants consist of just the ``metadata`` and ``value`` fields only)* + (unshredded variants consist of just the ``metadata`` and ``value`` fields only) * A field named ``typed_value`` which can be a :ref:`variant_primitive_type_mapping` or a ``List``, ``LargeList``, ``ListView`` or ``Struct`` From 25c11e0c05b287eb456d074694fbeb8217a1cf0a Mon Sep 17 00:00:00 2001 From: Ian Cook Date: Mon, 8 Sep 2025 14:29:32 -0400 Subject: [PATCH 23/25] Revert change to allowed prefixes --- docs/source/format/CanonicalExtensions.rst | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/docs/source/format/CanonicalExtensions.rst b/docs/source/format/CanonicalExtensions.rst index 1f63b780360c..7c304564a023 100644 --- a/docs/source/format/CanonicalExtensions.rst +++ b/docs/source/format/CanonicalExtensions.rst @@ -45,11 +45,7 @@ types: * The specification text to be added *must* follow these requirements: - 1) It *must* define a well-defined extension name starting with an allowed prefix. - The currently allowed prefixes are: - * "``arrow.``" - For general-purpose canonical extension types. - * "``parquet.``" - For canonical extension types that are intended primarily for - interoperability with the `Apache Parquet `__ format. + 1) It *must* define a well-defined extension name starting with "``arrow.``". 2) Its parameters, if any, *must* be described in the proposal. From 44da918127c4e737785e9fcc9be99512b50726cf Mon Sep 17 00:00:00 2001 From: Matt Topol Date: Mon, 8 Sep 2025 16:14:16 -0400 Subject: [PATCH 24/25] updates from comments --- docs/source/format/CanonicalExtensions.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/source/format/CanonicalExtensions.rst b/docs/source/format/CanonicalExtensions.rst index 7c304564a023..8608a6388e0c 100644 --- a/docs/source/format/CanonicalExtensions.rst +++ b/docs/source/format/CanonicalExtensions.rst @@ -433,9 +433,9 @@ Variant represents a value that may be one of: Particularly, this provides a way to represent semi-structured data which is stored as a `Parquet Variant `__ value within Arrow columns in a lossless fashion. This also provides the ability to represent `shredded `__ -variant values. This will make it possible for systems to pass Variant data around without having to upgrade their Arrow version -or otherwise require special handling unless they want to directly interact with the encoded variant data. See the previous links -to the Parquet format specification for details on what the actual binary values should look like. +variant values. The canonical extension type allows systems to pass Variant encoded data around without special handling unless +they want to directly interact with the encoded variant data. See the Parquet format specification for details on what the actual +binary values look like. * Extension name: ``arrow.parquet.variant``. @@ -491,7 +491,7 @@ Primitive Type Mappings +======================+========================+ | Null | Null | +----------------------+------------------------+ -| Boolean (true/false) | Boolean | +| Boolean | Boolean (true/false) | +----------------------+------------------------+ | Int8 | Int8 | +----------------------+------------------------+ From 5acb7241c8752b9976fa30abaa47272295b27453 Mon Sep 17 00:00:00 2001 From: Matt Topol Date: Tue, 9 Sep 2025 12:59:33 -0400 Subject: [PATCH 25/25] Update docs/source/format/CanonicalExtensions/Examples.rst Co-authored-by: Yan Tingwang --- docs/source/format/CanonicalExtensions/Examples.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/format/CanonicalExtensions/Examples.rst b/docs/source/format/CanonicalExtensions/Examples.rst index 2d4303d2e4e4..1d3e0d79d8b8 100644 --- a/docs/source/format/CanonicalExtensions/Examples.rst +++ b/docs/source/format/CanonicalExtensions/Examples.rst @@ -115,7 +115,7 @@ The data should be stored/represented in Arrow as:: * Value buffer: - | Bytes 0-31 | Bytes 32-63 | + | Bytes 0-31 | Bytes 32-63 | |---------------------|--------------------------| | 34, 00, 00, 100 | unspecified (padding) |