Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
f1cd15a
docs: add variant extension type docs
zeroshade Aug 28, 2025
712ad76
whitespace
zeroshade Aug 28, 2025
aa0611c
Update docs/source/format/CanonicalExtensions.rst
zeroshade Aug 28, 2025
fab0ea4
update from comment
zeroshade Aug 28, 2025
8e4a24c
Apply suggestions from code review
zeroshade Aug 29, 2025
d0c218b
updates from feedback
zeroshade Aug 29, 2025
60e21dd
Update docs/source/format/CanonicalExtensions.rst
zeroshade Sep 1, 2025
9c8d05f
Update docs/source/format/CanonicalExtensions.rst
zeroshade Sep 1, 2025
e91dbd6
Update docs/source/format/CanonicalExtensions.rst
zeroshade Sep 1, 2025
9ca31a5
fix from feedback
zeroshade Sep 1, 2025
5352496
clarify case sensitive
zeroshade Sep 1, 2025
35c89a4
Apply suggestions from code review
zeroshade Sep 2, 2025
4619dc7
updates from feedback
zeroshade Sep 2, 2025
d8e1cfc
Update docs/source/format/CanonicalExtensions.rst
zeroshade Sep 3, 2025
23f6fb3
Update docs/source/format/CanonicalExtensions.rst
zeroshade Sep 3, 2025
b3b4063
Apply suggestions from code review
zeroshade Sep 3, 2025
5055efe
Update docs/source/format/CanonicalExtensions.rst
zeroshade Sep 3, 2025
2490f56
clarify the meaning of the struct fields
zeroshade Sep 3, 2025
d9c16cd
Add type mapping
zeroshade Sep 3, 2025
b415e37
move examples to new document
zeroshade Sep 3, 2025
728b19a
trim whitespace
zeroshade Sep 3, 2025
6df102c
Update docs/source/format/CanonicalExtensions.rst
zeroshade Sep 3, 2025
25c11e0
Revert change to allowed prefixes
ianmcook Sep 8, 2025
44da918
updates from comments
zeroshade Sep 8, 2025
5acb724
Update docs/source/format/CanonicalExtensions/Examples.rst
zeroshade Sep 9, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
128 changes: 127 additions & 1 deletion docs/source/format/CanonicalExtensions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -417,7 +417,133 @@ better zero-copy compatibility with various systems that also store booleans usi

Metadata is an empty string.

=========================
.. _parquet_variant_extension:

Parquet Variant
===============

Variant represents a value that may be one of:
Comment thread
zeroshade marked this conversation as resolved.

* Primitive: a type and corresponding value (e.g. ``INT``, ``STRING``)

* Array: An ordered list of Variant values

* Object: An unordered collection of string/Variant pairs (i.e. key/value pairs). An object may not contain duplicate keys

Particularly, this provides a way to represent semi-structured data which is stored as a
`Parquet Variant <https://github.com/apache/parquet-format/blob/master/VariantEncoding.md>`__ value within Arrow columns in
a lossless fashion. This also provides the ability to represent `shredded <https://github.com/apache/parquet-format/blob/master/VariantShredding.md>`__
variant values. The canonical extension type allows systems to pass Variant encoded data around without special handling unless
they want to directly interact with the encoded variant data. See the Parquet format specification for details on what the actual
binary values look like.

* Extension name: ``arrow.parquet.variant``.

* The storage type of this extension is a ``Struct`` that obeys the following rules:

* A *non-nullable* field named ``metadata`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``.
Comment thread
zeroshade marked this conversation as resolved.

* At least one (or both) of the following:

* A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``.
(unshredded variants consist of just the ``metadata`` and ``value`` fields only)

* A field named ``typed_value`` which can be a :ref:`variant_primitive_type_mapping` or a ``List``, ``LargeList``, ``ListView`` or ``Struct``

* If the ``typed_value`` field is a ``List``, ``LargeList`` or ``ListView`` its elements **must** be *non-nullable* and **must**
be a ``Struct`` consisting of at least one (or both) of the following:

* A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``.

* A field named ``typed_value`` which follows the rules outlined above (this allows for arbitrarily nested data).

Comment thread
zeroshade marked this conversation as resolved.
* If the ``typed_value`` field is a ``Struct``, then its fields **must** be *non-nullable*, representing the fields being shredded
from the objects, and **must** be a ``Struct`` consisting of at least one (or both) of the following:

* A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``.

* A field named ``typed_value`` which follows the rules outlined above (this allows for arbitrarily nested data).

* Extension type parameters:
Comment thread
zeroshade marked this conversation as resolved.

This type does not have any parameters.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the parquet spec is versioned for forward compatiblity, are we going to ignore that for now?

Copy link
Copy Markdown
Member

@pitrou pitrou Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a later optional parameter to denote future versions of the Variant spec, if any?
(other Parquet spec additions would not matter)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Arrow spec is also versioned, and the data in the metadata array is the raw Parquet Variant bytes which include the version number


* Description of the serialization:

Extension metadata is an empty string.

.. note::

It is also *permissible* for the ``metadata`` field to be dictionary-encoded with a preferred (*but not required*) index type of ``int8``,
or run-end-encoded with a preferred (*but not required*) runs type of ``int8``.

.. note::

The fields may be in any order, and thus must be accessed by **name** not by *position*. The field names are case sensitive.

.. _variant_primitive_type_mapping:

Primitive Type Mappings
-----------------------

+----------------------+------------------------+
| Arrow Primitive Type | Variant Primitive Type |
+======================+========================+
| Null | Null |
+----------------------+------------------------+
| Boolean | Boolean (true/false) |
+----------------------+------------------------+
| Int8 | Int8 |
+----------------------+------------------------+
| Uint8 | Int16 |
+----------------------+------------------------+
| Int16 | Int16 |
+----------------------+------------------------+
| Uint16 | Int32 |
+----------------------+------------------------+
| Int32 | Int32 |
+----------------------+------------------------+
| Uint32 | Int64 |
+----------------------+------------------------+
| Int64 | Int64 |
+----------------------+------------------------+
| Float | Float |
+----------------------+------------------------+
| Double | Double |
+----------------------+------------------------+
| Decimal32 | decimal4 |
+----------------------+------------------------+
| Decimal64 | decimal8 |
+----------------------+------------------------+
| Decimal128 | decimal16 |
+----------------------+------------------------+
| Date32 | Date |
+----------------------+------------------------+
| Time64 | TimeNTZ |
+----------------------+------------------------+
| Timestamp(us, UTC) | Timestamp (micro) |
+----------------------+------------------------+
| Timestamp(us) | TimestampNTZ (micro) |
+----------------------+------------------------+
| Timestamp(ns, UTC) | Timestamp (nano) |
+----------------------+------------------------+
| Timestamp(ns) | TimestampNTZ (nano) |
+----------------------+------------------------+
| Binary | Binary |
+----------------------+------------------------+
| LargeBinary | Binary |
+----------------------+------------------------+
| BinaryView | Binary |
+----------------------+------------------------+
| String | String |
+----------------------+------------------------+
| LargeString | String |
+----------------------+------------------------+
| StringView | String |
+----------------------+------------------------+
| UUID extension type | UUID |
+----------------------+------------------------+

Community Extension Types
=========================

Expand Down
Loading