Skip to content

GH-46908: [Docs][Format] Add variant extension type docs#47456

Merged
zeroshade merged 25 commits intoapache:mainfrom
zeroshade:variant-docs
Sep 16, 2025
Merged

GH-46908: [Docs][Format] Add variant extension type docs#47456
zeroshade merged 25 commits intoapache:mainfrom
zeroshade:variant-docs

Conversation

@zeroshade
Copy link
Copy Markdown
Member

@zeroshade zeroshade commented Aug 28, 2025

Rationale for this change

To support the addition of the Parquet Variant data type and the Iceberg adoption of the variant type, we need a defined way to pass this data through Arrow-compatible systems. As such, we need a specification for a canonical Arrow extension type to represent Variant data.

What changes are included in this PR?

Updates to the docs which define the Arrow Variant Extension type

@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #46908 has been automatically assigned in GitHub to PR creator.

@zeroshade
Copy link
Copy Markdown
Member Author

I'll wait for a few reviews on this before I send something on the mailing list to start a vote.

@ianmcook
Copy link
Copy Markdown
Member

Higher up in this doc, can you please make this change:

The specification text to be added must follow these requirements:

  1. It must define a well-defined extension name which must start with an allowed prefix. The currently allowed prefixes are:
    • "arrow." - For general-purpose canonical extension types.
    • "parquet." - For canonical extension types that are intended primarily for interoperability with Apache Parquet format.

@ianmcook
Copy link
Copy Markdown
Member

For uses of the word "variant" outside of code, please standardize the case on "Variant" or "variant" consistently. Right now it's a mix.

Comment thread docs/source/format/CanonicalExtensions.rst Outdated
@github-actions github-actions Bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Aug 28, 2025
Co-authored-by: Ian Cook <ianmcook@gmail.com>
@github-actions github-actions Bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 28, 2025
Comment thread docs/source/format/CanonicalExtensions.rst Outdated
Comment thread docs/source/format/CanonicalExtensions.rst Outdated
Comment thread docs/source/format/CanonicalExtensions.rst Outdated
Comment thread docs/source/format/CanonicalExtensions.rst Outdated
Comment thread docs/source/format/CanonicalExtensions.rst Outdated
Comment thread docs/source/format/CanonicalExtensions.rst Outdated
Comment thread docs/source/format/CanonicalExtensions.rst Outdated
Comment thread docs/source/format/CanonicalExtensions.rst Outdated
Comment thread docs/source/format/CanonicalExtensions.rst Outdated
Comment on lines +567 to +571
.. note::

Notice that there is a variant ``literal null`` in the ``value`` array, this is due to the
`shredding specification <https://github.com/apache/parquet-format/blob/master/VariantShredding.md#value-shredding>`__
so that a consumer can tell the difference between a *missing* field and a **null** field.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We implicitly describe nulls here and then explicitly describe it below; would it make sense to explicitly introduce nulls first (and link to the docs) and then omit both of the other explanations?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the updated version better?

@github-actions github-actions Bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Aug 29, 2025
Comment thread docs/source/format/CanonicalExtensions.rst Outdated
Comment thread docs/source/format/CanonicalExtensions.rst Outdated
Comment thread docs/source/format/CanonicalExtensions.rst Outdated
@github-actions github-actions Bot added the awaiting change review Awaiting change review label Sep 3, 2025
Copy link
Copy Markdown
Member

@amoeba amoeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple more comments.

Comment thread docs/source/format/CanonicalExtensions.rst Outdated
The simplest case, an unshredded variant always consists of **exactly** two fields: ``metadata`` and ``value``. Any of
the following storage types are valid (not an exhaustive list):

* ``struct<metadata: binary non-nullable, value: binary nullable>``
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like applications don't have to reorder but may choose to and would need to be aware that they must access fields by name and not position. 1:1 compatibility with what Parquet is doing sounds like the better tradeoff (Option 2 above).

Could we extend the existing Note about field order or add another Note succinctly explaining this? i.e., answer @emkornfield question in the final document.

Co-authored-by: Bryce Mecum <petridish@gmail.com>
@github-actions github-actions Bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Sep 3, 2025
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Sep 4, 2025

Sorry I have been out the last week -- I will review this shortly

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Sep 8, 2025

For anyone who missed it this proposal was voted and approved: https://lists.apache.org/thread/44o0d3nvxx0y3fzoschny96k5f3mzvlb

(thank you @zeroshade for driving this)

@zeroshade
Copy link
Copy Markdown
Member Author

Once we have consensus and approval here, we can merge this. Can everyone please take a look and try to comment on this (or give approval) by the end of the week?

@github-actions github-actions Bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Sep 8, 2025
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me -- thank you @zeroshade for driving this

we have made good use of the extension type draft documentation in arrow-rs

FYI @scovich, @klion26 @codephage2020 @carpecodeum and @liamzwbao as you have contributed to the arrow-rs implementation and might be interested

Comment thread docs/source/format/CanonicalExtensions.rst Outdated
Comment thread docs/source/format/CanonicalExtensions.rst Outdated
@github-actions github-actions Bot added awaiting merge Awaiting merge awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting merge Awaiting merge labels Sep 8, 2025
@kou
Copy link
Copy Markdown
Member

kou commented Sep 9, 2025

@github-actions crossbow submit preview-docs

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Sep 9, 2025

Revision: 44da918

Submitted crossbow builds: ursacomputing/crossbow @ actions-fdb42d9677

Task Status
preview-docs GitHub Actions

Copy link
Copy Markdown
Contributor

@codephage2020 codephage2020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Good To Me! Thank you!

Comment thread docs/source/format/CanonicalExtensions/Examples.rst Outdated
@kou
Copy link
Copy Markdown
Member

kou commented Sep 9, 2025

Comment thread docs/source/format/CanonicalExtensions/Examples.rst
Co-authored-by: Yan Tingwang <tingwangyan2020@163.com>
@zeroshade
Copy link
Copy Markdown
Member Author

Thanks everyone for the reviews and assistance making this happen!

@conbench-apache-arrow
Copy link
Copy Markdown

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 6c0c3cd.

There weren't enough matching historic benchmark results to make a call on whether there were regressions.

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.