GH-46908: [Docs][Format] Add variant extension type docs#47456
GH-46908: [Docs][Format] Add variant extension type docs#47456zeroshade merged 25 commits intoapache:mainfrom
Conversation
|
|
|
I'll wait for a few reviews on this before I send something on the mailing list to start a vote. |
|
Higher up in this doc, can you please make this change:
|
|
For uses of the word "variant" outside of code, please standardize the case on "Variant" or "variant" consistently. Right now it's a mix. |
Co-authored-by: Ian Cook <ianmcook@gmail.com>
| .. note:: | ||
|
|
||
| Notice that there is a variant ``literal null`` in the ``value`` array, this is due to the | ||
| `shredding specification <https://github.com/apache/parquet-format/blob/master/VariantShredding.md#value-shredding>`__ | ||
| so that a consumer can tell the difference between a *missing* field and a **null** field. |
There was a problem hiding this comment.
We implicitly describe nulls here and then explicitly describe it below; would it make sense to explicitly introduce nulls first (and link to the docs) and then omit both of the other explanations?
There was a problem hiding this comment.
Is the updated version better?
| The simplest case, an unshredded variant always consists of **exactly** two fields: ``metadata`` and ``value``. Any of | ||
| the following storage types are valid (not an exhaustive list): | ||
|
|
||
| * ``struct<metadata: binary non-nullable, value: binary nullable>`` |
There was a problem hiding this comment.
It sounds like applications don't have to reorder but may choose to and would need to be aware that they must access fields by name and not position. 1:1 compatibility with what Parquet is doing sounds like the better tradeoff (Option 2 above).
Could we extend the existing Note about field order or add another Note succinctly explaining this? i.e., answer @emkornfield question in the final document.
Co-authored-by: Bryce Mecum <petridish@gmail.com>
|
Sorry I have been out the last week -- I will review this shortly |
|
For anyone who missed it this proposal was voted and approved: https://lists.apache.org/thread/44o0d3nvxx0y3fzoschny96k5f3mzvlb (thank you @zeroshade for driving this) |
|
Once we have consensus and approval here, we can merge this. Can everyone please take a look and try to comment on this (or give approval) by the end of the week? |
alamb
left a comment
There was a problem hiding this comment.
looks good to me -- thank you @zeroshade for driving this
we have made good use of the extension type draft documentation in arrow-rs
FYI @scovich, @klion26 @codephage2020 @carpecodeum and @liamzwbao as you have contributed to the arrow-rs implementation and might be interested
|
@github-actions crossbow submit preview-docs |
|
Revision: 44da918 Submitted crossbow builds: ursacomputing/crossbow @ actions-fdb42d9677
|
codephage2020
left a comment
There was a problem hiding this comment.
Looks Good To Me! Thank you!
Co-authored-by: Yan Tingwang <tingwangyan2020@163.com>
|
Thanks everyone for the reviews and assistance making this happen! |
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 6c0c3cd. There weren't enough matching historic benchmark results to make a call on whether there were regressions. The full Conbench report has more details. |
Rationale for this change
To support the addition of the Parquet Variant data type and the Iceberg adoption of the variant type, we need a defined way to pass this data through Arrow-compatible systems. As such, we need a specification for a canonical Arrow extension type to represent Variant data.
What changes are included in this PR?
Updates to the docs which define the Arrow Variant Extension type