Skip to content

Add support for FixedSizeList to variant_to_arrow#9663

Open
rishvin wants to merge 11 commits intoapache:mainfrom
rishvin:fixed_size_list_variant_to_arrow
Open

Add support for FixedSizeList to variant_to_arrow#9663
rishvin wants to merge 11 commits intoapache:mainfrom
rishvin:fixed_size_list_variant_to_arrow

Conversation

@rishvin
Copy link
Copy Markdown

@rishvin rishvin commented Apr 6, 2026

Rationale for this change

Add support for FixedSizeList when invoking variant_to_arrow.

What changes are included in this PR?

  • Introduces a new builder VariantToFixedSizeListArrowRowBuilder.
  • Adds test cases for shredding and getting variant by FixedSizeList.

Are these changes tested?

By adding few test cases.

Are there any user-facing changes?

N/A.

@github-actions github-actions Bot added the parquet-variant parquet-variant* crates label Apr 6, 2026
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 6, 2026

@sdf-jkl could you help review this PR?

Copy link
Copy Markdown
Contributor

@sdf-jkl sdf-jkl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rishvin, I left some suggestions.

Comment thread parquet-variant-compute/src/variant_to_arrow.rs Outdated
Comment thread parquet-variant-compute/src/variant_to_arrow.rs Outdated
Comment thread parquet-variant-compute/src/variant_to_arrow.rs
Comment thread parquet-variant-compute/src/variant_get.rs Outdated
Comment thread parquet-variant-compute/src/variant_get.rs Outdated
Comment thread parquet-variant-compute/src/shred_variant.rs
@rishvin rishvin requested a review from sdf-jkl April 11, 2026 21:24
@rishvin
Copy link
Copy Markdown
Author

rishvin commented Apr 11, 2026

Thanks @sdf-jkl for reviewing changes. I have addressed the comments, could you re-review ?

Copy link
Copy Markdown
Contributor

@sdf-jkl sdf-jkl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this @rishvin! Everything looks great, I left some nits.

}

impl<'a> ListElementBuilder<'a> {
fn append_null(&mut self) -> Result<()> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment mentioning this is only used for FixedSizeList builder?

Comment thread parquet-variant-compute/src/variant_get.rs Outdated
Comment thread parquet-variant-compute/src/variant_get.rs Outdated
Comment thread parquet-variant-compute/src/shred_variant.rs Outdated
Comment thread parquet-variant-compute/src/shred_variant.rs Outdated
Comment thread parquet-variant-compute/src/shred_variant.rs Outdated
Comment thread parquet-variant-compute/src/shred_variant.rs
rishvin and others added 2 commits April 13, 2026 11:53
Co-authored-by: Konstantin Tarasov <33369833+sdf-jkl@users.noreply.github.com>
@rishvin rishvin requested a review from sdf-jkl April 13, 2026 19:24
@rishvin
Copy link
Copy Markdown
Author

rishvin commented Apr 13, 2026

Thanks @sdf-jkl addressed new comments.

Copy link
Copy Markdown
Contributor

@sdf-jkl sdf-jkl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One tiny change.

Comment thread parquet-variant-compute/src/shred_variant.rs Outdated
Co-authored-by: Konstantin Tarasov <33369833+sdf-jkl@users.noreply.github.com>
@rishvin rishvin requested a review from sdf-jkl April 14, 2026 04:25
@rishvin
Copy link
Copy Markdown
Author

rishvin commented Apr 14, 2026

Thanks again @sdf-jkl ! Addressed it.

Copy link
Copy Markdown
Contributor

@sdf-jkl sdf-jkl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rishvin! LGTM.

@scovich could you please check this when available? Thanks!

Comment on lines +334 to +336
// With `safe` cast option set to false, appending list of wrong size to
// `typed_value_builder` of type `FixedSizeList` will result in an error. In such a
// case, the provided list should be appended to the `value_builder.
Copy link
Copy Markdown
Contributor

@scovich scovich Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the variant shredding spec allows for shredding as a fixed size list, if the resulting layout differs physically from a normal list?

Arrays can be shredded by using a 3-level Parquet list for typed_value.

If the value is not an array, typed_value must be null. If the value is an array, value must be null.

It looks to me like any attempt to shred as fixed-sized list must either succeed (if the size is correct) or hard-fail (because value as fallback is not allowed).

Copy link
Copy Markdown
Contributor

@sdf-jkl sdf-jkl Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It differs physically on the Arrow side, but once we write it to Parquet it'd be same as other ListLikeArrays. But this leads to further discussion on adding FixedSizeList support for VariantArray as well as implementing other types, currently not supported in spec.

We're keeping value because we consider this a cast from Variant to FixedSizeList. The extra len check is there because there is no Variant::FixedSizeList enum to match to. If len is incorrect we consider the cast failed and proceed following the safe cast option as if typed_value is Null.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When casting from variant to arrow, we can do whatever we want.

But this code here is about going from binary variant to shredded variant. And the variant shredding spec directly forbids value to contain a variant array, when shredding as array.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I think the core issue is that Parquet currently has only one logical LIST type. If Parquet had a dedicated logical type for FixedSizeList, the spec wording could be more explicit.

Btw, there’s ongoing work on this too: apache/parquet-format#241 (recently revived).

Given the current spec text:

Arrays can be shredded by using a 3-level Parquet list for typed_value.

If the value is not an array, typed_value must be null. If the value is an array, value must be null.

I read “array” as "a value matching the specific list shape we’re shredding into". For List/LargeList/ListView it's List values, for FixedSizeList array it's a FixedSizeList value.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand, the variant spec neither knows nor cares about the intricacies of arrow array types (it also doesn't care about spark or SQL). If we're shredding to a 3-level parquet list, and we encounter a variant array value, the resulting value column entry must be null.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know...

My gut reaction is to forbid it completely because it's preferable to fail early and consistently rather than blow up unpredictably partway through shredding. The spec forbids us to fallback to value when the length is wrong, and we can't emit NULL either (regardless of safe casting options) because shredding is supposed to be lossless (it isn't a converting cast).

But I also don't have any intuition of actual use cases for this feature, to weigh its benefits against the downsides.

NOTE: The same dilemma applies to fixed length binary and fixed length string, except the spec technically doesn't forbid us to fallback to the value column. But how would we actually use it?

Thought experiment

Suppose we could use the value column as fallback in all three cases? Then the writer produces a shredded variant column where all wrong-length values are in value column (along with the wrong-type values) while all right-type-right-length values are in typed_value. But what good does it do? A reader who comes along later has no way to know about any of this, because there's nothing in the parquet that reliably indicates this was a fixed length array/binary/string column. Sure, we might embed arrow metadata, but non-arrow readers would ignore it and even arrow readers are on shaky ground trusting it because this is something entirely outside the variant spec.

Overall, enforcing list/binary/string length feels like a logical validation issue while shredding is a physical operation. And I suspect it's best to leave client code to enforce logical validity of all forms, not try to encode it in the type system. Additionally, It feels like fixed-len-xxx types fall in the same category as unsigned or non-zero types. True, arrow has unsigned int types but it lacks non-zero types. And parquet has neither.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, the use case should be the first thing we think about before implementing.

FSL could be useful for embeddings and tensors but the question is why would anyone put them inside a Variant for such a heavy compute work? Maybe coordinates inside a log struct?

The fact that we can't store FSL in Parquet is a big reason to not do it.

FixedSizeBinary actually does have a Parquet representation - FIXED_LEN_BYTE_ARRAY and is used for shredding into a UUID, but other than UUID, what could be a use case for storing inside a Variant?

Overall, I think we'd not gain much by including FSL because it's so frail and the use cause is not defined. Isn't the whole point of Variant is to host semi-structured data, if we have fixed-len-something it'd be better to keep a separate column for it. If fixed-length is a logical constraint rather than a physical one, enforcement belongs downstream.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think we'd not gain much by including FSL because it's so frail and the use cause is not defined. Isn't the whole point of Variant is to host semi-structured data, if we have fixed-len-something it'd be better to keep a separate column for it. If fixed-length is a logical constraint rather than a physical one, enforcement belongs downstream.

Agree. I just realized that somebody could create a FSL of [shredded] variant, if the use case just needs flexible typing of N list members. But that also seems somewhat contrived, to have such strongly-typed structure but sufficient uncertainty about the list element type to merit using variant.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we then keep it hard fail and leave it to the user?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching up late on this. I'm aligned with keeping it a hard fail and leaving it to the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Add variant_to_arrow FixedSizeList type support

5 participants