Add support for FixedSizeList to variant_to_arrow#9663
Add support for FixedSizeList to variant_to_arrow#9663rishvin wants to merge 11 commits intoapache:mainfrom
Conversation
|
@sdf-jkl could you help review this PR? |
|
Thanks @sdf-jkl for reviewing changes. I have addressed the comments, could you re-review ? |
| } | ||
|
|
||
| impl<'a> ListElementBuilder<'a> { | ||
| fn append_null(&mut self) -> Result<()> { |
There was a problem hiding this comment.
Maybe add a comment mentioning this is only used for FixedSizeList builder?
Co-authored-by: Konstantin Tarasov <33369833+sdf-jkl@users.noreply.github.com>
|
Thanks @sdf-jkl addressed new comments. |
Co-authored-by: Konstantin Tarasov <33369833+sdf-jkl@users.noreply.github.com>
|
Thanks again @sdf-jkl ! Addressed it. |
| // With `safe` cast option set to false, appending list of wrong size to | ||
| // `typed_value_builder` of type `FixedSizeList` will result in an error. In such a | ||
| // case, the provided list should be appended to the `value_builder. |
There was a problem hiding this comment.
I'm not sure the variant shredding spec allows for shredding as a fixed size list, if the resulting layout differs physically from a normal list?
Arrays can be shredded by using a 3-level Parquet list for
typed_value.If the value is not an array,
typed_valuemust be null. If the value is an array,valuemust be null.
It looks to me like any attempt to shred as fixed-sized list must either succeed (if the size is correct) or hard-fail (because value as fallback is not allowed).
There was a problem hiding this comment.
It differs physically on the Arrow side, but once we write it to Parquet it'd be same as other ListLikeArrays. But this leads to further discussion on adding FixedSizeList support for VariantArray as well as implementing other types, currently not supported in spec.
We're keeping value because we consider this a cast from Variant to FixedSizeList. The extra len check is there because there is no Variant::FixedSizeList enum to match to. If len is incorrect we consider the cast failed and proceed following the safe cast option as if typed_value is Null.
There was a problem hiding this comment.
When casting from variant to arrow, we can do whatever we want.
But this code here is about going from binary variant to shredded variant. And the variant shredding spec directly forbids value to contain a variant array, when shredding as array.
There was a problem hiding this comment.
True. I think the core issue is that Parquet currently has only one logical LIST type. If Parquet had a dedicated logical type for FixedSizeList, the spec wording could be more explicit.
Btw, there’s ongoing work on this too: apache/parquet-format#241 (recently revived).
Given the current spec text:
Arrays can be shredded by using a 3-level Parquet list for typed_value.
If the value is not an array, typed_value must be null. If the value is an array, value must be null.
I read “array” as "a value matching the specific list shape we’re shredding into". For List/LargeList/ListView it's List values, for FixedSizeList array it's a FixedSizeList value.
There was a problem hiding this comment.
From what I understand, the variant spec neither knows nor cares about the intricacies of arrow array types (it also doesn't care about spark or SQL). If we're shredding to a 3-level parquet list, and we encounter a variant array value, the resulting value column entry must be null.
There was a problem hiding this comment.
I don't know...
My gut reaction is to forbid it completely because it's preferable to fail early and consistently rather than blow up unpredictably partway through shredding. The spec forbids us to fallback to value when the length is wrong, and we can't emit NULL either (regardless of safe casting options) because shredding is supposed to be lossless (it isn't a converting cast).
But I also don't have any intuition of actual use cases for this feature, to weigh its benefits against the downsides.
NOTE: The same dilemma applies to fixed length binary and fixed length string, except the spec technically doesn't forbid us to fallback to the value column. But how would we actually use it?
Thought experiment
Suppose we could use the value column as fallback in all three cases? Then the writer produces a shredded variant column where all wrong-length values are in value column (along with the wrong-type values) while all right-type-right-length values are in typed_value. But what good does it do? A reader who comes along later has no way to know about any of this, because there's nothing in the parquet that reliably indicates this was a fixed length array/binary/string column. Sure, we might embed arrow metadata, but non-arrow readers would ignore it and even arrow readers are on shaky ground trusting it because this is something entirely outside the variant spec.
Overall, enforcing list/binary/string length feels like a logical validation issue while shredding is a physical operation. And I suspect it's best to leave client code to enforce logical validity of all forms, not try to encode it in the type system. Additionally, It feels like fixed-len-xxx types fall in the same category as unsigned or non-zero types. True, arrow has unsigned int types but it lacks non-zero types. And parquet has neither.
There was a problem hiding this comment.
True, the use case should be the first thing we think about before implementing.
FSL could be useful for embeddings and tensors but the question is why would anyone put them inside a Variant for such a heavy compute work? Maybe coordinates inside a log struct?
The fact that we can't store FSL in Parquet is a big reason to not do it.
FixedSizeBinary actually does have a Parquet representation - FIXED_LEN_BYTE_ARRAY and is used for shredding into a UUID, but other than UUID, what could be a use case for storing inside a Variant?
Overall, I think we'd not gain much by including FSL because it's so frail and the use cause is not defined. Isn't the whole point of Variant is to host semi-structured data, if we have fixed-len-something it'd be better to keep a separate column for it. If fixed-length is a logical constraint rather than a physical one, enforcement belongs downstream.
There was a problem hiding this comment.
Overall, I think we'd not gain much by including FSL because it's so frail and the use cause is not defined. Isn't the whole point of Variant is to host semi-structured data, if we have fixed-len-something it'd be better to keep a separate column for it. If fixed-length is a logical constraint rather than a physical one, enforcement belongs downstream.
Agree. I just realized that somebody could create a FSL of [shredded] variant, if the use case just needs flexible typing of N list members. But that also seems somewhat contrived, to have such strongly-typed structure but sufficient uncertainty about the list element type to merit using variant.
There was a problem hiding this comment.
Should we then keep it hard fail and leave it to the user?
There was a problem hiding this comment.
Catching up late on this. I'm aligned with keeping it a hard fail and leaving it to the user.
variant_to_arrowFixedSizeListtype support #9531Rationale for this change
Add support for
FixedSizeListwhen invokingvariant_to_arrow.What changes are included in this PR?
VariantToFixedSizeListArrowRowBuilder.FixedSizeList.Are these changes tested?
By adding few test cases.
Are there any user-facing changes?
N/A.