[Variant] WIP Tests for variant_get of shredded variants#7965
[Variant] WIP Tests for variant_get of shredded variants#7965alamb wants to merge 1 commit intoapache:mainfrom
Conversation
|
FYI @Samyak2 @scovich @friendlymatthew and @klion26 and @carpecodeum as I think you are interested in this feature |
| let struct_array = StructArrayBuilder::new() | ||
| .with_field("metadata", Arc::new(metadata)) | ||
| .with_field("typed_value", Arc::new(typed_value)) | ||
| .build(); | ||
|
|
||
| Arc::new( | ||
| VariantArray::try_new(Arc::new(struct_array)).expect("should create variant array"), | ||
| ) |
There was a problem hiding this comment.
From what I understood of the spec, the value field needs to be present but it can have nulls. Did I understand it incorrectly? If not, this struct would be an invalid variant
There was a problem hiding this comment.
My reading of https://docs.google.com/document/d/1pw0AWoMQY3SjD7R4LgbPvMjG_xSCtXp3rZHkVp9jpZ4/ was that value was optional in the case the variant shredded "perfectly"
I may also be misreading it.
Maybe @zeroshade has some time to coment
There was a problem hiding this comment.
The shredding spec definitely says both columns are optional:
Both
valueandtyped_valueare optional fields used together to encode a single value.
Additionally:
Values in the two fields must be interpreted according to the following table:
valuetyped_valueMeaning null null The value is missing; only valid for shredded object fields non-null null The value is present and may be any type, including null null non-null The value is present and is the shredded type non-null non-null The value is present and is a partially shredded object
As always in parquet, a physically missing column is interpreted as all-null.
There was a problem hiding this comment.
CC @cashmand in case I totally misunderstood something
There was a problem hiding this comment.
Agreed, the intent is that a missing value column should be treated as equivalent to a present-but-all-null value column.
There was a problem hiding this comment.
Correct, both are optional fields, but it's required to have at least one of them.
| ); | ||
| } | ||
|
|
||
| /// Shredding: extract a value as a VariantArray |
There was a problem hiding this comment.
The PR just adds tests here -- to show what I think the output of variant_get should be for different variant cases.
| /// Append the [`Variant`] to the builder as the next row | ||
| pub fn append_variant(&mut self, variant: Variant) { | ||
| pub fn append_variant<'m, 'v>(&mut self, variant: impl Into<Variant<'m, 'v>>) { | ||
| // TODO make this more efficient by avoiding the intermediate buffers |
There was a problem hiding this comment.
Could we implement the approach @scovich suggested about invoking primitive append_xxx methods directly instead of using intermediate VariantBuilder?
There was a problem hiding this comment.
something like
pub fn append_variant<'m, 'v>(&mut self, variant: impl Into<Variant<'m, 'v>>) {
let variant = variant.into();
match variant {
Variant::Int32(v) => self.append_int32(v),
Variant::String(v) => self.append_string(v),
Variant::Object(obj) => {
// Build object directly into our buffers without intermediate VariantBuilder
self.append_object_start();
for field in obj.iter() {
self.append_variant(field.value()); // recursive
}
self.append_object_end();
}
// ... other variants
}
}
There was a problem hiding this comment.
If you are concerned about the intermediate buffers, I think after the following two PRs are merged:
- Convert JSON to VariantArray without copying (8 - 32% faster) #7911
- [Variant] Avoid extra allocation in object builder #7935
There will be a way to append variant values to the final output buffer directly without copying to intermediate buffers
Is there another potential part of your question?
There was a problem hiding this comment.
Got it, will follow these PRs then
| assert_eq!(result.len(), 3); | ||
|
|
||
| // Expect the values are the same as the original values | ||
| assert_eq!(result.value(0), Variant::Int32(1)); |
There was a problem hiding this comment.
If the GetOptions points to the leaf node, will the result here be equal to the result in test get_variant_perfectly_shredded_int32_as_int32
# Which issue does this PR close? - Part of apache#6736 - Closes apache#7941 - Closes apache#7965 # Rationale for this change This is has a proposal for how to structure shredded `VariantArray`s and the `variant_get` kernel If people like the basic idea I will file some more tickets to track additional follow on work It is based on ideas ideas from @carpecodeum in apache#7946 and @scovich in apache#7915 I basically took the tests from apache#7965 and the conversation with @scovich recorded from apache#7941 (comment) and I bashed out how this might look # What changes are included in this PR? 1. Update `VariantArray` to represent shredding 2. Add code to `variant_get` to support extracting paths as both variants and typed fields 3. A pattern that I think can represent shredding and extraction 4. Tests for same Note there are many things that are NOT in this PR that I envision doing as follow on PRs: 1. Support and implementing `Path`s 2. Support for shredded objects 3. Support shredded lists 4. Support nested objects / lists 5. Full casting support 6. Support for other output types: `StringArray`, `StringViewArray`, etc 8. Many performance improvements # Are these changes tested? Yes # Are there any user-facing changes? New feature --------- Co-authored-by: Samyak Sarnayak <samyak201@gmail.com> Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>
Which issue does this PR close?
variant_getkernel for shredded variants #7941Rationale for this change
In order to make progress on accessing shredded variants, I think we need some tests showing how they are used and constructed. I am hoping some tests will help us along in figuring out how to implement this efficiently
I am hoping someone can use these tests to start implement efficient
variant_getfor shredded valuesWhat changes are included in this PR?
Implement
varaint_gettests based on the examples in the arrow variant doc from @zeroshade (see apache/arrow#46908)Are these changes tested?
They are only tests
Are there any user-facing changes?
No