-
Notifications
You must be signed in to change notification settings - Fork 108
feat: Add push_validity_into_children methods to StructArray #5826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Add methods to push struct-level validity into child fields: - push_validity_into_children(preserve_struct_validity: bool) - push_validity_into_children_default() - convenience method with preserve=false The functionality propagates null information from struct level down to individual fields, with options to preserve or remove the struct-level validity. Includes comprehensive tests covering all scenarios: - preserve_struct_validity = true - preserve_struct_validity = false (default) - no nulls edge case Signed-off-by: amorynan <amorywang111@gmail.com>
Codecov Report✅ All modified and coverable lines are covered by tests. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
connortsui20
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good first step, though I'm trying to figure out how we want to implement this behavior for the new operator world we are migrating to.
@gatesn mentioned this in the original issue:
Since this issue was filed, we now have a "Mask" expression that essentially performs intersection with validity.
So you could take a look at how you might take the validity from the struct array itself, and wrap up each child field in a mask expression (see builtins.rs), before constructing a new StructArray with or without the previous validity. These two are different behaviors.
This PR implements this feature correctly for the old (current) world, but we will likely want to implement the behavior described above very very soon, so soon that it might not even be worth merging this right now? @gatesn do you have any thoughts?
|
|
||
|
|
||
| pub fn push_validity_into_children(&self, preserve_struct_validity: bool) -> VortexResult<Self> { | ||
| use crate::compute::mask; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this import to the top?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also prefer to see compute::mask as the function call instead of just mask, but that is just my personal preference
| /// // Create struct with top-level nulls | ||
| /// let struct_array = StructArray::try_new( | ||
| /// ["a", "b"].into(), | ||
| /// vec![ | ||
| /// buffer![1i32, 2i32, 3i32].into_array(), | ||
| /// buffer![10i32, 20i32, 30i32].into_array(), | ||
| /// ], | ||
| /// 3, | ||
| /// Validity::from_iter([true, false, true]), // row 1 is null | ||
| /// ).unwrap(); | ||
| /// | ||
| /// // Push validity into children, preserving struct validity | ||
| /// let pushed = struct_array.push_validity_into_children(true).unwrap(); | ||
| /// // pushed.fields()[0] now has nulls at position 1 | ||
| /// // pushed.fields()[1] now has nulls at position 1 | ||
| /// // pushed.validity still shows row 1 as null | ||
| /// | ||
| /// // Push validity into children, removing struct validity | ||
| /// let pushed_no_struct = struct_array.push_validity_into_children(false).unwrap(); | ||
| /// // pushed_no_struct.fields()[0] now has nulls at position 1 | ||
| /// // pushed_no_struct.fields()[1] now has nulls at position 1 | ||
| /// // pushed_no_struct.validity is AllValid | ||
| /// ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more helpful to comment what the struct array looks like instead of requiring the reader to parse what the created struct array becomes after try_new. And you might as well have assertions stating that specific positions are null rather than just comments.
| /// buffer![1i32, 2i32, 3i32].into_array(), | ||
| /// buffer![10i32, 20i32, 30i32].into_array(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may not be immediately obvious to someone reading this that the child buffers are non-nullable
| self.names().clone(), | ||
| self.fields().clone(), | ||
| self.len(), | ||
| Validity::AllValid, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm I wonder if we should make the validity of the top-level struct array Validity::NonNullable instead of AllValid if the user requests to not preserve the struct validity. Do you have any thoughts?
| let null_mask = struct_validity_mask.iter_bools(|iter| { | ||
| Mask::from_iter(iter.map(|valid| !valid)) // invert: valid->invalid, invalid->valid | ||
| }); | ||
|
|
||
| let masked_fields: Vec<ArrayRef> = self | ||
| .fields() | ||
| .iter() | ||
| .map(|field| { | ||
| // Use the mask function to apply null positions to each field. | ||
| mask(field.as_ref(), &null_mask) | ||
| }) | ||
| .collect::<VortexResult<Vec<_>>>()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This inverted logic makes me think more that we should just move this to the new world now instead of merging this and then getting rid of it immediately. @gatesn Any thoughts about this?
For solving : #3859
Add methods to push struct-level validity into child fields:
The functionality propagates null information from struct level down to individual fields, with options to preserve or remove the struct-level validity.
Includes comprehensive tests covering all scenarios: