-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10684: [Rust] Inherit struct nulls in child null equality #8739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This change aligns with the spec, in that a struct array's nulls should take precedence over a child array's nulls. We carry across the parent's null buffer, and merge it with the child's.
|
@jorgecarleitao @alamb I've only implemented this for The change mainly keeps track of the null buffer of Where we need null counts for comparisons, we have to recalculate the counts from the separate null buffer, as they might have been combined with the parent's null buffer (buffer-ception? let me know if my sentence is unclear). I'll work on the below so we have sufficient test coverage:
|
| equal_values(lhs, rhs, lhs_nulls, rhs_nulls, lhs_start, rhs_start, len) | ||
| } else { | ||
| // with nulls, we need to compare item by item whenever it is not null | ||
| (0..len).all(|i| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be easier if we took into account the nulls on this statement here:
let lhs_is_null = lhs.is_null(lhs_pos);
let rhs_is_null = rhs.is_null(rhs_pos);by replacing it with
let lhs_is_null = lhs.is_null(lhs_pos) && lhs.nullbitmap.isnull(lhs_pos);
let rhs_is_null = rhs.is_null(rhs_pos) && rhs.nullbitmap.isnull(rhs_pos);(still keeping the re-count in place, to avoid wrongfully optimizing like this PR fixes)
The advantage is that for all types except struct, this PR merges two equal bitmaps twice (all types except struct now pass null_buffer() to equal_range. E.g. in list:
lhs_values,
rhs_values,
lhs_values.null_buffer(),
rhs_values.null_buffer(),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternate thing to do is to pass in None for all types except struct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb if we pass None where we don't have a struct, wouldn't we be describing a different position, which might be incorrect?
A None for the Option<&Buffer> indicates that there are no nulls, as arrays are permitted to omit the null buffer/bitmap altogether if there are no nulls. So if we pass None for lhs_nulls, we shouldn't have to look at lhs, as the presumption is that we've already extracted the null buffer, and there wasn't any.
Perhaps I don't understand you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nevi-me I think I was confused -- I was trying to offer some way to avoid @jorgecarleitao 's concern of double merge concerns but now that I re-read my comments I am not sure that they make any sense. Sorry for the confusion
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @nevi-me -- I think this approach looks good. I had some potential optimization suggestion and I think @jorgecarleitao 's is worth considering
| equal_values(lhs, rhs, lhs_nulls, rhs_nulls, lhs_start, rhs_start, len) | ||
| } else { | ||
| // with nulls, we need to compare item by item whenever it is not null | ||
| (0..len).all(|i| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternate thing to do is to pass in None for all types except struct
| if lhs_null_count == 0 && rhs_null_count == 0 { | ||
| equal_values(lhs, rhs, lhs_nulls, rhs_nulls, lhs_start, rhs_start, len) | ||
| } else { | ||
| // get a ref of the null buffer bytes, to use in testing for nullness |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorgecarleitao does this address your comment? I was incorrectly using the array's null buffer, instead of checking nullness from the merged one. If it doesn't, then it means I didn't understand what you were suggesting.
|
@alamb @jorgecarleitao this is blocking some work that I'm doing on parquet, so I'd like to limit the scope of the PR to just nested structs, as I want to submit a PR that correctly reads nested arrow structs in parquet. Are you fine with reviewing the PR as is just for nested structs? I believe I've responded to or addressed comments raised so far, but I'll address more where they come, or where i haven't covered everything as yet. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this code looks good, well tested and commented, and we can always improve upon it in subsequent PRs. I think it is good to merge. FYI @jorgecarleitao
| equal_values(lhs, rhs, lhs_nulls, rhs_nulls, lhs_start, rhs_start, len) | ||
| } else { | ||
| // with nulls, we need to compare item by item whenever it is not null | ||
| (0..len).all(|i| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nevi-me I think I was confused -- I was trying to offer some way to avoid @jorgecarleitao 's concern of double merge concerns but now that I re-read my comments I am not sure that they make any sense. Sorry for the confusion
|
The CI failures seem unrelated For example: https://github.com/apache/arrow/pull/8739/checks?check_run_id=1465465859 I am going to retrigger the tests on github and then plan to merge this PR if they pass |
@alamb something's filling up our space on CI, so the tests might still fail. I tried looking at it in the morning, but ended up parking it for later so I could do other work. |
|
@jorgecarleitao may you please take a look at this again when you have a chance. I opted not to address lists for now, as their semantics might be fine. I'll look into other types as part of ARROW-10766, which I'll be working on in the evenings in the coming week. |
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the changes and they make sense.
I am still a bit unsure about why we need to perform this in Struct but not in List: the null bitmaps serve the same purpose in both cases and thus I would not expect the implementations to differ in this respect, but the tests demonstrate an improvement, so ![]()
Thanks a lot for taking this, @nevi-me
This is the change to enable writing (and reading) nested structs correctly, ala `<struct<struct<primitive>>>`. The change introduces a busy `LevelInfo` struct, which is mostly useful for tracking changes across nested lists. I opted to get structs right first, then refocus on lists, so our list IO support is still broken for now. I also picked up an issue with our dictionary write support, in that some of the indexing didn't look right. i'm going to work on that in a separate commit, as I squashed all my work on this PR into one commit. There's a nested struct test that's failing in this PR, but that's because I need #8739 in order for that test to pass. I haven't focused on making this performant as yet, so there might be plenty of room to improve performance. I mainly focused on getting the nesting arithmetic correct. I think we should be able to paralleli(s|z)e the logic per array, but alas. In future PR's, I'll work on list support. Closes #8792 from nevi-me/ARROW-9728 Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This change aligns with the spec, in that a struct array's nulls should take precedence over a child array's nulls.
We carry across the parent's null buffer, and merge it with the child's.