-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-11125: [Rust] Logical equality for list arrays #9093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@jorgecarleitao @alamb this is in substance ready for review,, I'd like some feedback on the approach if you get the time. There's a few TODOs that I left for myself, which I'll address in the coming hours. |
|
I saw the clippy warning, I'll fix it |
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a first pass a this. Great work, and I can see that this is non-trivial due to having to & the bitmaps in nested types.
I generally agree with the approach, left some ideas.
| .zip(rhs.child_data()) | ||
| .all(|(lhs_values, rhs_values)| { | ||
| // merge the null data | ||
| let lhs_merged_nulls = match (lhs_nulls, lhs_values.null_buffer()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorgecarleitao I thought you were referring to this part of the code. I removed the one in mod.rs because I found that it was computing a duplicate when dealing with just primitive arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the use of temp_lhs and temp_rhs here is to avoid the lhs_nulls.cloned() and rhs_nulls.cloned() calls below.
rust/arrow/src/array/equal/utils.rs
Outdated
| /// one on the `ArrayData`. | ||
| pub(super) fn child_logical_null_buffer( | ||
| parent_data: &ArrayData, | ||
| logical_null_buffer: Option<Buffer>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb @jorgecarleitao I wanted to make this Option<&Buffer> to avoid cloning, but because I create a Bitmap for parent_bitmap and self_null_bitmap , I have to end up cloning the &Buffer. So it's extra work to change the signature, and probably doesn't yield any benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you changed
let parent_bitmap = logical_null_buffer.map(Bitmap::from).unwrap_or_else(|| {
to
let parent_bitmap = logical_null_buffer.cloned().map(Bitmap::from).unwrap_or_else(|| {
Then the signature could take an Option<&Buffer> and the code is cleaner (fewer calls to .cloned() outside this function).
But I don't think it has any runtime effect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your approach, better to clone in one place, than various.
|
Thanks for the review @jorgecarleitao. I've addressed your queries and comments, and cleaned up the TODOs |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I don't understand this entire PR overall I think the code looks good enough to merge to me.
My only concern I have is the disabling of the test in rust/datafusion/tests/sql.rs as that seems like a regression in functionality.
| } | ||
|
|
||
| #[tokio::test] | ||
| #[ignore = "Test ignored, will be enabled as part of the nested Parquet reader"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened to this test? It looks like it used to pass and now it doesn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I ran this test locally, it fails with a seemingly non-sensical error
failures:
---- parquet_list_columns stdout ----
thread 'parquet_list_columns' panicked at 'assertion failed: `(left == right)`
left: `PrimitiveArray<Int64>
[
null,
1,
]`,
right: `PrimitiveArray<Int64>
[
null,
1,
]`', datafusion/tests/sql.rs:204:5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same test failed at some point on the parquet list-writer branch. I'm not confident that we were reconstructing list arrays correctly from parquet data.
I'm opting to disable it temporarily, then re-enable it in #8927, as I'd otherwise be duplicating my effort here.
| }) | ||
| } else { | ||
| // get a ref of the null buffer bytes, to use in testing for nullness | ||
| let lhs_null_bytes = lhs_nulls.as_ref().unwrap().as_slice(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for lhs_nulls == Some(..) but rhs_nulls == None (and visa versa?) Given they are optional arguments I wasn't sure if they would always both be either None or Some
| let rhs_offsets = rhs.buffer::<T>(0); | ||
|
|
||
| // There is an edge-case where a n-length list that has 0 children, results in panics. | ||
| // For example; an array with offsets [0, 0, 0, 0, 0] has 4 slots, but will have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably am mis understanding but [0, 0, 0, 0, 0] has 5 entries but this comment says "4 slots"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, that's the offsets for the 4 slots. A list's offsets are always list_length + 1, as they point to the range of values. [0, 2, 3] has 2 slots, with slot 1 being [1, 2], and slot 2 being [3].
| // however, one is more likely to slice into a list array and get a region that has 0 | ||
| // child values. | ||
| // The test that triggered this behaviour had [4, 4] as a slice of 1 value slot. | ||
| let lhs_child_length = lhs_offsets.get(len).unwrap().to_usize().unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand the need for this check given that count_nulls seems to handle a buffer of None by returning zero.
When this code is commented out, however, I see the panic of
---- array::equal::tests::test_list_offsets stdout ----
thread 'array::equal::tests::test_list_offsets' panicked at 'assertion failed: ceil(offset + len, 8) <= buffer.len() * 8', arrow/src/util/bit_chunk_iterator.rs:33:9
So given that it is covered by tests, 👍
| .zip(rhs.child_data()) | ||
| .all(|(lhs_values, rhs_values)| { | ||
| // merge the null data | ||
| let lhs_merged_nulls = match (lhs_nulls, lhs_values.null_buffer()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the use of temp_lhs and temp_rhs here is to avoid the lhs_nulls.cloned() and rhs_nulls.cloned() calls below.
rust/arrow/src/array/equal/utils.rs
Outdated
| /// one on the `ArrayData`. | ||
| pub(super) fn child_logical_null_buffer( | ||
| parent_data: &ArrayData, | ||
| logical_null_buffer: Option<Buffer>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you changed
let parent_bitmap = logical_null_buffer.map(Bitmap::from).unwrap_or_else(|| {
to
let parent_bitmap = logical_null_buffer.cloned().map(Bitmap::from).unwrap_or_else(|| {
Then the signature could take an Option<&Buffer> and the code is cleaner (fewer calls to .cloned() outside this function).
But I don't think it has any runtime effect
| // This might be a valid case during integration testing, where we read Arrow arrays | ||
| // from IPC data, which has padding. | ||
| // | ||
| // We first perform a bitwise comparison, and if there is an error, we revert to a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for comments.
| } | ||
| } | ||
|
|
||
| // Calculate a list child's logical bitmap/buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand how lists work in arrow, but I will take your word for it that it does the right thing and that the tests are accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm comfortable that I've captured the correct semantics of logical equality for lists; that said, lists have been a thorn on my side for some time now :(
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM also. And a good speed up also!
This is blocking my work on the nested parquet list writer. I had left out list logical equality due to the M:N nature of lists, which requires iterating over the parent list to create the child null buffer/bitmap.
Impact on benchmarks: