ARROW-11125: [Rust] Logical equality for list arrays #9093

nevi-me · 2021-01-04T18:10:45Z

This is blocking my work on the nested parquet list writer. I had left out list logical equality due to the M:N nature of lists, which requires iterating over the parent list to create the child null buffer/bitmap.

Impact on benchmarks:

equal_512               time:   [78.060 ns 78.376 ns 78.748 ns]                      
                        change: [+3.0483% +4.2774% +5.5657%] (p = 0.00 < 0.05)
                        Performance has regressed.

equal_nulls_512         time:   [1.0744 us 1.0753 us 1.0763 us]                             
                        change: [-30.741% -29.902% -29.104%] (p = 0.00 < 0.05)
                        Performance has improved.

equal_string_512        time:   [187.99 ns 188.86 ns 189.80 ns]                             
                        change: [-60.069% -59.421% -58.741%] (p = 0.00 < 0.05)
                        Performance has improved.

equal_string_nulls_512  time:   [3.5054 us 3.5101 us 3.5149 us]                                    
                        change: [-6.8842% -6.5890% -6.2796%] (p = 0.00 < 0.05)
                        Performance has improved.

github-actions · 2021-01-04T18:11:15Z

https://issues.apache.org/jira/browse/ARROW-11125

nevi-me · 2021-01-04T18:14:20Z

@jorgecarleitao @alamb this is in substance ready for review,, I'd like some feedback on the approach if you get the time.

There's a few TODOs that I left for myself, which I'll address in the coming hours.

nevi-me · 2021-01-04T20:01:42Z

I saw the clippy warning, I'll fix it

jorgecarleitao

I did a first pass a this. Great work, and I can see that this is non-trivial due to having to & the bitmaps in nested types.

I generally agree with the approach, left some ideas.

rust/arrow/src/array/data.rs

rust/arrow/src/array/equal/structure.rs

rust/arrow/src/array/data.rs

rust/arrow/src/array/equal/mod.rs

rust/arrow/src/array/equal/primitive.rs

rust/arrow/src/array/equal/structure.rs

rust/arrow/src/array/data.rs

nevi-me · 2021-01-05T07:39:14Z

rust/arrow/src/array/equal/structure.rs

        .zip(rhs.child_data())
        .all(|(lhs_values, rhs_values)| {
            // merge the null data
-            let lhs_merged_nulls = match (lhs_nulls, lhs_values.null_buffer()) {


@jorgecarleitao I thought you were referring to this part of the code. I removed the one in mod.rs because I found that it was computing a duplicate when dealing with just primitive arrays.

I think the use of temp_lhs and temp_rhs here is to avoid the lhs_nulls.cloned() and rhs_nulls.cloned() calls below.

nevi-me · 2021-01-05T07:41:17Z

rust/arrow/src/array/equal/utils.rs

+/// one on the `ArrayData`.
+pub(super) fn child_logical_null_buffer(
+    parent_data: &ArrayData,
+    logical_null_buffer: Option<Buffer>,


@alamb @jorgecarleitao I wanted to make this Option<&Buffer> to avoid cloning, but because I create a Bitmap for parent_bitmap and self_null_bitmap , I have to end up cloning the &Buffer. So it's extra work to change the signature, and probably doesn't yield any benefit.

I think if you changed

let parent_bitmap = logical_null_buffer.map(Bitmap::from).unwrap_or_else(|| {

to

let parent_bitmap = logical_null_buffer.cloned().map(Bitmap::from).unwrap_or_else(|| {

Then the signature could take an Option<&Buffer> and the code is cleaner (fewer calls to .cloned() outside this function).

But I don't think it has any runtime effect

I like your approach, better to clone in one place, than various.

nevi-me · 2021-01-05T07:42:06Z

Thanks for the review @jorgecarleitao. I've addressed your queries and comments, and cleaned up the TODOs

alamb

While I don't understand this entire PR overall I think the code looks good enough to merge to me.

My only concern I have is the disabling of the test in rust/datafusion/tests/sql.rs as that seems like a regression in functionality.

alamb · 2021-01-05T11:51:04Z

rust/datafusion/tests/sql.rs

 }

 #[tokio::test]
+#[ignore = "Test ignored, will be enabled as part of the nested Parquet reader"]


What happened to this test? It looks like it used to pass and now it doesn't?

When I ran this test locally, it fails with a seemingly non-sensical error

failures: ---- parquet_list_columns stdout ---- thread 'parquet_list_columns' panicked at 'assertion failed: `(left == right)` left: `PrimitiveArray<Int64> [ null, 1, ]`, right: `PrimitiveArray<Int64> [ null, 1, ]`', datafusion/tests/sql.rs:204:5

The same test failed at some point on the parquet list-writer branch. I'm not confident that we were reconstructing list arrays correctly from parquet data.
I'm opting to disable it temporarily, then re-enable it in #8927, as I'd otherwise be duplicating my effort here.

alamb · 2021-01-05T11:54:48Z

rust/arrow/src/array/equal/boolean.rs

+        })
+    } else {
+        // get a ref of the null buffer bytes, to use in testing for nullness
+        let lhs_null_bytes = lhs_nulls.as_ref().unwrap().as_slice();


Is it possible for lhs_nulls == Some(..) but rhs_nulls == None (and visa versa?) Given they are optional arguments I wasn't sure if they would always both be either None or Some

alamb · 2021-01-05T11:57:45Z

rust/arrow/src/array/equal/list.rs

    let rhs_offsets = rhs.buffer::<T>(0);

+    // There is an edge-case where a n-length list that has 0 children, results in panics.
+    // For example; an array with offsets [0, 0, 0, 0, 0] has 4 slots, but will have


I probably am mis understanding but [0, 0, 0, 0, 0] has 5 entries but this comment says "4 slots"

Ah yes, that's the offsets for the 4 slots. A list's offsets are always list_length + 1, as they point to the range of values. [0, 2, 3] has 2 slots, with slot 1 being [1, 2], and slot 2 being [3].

alamb · 2021-01-05T12:04:11Z

rust/arrow/src/array/equal/list.rs

+    // however, one is more likely to slice into a list array and get a region that has 0
+    // child values.
+    // The test that triggered this behaviour had [4, 4] as a slice of 1 value slot.
+    let lhs_child_length = lhs_offsets.get(len).unwrap().to_usize().unwrap()


I don't fully understand the need for this check given that count_nulls seems to handle a buffer of None by returning zero.

When this code is commented out, however, I see the panic of

---- array::equal::tests::test_list_offsets stdout ---- thread 'array::equal::tests::test_list_offsets' panicked at 'assertion failed: ceil(offset + len, 8) <= buffer.len() * 8', arrow/src/util/bit_chunk_iterator.rs:33:9

So given that it is covered by tests, 👍

alamb · 2021-01-05T12:07:40Z

rust/arrow/src/array/equal/structure.rs

        .zip(rhs.child_data())
        .all(|(lhs_values, rhs_values)| {
            // merge the null data
-            let lhs_merged_nulls = match (lhs_nulls, lhs_values.null_buffer()) {


I think the use of temp_lhs and temp_rhs here is to avoid the lhs_nulls.cloned() and rhs_nulls.cloned() calls below.

alamb · 2021-01-05T12:13:44Z

rust/arrow/src/array/equal/utils.rs

+/// one on the `ArrayData`.
+pub(super) fn child_logical_null_buffer(
+    parent_data: &ArrayData,
+    logical_null_buffer: Option<Buffer>,


I think if you changed

let parent_bitmap = logical_null_buffer.map(Bitmap::from).unwrap_or_else(|| {

to

let parent_bitmap = logical_null_buffer.cloned().map(Bitmap::from).unwrap_or_else(|| {

Then the signature could take an Option<&Buffer> and the code is cleaner (fewer calls to .cloned() outside this function).

But I don't think it has any runtime effect

alamb · 2021-01-05T12:16:07Z

rust/arrow/src/array/equal/utils.rs

+            // This might be a valid case during integration testing, where we read Arrow arrays
+            // from IPC data, which has padding.
+            //
+            // We first perform a bitwise comparison, and if there is an error, we revert to a


👍 for comments.

alamb · 2021-01-05T12:19:37Z

rust/arrow/src/array/equal/utils.rs

+    }
+}
+
+// Calculate a list child's logical bitmap/buffer


I don't fully understand how lists work in arrow, but I will take your word for it that it does the right thing and that the tests are accurate.

I'm comfortable that I've captured the correct semantics of logical equality for lists; that said, lists have been a thorn on my side for some time now :(

jorgecarleitao

LGTM also. And a good speed up also!

ARROW-11125: [Rust] Logical equality for list arrays

10be8bb

github-actions bot added Component: Rust - DataFusion Component: Rust labels Jan 4, 2021

update renamed functions

4c4ac46

nevi-me mentioned this pull request Jan 4, 2021

ARROW-10766: [Rust] [Parquet] Nested List IO [WIP] #8927

Closed

ignore clippy warning

a2941dd

jorgecarleitao reviewed Jan 5, 2021

View reviewed changes

nevi-me added 2 commits January 5, 2021 08:43

fix integration test failures

66bb98e

address review comments

93017a2

nevi-me commented Jan 5, 2021

View reviewed changes

alamb reviewed Jan 5, 2021

View reviewed changes

take borrowed reference of buffer

01b28b5

alamb approved these changes Jan 5, 2021

View reviewed changes

jorgecarleitao approved these changes Jan 5, 2021

View reviewed changes

nevi-me closed this in 0eae886 Jan 5, 2021

asfimport mentioned this pull request Jan 5, 2021

[Rust] Implement logical equality for list arrays #27036

Closed

ARROW-11125: [Rust] Logical equality for list arrays #9093

ARROW-11125: [Rust] Logical equality for list arrays #9093

Uh oh!

Conversation

nevi-me commented Jan 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 4, 2021

Uh oh!

nevi-me commented Jan 4, 2021

Uh oh!

nevi-me commented Jan 4, 2021

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nevi-me commented Jan 5, 2021

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nevi-me commented Jan 4, 2021 •

edited

Loading