Skip to content

Conversation

@nevi-me
Copy link
Contributor

@nevi-me nevi-me commented Jan 4, 2021

This is blocking my work on the nested parquet list writer. I had left out list logical equality due to the M:N nature of lists, which requires iterating over the parent list to create the child null buffer/bitmap.

Impact on benchmarks:

equal_512               time:   [78.060 ns 78.376 ns 78.748 ns]                      
                        change: [+3.0483% +4.2774% +5.5657%] (p = 0.00 < 0.05)
                        Performance has regressed.

equal_nulls_512         time:   [1.0744 us 1.0753 us 1.0763 us]                             
                        change: [-30.741% -29.902% -29.104%] (p = 0.00 < 0.05)
                        Performance has improved.

equal_string_512        time:   [187.99 ns 188.86 ns 189.80 ns]                             
                        change: [-60.069% -59.421% -58.741%] (p = 0.00 < 0.05)
                        Performance has improved.

equal_string_nulls_512  time:   [3.5054 us 3.5101 us 3.5149 us]                                    
                        change: [-6.8842% -6.5890% -6.2796%] (p = 0.00 < 0.05)
                        Performance has improved.

@github-actions
Copy link

github-actions bot commented Jan 4, 2021

@nevi-me
Copy link
Contributor Author

nevi-me commented Jan 4, 2021

@jorgecarleitao @alamb this is in substance ready for review,, I'd like some feedback on the approach if you get the time.

There's a few TODOs that I left for myself, which I'll address in the coming hours.

@nevi-me
Copy link
Contributor Author

nevi-me commented Jan 4, 2021

I saw the clippy warning, I'll fix it

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first pass a this. Great work, and I can see that this is non-trivial due to having to & the bitmaps in nested types.

I generally agree with the approach, left some ideas.

.zip(rhs.child_data())
.all(|(lhs_values, rhs_values)| {
// merge the null data
let lhs_merged_nulls = match (lhs_nulls, lhs_values.null_buffer()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgecarleitao I thought you were referring to this part of the code. I removed the one in mod.rs because I found that it was computing a duplicate when dealing with just primitive arrays.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the use of temp_lhs and temp_rhs here is to avoid the lhs_nulls.cloned() and rhs_nulls.cloned() calls below.

/// one on the `ArrayData`.
pub(super) fn child_logical_null_buffer(
parent_data: &ArrayData,
logical_null_buffer: Option<Buffer>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb @jorgecarleitao I wanted to make this Option<&Buffer> to avoid cloning, but because I create a Bitmap for parent_bitmap and self_null_bitmap , I have to end up cloning the &Buffer. So it's extra work to change the signature, and probably doesn't yield any benefit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you changed

    let parent_bitmap = logical_null_buffer.map(Bitmap::from).unwrap_or_else(|| {

to

    let parent_bitmap = logical_null_buffer.cloned().map(Bitmap::from).unwrap_or_else(|| {

Then the signature could take an Option<&Buffer> and the code is cleaner (fewer calls to .cloned() outside this function).

But I don't think it has any runtime effect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your approach, better to clone in one place, than various.

@nevi-me
Copy link
Contributor Author

nevi-me commented Jan 5, 2021

Thanks for the review @jorgecarleitao. I've addressed your queries and comments, and cleaned up the TODOs

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I don't understand this entire PR overall I think the code looks good enough to merge to me.

My only concern I have is the disabling of the test in rust/datafusion/tests/sql.rs as that seems like a regression in functionality.

}

#[tokio::test]
#[ignore = "Test ignored, will be enabled as part of the nested Parquet reader"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened to this test? It looks like it used to pass and now it doesn't?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I ran this test locally, it fails with a seemingly non-sensical error

failures:

---- parquet_list_columns stdout ----
thread 'parquet_list_columns' panicked at 'assertion failed: `(left == right)`
  left: `PrimitiveArray<Int64>
[
  null,
  1,
]`,
 right: `PrimitiveArray<Int64>
[
  null,
  1,
]`', datafusion/tests/sql.rs:204:5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same test failed at some point on the parquet list-writer branch. I'm not confident that we were reconstructing list arrays correctly from parquet data.
I'm opting to disable it temporarily, then re-enable it in #8927, as I'd otherwise be duplicating my effort here.

})
} else {
// get a ref of the null buffer bytes, to use in testing for nullness
let lhs_null_bytes = lhs_nulls.as_ref().unwrap().as_slice();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for lhs_nulls == Some(..) but rhs_nulls == None (and visa versa?) Given they are optional arguments I wasn't sure if they would always both be either None or Some

let rhs_offsets = rhs.buffer::<T>(0);

// There is an edge-case where a n-length list that has 0 children, results in panics.
// For example; an array with offsets [0, 0, 0, 0, 0] has 4 slots, but will have
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably am mis understanding but [0, 0, 0, 0, 0] has 5 entries but this comment says "4 slots"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, that's the offsets for the 4 slots. A list's offsets are always list_length + 1, as they point to the range of values. [0, 2, 3] has 2 slots, with slot 1 being [1, 2], and slot 2 being [3].

// however, one is more likely to slice into a list array and get a region that has 0
// child values.
// The test that triggered this behaviour had [4, 4] as a slice of 1 value slot.
let lhs_child_length = lhs_offsets.get(len).unwrap().to_usize().unwrap()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand the need for this check given that count_nulls seems to handle a buffer of None by returning zero.

When this code is commented out, however, I see the panic of

---- array::equal::tests::test_list_offsets stdout ----
thread 'array::equal::tests::test_list_offsets' panicked at 'assertion failed: ceil(offset + len, 8) <= buffer.len() * 8', arrow/src/util/bit_chunk_iterator.rs:33:9

So given that it is covered by tests, 👍

.zip(rhs.child_data())
.all(|(lhs_values, rhs_values)| {
// merge the null data
let lhs_merged_nulls = match (lhs_nulls, lhs_values.null_buffer()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the use of temp_lhs and temp_rhs here is to avoid the lhs_nulls.cloned() and rhs_nulls.cloned() calls below.

/// one on the `ArrayData`.
pub(super) fn child_logical_null_buffer(
parent_data: &ArrayData,
logical_null_buffer: Option<Buffer>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you changed

    let parent_bitmap = logical_null_buffer.map(Bitmap::from).unwrap_or_else(|| {

to

    let parent_bitmap = logical_null_buffer.cloned().map(Bitmap::from).unwrap_or_else(|| {

Then the signature could take an Option<&Buffer> and the code is cleaner (fewer calls to .cloned() outside this function).

But I don't think it has any runtime effect

// This might be a valid case during integration testing, where we read Arrow arrays
// from IPC data, which has padding.
//
// We first perform a bitwise comparison, and if there is an error, we revert to a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for comments.

}
}

// Calculate a list child's logical bitmap/buffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand how lists work in arrow, but I will take your word for it that it does the right thing and that the tests are accurate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm comfortable that I've captured the correct semantics of logical equality for lists; that said, lists have been a thorn on my side for some time now :(

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM also. And a good speed up also!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants