-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-9523 [Rust] Improve filter kernel performance #7798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-9523 [Rust] Improve filter kernel performance #7798
Conversation
update from upstream repo
merge latest changes from upstream
Merge latest changes from apache/arrow
…ing dictionary arrays
|
I refer you all to the work that we've recently done in C++ where we use popcount on 64 bits at a time to efficiently compute the size of the filter output as well as quickly compute the filtered output. It might be worth replicating the "BitBlockCounter" concept in Rust |
|
@yordan-pavlov This is exciting! I will start reviewing this later today. |
|
just for reference, I think the relevant C++ filter implementation (which wesm is referring to) is in the PrimitiveFilterImpl class here https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_selection.cc#L558 |
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through thez code. This is beautiful, great work! Haven't read such types of optimizations for a long time!
I have two minor comments:
- I think that there is an
.unwrap()that hides some "not implemented"Err. I left a comment around where we do it. The rational here is that we could try to use?instead so that the error is passed along in the trace stack. - Do you know what is the speedup of introducing the
unsafebits vs the speedup of not usingunsafe? I am just curious as to whether thoseunsafeare a fundamental part of the speedup (and are thus necessary). The rational is that in rustunsafeoften justifies some care, and one hypothesis is that most of the speedup comes from the 3 ideas you outlined.
| let filtered_arrays = record_batch | ||
| .columns() | ||
| .iter() | ||
| .map(|a| filter_context.filter(a.as_ref()).unwrap()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this .unwrap()"hiding" errors related with unsupported data types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good spot, I have changed the filter_record_batch method to remove the unwrap() and use .collect::<Result<Vec<ArrayRef>>>()?; instead
|
@jorgecarleitao thanks for the feedback I did some more profiling specifically around the unsafe parts of the code and found that the safe version of copy_null_bit unsafe: copy_null_bit safe: I also benchmarked replacing the unsafe section in the filter u8 low selectivity filter context u8 low selectivity filter context u8 w NULLs low selectivity filter context f32 low selectivity finally, looking at the C++ implementation inspired me to change the filter u8 low selectivity time: [118.21 us 118.82 us 119.61 us] filter context u8 low selectivity time: [115.22 us 115.77 us 116.36 us] filter context u8 w NULLs low selectivity time: [132.49 us 132.78 us 133.10 us] filter context f32 low selectivity time: [158.81 us 161.58 us 164.61 us] @andygrove any thoughts? |
|
Sorry @yordan-pavlov but I didn't get a chance to review yet. I hope to get to it soon. |
| let data_index = i * 64; | ||
| let data_len = value_size * 64; | ||
| null_bit_setter.copy_null_bits(data_index, 64); | ||
| unsafe { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of unsafe makes me nervous. If the performance improvement justifies using this, how can we be sure we have adequate testing for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andygrove I have replaced the use of unsafe memcpy with copy_from_slice - this does reduce performance (by about 12%) but it is still much faster than the old implementation; here are the latest benchmark results:
filter u8 low selectivity time: [130.74 us 131.37 us 132.19 us]
filter u8 high selectivity time: [5.2031 us 5.2366 us 5.2764 us]
filter u8 very low selectivity time: [12.353 us 12.542 us 12.759 us]
filter context u8 low selectivity time: [129.54 us 129.88 us 130.30 us]
filter context u8 high selectivity time: [1.7926 us 1.7974 us 1.8046 us]
filter context u8 very low selectivity time: [8.7700 us 8.7987 us 8.8342 us]
filter context u8 w NULLs low selectivity time: [150.36 us 151.09 us 152.01 us]
filter context u8 w NULLs high selectivity time: [2.4173 us 2.4882 us 2.5703 us]
filter context u8 w NULLs very low selectivity time: [158.86 us 160.32 us 162.26 us]
filter context f32 low selectivity time: [123.38 us 124.34 us 126.18 us]
filter context f32 high selectivity time: [1.4836 us 1.4994 us 1.5297 us]
filter context f32 very low selectivity time: [19.422 us 19.653 us 19.932 us]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @yordan-pavlov We can always consider the unsafe changes in a future PR, but removing them for now makes it much easier to get these changes merged.
| let data_index = (i * 64) + j; | ||
| null_bit_setter.copy_null_bit(data_index); | ||
| // if filter bit == 1: copy data value to temp array | ||
| unsafe { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above re unsafe
|
LGTM overall. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you @yordan-pavlov.
One thing that we are running into with the kernel implementations is that we are not considering the offsets in a lot of cases so we run into problems when we slice arrays and pass them to kernels.
I think we need a test or two around this, but it does not need to be in this PR.
|
@paddyhoran yes, you are right, I added a couple more tests for sliced arrays and they didn't pass so seeing that the PR was not yet merged I added a few small changes to I looked briefly into implementing support for offset filter arrays but I couldn't figure out how to do it quickly - I might have to take another look at the C++ implementation and submit a separate PR for that. |
andygrove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
This PR makes me think that at some point (when someone gets really motivated), it would be interesting to implement a non-language-dependent benchmark harness for certain operations (such that can be represented using e.g. a standard protobuf serialized operation/expression) so that we can get apples-to-apples numbers for certain operations across implementations. It would be interesting to know e.g. how the Rust implementation of filtering compares with the C++ |
|
I completely agree. I just happen to have a protobuf definition for plans and expressions as well as serde code for Java and Rust that I would be happy to contribute. We would need to implement for C++. |
The filter kernel located here https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/filter.rs
currently has the following performance:
filter old u8 low selectivity time: [1.7782 ms 1.7790 ms 1.7801 ms]
filter old u8 high selectivity time: [815.58 us 816.58 us 817.57 us]
filter old u8 w NULLs low selectivity time: [1.8131 ms 1.8231 ms 1.8336 ms]
filter old u8 w NULLs high selectivity time: [817.41 us 820.01 us 823.05 us]
I have been working on a new implementation which performs between approximately 17 and 550 times faster depending mostly on filter selectivity. Here are the benchmark results:
filter u8 low selectivity time: [107.26 us 108.24 us 109.58 us]
filter u8 high selectivity time: [4.7854 us 4.8050 us 4.8276 us]
filter context u8 low selectivity time: [102.59 us 102.93 us 103.38 us]
filter context u8 high selectivity time: [1.4709 us 1.4760 us 1.4823 us]
filter context u8 w NULLs low selectivity time: [130.48 us 131.00 us 131.65 us]
filter context u8 w NULLs high selectivity time: [2.0520 us 2.0818 us 2.1137 us]
filter context f32 low selectivity time: [117.26 us 118.58 us 120.13 us]
filter context f32 high selectivity time: [1.7895 us 1.7919 us 1.7942 us]
This new implementation is based on a few key ideas:
(1) if the data array being filtered doesn't have a null bitmap, no time should be wasted to copy or create a null bitmap in the resulting filtered data array - this is implemented using the CopyNullBit trait which has a no-op implementation and an actual implementation
(2) when the filter is highly selective, e.g. only a small number of values from the data array are selected, the filter implementation should efficiently skip entire batches of 0s in the filter array - this is implemented by transmuting the filter array to u64 which allows to quickly check and skip entire batches of 64 bits
(3) when an entire record batch is filtered, any computation which only depends on the filter array is done once and then shared for filtering all the data arrays in the record batch - this is implemented using the FilterContext struct
This pull request also implements support for filtering dictionary arrays.
@paddyhoran @andygrove