ARROW-9523 [Rust] Improve filter kernel performance #7798

yordan-pavlov · 2020-07-19T11:33:25Z

The filter kernel located here https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/filter.rs

currently has the following performance:

filter old u8 low selectivity time: [1.7782 ms 1.7790 ms 1.7801 ms]
filter old u8 high selectivity time: [815.58 us 816.58 us 817.57 us]
filter old u8 w NULLs low selectivity time: [1.8131 ms 1.8231 ms 1.8336 ms]
filter old u8 w NULLs high selectivity time: [817.41 us 820.01 us 823.05 us]

I have been working on a new implementation which performs between approximately 17 and 550 times faster depending mostly on filter selectivity. Here are the benchmark results:

filter u8 low selectivity time: [107.26 us 108.24 us 109.58 us]
filter u8 high selectivity time: [4.7854 us 4.8050 us 4.8276 us]
filter context u8 low selectivity time: [102.59 us 102.93 us 103.38 us]
filter context u8 high selectivity time: [1.4709 us 1.4760 us 1.4823 us]
filter context u8 w NULLs low selectivity time: [130.48 us 131.00 us 131.65 us]
filter context u8 w NULLs high selectivity time: [2.0520 us 2.0818 us 2.1137 us]
filter context f32 low selectivity time: [117.26 us 118.58 us 120.13 us]
filter context f32 high selectivity time: [1.7895 us 1.7919 us 1.7942 us]

This new implementation is based on a few key ideas:

(1) if the data array being filtered doesn't have a null bitmap, no time should be wasted to copy or create a null bitmap in the resulting filtered data array - this is implemented using the CopyNullBit trait which has a no-op implementation and an actual implementation

(2) when the filter is highly selective, e.g. only a small number of values from the data array are selected, the filter implementation should efficiently skip entire batches of 0s in the filter array - this is implemented by transmuting the filter array to u64 which allows to quickly check and skip entire batches of 64 bits

(3) when an entire record batch is filtered, any computation which only depends on the filter array is done once and then shared for filtering all the data arrays in the record batch - this is implemented using the FilterContext struct

This pull request also implements support for filtering dictionary arrays.

@paddyhoran @andygrove

update from upstream repo

merge latest changes from upstream

Merge latest changes from apache/arrow

…ing dictionary arrays

github-actions · 2020-07-19T11:46:46Z

https://issues.apache.org/jira/browse/ARROW-9523

wesm · 2020-07-19T17:17:41Z

I refer you all to the work that we've recently done in C++ where we use popcount on 64 bits at a time to efficiently compute the size of the filter output as well as quickly compute the filtered output. It might be worth replicating the "BitBlockCounter" concept in Rust

andygrove · 2020-07-20T18:21:49Z

@yordan-pavlov This is exciting! I will start reviewing this later today.

yordan-pavlov · 2020-07-24T20:23:50Z

just for reference, I think the relevant C++ filter implementation (which wesm is referring to) is in the PrimitiveFilterImpl class here https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_selection.cc#L558

jorgecarleitao

I went through thez code. This is beautiful, great work! Haven't read such types of optimizations for a long time!

I have two minor comments:

I think that there is an .unwrap() that hides some "not implemented" Err. I left a comment around where we do it. The rational here is that we could try to use ? instead so that the error is passed along in the trace stack.
Do you know what is the speedup of introducing the unsafe bits vs the speedup of not using unsafe? I am just curious as to whether those unsafe are a fundamental part of the speedup (and are thus necessary). The rational is that in rust unsafe often justifies some care, and one hypothesis is that most of the speedup comes from the 3 ideas you outlined.

jorgecarleitao · 2020-07-25T04:51:44Z

rust/arrow/src/compute/kernels/filter.rs

+    let filtered_arrays = record_batch
+        .columns()
+        .iter()
+        .map(|a| filter_context.filter(a.as_ref()).unwrap())


Isn't this .unwrap()"hiding" errors related with unsupported data types?

good spot, I have changed the filter_record_batch method to remove the unwrap() and use .collect::<Result<Vec<ArrayRef>>>()?; instead

…s of filter values

yordan-pavlov · 2020-07-25T18:49:08Z

@jorgecarleitao thanks for the feedback

I did some more profiling specifically around the unsafe parts of the code and found that the safe version of copy_null_bit is just as fast so have removed that unsafe section; here are some benchmark results:

copy_null_bit unsafe:
filter context u8 w NULLs low selectivity time: [142.05 us 142.35 us 142.68 us]
filter context u8 w NULLs high selectivity time: [2.0915 us 2.1015 us 2.1127 us]

copy_null_bit safe:
filter context u8 w NULLs low selectivity time: [134.74 us 134.86 us 134.98 us]
filter context u8 w NULLs high selectivity time: [2.0536 us 2.0613 us 2.0707 us]

I also benchmarked replacing the unsafe section in the filter_array_impl method with value_buffer.write() but this results in approximately 18% drop in performance with sparse filter arrays as can be seen from the benchmark results below:

filter u8 low selectivity
time: [131.08 us 132.46 us 134.27 us]
change: [+13.141% +17.189% +22.115%] (p = 0.00 < 0.05)

filter context u8 low selectivity
time: [127.47 us 129.27 us 131.56 us]
change: [+12.008% +19.674% +27.939%] (p = 0.00 < 0.05)

filter context u8 w NULLs low selectivity
time: [154.32 us 155.27 us 156.79 us]
change: [+15.444% +17.846% +22.268%] (p = 0.00 < 0.05)

filter context f32 low selectivity
time: [137.62 us 138.01 us 138.52 us]
change: [+12.495% +18.180% +23.088%] (p = 0.00 < 0.05)

finally, looking at the C++ implementation inspired me to change the filter_array_impl method to add a special case where the 64bit filter batch is all 1s and this doesn't appear to reduce performance in other cases but improves performance of filtering with very dense filter arrays (almost all 1s) by about 20 times; here are the latest benchmark results:

filter u8 low selectivity time: [118.21 us 118.82 us 119.61 us]
filter u8 high selectivity time: [4.9893 us 4.9941 us 4.9998 us]
filter u8 very low selectivity time: [11.861 us 11.919 us 11.987 us]

filter context u8 low selectivity time: [115.22 us 115.77 us 116.36 us]
filter context u8 high selectivity time: [1.6571 us 1.6784 us 1.7033 us]
filter context u8 very low selectivity time: [8.4205 us 8.4370 us 8.4590 us]

filter context u8 w NULLs low selectivity time: [132.49 us 132.78 us 133.10 us]
filter context u8 w NULLs high selectivity time: [2.1935 us 2.1979 us 2.2030 us]
filter context u8 w NULLs very low selectivity time: [161.64 us 162.12 us 162.55 us]

filter context f32 low selectivity time: [158.81 us 161.58 us 164.61 us]
filter context f32 high selectivity time: [1.8318 us 1.8371 us 1.8436 us]
filter context f32 very low selectivity time: [18.658 us 18.785 us 18.935 us]

@andygrove any thoughts?

andygrove · 2020-07-26T04:19:36Z

Sorry @yordan-pavlov but I didn't get a chance to review yet. I hope to get to it soon.

…er values

andygrove · 2020-07-27T13:23:48Z

rust/arrow/src/compute/kernels/filter.rs

+            let data_index = i * 64;
+            let data_len = value_size * 64;
+            null_bit_setter.copy_null_bits(data_index, 64);
+            unsafe {


The use of unsafe makes me nervous. If the performance improvement justifies using this, how can we be sure we have adequate testing for this?

@andygrove I have replaced the use of unsafe memcpy with copy_from_slice - this does reduce performance (by about 12%) but it is still much faster than the old implementation; here are the latest benchmark results:

filter u8 low selectivity time: [130.74 us 131.37 us 132.19 us]
filter u8 high selectivity time: [5.2031 us 5.2366 us 5.2764 us]
filter u8 very low selectivity time: [12.353 us 12.542 us 12.759 us]

filter context u8 low selectivity time: [129.54 us 129.88 us 130.30 us]
filter context u8 high selectivity time: [1.7926 us 1.7974 us 1.8046 us]
filter context u8 very low selectivity time: [8.7700 us 8.7987 us 8.8342 us]

filter context u8 w NULLs low selectivity time: [150.36 us 151.09 us 152.01 us]
filter context u8 w NULLs high selectivity time: [2.4173 us 2.4882 us 2.5703 us]
filter context u8 w NULLs very low selectivity time: [158.86 us 160.32 us 162.26 us]

filter context f32 low selectivity time: [123.38 us 124.34 us 126.18 us]
filter context f32 high selectivity time: [1.4836 us 1.4994 us 1.5297 us]
filter context f32 very low selectivity time: [19.422 us 19.653 us 19.932 us]

Thanks @yordan-pavlov We can always consider the unsafe changes in a future PR, but removing them for now makes it much easier to get these changes merged.

andygrove · 2020-07-27T13:24:05Z

rust/arrow/src/compute/kernels/filter.rs

+                let data_index = (i * 64) + j;
+                null_bit_setter.copy_null_bit(data_index);
+                // if filter bit == 1: copy data value to temp array
+                unsafe {


Same comment as above re unsafe

andygrove · 2020-07-27T13:24:45Z

LGTM overall.

paddyhoran

LGTM, thank you @yordan-pavlov.

One thing that we are running into with the kernel implementations is that we are not considering the offsets in a lot of cases so we run into problems when we slice arrays and pass them to kernels.

I think we need a test or two around this, but it does not need to be in this PR.

yordan-pavlov · 2020-07-31T20:32:35Z

@paddyhoran yes, you are right, I added a couple more tests for sliced arrays and they didn't pass so seeing that the PR was not yet merged I added a few small changes to
(1) implement support for filtering of sliced / offset data arrays
(2) return an error if the filter array is offset - I thought it better to make this obvious rather than silently process the offset filter array incorrectly

I looked briefly into implementing support for offset filter arrays but I couldn't figure out how to do it quickly - I might have to take another look at the C++ implementation and submit a separate PR for that.

andygrove

LGTM

wesm · 2020-08-07T21:38:33Z

This PR makes me think that at some point (when someone gets really motivated), it would be interesting to implement a non-language-dependent benchmark harness for certain operations (such that can be represented using e.g. a standard protobuf serialized operation/expression) so that we can get apples-to-apples numbers for certain operations across implementations. It would be interesting to know e.g. how the Rust implementation of filtering compares with the C++

andygrove · 2020-08-07T21:45:22Z

I completely agree. I just happen to have a protobuf definition for plans and expressions as well as serde code for Java and Rust that I would be happy to contribute. We would need to implement for C++.

yordan-pavlov and others added 6 commits May 21, 2020 22:46

Merge pull request #1 from apache/master

41be57f

update from upstream repo

Merge pull request #2 from apache/master

2707a49

merge latest changes from upstream

Merge pull request #3 from apache/master

bb59099

Merge latest changes from apache/arrow

improve filter kernel performance; also implements support for filter…

13c0ace

…ing dictionary arrays

fix rustfmt issues

a5bf994

add filter kernel benchmarks

17a4e43

fix clippy issues

bf9437e

add license at the top of filter_kernels.rs

5434024

andygrove added the Component: Rust label Jul 20, 2020

jorgecarleitao reviewed Jul 25, 2020

View reviewed changes

yordan-pavlov added 3 commits July 25, 2020 16:07

remove use of unwrap from filter_record_batch method

223b444

remove use of unsafe from copy_null_bit method

5383a9f

change filter_array_impl method to add special case for all 1s batche…

c84d3ba

…s of filter values

yordan-pavlov marked this pull request as draft July 26, 2020 08:07

fix issue where null bitmap was not copied for batches of all 1s filt…

3b470d3

…er values

yordan-pavlov marked this pull request as ready for review July 26, 2020 11:10

andygrove requested review from nevi-me and paddyhoran July 27, 2020 13:22

andygrove reviewed Jul 27, 2020

View reviewed changes

replace unsafe memory::memcpy with copy_from_slice

6073fb7

paddyhoran approved these changes Jul 29, 2020

View reviewed changes

implemented support for filtering sliced data arrays

d2179be

andygrove approved these changes Aug 6, 2020

View reviewed changes

andygrove closed this in 62dfa11 Aug 6, 2020

yordan-pavlov mentioned this pull request Nov 10, 2020

ARROW-10540 [Rust] Improve filtering #8630

Closed

asfimport mentioned this pull request Aug 7, 2020

[Rust] improve performance of filter kernel #25591

Closed

ARROW-9523 [Rust] Improve filter kernel performance #7798

ARROW-9523 [Rust] Improve filter kernel performance #7798

Uh oh!

Conversation

yordan-pavlov commented Jul 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 19, 2020

Uh oh!

wesm commented Jul 19, 2020

Uh oh!

andygrove commented Jul 20, 2020

Uh oh!

yordan-pavlov commented Jul 24, 2020

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao Jul 25, 2020

Choose a reason for hiding this comment

Uh oh!

yordan-pavlov Jul 25, 2020

Choose a reason for hiding this comment

Uh oh!

yordan-pavlov commented Jul 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andygrove commented Jul 26, 2020

Uh oh!

andygrove Jul 27, 2020

Choose a reason for hiding this comment

Uh oh!

yordan-pavlov Jul 27, 2020

Choose a reason for hiding this comment

Uh oh!

andygrove Jul 27, 2020

Choose a reason for hiding this comment

Uh oh!

andygrove Jul 27, 2020

Choose a reason for hiding this comment

Uh oh!

andygrove commented Jul 27, 2020

Uh oh!

paddyhoran left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yordan-pavlov commented Jul 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

wesm commented Aug 7, 2020

Uh oh!

andygrove commented Aug 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yordan-pavlov commented Jul 19, 2020 •

edited

Loading

yordan-pavlov commented Jul 25, 2020 •

edited

Loading

paddyhoran left a comment •

edited

Loading

yordan-pavlov commented Jul 31, 2020 •

edited

Loading