Skip to content

Conversation

@yordan-pavlov
Copy link
Contributor

@yordan-pavlov yordan-pavlov commented Jul 19, 2020

The filter kernel located here https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/filter.rs

currently has the following performance:

filter old u8 low selectivity time: [1.7782 ms 1.7790 ms 1.7801 ms]
filter old u8 high selectivity time: [815.58 us 816.58 us 817.57 us]
filter old u8 w NULLs low selectivity time: [1.8131 ms 1.8231 ms 1.8336 ms]
filter old u8 w NULLs high selectivity time: [817.41 us 820.01 us 823.05 us]

I have been working on a new implementation which performs between approximately 17 and 550 times faster depending mostly on filter selectivity. Here are the benchmark results:

filter u8 low selectivity time: [107.26 us 108.24 us 109.58 us]
filter u8 high selectivity time: [4.7854 us 4.8050 us 4.8276 us]
filter context u8 low selectivity time: [102.59 us 102.93 us 103.38 us]
filter context u8 high selectivity time: [1.4709 us 1.4760 us 1.4823 us]
filter context u8 w NULLs low selectivity time: [130.48 us 131.00 us 131.65 us]
filter context u8 w NULLs high selectivity time: [2.0520 us 2.0818 us 2.1137 us]
filter context f32 low selectivity time: [117.26 us 118.58 us 120.13 us]
filter context f32 high selectivity time: [1.7895 us 1.7919 us 1.7942 us]

This new implementation is based on a few key ideas:

(1) if the data array being filtered doesn't have a null bitmap, no time should be wasted to copy or create a null bitmap in the resulting filtered data array - this is implemented using the CopyNullBit trait which has a no-op implementation and an actual implementation

(2) when the filter is highly selective, e.g. only a small number of values from the data array are selected, the filter implementation should efficiently skip entire batches of 0s in the filter array - this is implemented by transmuting the filter array to u64 which allows to quickly check and skip entire batches of 64 bits 

(3) when an entire record batch is filtered, any computation which only depends on the filter array is done once and then shared for filtering all the data arrays in the record batch - this is implemented using the FilterContext struct

This pull request also implements support for filtering dictionary arrays. 

@paddyhoran @andygrove

@github-actions
Copy link

@wesm
Copy link
Member

wesm commented Jul 19, 2020

I refer you all to the work that we've recently done in C++ where we use popcount on 64 bits at a time to efficiently compute the size of the filter output as well as quickly compute the filtered output. It might be worth replicating the "BitBlockCounter" concept in Rust

@andygrove
Copy link
Member

@yordan-pavlov This is exciting! I will start reviewing this later today.

@yordan-pavlov
Copy link
Contributor Author

just for reference, I think the relevant C++ filter implementation (which wesm is referring to) is in the PrimitiveFilterImpl class here https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_selection.cc#L558

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through thez code. This is beautiful, great work! Haven't read such types of optimizations for a long time!

I have two minor comments:

  • I think that there is an .unwrap() that hides some "not implemented" Err. I left a comment around where we do it. The rational here is that we could try to use ? instead so that the error is passed along in the trace stack.
  • Do you know what is the speedup of introducing the unsafe bits vs the speedup of not using unsafe? I am just curious as to whether those unsafe are a fundamental part of the speedup (and are thus necessary). The rational is that in rust unsafe often justifies some care, and one hypothesis is that most of the speedup comes from the 3 ideas you outlined.

let filtered_arrays = record_batch
.columns()
.iter()
.map(|a| filter_context.filter(a.as_ref()).unwrap())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this .unwrap()"hiding" errors related with unsupported data types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good spot, I have changed the filter_record_batch method to remove the unwrap() and use .collect::<Result<Vec<ArrayRef>>>()?; instead

@yordan-pavlov
Copy link
Contributor Author

yordan-pavlov commented Jul 25, 2020

@jorgecarleitao thanks for the feedback

I did some more profiling specifically around the unsafe parts of the code and found that the safe version of copy_null_bit is just as fast so have removed that unsafe section; here are some benchmark results:

copy_null_bit unsafe:
filter context u8 w NULLs low selectivity time: [142.05 us 142.35 us 142.68 us]
filter context u8 w NULLs high selectivity time: [2.0915 us 2.1015 us 2.1127 us]

copy_null_bit safe:
filter context u8 w NULLs low selectivity time: [134.74 us 134.86 us 134.98 us]
filter context u8 w NULLs high selectivity time: [2.0536 us 2.0613 us 2.0707 us]

I also benchmarked replacing the unsafe section in the filter_array_impl method with value_buffer.write() but this results in approximately 18% drop in performance with sparse filter arrays as can be seen from the benchmark results below:

filter u8 low selectivity
time: [131.08 us 132.46 us 134.27 us]
change: [+13.141% +17.189% +22.115%] (p = 0.00 < 0.05)

filter context u8 low selectivity
time: [127.47 us 129.27 us 131.56 us]
change: [+12.008% +19.674% +27.939%] (p = 0.00 < 0.05)

filter context u8 w NULLs low selectivity
time: [154.32 us 155.27 us 156.79 us]
change: [+15.444% +17.846% +22.268%] (p = 0.00 < 0.05)

filter context f32 low selectivity
time: [137.62 us 138.01 us 138.52 us]
change: [+12.495% +18.180% +23.088%] (p = 0.00 < 0.05)

finally, looking at the C++ implementation inspired me to change the filter_array_impl method to add a special case where the 64bit filter batch is all 1s and this doesn't appear to reduce performance in other cases but improves performance of filtering with very dense filter arrays (almost all 1s) by about 20 times; here are the latest benchmark results:

filter u8 low selectivity time: [118.21 us 118.82 us 119.61 us]
filter u8 high selectivity time: [4.9893 us 4.9941 us 4.9998 us]
filter u8 very low selectivity time: [11.861 us 11.919 us 11.987 us]

filter context u8 low selectivity time: [115.22 us 115.77 us 116.36 us]
filter context u8 high selectivity time: [1.6571 us 1.6784 us 1.7033 us]
filter context u8 very low selectivity time: [8.4205 us 8.4370 us 8.4590 us]

filter context u8 w NULLs low selectivity time: [132.49 us 132.78 us 133.10 us]
filter context u8 w NULLs high selectivity time: [2.1935 us 2.1979 us 2.2030 us]
filter context u8 w NULLs very low selectivity time: [161.64 us 162.12 us 162.55 us]

filter context f32 low selectivity time: [158.81 us 161.58 us 164.61 us]
filter context f32 high selectivity time: [1.8318 us 1.8371 us 1.8436 us]
filter context f32 very low selectivity time: [18.658 us 18.785 us 18.935 us]

@andygrove any thoughts?

@andygrove
Copy link
Member

Sorry @yordan-pavlov but I didn't get a chance to review yet. I hope to get to it soon.

@yordan-pavlov yordan-pavlov marked this pull request as draft July 26, 2020 08:07
@yordan-pavlov yordan-pavlov marked this pull request as ready for review July 26, 2020 11:10
@andygrove andygrove requested review from nevi-me and paddyhoran July 27, 2020 13:22
let data_index = i * 64;
let data_len = value_size * 64;
null_bit_setter.copy_null_bits(data_index, 64);
unsafe {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of unsafe makes me nervous. If the performance improvement justifies using this, how can we be sure we have adequate testing for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andygrove I have replaced the use of unsafe memcpy with copy_from_slice - this does reduce performance (by about 12%) but it is still much faster than the old implementation; here are the latest benchmark results:

filter u8 low selectivity time: [130.74 us 131.37 us 132.19 us]
filter u8 high selectivity time: [5.2031 us 5.2366 us 5.2764 us]
filter u8 very low selectivity time: [12.353 us 12.542 us 12.759 us]

filter context u8 low selectivity time: [129.54 us 129.88 us 130.30 us]
filter context u8 high selectivity time: [1.7926 us 1.7974 us 1.8046 us]
filter context u8 very low selectivity time: [8.7700 us 8.7987 us 8.8342 us]

filter context u8 w NULLs low selectivity time: [150.36 us 151.09 us 152.01 us]
filter context u8 w NULLs high selectivity time: [2.4173 us 2.4882 us 2.5703 us]
filter context u8 w NULLs very low selectivity time: [158.86 us 160.32 us 162.26 us]

filter context f32 low selectivity time: [123.38 us 124.34 us 126.18 us]
filter context f32 high selectivity time: [1.4836 us 1.4994 us 1.5297 us]
filter context f32 very low selectivity time: [19.422 us 19.653 us 19.932 us]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yordan-pavlov We can always consider the unsafe changes in a future PR, but removing them for now makes it much easier to get these changes merged.

let data_index = (i * 64) + j;
null_bit_setter.copy_null_bit(data_index);
// if filter bit == 1: copy data value to temp array
unsafe {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above re unsafe

@andygrove
Copy link
Member

LGTM overall.

Copy link
Contributor

@paddyhoran paddyhoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @yordan-pavlov.

One thing that we are running into with the kernel implementations is that we are not considering the offsets in a lot of cases so we run into problems when we slice arrays and pass them to kernels.

I think we need a test or two around this, but it does not need to be in this PR.

@yordan-pavlov
Copy link
Contributor Author

yordan-pavlov commented Jul 31, 2020

@paddyhoran yes, you are right, I added a couple more tests for sliced arrays and they didn't pass so seeing that the PR was not yet merged I added a few small changes to
(1) implement support for filtering of sliced / offset data arrays
(2) return an error if the filter array is offset - I thought it better to make this obvious rather than silently process the offset filter array incorrectly

I looked briefly into implementing support for offset filter arrays but I couldn't figure out how to do it quickly - I might have to take another look at the C++ implementation and submit a separate PR for that.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andygrove andygrove closed this in 62dfa11 Aug 6, 2020
@wesm
Copy link
Member

wesm commented Aug 7, 2020

This PR makes me think that at some point (when someone gets really motivated), it would be interesting to implement a non-language-dependent benchmark harness for certain operations (such that can be represented using e.g. a standard protobuf serialized operation/expression) so that we can get apples-to-apples numbers for certain operations across implementations. It would be interesting to know e.g. how the Rust implementation of filtering compares with the C++

@andygrove
Copy link
Member

I completely agree. I just happen to have a protobuf definition for plans and expressions as well as serde code for Java and Rust that I would be happy to contribute. We would need to implement for C++.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants