Skip to content

Conversation

@jorgecarleitao
Copy link
Member

TL;DR:

  • Add support to filter StructArray
  • fixes 2 bugs in take of StructArray and ListArray
  • speedup take by 1.2-1.9

Motivation

Same motivation as #8630 plus solving the following issues:

They are all inter-connected, which made it difficult to solve one without solving the other (as the tests either pass or don't).

This PR

This PR extends MutableDataArray to support extending the MutableDataArray with nulls, thereby allowing to correctly implement take and filter for StructArray. Specifically, currently,

This PR fixes both issues.

This PR also converts the implementation of take to use MutableDataArray, thereby reducing the chance of bugs and maintenance burden, as well as performance. While doing that, I found 1 bug and 1 improvement, that this PR addresses:

  • currently, take has a bug on which nulls are not correctly taken into account (ARROW-10593)
  • currently take takes all values from a struct array, even when we are just taking a null (ARROW-10594)

This PR fixes both issues.

Finally, this improves performance of take by 1.2-1.9x (which also impacts the performance of the sort).

Benchmarks
set -e
git checkout 010d260173a9e110901ca67372a4ac379a615a13
cargo bench --bench filter_kernels
git checkout mutable_filter
cargo bench --bench filter_kernels
Previous HEAD position was e54bd6272 Migrated filter.
Switched to branch 'clean_take'
Your branch is up to date with 'origin/clean_take'.
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 45.00s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/take_kernels-cd6b83d872e2c3bf
Gnuplot not found, using plotters backend
take i32 512            time:   [6.1191 us 6.1260 us 6.1333 us]                          
                        change: [-20.128% -19.695% -19.203%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

take i32 1024           time:   [10.691 us 10.713 us 10.737 us]                           
                        change: [-23.008% -22.493% -21.981%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe

take bool 512           time:   [5.1620 us 5.1880 us 5.2168 us]                           
                        change: [-38.963% -38.374% -37.725%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

take bool 1024          time:   [8.5777 us 8.5896 us 8.6024 us]                            
                        change: [-45.607% -45.364% -45.092%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

take str 512            time:   [13.796 us 13.811 us 13.826 us]                          
                        change: [-24.722% -24.254% -23.764%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

take str 1024           time:   [25.149 us 25.167 us 25.186 us]                           
                        change: [-26.483% -26.051% -25.609%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

@github-actions
Copy link

@jorgecarleitao
Copy link
Member Author

I am still not comfortable with this, so I am closing it for now.

@drusso
Copy link
Contributor

drusso commented Nov 15, 2020

This looks like a neat take on take (pardon the pun!). Is there anything in particular preventing adoption of this approach?

@jorgecarleitao
Copy link
Member Author

Performance: with the fix on the benchmarks for take, this is 2x slower than master.

@drusso
Copy link
Contributor

drusso commented Nov 17, 2020

Got it, thanks for clarifying.

@jorgecarleitao jorgecarleitao deleted the clean_take branch December 4, 2020 07:40
@jorgecarleitao jorgecarleitao restored the clean_take branch December 4, 2020 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants