ARROW-11270: [Rust] Array slice accessors #9215

tyrelr · 2021-01-16T19:16:54Z

Using an approach similar to ARROW-10989, migrate typed array API's to use slices where they can.

This impacts the API of GenericBinaryArray<>, GenericListArray<>, GenericStringArray<>

This also enables bounds checking in every value() function on each of the above arrays, with an unsafe value_unchecked() provided for callers willing to do their own bounds-checking.

github-actions · 2021-01-16T19:17:15Z

https://issues.apache.org/jira/browse/ARROW-11270

codecov-io · 2021-01-16T19:39:00Z

Codecov Report

Merging #9215 (dda0545) into master (ab5fc97) will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #9215      +/-   ##
==========================================
+ Coverage   81.89%   81.92%   +0.03%     
==========================================
  Files         215      215              
  Lines       52988    53070      +82     
==========================================
+ Hits        43392    43478      +86     
+ Misses       9596     9592       -4

Impacted Files	Coverage Δ
rust/arrow/src/array/array.rs	`88.61% <100.00%> (ø)`
rust/arrow/src/array/array_binary.rs	`91.75% <100.00%> (+0.76%)`	⬆️
rust/arrow/src/array/array_list.rs	`93.84% <100.00%> (+0.78%)`	⬆️
rust/arrow/src/array/array_primitive.rs	`94.72% <100.00%> (+0.24%)`	⬆️
rust/arrow/src/array/array_string.rs	`95.37% <100.00%> (+1.25%)`	⬆️
rust/arrow/src/array/builder.rs	`85.10% <100.00%> (ø)`
rust/arrow/src/compute/kernels/cast.rs	`96.96% <100.00%> (-0.03%)`	⬇️
rust/arrow/src/compute/kernels/limit.rs	`100.00% <100.00%> (ø)`
rust/arrow/src/compute/util.rs	`98.92% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ab5fc97...dda0545. Read the comment docs.

jorgecarleitao

LGTM. Great simplification and really called for :)

With this change, I think that we may be able to remove an unsafe in compute::kernels::length where we re-interpret the offset buffer as type Offset.

alamb · 2021-01-17T11:13:26Z

I apologize for the delay in merging Rust PRs -- the 3.0 release is being finalized now and are planning to minimize entropy by postponing merging changes not critical for the release until the process was complete. I hope the process is complete in the next few days. There is more discussion in the mailing list

alamb · 2021-01-20T11:57:09Z

@jorgecarleitao do you think this PR is ready to merge? Does it need any additional review or rebasing?

jorgecarleitao · 2021-01-20T11:58:48Z

Ready to merge. Note that it has some backward incompatible changes, so PRs on top need to be merged carefully.

alamb · 2021-01-20T12:08:56Z

Roger -- I'll plan to merge it later today (after the current round of merges have gotten through CI)

alamb

I merged this branch to apache/master locally and I got a compilation error. I think this is a logical conflict introduced with a bunch of the other PRs that have been merged over the last few days.

@tyrelr can you possibly rebase and fix the conflicts?

Thanks again for your help,
Andrew

(arrow_dev) alamb@MacBook-Pro:~/Software/arrow2/rust$ cargo test --all
   Compiling parquet v3.0.0-SNAPSHOT (/Users/alamb/Software/arrow2/rust/parquet)
   Compiling arrow v3.0.0-SNAPSHOT (/Users/alamb/Software/arrow2/rust/arrow)
error[E0599]: no method named `to_byte_slice` found for array `[{integer}; 10]` in the current scope
   --> arrow/src/array/array_list.rs:710:49
    |
710 |                 &[0, 1, 2, 3, 4, 5, 6, 7, 8, 9].to_byte_slice(),
    |                                                 ^^^^^^^^^^^^^ method not found in `[{integer}; 10]`
    |
    = help: items from traits can only be used if the trait is in scope
    = note: the following trait is implemented but not in scope; perhaps add a `use` for it:
            `use crate::datatypes::ToByteSlice;`

error[E0599]: no method named `to_byte_slice` found for array `[i64; 10]` in the current scope
   --> arrow/src/array/array_list.rs:717:62
    |
717 |             Buffer::from(&[0i64, 2, 2, 2, 4, 6, 6, 9, 9, 10].to_byte_slice());
    |                                                              ^^^^^^^^^^^^^ method not found in `[i64; 10]`
    |
    = help: items from traits can only be used if the trait is in scope
    = note: the following trait is implemented but not in scope; perhaps add a `use` for it:
            `use crate::datatypes::ToByteSlice;`

error[E0599]: no method named `to_byte_slice` found for array `[{integer}; 10]` in the current scope
   --> arrow/src/array/array_list.rs:807:49
    |
807 |                 &[0, 1, 2, 3, 4, 5, 6, 7, 8, 9].to_byte_slice(),
    |                                                 ^^^^^^^^^^^^^ method not found in `[{integer}; 10]`
    |
    = help: items from traits can only be used if the trait is in scope
    = note: the following trait is implemented but not in scope; perhaps add a `use` for it:
            `use crate::datatypes::ToByteSlice;`

   Compiling arrow-flight v3.0.0-SNAPSHOT (/Users/alamb/Software/arrow2/rust/arrow-flight)
error: aborting due to 3 previous errors

For more information about this error, try `rustc --explain E0599`.
error: could not compile `arrow`

To learn more, run the command again with --verbose.
warning: build failed, waiting for other jobs to finish...
error: build failed

tyrelr · 2021-01-21T01:34:44Z

Rebased. But I didn't hit the merge conflicts during... I can see that a number of tests were swapped to use Buffer::from_slice_ref()... so I'll add another commit to switch to that function just to be safe.

jorgecarleitao

Left a performance concern.

jorgecarleitao · 2021-01-21T05:49:34Z

rust/arrow/src/array/array_primitive.rs

Thinking about it, I am not sure we want this: this will have severe implications to almost all uses of value(). Could you run the benches before and after this PR?

My suggestion is that we mark this method as unsafe, so that it is clear that people need to be conscious about it. Since this will impact a large part of our code base (and likely others), I suggest that we do this change on a separate PR.

I think that we would like to instruct users of this API to switch to values(). This change will not trigger that change (as the code continues to run), but will have an amazing performance implication.

Think

for i in 0..array.len() { if array.value(i) > 2 { // do x }; }

this loop will suffer a lot from this change.

we would like users to change it to

array.values().for_each(|value| { if value > 2 { // do x }; });

For that, we need to mark array.value(i) as unsafe to indicate that yes, you can use that method, yes, it is fast, but you need to be careful about i.

I agree, that we want to guide API users to switch to values(). My hesitancy with making/leaving it unsafe is that I think it's the only Array value(x) method when combined with the other changes in this PR (which is why it got added as a followon commit). That leads to awkwardness in macros, such as the comparison kernel, where it effectively calls "$left.value(x)". Making macro style invocation sometimes-need-unsafe is a bit unfortunate.

(There may be an argument that a trait could abstract away the need for macros altogether... but my experiments to rework the comparison kernel to that effect are incomplete - and partly sidelined with various other experiments I'm toying with... most of which somehow relate to removing direct calls to __Array.value(i) )

I would expect that in the exact for loop you mention, the compiler could recognize the satisfied loop bound since both len() and .value(i) are inlinable and use the same underlying field, but I expect it is also easy to complicate on that loop to defeat that by having a less direct relationship between i & array.len(). The pretty printing logic probably would do additional bounds checks now, as it iterates based on len() from the record batch, which is semantically equal, but nothing in the code enforces that in a way I would expect the compiler to understand.

I'll launch some benchmarks of head, master, and the commit just before the PrimitiveArray safety change.

I think that we may need to park this change until we migrate our code base to use values() whenever possible.

If we merge this one, we get a major performance degradation. If we add unsafe, we need to add unsafe in a lot of places. Neither are great options :)

That makes sense. I'll rebase dropping that commit.

I think that I covered all of those in #9291 . Good that there is no impact on sort, which was the only case where I could not optimize (i.e. we really needed the bound check).

The larger task atm is to convert all unit-tests that use value(i) to values()[i].

Fyi, the reason I propose keeping value(i) as unsafe instead of un-performant is that there is no way to efficiently access the value i without knowing the internals of the primitiveArray (i.e. pointer and offset). imo this method does serve the purpose of abstracting out those details. It just happens to be unsafe and we currently not marking it as such. However, please say if you disagree here.

wrt to the comparison kernel, we can create a generic for primitives alone for that use-case, no?

Let me know what do you think

#9291 is good progress towards eliminating use of the function.

And certainly, we could 'split' the macro for primitives as a quick fix to get rid of the call to the function. I've been experimenting with an alternative approach that might be a bit more flexible to multiple use cases, described at the bottom of this comment.

I am quite torn about whether I think value should or should not be in the interface.

Reasons to drop value(i) -> T::Native

I think that even if value(i) was dropped from the PrimitiveArray impl's, efficient random access to items without a bounds check can still be achieved through unsafe{*primitive_array.values().get_unchecked(i)} (the extra * because get_unchecked() returns a ref to the value).

I'm not sure I have any example code or measurements to demonstrate it on hand, but I am certain I saw the silently-unsafe implementationx.values().iter().zip(y.values().iter()) did (slightly) outperform (0..x.len()).map(|i|{x.value(i),y.value(i)}. I believe it was when I was playing with non-simd arithmetic kernels.... So that is the root of my hesitancy, is I'm worried it doesn't actually escape any overhead, and unintentionally lead people away from a more reliable/performant way. IF there is a context where unsafe{x.value(i)} beats the performance of unsafe{*x.values().get_unchecked(i)}

Reasons to keep value(i) -> T::Native

All other array implementations have value functions as far as I recall, so it is a nice 'consistency'.

In the back of my mind, the biggest argument to keep value(i) is for api consistency... so long term, a 'trait' may be the place where it might fit best? Very roughly, I'm thinking:

trait TypedArrowArray : ArrowArray { type RefType; fn is_valid(i:usize) -> bool; //bounds check unsafe fn is_valid(i:usize) -> bool; //no bounds check fn value(i:usize) -> RefType; //bounds check unsafe fn value_unchecked(i:usize) -> RefType; //no bounds checked fn iter() -> impl Iterator<Option<RefType>>; fn iter_values() -> impl Iterator<RefType>; } impl <T: ArrowPrimitiveType> TypedArrowArray<&T::Native> for PrimitiveArray<T> { ... } impl TypedArrowArray<ArrayRef> for GenericListArray<T> { ... } //and similar for string/binary. ... I am not sure whether struct arrays could fit... Dictionary would not give access to 'keys', only to the values referenced by each key? Union would require some kind of RefType that can downcast into the actual value?

Of course, I am uncertain how much overhead the 'standarization' such a trait impl implies would bring... would any kernels actually benefit from using generic implementations against such an api, or will they always go down to the concrete type to squeeze little short-cuts out that don't fit in the generic interface? I'm unsure, so very (very, very) slowly experimenting...

Summary

So in short, my thoughts are:

I think that leaving value(i) safety consideration out of this PR makes sense. I've rebased to drop that out - although I did leave the additional values() test code.

Marking it unsafe in the near future is absolutely better than being silently-unsafe. The argument that adding bounds-checks could silently impact external users is reasonable, taking unsafe has the larger 'warning' so that the change isn't missed.

Longer term, the options of deprecating it, or explicitly moving it into an trait impl are both contenders in my mind... but neither option is directly relevant to this PR.

Let me know if that seems reasonable.

Just looking over the benchmarks again, one of the biggest changes is actually a performance improvement while doing safe access... which makes very little sense... so I think I'll run the benchmarks again overnight.

jorgecarleitao

Great work, @tyrelr , I think that this is a major improvement to the API, reduces unsoundness, and makes it clear.

I left an optional comment, since we are touching the API.

@sunchao @nevi-me @alamb would you like to voice your opinion here?

jorgecarleitao · 2021-01-23T13:34:30Z

rust/arrow/src/array/array_string.rs

what do you think of naming it simply offsets? I can't see the value of value_ on the name (same for the other types). Since we are touching the API in backward incompatible ways... xD

I had considered that, but was concerned that it may be confused with the buffer.offset... my hope was that it'd make it clear that it's an "offset into values()" (not into the buffer, which has the additional array-slicing offset).

I could go either way with it though?

I'm fine with removing the value part - we should do it in another PR though.

rust/arrow/src/array/array_string.rs

tyrelr · 2021-01-23T18:06:00Z

The 10% performance impact looks like mostly improvements:

critcmp master-499b6d0c array-slice-78f360b7c -t 10
group                      array-slice-78f360b7c                  master-499b6d0c
-----                      ---------------------                  ---------------
add 512                    1.00    282.9±0.41ns        ? B/sec    1.37    387.7±0.71ns        ? B/sec
add_nulls_512              1.00    344.7±0.54ns        ? B/sec    1.30    449.3±1.08ns        ? B/sec
bench_primitive            1.42      3.1±0.00ms  1279.3 MB/sec    1.00      2.2±0.01ms  1815.1 MB/sec
buffer_bit_ops or          1.00    590.9±0.95ns        ? B/sec    1.14    673.4±0.93ns        ? B/sec
cast int64 to int32 512    1.15      2.5±0.02µs        ? B/sec    1.00      2.2±0.01µs        ? B/sec
equal_bool_512             1.11     43.1±0.08ns        ? B/sec    1.00     38.8±0.06ns        ? B/sec
like_utf8 scalar equals    1.00     72.5±0.88µs        ? B/sec    1.22     88.4±0.18µs        ? B/sec
max nulls 512              1.00   1758.9±6.23ns        ? B/sec    1.31      2.3±0.01µs        ? B/sec
min string 512             1.00      3.4±0.00µs        ? B/sec    1.32      4.5±0.07µs        ? B/sec
multiply 512               1.00    335.6±0.67ns        ? B/sec    1.19    398.6±5.74ns        ? B/sec
mutable str 1024           1.00      2.7±0.00ms        ? B/sec    1.14      3.0±0.01ms        ? B/sec
subtract 512               1.00    345.3±1.26ns        ? B/sec    1.12    385.0±0.77ns        ? B/sec
take bool nulls 1024       1.15      4.2±0.03µs        ? B/sec    1.00      3.7±0.02µs        ? B/sec

I don't believe that the list/binary/string array would be involved in most of the benchmarks listed here? (multiply, subtract, take bool, add, add_nulls, bench_primitive should be operations on boolean/primitive arrays, which have no functional changes)...

jorgecarleitao · 2021-01-23T18:37:15Z

@tyrelr , I believe that you may have to run the benches against the latest master vs your branch rebased, for consistency: 499b6d0 is before a significant change to the mutable buffer's performance. E.g. this code does not touch add, but there is a 40% difference in the benches (consistent with the PR that optimizes that kernel.

tyrelr · 2021-01-24T04:11:17Z

I've rebased locally and will be running benchmarks overnight.

tyrelr · 2021-01-24T14:57:06Z

I'm still seeing a mix of inconsistent performance hits/bumps after the rebase.

critcmp master-67d0c2e38 array-slice-83b8938af -t 10
group                        array-slice-83b8938af                  master-67d0c2e38
-----                        ---------------------                  ----------------
array_slice 512              1.11    127.8±0.36ns        ? B/sec    1.00    115.4±0.15ns        ? B/sec
cast int64 to int32 512      1.00      2.3±0.00µs        ? B/sec    1.11      2.5±0.01µs        ? B/sec
concat i32 1024              1.00      2.5±0.00µs        ? B/sec    1.19      3.0±0.01µs        ? B/sec
equal_512                    1.14     46.7±0.18ns        ? B/sec    1.00     41.1±0.05ns        ? B/sec
like_utf8 scalar complex     1.00   1075.9±2.54µs        ? B/sec    1.16   1249.6±2.29µs        ? B/sec
like_utf8 scalar equals      1.00     70.8±0.08µs        ? B/sec    1.25     88.2±0.12µs        ? B/sec
min nulls string 512         1.14      6.5±0.04µs        ? B/sec    1.00      5.7±0.03µs        ? B/sec
min string 512               1.00      3.4±0.00µs        ? B/sec    1.28      4.3±0.01µs        ? B/sec
multiply 512                 1.38    346.6±0.37ns        ? B/sec    1.00    251.2±0.49ns        ? B/sec
nlike_utf8 scalar complex    1.00   1162.1±1.30µs        ? B/sec    1.12   1300.7±1.45µs        ? B/sec
take bool nulls 1024         1.00      3.7±0.03µs        ? B/sec    1.38      5.1±0.03µs        ? B/sec
take bool nulls 512          1.00   1720.3±8.69ns        ? B/sec    1.42      2.5±0.02µs        ? B/sec
take i32 512                 1.11   1023.5±1.61ns        ? B/sec    1.00    918.6±1.07ns        ? B/sec
take i32 nulls 512           1.00    989.6±1.31ns        ? B/sec    1.10   1089.6±2.06ns        ? B/sec

I won't push my rebase up unless we decide on some further tweaks to make, as it doesn't seem worth forcing a re-review since there were no conflicts.

jorgecarleitao · 2021-01-26T17:46:25Z

@tyrelr , could you rebase against master?

@kszucs , something may have happened to the force push: the commits seem funny.

kszucs · 2021-01-26T18:52:31Z

Could you please rebase on top of master?

…nctions providing a slice of the offsets. Make value() functions check bounds to make them safe, add unsafe variants for callers willing to check bounds ahead of time.

…en proven in bounds

github-actions bot added the Component: Rust label Jan 16, 2021

jorgecarleitao approved these changes Jan 17, 2021

View reviewed changes

alamb reviewed Jan 20, 2021

View reviewed changes

alamb added the needs-rebase A PR that needs to be rebased by the author label Jan 20, 2021

tyrelr force-pushed the array_slice_accessors branch from 47086b8 to 5cd8075 Compare January 21, 2021 01:27

jorgecarleitao requested changes Jan 21, 2021

View reviewed changes

tyrelr force-pushed the array_slice_accessors branch from 8181d2b to 41c29bc Compare January 22, 2021 06:36

jorgecarleitao self-requested a review January 23, 2021 13:36

jorgecarleitao approved these changes Jan 23, 2021

View reviewed changes

This comment has been minimized.

Sign in to view

tyrelr commented Jan 23, 2021

View reviewed changes

rust/arrow/src/array/array_string.rs Outdated Show resolved Hide resolved

jorgecarleitao removed the needs-rebase A PR that needs to be rebased by the author label Jan 24, 2021

kszucs force-pushed the master branch from d238279 to 437c8c9 Compare January 25, 2021 21:53

tyrelr added 2 commits January 27, 2021 07:27

Replace unsafe value_offset(i) functions with safe value_offsets() fu…

0ef4bb4

…nctions providing a slice of the offsets. Make value() functions check bounds to make them safe, add unsafe variants for callers willing to check bounds ahead of time.

use unsafe access to the offset array where the offset has already be…

dda0545

…en proven in bounds

tyrelr force-pushed the array_slice_accessors branch from e18e319 to dda0545 Compare January 27, 2021 13:31

jorgecarleitao closed this in 2fd5857 Jan 28, 2021

jorgecarleitao mentioned this pull request Jan 28, 2021

ARROW-11420: [Rust] Added support to length of Binary and List. #9353

Closed

asfimport mentioned this pull request Jan 28, 2021

[Rust] Use slices for simple array data buffer access #27171

Closed

ARROW-11270: [Rust] Array slice accessors #9215

ARROW-11270: [Rust] Array slice accessors #9215

Uh oh!

Conversation

tyrelr commented Jan 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 16, 2021

Uh oh!

codecov-io commented Jan 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Jan 17, 2021

Uh oh!

alamb commented Jan 20, 2021

Uh oh!

jorgecarleitao commented Jan 20, 2021

Uh oh!

alamb commented Jan 20, 2021

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

tyrelr commented Jan 21, 2021

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao Jan 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tyrelr Jan 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Reasons to drop value(i) -> T::Native

Reasons to keep value(i) -> T::Native

Summary

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

Uh oh!

tyrelr commented Jan 23, 2021

Uh oh!

jorgecarleitao commented Jan 23, 2021

Uh oh!

tyrelr commented Jan 24, 2021

Uh oh!

tyrelr commented Jan 24, 2021

Uh oh!

jorgecarleitao commented Jan 26, 2021

Uh oh!

kszucs commented Jan 26, 2021

Uh oh!

Reviewers

tyrelr commented Jan 16, 2021 •

edited

Loading

codecov-io commented Jan 16, 2021 •

edited

Loading

jorgecarleitao Jan 21, 2021 •

edited

Loading

tyrelr Jan 21, 2021 •

edited

Loading

jorgecarleitao left a comment •

edited

Loading