-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-11270: [Rust] Array slice accessors #9215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #9215 +/- ##
==========================================
+ Coverage 81.89% 81.92% +0.03%
==========================================
Files 215 215
Lines 52988 53070 +82
==========================================
+ Hits 43392 43478 +86
+ Misses 9596 9592 -4
Continue to review full report at Codecov.
|
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Great simplification and really called for :)
With this change, I think that we may be able to remove an unsafe in compute::kernels::length where we re-interpret the offset buffer as type Offset.
|
I apologize for the delay in merging Rust PRs -- the 3.0 release is being finalized now and are planning to minimize entropy by postponing merging changes not critical for the release until the process was complete. I hope the process is complete in the next few days. There is more discussion in the mailing list |
|
@jorgecarleitao do you think this PR is ready to merge? Does it need any additional review or rebasing? |
|
Ready to merge. Note that it has some backward incompatible changes, so PRs on top need to be merged carefully. |
|
Roger -- I'll plan to merge it later today (after the current round of merges have gotten through CI) |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I merged this branch to apache/master locally and I got a compilation error. I think this is a logical conflict introduced with a bunch of the other PRs that have been merged over the last few days.
@tyrelr can you possibly rebase and fix the conflicts?
Thanks again for your help,
Andrew
(arrow_dev) alamb@MacBook-Pro:~/Software/arrow2/rust$ cargo test --all
Compiling parquet v3.0.0-SNAPSHOT (/Users/alamb/Software/arrow2/rust/parquet)
Compiling arrow v3.0.0-SNAPSHOT (/Users/alamb/Software/arrow2/rust/arrow)
error[E0599]: no method named `to_byte_slice` found for array `[{integer}; 10]` in the current scope
--> arrow/src/array/array_list.rs:710:49
|
710 | &[0, 1, 2, 3, 4, 5, 6, 7, 8, 9].to_byte_slice(),
| ^^^^^^^^^^^^^ method not found in `[{integer}; 10]`
|
= help: items from traits can only be used if the trait is in scope
= note: the following trait is implemented but not in scope; perhaps add a `use` for it:
`use crate::datatypes::ToByteSlice;`
error[E0599]: no method named `to_byte_slice` found for array `[i64; 10]` in the current scope
--> arrow/src/array/array_list.rs:717:62
|
717 | Buffer::from(&[0i64, 2, 2, 2, 4, 6, 6, 9, 9, 10].to_byte_slice());
| ^^^^^^^^^^^^^ method not found in `[i64; 10]`
|
= help: items from traits can only be used if the trait is in scope
= note: the following trait is implemented but not in scope; perhaps add a `use` for it:
`use crate::datatypes::ToByteSlice;`
error[E0599]: no method named `to_byte_slice` found for array `[{integer}; 10]` in the current scope
--> arrow/src/array/array_list.rs:807:49
|
807 | &[0, 1, 2, 3, 4, 5, 6, 7, 8, 9].to_byte_slice(),
| ^^^^^^^^^^^^^ method not found in `[{integer}; 10]`
|
= help: items from traits can only be used if the trait is in scope
= note: the following trait is implemented but not in scope; perhaps add a `use` for it:
`use crate::datatypes::ToByteSlice;`
Compiling arrow-flight v3.0.0-SNAPSHOT (/Users/alamb/Software/arrow2/rust/arrow-flight)
error: aborting due to 3 previous errors
For more information about this error, try `rustc --explain E0599`.
error: could not compile `arrow`
To learn more, run the command again with --verbose.
warning: build failed, waiting for other jobs to finish...
error: build failed
47086b8 to
5cd8075
Compare
|
Rebased. But I didn't hit the merge conflicts during... I can see that a number of tests were swapped to use Buffer::from_slice_ref()... so I'll add another commit to switch to that function just to be safe. |
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a performance concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it, I am not sure we want this: this will have severe implications to almost all uses of value(). Could you run the benches before and after this PR?
My suggestion is that we mark this method as unsafe, so that it is clear that people need to be conscious about it. Since this will impact a large part of our code base (and likely others), I suggest that we do this change on a separate PR.
I think that we would like to instruct users of this API to switch to values(). This change will not trigger that change (as the code continues to run), but will have an amazing performance implication.
Think
for i in 0..array.len() {
if array.value(i) > 2 {
// do x
};
}this loop will suffer a lot from this change.
we would like users to change it to
array.values().for_each(|value| {
if value > 2 {
// do x
};
});For that, we need to mark array.value(i) as unsafe to indicate that yes, you can use that method, yes, it is fast, but you need to be careful about i.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, that we want to guide API users to switch to values(). My hesitancy with making/leaving it unsafe is that I think it's the only Array value(x) method when combined with the other changes in this PR (which is why it got added as a followon commit). That leads to awkwardness in macros, such as the comparison kernel, where it effectively calls "$left.value(x)". Making macro style invocation sometimes-need-unsafe is a bit unfortunate.
(There may be an argument that a trait could abstract away the need for macros altogether... but my experiments to rework the comparison kernel to that effect are incomplete - and partly sidelined with various other experiments I'm toying with... most of which somehow relate to removing direct calls to __Array.value(i) )
I would expect that in the exact for loop you mention, the compiler could recognize the satisfied loop bound since both len() and .value(i) are inlinable and use the same underlying field, but I expect it is also easy to complicate on that loop to defeat that by having a less direct relationship between i & array.len(). The pretty printing logic probably would do additional bounds checks now, as it iterates based on len() from the record batch, which is semantically equal, but nothing in the code enforces that in a way I would expect the compiler to understand.
I'll launch some benchmarks of head, master, and the commit just before the PrimitiveArray safety change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we may need to park this change until we migrate our code base to use values() whenever possible.
If we merge this one, we get a major performance degradation. If we add unsafe, we need to add unsafe in a lot of places. Neither are great options :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I'll rebase dropping that commit.
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that I covered all of those in #9291 . Good that there is no impact on sort, which was the only case where I could not optimize (i.e. we really needed the bound check).
The larger task atm is to convert all unit-tests that use value(i) to values()[i].
Fyi, the reason I propose keeping value(i) as unsafe instead of un-performant is that there is no way to efficiently access the value i without knowing the internals of the primitiveArray (i.e. pointer and offset). imo this method does serve the purpose of abstracting out those details. It just happens to be unsafe and we currently not marking it as such. However, please say if you disagree here.
wrt to the comparison kernel, we can create a generic for primitives alone for that use-case, no?
Let me know what do you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#9291 is good progress towards eliminating use of the function.
And certainly, we could 'split' the macro for primitives as a quick fix to get rid of the call to the function. I've been experimenting with an alternative approach that might be a bit more flexible to multiple use cases, described at the bottom of this comment.
I am quite torn about whether I think value should or should not be in the interface.
Reasons to drop value(i) -> T::Native
I think that even if value(i) was dropped from the PrimitiveArray impl's, efficient random access to items without a bounds check can still be achieved through unsafe{*primitive_array.values().get_unchecked(i)} (the extra * because get_unchecked() returns a ref to the value).
I'm not sure I have any example code or measurements to demonstrate it on hand, but I am certain I saw the silently-unsafe implementationx.values().iter().zip(y.values().iter()) did (slightly) outperform (0..x.len()).map(|i|{x.value(i),y.value(i)}. I believe it was when I was playing with non-simd arithmetic kernels.... So that is the root of my hesitancy, is I'm worried it doesn't actually escape any overhead, and unintentionally lead people away from a more reliable/performant way. IF there is a context where unsafe{x.value(i)} beats the performance of unsafe{*x.values().get_unchecked(i)}
Reasons to keep value(i) -> T::Native
All other array implementations have value functions as far as I recall, so it is a nice 'consistency'.
In the back of my mind, the biggest argument to keep value(i) is for api consistency... so long term, a 'trait' may be the place where it might fit best? Very roughly, I'm thinking:
trait TypedArrowArray : ArrowArray {
type RefType;
fn is_valid(i:usize) -> bool; //bounds check
unsafe fn is_valid(i:usize) -> bool; //no bounds check
fn value(i:usize) -> RefType; //bounds check
unsafe fn value_unchecked(i:usize) -> RefType; //no bounds checked
fn iter() -> impl Iterator<Option<RefType>>;
fn iter_values() -> impl Iterator<RefType>;
}
impl <T: ArrowPrimitiveType> TypedArrowArray<&T::Native> for PrimitiveArray<T> { ... }
impl TypedArrowArray<ArrayRef> for GenericListArray<T> { ... }
//and similar for string/binary. ... I am not sure whether struct arrays could fit... Dictionary would not give access to 'keys', only to the values referenced by each key? Union would require some kind of RefType that can downcast into the actual value?
Of course, I am uncertain how much overhead the 'standarization' such a trait impl implies would bring... would any kernels actually benefit from using generic implementations against such an api, or will they always go down to the concrete type to squeeze little short-cuts out that don't fit in the generic interface? I'm unsure, so very (very, very) slowly experimenting...
Summary
So in short, my thoughts are:
- I think that leaving value(i) safety consideration out of this PR makes sense. I've rebased to drop that out - although I did leave the additional values() test code.
- Marking it unsafe in the near future is absolutely better than being silently-unsafe. The argument that adding bounds-checks could silently impact external users is reasonable, taking unsafe has the larger 'warning' so that the change isn't missed.
- Longer term, the options of deprecating it, or explicitly moving it into an trait impl are both contenders in my mind... but neither option is directly relevant to this PR.
Let me know if that seems reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just looking over the benchmarks again, one of the biggest changes is actually a performance improvement while doing safe access... which makes very little sense... so I think I'll run the benchmarks again overnight.
8181d2b to
41c29bc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rust/arrow/src/array/array_string.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think of naming it simply offsets? I can't see the value of value_ on the name (same for the other types). Since we are touching the API in backward incompatible ways... xD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had considered that, but was concerned that it may be confused with the buffer.offset... my hope was that it'd make it clear that it's an "offset into values()" (not into the buffer, which has the additional array-slicing offset).
I could go either way with it though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with removing the value part - we should do it in another PR though.
This comment has been minimized.
This comment has been minimized.
|
The 10% performance impact looks like mostly improvements: I don't believe that the list/binary/string array would be involved in most of the benchmarks listed here? (multiply, subtract, take bool, add, add_nulls, bench_primitive should be operations on boolean/primitive arrays, which have no functional changes)... |
|
@tyrelr , I believe that you may have to run the benches against the latest master vs your branch rebased, for consistency: 499b6d0 is before a significant change to the mutable buffer's performance. E.g. this code does not touch |
|
I've rebased locally and will be running benchmarks overnight. |
|
I'm still seeing a mix of inconsistent performance hits/bumps after the rebase. I won't push my rebase up unless we decide on some further tweaks to make, as it doesn't seem worth forcing a re-review since there were no conflicts. |
|
Could you please rebase on top of master? |
…nctions providing a slice of the offsets. Make value() functions check bounds to make them safe, add unsafe variants for callers willing to check bounds ahead of time.
…en proven in bounds
e18e319 to
dda0545
Compare
Using an approach similar to ARROW-10989, migrate typed array API's to use slices where they can.
This impacts the API of GenericBinaryArray<>, GenericListArray<>, GenericStringArray<>
This also enables bounds checking in every value() function on each of the above arrays, with an unsafe value_unchecked() provided for callers willing to do their own bounds-checking.