-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10903 [Rust] Implement FromIter<Option<Vec<u8>>> constructor for FixedSizeBinaryArray #9647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ivanvankov . Thank you for your interest in this project and for your first PR!
I went through this PR and I think that you have the right ideas. I like the API for the try_from_iter, I arrived to a similar API independently :-)
There are some comments that I would like to add here:
-
in Arrow, the null bitmap is stored as bytes, where each byte represents 8 "items" of the array. In this context,
null_buf.push(0)won't work because even though the type is correct (u8), we are adding 1u8per item, not 1u8per 8 items. Some bitmap logic is needed here. I think that thePrimitiveArray::from_iteralready has that implemented, so you can borrow that. -
the number of items of a
FixedSizeBinaryArrayincludes itssize, i.e. the invariants here are:
array.len() % size == 0;
array.len() == array.buffer()[0].len()
array.len() <= array.nulls().len() * 8 (we may have an extra byte for len not divisible by 8)Note that when these are not fulfilled, we unfortunately reach to undefined behavior.
-
I did not understand the reason for the
unsafechunk about copying. Can't we useMutableBuffer::extend_from_slice? -
I think that it would be beneficial to document the difference between
try_from_sparse_iterandtry_from_itera bit better. I can't tell the difference from the docs, and the tests seem to be the same, no?
Regardless, overall this is heading in the right direction. We may also do a quick search on the tests to see if we can replace some custom array creations by this API.
|
BTW the test workspace check is also failing on master, so it may not be related to this PR. See more details on https://issues.apache.org/jira/browse/ARROW-11896 The integration test failure looks strange (we had fixed something like that previously and it looks like your branch has the fix). I retriggered the test jobs to see if I could get a passing run |
|
Hi @jorgecarleitao, and thank you for the extended review and explanations.
Feel free to add more remarks if you have any. Thanks. |
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does actually push just one byte per each 8 items. But here I've noticed a mistake :-)
null_buf.push(0) actually pushes 4 bytes, so I'll change it to null_buf.push(0u8) to fix this. The invariants are still held though.
You are obviously right. Sorry about that.
Regarding invariants I don't see anything that breaks them in my implementation. If you see anything please point to it.
I am sorry, I should have been more explicit here; I've added an inline comment where I think there may be an issue.
I think, for more performance it would be better to prepend immediately after we determined size, not after iterator finished.
I also think so, and also because no unsafe would be used. Our guidelines about unsafe is that there must be good justification for it, and I am trying to understand what is the justification exactly.
E.g. since we have to validate that every entries' size are equal, couldn't we do it directly? e.g. something like this:
Some(v) => { // item is not null
let bytes = *v;
if let Some(size) = size {
if size != bytes.len() {
return Err(ArrowError::...);
}
} else {
// first time we see a len
size = Some(bytes.len());
self.values
.extend_constant(&vec![0; bytes.len() * current_validity]);
};
self.values.extend_from_slice(bytes);
self.validity.push(true);where current_validity tracks the number of nones up to when we get a size, and size: Option<usize> represents the current size (if known).
rust/arrow/src/array/array_binary.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider an iterator of size 4 whose every element is Some(&[0, 0, 0]).
I think that len is being increased by 1 on every group of size size. In the example above, won't len equal to 4 instead of 12?
I think that this should be len * size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will be equal to 4. But in the previous implementation, for example impl From<Vec<Vec<u8>>> for FixedSizeBinaryArray, when we pass for example a vec with 4 items to from then it will put 4 as len in the array builder. Isn't it the same there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right (always learning something new here). In FixedSizeBinary[3] , each item is [a, b, c], pretty much like in an ListArray, and thus the len should be 4, not 12. So, the invariant is actually the other way around:
array.buffer()[0].len() % size == 0
array.len() * size == array.buffer()[0].len()
array.validity().len() == array.len()Sorry about the noise and thank you for your patience 🥇
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work you do 👍
You are actually absolutely right, no |
|
FYI I merged #9653 / ARROW-11896 for the Rust CI checks which may affect this PR. If you see "Rust / AMD64 Debian 10 Rust stable test workspace" failing with a linker error or no logs, rebasing against master will hopefully fix the problem |
Codecov Report
@@ Coverage Diff @@
## master #9647 +/- ##
==========================================
- Coverage 82.49% 82.49% -0.01%
==========================================
Files 245 245
Lines 57347 57380 +33
==========================================
+ Hits 47311 47335 +24
- Misses 10036 10045 +9
Continue to review full report at Codecov.
|
…or FixedSizeBinaryArray
|
Sorry, closed this one by mistake. Let me know if anything else is to be done here. Thank you! |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just trolling through past PRs and I noticed this one seems to have gotten stuck. @ivanvankov this PR looks good from my perspective -- can you please rebase it against apache/master?
@jorgecarleitao any other comments?
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this is ready to merge, yes 👍 Thanks a lot @ivanvankov for this!
|
I merged this code with apache/master locally and ran |
This Pr is follow up for #8853 (comment) .
I was not able to utilize
TryFrombecause of conflicting implementations, so instead I created two new functionstry_from_sparse_iterandtry_from_iterin place ofimpl From<Vec<Vec<u8>>> for FixedSizeBinaryArrayandimpl From<Vec<Option<Vec<u8>>>> for FixedSizeBinaryArray