ARROW-10903 [Rust] Implement FromIter<Option<Vec<u8>>> constructor for FixedSizeBinaryArray #9647

ivanvankov · 2021-03-06T12:24:43Z

This Pr is follow up for #8853 (comment) .
I was not able to utilize TryFrom because of conflicting implementations, so instead I created two new functions try_from_sparse_iter and try_from_iter in place of impl From<Vec<Vec<u8>>> for FixedSizeBinaryArray and impl From<Vec<Option<Vec<u8>>>> for FixedSizeBinaryArray

github-actions · 2021-03-06T12:25:03Z

https://issues.apache.org/jira/browse/ARROW-10903

jorgecarleitao

Hi @ivanvankov . Thank you for your interest in this project and for your first PR!

I went through this PR and I think that you have the right ideas. I like the API for the try_from_iter, I arrived to a similar API independently :-)

There are some comments that I would like to add here:

in Arrow, the null bitmap is stored as bytes, where each byte represents 8 "items" of the array. In this context, null_buf.push(0) won't work because even though the type is correct (u8), we are adding 1 u8 per item, not 1 u8 per 8 items. Some bitmap logic is needed here. I think that the PrimitiveArray::from_iter already has that implemented, so you can borrow that.
the number of items of a FixedSizeBinaryArray includes its size, i.e. the invariants here are:

array.len() % size == 0;
array.len() == array.buffer()[0].len()
array.len() <= array.nulls().len() * 8 (we may have an extra byte for len not divisible by 8)

Note that when these are not fulfilled, we unfortunately reach to undefined behavior.

I did not understand the reason for the unsafe chunk about copying. Can't we use MutableBuffer::extend_from_slice?
I think that it would be beneficial to document the difference between try_from_sparse_iter and try_from_iter a bit better. I can't tell the difference from the docs, and the tests seem to be the same, no?

Regardless, overall this is heading in the right direction. We may also do a quick search on the tests to see if we can replace some custom array creations by this API.

alamb · 2021-03-07T11:50:54Z

BTW the test workspace check is also failing on master, so it may not be related to this PR. See more details on https://issues.apache.org/jira/browse/ARROW-11896

The integration test failure looks strange (we had fixed something like that previously and it looks like your branch has the fix). I retriggered the test jobs to see if I could get a passing run

ivanvankov · 2021-03-07T14:11:37Z

Hi @jorgecarleitao, and thank you for the extended review and explanations.
Now, regarding comments you have:

It does actually push just one byte per each 8 items. But here I've noticed a mistake :-)
null_buf.push(0) actually pushes 4 bytes, so I'll change it to null_buf.push(0u8) to fix this. The invariants are still held though.
Regarding invariants I don't see anything that breaks them in my implementation. If you see anything please point to it.
The part with unsafe is for the case when iterator returns None as first item. Since we need to add to buffer a sequence of zeros of length size we need to know size to do that. But we still don't know size value, we need to get to the first Some to determine it. So, I only save count of None elements until iterator returns first Some, and don't add to buffer anything until then. When iterator stops we need to add size * number of None zeros in the beginning of buffer, and for that the section with usafe is added, just to copy from one position to other and put zeros. So, I need to prepend data, not to append, that's why MutableBuffer::extend_from_slice can't be used.
I think, for more performance it would be better to prepend immediately after we determined size, not after iterator finished.
I'll add examples to documentation section for the functions.

Feel free to add more remarks if you have any. Thanks.

jorgecarleitao

It does actually push just one byte per each 8 items. But here I've noticed a mistake :-)
null_buf.push(0) actually pushes 4 bytes, so I'll change it to null_buf.push(0u8) to fix this. The invariants are still held though.

You are obviously right. Sorry about that.

Regarding invariants I don't see anything that breaks them in my implementation. If you see anything please point to it.

I am sorry, I should have been more explicit here; I've added an inline comment where I think there may be an issue.

I think, for more performance it would be better to prepend immediately after we determined size, not after iterator finished.

I also think so, and also because no unsafe would be used. Our guidelines about unsafe is that there must be good justification for it, and I am trying to understand what is the justification exactly.

E.g. since we have to validate that every entries' size are equal, couldn't we do it directly? e.g. something like this:

         Some(v) => {  // item is not null
                let bytes = *v;
                if let Some(size) = size {
                    if size != bytes.len() {
                        return Err(ArrowError::...);
                    }
                } else {
                    // first time we see a len
                    size = Some(bytes.len());
                    self.values
                        .extend_constant(&vec![0; bytes.len() * current_validity]);
                };
                self.values.extend_from_slice(bytes);
                self.validity.push(true);

where current_validity tracks the number of nones up to when we get a size, and size: Option<usize> represents the current size (if known).

jorgecarleitao · 2021-03-07T14:30:58Z

rust/arrow/src/array/array_binary.rs

Consider an iterator of size 4 whose every element is Some(&[0, 0, 0]).

I think that len is being increased by 1 on every group of size size. In the example above, won't len equal to 4 instead of 12?

I think that this should be len * size

Yes, it will be equal to 4. But in the previous implementation, for example impl From<Vec<Vec<u8>>> for FixedSizeBinaryArray, when we pass for example a vec with 4 items to from then it will put 4 as len in the array builder. Isn't it the same there?

You are right (always learning something new here). In FixedSizeBinary[3] , each item is [a, b, c], pretty much like in an ListArray, and thus the len should be 4, not 12. So, the invariant is actually the other way around:

array.buffer()[0].len() % size == 0 array.len() * size == array.buffer()[0].len() array.validity().len() == array.len()

Sorry about the noise and thank you for your patience 🥇

Great work you do 👍

ivanvankov · 2021-03-07T15:31:50Z

I also think so, and also because no unsafe would be used. Our guidelines about unsafe is that there must be good justification for it, and I am trying to understand what is the justification exactly.

E.g. since we have to validate that every entries' size are equal, couldn't we do it directly? e.g. something like this:

         Some(v) => {  // item is not null
                let bytes = *v;
                if let Some(size) = size {
                    if size != bytes.len() {
                        return Err(ArrowError::...);
                    }
                } else {
                    // first time we see a len
                    size = Some(bytes.len());
                    self.values
                        .extend_constant(&vec![0; bytes.len() * current_validity]);
                };
                self.values.extend_from_slice(bytes);
                self.validity.push(true);

where current_validity tracks the number of nones up to when we get a size, and size: Option<usize> represents the current size (if known).

You are actually absolutely right, no unsafe needed. Thanks!

alamb · 2021-03-07T17:57:43Z

FYI I merged #9653 / ARROW-11896 for the Rust CI checks which may affect this PR. If you see "Rust / AMD64 Debian 10 Rust stable test workspace" failing with a linker error or no logs, rebasing against master will hopefully fix the problem

codecov-io · 2021-03-07T18:12:19Z

Codecov Report

Merging #9647 (cb69ca1) into master (bfa99d9) will decrease coverage by 0.00%.
The diff coverage is 82.85%.

@@            Coverage Diff             @@
##           master    #9647      +/-   ##
==========================================
- Coverage   82.49%   82.49%   -0.01%     
==========================================
  Files         245      245              
  Lines       57347    57380      +33     
==========================================
+ Hits        47311    47335      +24     
- Misses      10036    10045       +9

Impacted Files	Coverage Δ
rust/arrow/src/array/array_binary.rs	`90.35% <80.32%> (-1.47%)`	⬇️
rust/arrow/src/array/transform/mod.rs	`93.21% <100.00%> (+0.02%)`	⬆️
rust/arrow/src/array/equal/utils.rs	`76.47% <0.00%> (+0.98%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bfa99d9...cb69ca1. Read the comment docs.

…or FixedSizeBinaryArray

ivanvankov · 2021-03-08T20:37:57Z

Sorry, closed this one by mistake. Let me know if anything else is to be done here. Thank you!

alamb

I was just trolling through past PRs and I noticed this one seems to have gotten stuck. @ivanvankov this PR looks good from my perspective -- can you please rebase it against apache/master?

@jorgecarleitao any other comments?

jorgecarleitao

Sorry, this is ready to merge, yes 👍 Thanks a lot @ivanvankov for this!

alamb · 2021-03-18T13:41:58Z

I merged this code with apache/master locally and ran cargo test --all and it all looks good to me. Thus I am going to merge this one in. Thanks again @ivanvankov and sorry for the delay!

github-actions bot added the Component: Rust label Mar 6, 2021

jorgecarleitao reviewed Mar 7, 2021

View reviewed changes

ivanvankov closed this Mar 7, 2021

ivanvankov force-pushed the ARROW-10903 branch from cb69ca1 to 69d436d Compare March 7, 2021 19:44

ARROW-10903: [Rust] Implement FromIter<Option<Vec<u8>>> constructor f…

d0e0a46

…or FixedSizeBinaryArray

ivanvankov reopened this Mar 7, 2021

alamb approved these changes Mar 18, 2021

View reviewed changes

jorgecarleitao approved these changes Mar 18, 2021

View reviewed changes

alamb closed this in 775a714 Mar 18, 2021

alamb mentioned this pull request Mar 19, 2021

ARROW-12015: [Rust] Fix Clippy #9754

Closed

asfimport mentioned this pull request Mar 18, 2021

[Rust] Implement FromIter<Option<Vec<u8>>> constructor for FixedSizeBinaryArray #26835

Closed

ARROW-10903 [Rust] Implement FromIter<Option<Vec<u8>>> constructor for FixedSizeBinaryArray #9647

ARROW-10903 [Rust] Implement FromIter<Option<Vec<u8>>> constructor for FixedSizeBinaryArray #9647

Uh oh!

Conversation

ivanvankov commented Mar 6, 2021

Uh oh!

github-actions bot commented Mar 6, 2021

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Mar 7, 2021

Uh oh!

ivanvankov commented Mar 7, 2021

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao Mar 7, 2021

Choose a reason for hiding this comment

Uh oh!

ivanvankov Mar 7, 2021

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao Mar 7, 2021

Choose a reason for hiding this comment

Uh oh!

ivanvankov Mar 7, 2021

Choose a reason for hiding this comment

Uh oh!

ivanvankov commented Mar 7, 2021

Uh oh!

alamb commented Mar 7, 2021

Uh oh!

codecov-io commented Mar 7, 2021

Codecov Report

Uh oh!

ivanvankov commented Mar 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Mar 18, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ivanvankov commented Mar 8, 2021 •

edited

Loading