Skip to content

Conversation

@ivanvankov
Copy link

This Pr is follow up for #8853 (comment) .
I was not able to utilize TryFrom because of conflicting implementations, so instead I created two new functions try_from_sparse_iter and try_from_iter in place of impl From<Vec<Vec<u8>>> for FixedSizeBinaryArray and impl From<Vec<Option<Vec<u8>>>> for FixedSizeBinaryArray

@github-actions
Copy link

github-actions bot commented Mar 6, 2021

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ivanvankov . Thank you for your interest in this project and for your first PR!

I went through this PR and I think that you have the right ideas. I like the API for the try_from_iter, I arrived to a similar API independently :-)

There are some comments that I would like to add here:

  • in Arrow, the null bitmap is stored as bytes, where each byte represents 8 "items" of the array. In this context, null_buf.push(0) won't work because even though the type is correct (u8), we are adding 1 u8 per item, not 1 u8 per 8 items. Some bitmap logic is needed here. I think that the PrimitiveArray::from_iter already has that implemented, so you can borrow that.

  • the number of items of a FixedSizeBinaryArray includes its size, i.e. the invariants here are:

array.len() % size == 0;
array.len() == array.buffer()[0].len()
array.len() <= array.nulls().len() * 8 (we may have an extra byte for len not divisible by 8)

Note that when these are not fulfilled, we unfortunately reach to undefined behavior.

  • I did not understand the reason for the unsafe chunk about copying. Can't we use MutableBuffer::extend_from_slice?

  • I think that it would be beneficial to document the difference between try_from_sparse_iter and try_from_iter a bit better. I can't tell the difference from the docs, and the tests seem to be the same, no?

Regardless, overall this is heading in the right direction. We may also do a quick search on the tests to see if we can replace some custom array creations by this API.

@alamb
Copy link
Contributor

alamb commented Mar 7, 2021

BTW the test workspace check is also failing on master, so it may not be related to this PR. See more details on https://issues.apache.org/jira/browse/ARROW-11896

The integration test failure looks strange (we had fixed something like that previously and it looks like your branch has the fix). I retriggered the test jobs to see if I could get a passing run

@ivanvankov
Copy link
Author

Hi @jorgecarleitao, and thank you for the extended review and explanations.
Now, regarding comments you have:

  • It does actually push just one byte per each 8 items. But here I've noticed a mistake :-)
    null_buf.push(0) actually pushes 4 bytes, so I'll change it to null_buf.push(0u8) to fix this. The invariants are still held though.

  • Regarding invariants I don't see anything that breaks them in my implementation. If you see anything please point to it.

  • The part with unsafe is for the case when iterator returns None as first item. Since we need to add to buffer a sequence of zeros of length size we need to know size to do that. But we still don't know size value, we need to get to the first Some to determine it. So, I only save count of None elements until iterator returns first Some, and don't add to buffer anything until then. When iterator stops we need to add size * number of None zeros in the beginning of buffer, and for that the section with usafe is added, just to copy from one position to other and put zeros. So, I need to prepend data, not to append, that's why MutableBuffer::extend_from_slice can't be used.
    I think, for more performance it would be better to prepend immediately after we determined size, not after iterator finished.

  • I'll add examples to documentation section for the functions.

Feel free to add more remarks if you have any. Thanks.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does actually push just one byte per each 8 items. But here I've noticed a mistake :-)
null_buf.push(0) actually pushes 4 bytes, so I'll change it to null_buf.push(0u8) to fix this. The invariants are still held though.

You are obviously right. Sorry about that.

Regarding invariants I don't see anything that breaks them in my implementation. If you see anything please point to it.

I am sorry, I should have been more explicit here; I've added an inline comment where I think there may be an issue.

I think, for more performance it would be better to prepend immediately after we determined size, not after iterator finished.

I also think so, and also because no unsafe would be used. Our guidelines about unsafe is that there must be good justification for it, and I am trying to understand what is the justification exactly.

E.g. since we have to validate that every entries' size are equal, couldn't we do it directly? e.g. something like this:

         Some(v) => {  // item is not null
                let bytes = *v;
                if let Some(size) = size {
                    if size != bytes.len() {
                        return Err(ArrowError::...);
                    }
                } else {
                    // first time we see a len
                    size = Some(bytes.len());
                    self.values
                        .extend_constant(&vec![0; bytes.len() * current_validity]);
                };
                self.values.extend_from_slice(bytes);
                self.validity.push(true);

where current_validity tracks the number of nones up to when we get a size, and size: Option<usize> represents the current size (if known).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider an iterator of size 4 whose every element is Some(&[0, 0, 0]).

I think that len is being increased by 1 on every group of size size. In the example above, won't len equal to 4 instead of 12?

I think that this should be len * size

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will be equal to 4. But in the previous implementation, for example impl From<Vec<Vec<u8>>> for FixedSizeBinaryArray, when we pass for example a vec with 4 items to from then it will put 4 as len in the array builder. Isn't it the same there?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right (always learning something new here). In FixedSizeBinary[3] , each item is [a, b, c], pretty much like in an ListArray, and thus the len should be 4, not 12. So, the invariant is actually the other way around:

array.buffer()[0].len() % size == 0
array.len() * size == array.buffer()[0].len()
array.validity().len() == array.len()

Sorry about the noise and thank you for your patience 🥇

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work you do 👍

@ivanvankov
Copy link
Author

I also think so, and also because no unsafe would be used. Our guidelines about unsafe is that there must be good justification for it, and I am trying to understand what is the justification exactly.

E.g. since we have to validate that every entries' size are equal, couldn't we do it directly? e.g. something like this:

         Some(v) => {  // item is not null
                let bytes = *v;
                if let Some(size) = size {
                    if size != bytes.len() {
                        return Err(ArrowError::...);
                    }
                } else {
                    // first time we see a len
                    size = Some(bytes.len());
                    self.values
                        .extend_constant(&vec![0; bytes.len() * current_validity]);
                };
                self.values.extend_from_slice(bytes);
                self.validity.push(true);

where current_validity tracks the number of nones up to when we get a size, and size: Option<usize> represents the current size (if known).

You are actually absolutely right, no unsafe needed. Thanks!

@alamb
Copy link
Contributor

alamb commented Mar 7, 2021

FYI I merged #9653 / ARROW-11896 for the Rust CI checks which may affect this PR. If you see "Rust / AMD64 Debian 10 Rust stable test workspace" failing with a linker error or no logs, rebasing against master will hopefully fix the problem

@codecov-io
Copy link

Codecov Report

Merging #9647 (cb69ca1) into master (bfa99d9) will decrease coverage by 0.00%.
The diff coverage is 82.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9647      +/-   ##
==========================================
- Coverage   82.49%   82.49%   -0.01%     
==========================================
  Files         245      245              
  Lines       57347    57380      +33     
==========================================
+ Hits        47311    47335      +24     
- Misses      10036    10045       +9     
Impacted Files Coverage Δ
rust/arrow/src/array/array_binary.rs 90.35% <80.32%> (-1.47%) ⬇️
rust/arrow/src/array/transform/mod.rs 93.21% <100.00%> (+0.02%) ⬆️
rust/arrow/src/array/equal/utils.rs 76.47% <0.00%> (+0.98%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bfa99d9...cb69ca1. Read the comment docs.

@ivanvankov ivanvankov reopened this Mar 7, 2021
@ivanvankov
Copy link
Author

ivanvankov commented Mar 8, 2021

Sorry, closed this one by mistake. Let me know if anything else is to be done here. Thank you!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just trolling through past PRs and I noticed this one seems to have gotten stuck. @ivanvankov this PR looks good from my perspective -- can you please rebase it against apache/master?

@jorgecarleitao any other comments?

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this is ready to merge, yes 👍 Thanks a lot @ivanvankov for this!

@alamb
Copy link
Contributor

alamb commented Mar 18, 2021

I merged this code with apache/master locally and ran cargo test --all and it all looks good to me. Thus I am going to merge this one in. Thanks again @ivanvankov and sorry for the delay!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants