Skip to content

Conversation

@nevi-me
Copy link
Contributor

@nevi-me nevi-me commented Nov 22, 2020

This change aligns with the spec, in that a struct array's nulls should take precedence over a child array's nulls.
We carry across the parent's null buffer, and merge it with the child's.

This change aligns with the spec, in that a struct array's nulls should take precedence over a child array's nulls.
We carry across the parent's null buffer, and merge it with the child's.
@github-actions
Copy link

@nevi-me
Copy link
Contributor Author

nevi-me commented Nov 22, 2020

@jorgecarleitao @alamb I've only implemented this for <struct<primitive>> for now, I'd like to get some feedback on whether the approach that I'm taking is fine.

The change mainly keeps track of the null buffer of ArrayData for arrays, and combines the array's null buffer with the child's, when a nested type is encountered. I use BitAnd to combine the buffers.

Where we need null counts for comparisons, we have to recalculate the counts from the separate null buffer, as they might have been combined with the parent's null buffer (buffer-ception? let me know if my sentence is unclear).

I'll work on the below so we have sufficient test coverage:

  • <struct<struct<primitive>>>
  • <struct<struct<dictionary>>>
  • <list<primitive>>
  • <list<dictionary>>
  • <list<struct<list<primitive>>>>

@nevi-me nevi-me changed the title ARROW-10684: [Rust] Inherit struct nulls in child ARROW-10684: [Rust] Inherit struct nulls in child null equality Nov 22, 2020
equal_values(lhs, rhs, lhs_nulls, rhs_nulls, lhs_start, rhs_start, len)
} else {
// with nulls, we need to compare item by item whenever it is not null
(0..len).all(|i| {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be easier if we took into account the nulls on this statement here:

let lhs_is_null = lhs.is_null(lhs_pos);
let rhs_is_null = rhs.is_null(rhs_pos);

by replacing it with

let lhs_is_null = lhs.is_null(lhs_pos) && lhs.nullbitmap.isnull(lhs_pos);
let rhs_is_null = rhs.is_null(rhs_pos) && rhs.nullbitmap.isnull(rhs_pos);

(still keeping the re-count in place, to avoid wrongfully optimizing like this PR fixes)

The advantage is that for all types except struct, this PR merges two equal bitmaps twice (all types except struct now pass null_buffer() to equal_range. E.g. in list:

            lhs_values,
            rhs_values,
            lhs_values.null_buffer(),
            rhs_values.null_buffer(),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb if we pass None where we don't have a struct, wouldn't we be describing a different position, which might be incorrect?
A None for the Option<&Buffer> indicates that there are no nulls, as arrays are permitted to omit the null buffer/bitmap altogether if there are no nulls. So if we pass None for lhs_nulls, we shouldn't have to look at lhs, as the presumption is that we've already extracted the null buffer, and there wasn't any.

Perhaps I don't understand you

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nevi-me I think I was confused -- I was trying to offer some way to avoid @jorgecarleitao 's concern of double merge concerns but now that I re-read my comments I am not sure that they make any sense. Sorry for the confusion

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @nevi-me -- I think this approach looks good. I had some potential optimization suggestion and I think @jorgecarleitao 's is worth considering

equal_values(lhs, rhs, lhs_nulls, rhs_nulls, lhs_start, rhs_start, len)
} else {
// with nulls, we need to compare item by item whenever it is not null
(0..len).all(|i| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if lhs_null_count == 0 && rhs_null_count == 0 {
equal_values(lhs, rhs, lhs_nulls, rhs_nulls, lhs_start, rhs_start, len)
} else {
// get a ref of the null buffer bytes, to use in testing for nullness
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgecarleitao does this address your comment? I was incorrectly using the array's null buffer, instead of checking nullness from the merged one. If it doesn't, then it means I didn't understand what you were suggesting.

@nevi-me nevi-me marked this pull request as ready for review November 27, 2020 20:44
@nevi-me
Copy link
Contributor Author

nevi-me commented Nov 27, 2020

@alamb @jorgecarleitao this is blocking some work that I'm doing on parquet, so I'd like to limit the scope of the PR to just nested structs, as I want to submit a PR that correctly reads nested arrow structs in parquet.
I'm also a bit unsure if the struct null inheritance semantics also apply to lists, so I'd like to defer that to follow-ups (as I'd need to do the work when writing nested arrow lists to parquet).

Are you fine with reviewing the PR as is just for nested structs? I believe I've responded to or addressed comments raised so far, but I'll address more where they come, or where i haven't covered everything as yet.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code looks good, well tested and commented, and we can always improve upon it in subsequent PRs. I think it is good to merge. FYI @jorgecarleitao

equal_values(lhs, rhs, lhs_nulls, rhs_nulls, lhs_start, rhs_start, len)
} else {
// with nulls, we need to compare item by item whenever it is not null
(0..len).all(|i| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nevi-me I think I was confused -- I was trying to offer some way to avoid @jorgecarleitao 's concern of double merge concerns but now that I re-read my comments I am not sure that they make any sense. Sorry for the confusion

@alamb
Copy link
Contributor

alamb commented Nov 28, 2020

The CI failures seem unrelated

For example:

https://github.com/apache/arrow/pull/8739/checks?check_run_id=1465465859

test result: ok. 56 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

+ popd
/d/a/arrow/arrow/rust /d/a/arrow/arrow
+ pushd arrow
/d/a/arrow/arrow/rust/arrow /d/a/arrow/arrow/rust /d/a/arrow/arrow
+ cargo run --example builders
   Compiling arrow v3.0.0-SNAPSHOT (D:\a\arrow\arrow\rust\arrow)
LLVM ERROR: IO failure on output stream: no space on device
error: could not compile `arrow`

I am going to retrigger the tests on github and then plan to merge this PR if they pass

@nevi-me
Copy link
Contributor Author

nevi-me commented Nov 28, 2020

I am going to retrigger the tests on github and then plan to merge this PR if they pass

@alamb something's filling up our space on CI, so the tests might still fail. I tried looking at it in the morning, but ended up parking it for later so I could do other work.

@nevi-me
Copy link
Contributor Author

nevi-me commented Nov 29, 2020

@jorgecarleitao may you please take a look at this again when you have a chance. I opted not to address lists for now, as their semantics might be fine. I'll look into other types as part of ARROW-10766, which I'll be working on in the evenings in the coming week.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the changes and they make sense.

I am still a bit unsure about why we need to perform this in Struct but not in List: the null bitmaps serve the same purpose in both cases and thus I would not expect the implementations to differ in this respect, but the tests demonstrate an improvement, so :shipit:

Thanks a lot for taking this, @nevi-me

alamb pushed a commit that referenced this pull request Nov 30, 2020
This is the change to enable writing (and reading) nested structs correctly, ala `<struct<struct<primitive>>>`.

The change introduces a busy `LevelInfo` struct, which is mostly useful for tracking changes across nested lists.
I opted to get structs right first, then refocus on lists, so our list IO support is still broken for now.

I also picked up an issue with our dictionary write support, in that some of the indexing didn't look right. i'm going to work on that in a separate commit, as I squashed all my work on this PR into one commit.

There's a nested struct test that's failing in this PR, but that's because I need #8739 in order for that test to pass.

I haven't focused on making this performant as yet, so there might be plenty of room to improve performance. I mainly focused on getting the nesting arithmetic correct. I think we should be able to paralleli(s|z)e the logic per array, but alas.

In future PR's, I'll work on list support.

Closes #8792 from nevi-me/ARROW-9728

Authored-by: Neville Dipale <nevilledips@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants