-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-9728: [Rust] [Parquet] Nested definition & repetition for structs #8792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb @carols10cents I removed a lot of the dictionary code, because casting to a primitive, then writing that primitive, is a simpler approach.
I initially thought the code was meant to perform better than the cast, but I noticed that right before we write the dictionary, we manually cast it by iterating over the key-values to create an array. That convinced me that we could avoid all of that by casting from the onset.
What are your thoughts? If you have any benchmarks on IOx, it'd be great if you could check if this regresses you in any way. If it does, then there's likely a bug in the materialization that we need to look at again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't yet have any dictionary array benchmarks in IOx yet (as we haven't yet hooked up the array writer -- that is planned 🔜 ).
I defer to @carols10cents on the intent of the dictionary code here thought, as she did all the work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we perform a cast of the values ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then here we iterate through the keys to create the underlying primitives ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fill this part in once I find a strategy for dealing with list arrays
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't say I reviewed / understood every line of this PR, but the overall structure and what I did review looks 👍 . Epic work @nevi-me 🏅
Some of the comments lead me to believe this implementation is not quite complete -- I personally recommend generating errors in cases we are not sure of so that we don't end up with hard to track down bugs later.
I am definitely not an expert in this code nor the parquet encoding of structures -- I (still) find the notion of repetition and definition levels somewhat confusing, but this PR seems like a good improvement and seems well tested.
It would help me (and maybe other reviewers) to have some more complete writeup (especially with examples of nullable and non-nullable structs and fields) of the algorithm this code is trying to implement -- I kind of get it from the comments and the code, but I am struggling to really understand deeply "what is being computed" so my attempted evaluation of if the "how it is being computed matches" is likely limited
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't yet have any dictionary array benchmarks in IOx yet (as we haven't yet hooked up the array writer -- that is planned 🔜 ).
I defer to @carols10cents on the intent of the dictionary code here thought, as she did all the work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is a runtime error a better behavior here than a TODO comment, just to warn potential users of the 'not yet implemented' status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this really need to be updated? It seems like the time/duration/interval types can be treated as primitives for the null calculation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a symptom of me having worked on this on & off for about 3 months now. so some TODOs are quite old. I've cleaned up many, and those that still remain are to help me with list support, which I'm doing next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests still pass after rebasing, so that's great. CC @jorgecarleitao
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it also worth testing with all fields non-nullable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I renamed this to mixed_null, then added a non_null with all fields non-nullable
rust/parquet/src/arrow/levels.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how definition and definition_mask differ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to write this up, as it becomes relevant when dealing with lists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some details, but if it's still unclear, I'll be able to better demostrate it with nested lists :)
rust/parquet/src/arrow/levels.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed this for now, stashed them somewhere, as these are relevant for list tests.
So the levels.rs is currently indirectly tested by the roundtrip tests.
rust/parquet/src/arrow/levels.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why #ignore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still failing as it tests lists, I'll add the reason
|
Looks like there's still a lot for me clean up on the |
|
@alamb I've now cleaned up the PR to strictly focus only on structs. |
save progress (11/11/2020)
save progress
Integrating level calculations in writer
Some tests are failing, still have a long way to go
fix lints
save progress
I'm nearly able to reproduce a `<struct<struct<primitive>>`
I'm writing one level too high for nulls, so my null counts differ.
Fixing this should result in nested struct roundtrip for the fully
nullable case.
Currently failing tests:
```rust
failures:
arrow::arrow_writer::tests::arrow_writer_2_level_struct
arrow::arrow_writer::tests::arrow_writer_complex
arrow::levels::tests::test_calculate_array_levels_2
arrow::levels::tests::test_calculate_array_levels_nested_list
arrow::levels::tests::test_calculate_one_level_2
```
They are mainly failing because we don't roundtrip lists correctly
save progress 19/20-11-2020
Structs that have nulls are working (need to revert non-null logic)
TODOs that need addressing later on
save progress
- Focused more on nested structs.
- Confident that writes are now fine
- Found issue with struct logical comparison, blocks this work
add failing arrow struct array test
a bit of cleanup for failing tests
Also document why dictionary test is failing
strip out list support, to be worked on separately
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is looking good enough to merge personally -- the existing test pass, and new tests added to show the new functionality. I haven't verified all the logic details (as I don't really understand what they should be) but the test cases look good to me.
I vote ![]()
This is the change to enable writing (and reading) nested structs correctly, ala
<struct<struct<primitive>>>.The change introduces a busy
LevelInfostruct, which is mostly useful for tracking changes across nested lists.I opted to get structs right first, then refocus on lists, so our list IO support is still broken for now.
I also picked up an issue with our dictionary write support, in that some of the indexing didn't look right. i'm going to work on that in a separate commit, as I squashed all my work on this PR into one commit.
There's a nested struct test that's failing in this PR, but that's because I need #8739 in order for that test to pass.
I haven't focused on making this performant as yet, so there might be plenty of room to improve performance. I mainly focused on getting the nesting arithmetic correct. I think we should be able to paralleli(s|z)e the logic per array, but alas.
In future PR's, I'll work on list support.