ARROW-9728: [Rust] [Parquet] Nested definition & repetition for structs #8792

nevi-me · 2020-11-28T11:59:30Z

This is the change to enable writing (and reading) nested structs correctly, ala <struct<struct<primitive>>>.

The change introduces a busy LevelInfo struct, which is mostly useful for tracking changes across nested lists.
I opted to get structs right first, then refocus on lists, so our list IO support is still broken for now.

I also picked up an issue with our dictionary write support, in that some of the indexing didn't look right. i'm going to work on that in a separate commit, as I squashed all my work on this PR into one commit.

There's a nested struct test that's failing in this PR, but that's because I need #8739 in order for that test to pass.

I haven't focused on making this performant as yet, so there might be plenty of room to improve performance. I mainly focused on getting the nesting arithmetic correct. I think we should be able to paralleli(s|z)e the logic per array, but alas.

In future PR's, I'll work on list support.

github-actions · 2020-11-28T12:07:24Z

https://issues.apache.org/jira/browse/ARROW-9728

rust/arrow/src/array/equal/mod.rs

nevi-me · 2020-11-28T12:25:59Z

rust/parquet/src/arrow/arrow_writer.rs

@alamb @carols10cents I removed a lot of the dictionary code, because casting to a primitive, then writing that primitive, is a simpler approach.

I initially thought the code was meant to perform better than the cast, but I noticed that right before we write the dictionary, we manually cast it by iterating over the key-values to create an array. That convinced me that we could avoid all of that by casting from the onset.

What are your thoughts? If you have any benchmarks on IOx, it'd be great if you could check if this regresses you in any way. If it does, then there's likely a bug in the materialization that we need to look at again.

We don't yet have any dictionary array benchmarks in IOx yet (as we haven't yet hooked up the array writer -- that is planned 🔜 ).

I defer to @carols10cents on the intent of the dictionary code here thought, as she did all the work.

nevi-me · 2020-11-28T12:26:43Z

rust/parquet/src/arrow/arrow_writer.rs

Here we perform a cast of the values ...

nevi-me · 2020-11-28T12:27:13Z

rust/parquet/src/arrow/arrow_writer.rs

then here we iterate through the keys to create the underlying primitives ...

nevi-me · 2020-11-28T12:28:28Z

rust/parquet/src/arrow/arrow_writer.rs

I'll fill this part in once I find a strategy for dealing with list arrays

alamb

I can't say I reviewed / understood every line of this PR, but the overall structure and what I did review looks 👍 . Epic work @nevi-me 🏅

Some of the comments lead me to believe this implementation is not quite complete -- I personally recommend generating errors in cases we are not sure of so that we don't end up with hard to track down bugs later.

I am definitely not an expert in this code nor the parquet encoding of structures -- I (still) find the notion of repetition and definition levels somewhat confusing, but this PR seems like a good improvement and seems well tested.

It would help me (and maybe other reviewers) to have some more complete writeup (especially with examples of nullable and non-nullable structs and fields) of the algorithm this code is trying to implement -- I kind of get it from the comments and the code, but I am struggling to really understand deeply "what is being computed" so my attempted evaluation of if the "how it is being computed matches" is likely limited

rust/parquet/src/arrow/arrow_writer.rs

alamb · 2020-11-28T12:54:45Z

rust/parquet/src/arrow/arrow_writer.rs

We don't yet have any dictionary array benchmarks in IOx yet (as we haven't yet hooked up the array writer -- that is planned 🔜 ).

I defer to @carols10cents on the intent of the dictionary code here thought, as she did all the work.

rust/parquet/src/arrow/arrow_writer.rs

alamb · 2020-11-28T13:46:59Z

rust/parquet/src/arrow/arrow_writer.rs

is a runtime error a better behavior here than a TODO comment, just to warn potential users of the 'not yet implemented' status?

alamb · 2020-11-28T13:48:26Z

rust/parquet/src/arrow/arrow_writer.rs

Does this really need to be updated? It seems like the time/duration/interval types can be treated as primitives for the null calculation

This is a symptom of me having worked on this on & off for about 3 months now. so some TODOs are quite old. I've cleaned up many, and those that still remain are to help me with list support, which I'm doing next.

alamb · 2020-11-28T14:02:33Z

rust/parquet/src/arrow/arrow_writer.rs

FYIW the PR from @ch-sc in #8715 will almost certainly conflict with this PR

Tests still pass after rebasing, so that's great. CC @jorgecarleitao

alamb · 2020-11-28T14:03:28Z

rust/parquet/src/arrow/arrow_writer.rs

is it also worth testing with all fields non-nullable?

Done, I renamed this to mixed_null, then added a non_null with all fields non-nullable

alamb · 2020-11-28T14:04:26Z

rust/parquet/src/arrow/levels.rs

I don't understand how definition and definition_mask differ...

I'll have to write this up, as it becomes relevant when dealing with lists.

I added some details, but if it's still unclear, I'll be able to better demostrate it with nested lists :)

alamb · 2020-11-28T14:05:00Z

rust/parquet/src/arrow/levels.rs

I've removed this for now, stashed them somewhere, as these are relevant for list tests.
So the levels.rs is currently indirectly tested by the roundtrip tests.

alamb · 2020-11-28T14:05:16Z

rust/parquet/src/arrow/levels.rs

why #ignore?

still failing as it tests lists, I'll add the reason

nevi-me · 2020-11-28T18:25:15Z

Looks like there's still a lot for me clean up on the levels.rs file. I'll document in detail how the algorithm I've adopted works, with some examples.

nevi-me · 2020-11-29T13:28:53Z

@alamb I've now cleaned up the PR to strictly focus only on structs.
Working on both structs and lists was proving to be difficult, so I'll submit a separate PR on top of this work for lists.

save progress (11/11/2020) save progress Integrating level calculations in writer Some tests are failing, still have a long way to go fix lints save progress I'm nearly able to reproduce a `<struct<struct<primitive>>` I'm writing one level too high for nulls, so my null counts differ. Fixing this should result in nested struct roundtrip for the fully nullable case. Currently failing tests: ```rust failures: arrow::arrow_writer::tests::arrow_writer_2_level_struct arrow::arrow_writer::tests::arrow_writer_complex arrow::levels::tests::test_calculate_array_levels_2 arrow::levels::tests::test_calculate_array_levels_nested_list arrow::levels::tests::test_calculate_one_level_2 ``` They are mainly failing because we don't roundtrip lists correctly save progress 19/20-11-2020 Structs that have nulls are working (need to revert non-null logic) TODOs that need addressing later on save progress - Focused more on nested structs. - Confident that writes are now fine - Found issue with struct logical comparison, blocks this work add failing arrow struct array test a bit of cleanup for failing tests Also document why dictionary test is failing

strip out list support, to be worked on separately

alamb

I think this is looking good enough to merge personally -- the existing test pass, and new tests added to show the new functionality. I haven't verified all the logic details (as I don't really understand what they should be) but the test cases look good to me.

I vote

github-actions bot added Component: Rust Component: Parquet labels Nov 28, 2020

nevi-me requested review from alamb and sunchao November 28, 2020 12:06

nevi-me commented Nov 28, 2020

View reviewed changes

nevi-me force-pushed the ARROW-9728 branch from 2fa2aef to 0558857 Compare November 28, 2020 12:31

nevi-me mentioned this pull request Nov 28, 2020

ARROW-10755: [Rust] [Parquet] Add support for writing boolean type #8790

Closed

alamb reviewed Nov 28, 2020

View reviewed changes

github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Nov 28, 2020

nevi-me changed the title ~~ARROW-9728: [Rust] [Parquet] Nested definition & repetition for structs~~ ARROW-9728: [WIP] [Rust] [Parquet] Nested definition & repetition for structs Nov 28, 2020

nevi-me force-pushed the ARROW-9728 branch from 0558857 to 58c54a6 Compare November 28, 2020 20:58

github-actions bot removed the needs-rebase A PR that needs to be rebased by the author label Nov 28, 2020

nevi-me force-pushed the ARROW-9728 branch from 58c54a6 to 33888ba Compare November 29, 2020 12:45

github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Nov 29, 2020

nevi-me force-pushed the ARROW-9728 branch from cbc02d3 to efffcca Compare November 29, 2020 13:20

github-actions bot removed the needs-rebase A PR that needs to be rebased by the author label Nov 29, 2020

nevi-me changed the title ~~ARROW-9728: [WIP] [Rust] [Parquet] Nested definition & repetition for structs~~ ARROW-9728: [Rust] [Parquet] Nested definition & repetition for structs Nov 29, 2020

github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Nov 30, 2020

nevi-me added 6 commits November 30, 2020 17:10

simplify dictionary writes

2231ef0

move things around

340079c

strip out list support, to be worked on separately

nested struct tests now pass

2afcf76

minor fixes to track max_definition correctly

16e0943

remove stray TODO to force CI to run

9d83df3

nevi-me force-pushed the ARROW-9728 branch from e8dd00f to 9d83df3 Compare November 30, 2020 15:10

github-actions bot removed the needs-rebase A PR that needs to be rebased by the author label Nov 30, 2020

alamb approved these changes Nov 30, 2020

View reviewed changes

alamb closed this in 4e4e938 Nov 30, 2020

asfimport mentioned this pull request Nov 30, 2020

[Rust] [Parquet] Compute nested definition and repetition for structs #17306

Closed

ARROW-9728: [Rust] [Parquet] Nested definition & repetition for structs #8792

ARROW-9728: [Rust] [Parquet] Nested definition & repetition for structs #8792

Uh oh!

Conversation

nevi-me commented Nov 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 28, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nevi-me commented Nov 28, 2020

Uh oh!

nevi-me commented Nov 29, 2020

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nevi-me commented Nov 28, 2020 •

edited

Loading