Skip to content

Conversation

@nevi-me
Copy link
Contributor

@nevi-me nevi-me commented Nov 28, 2020

This is the change to enable writing (and reading) nested structs correctly, ala <struct<struct<primitive>>>.

The change introduces a busy LevelInfo struct, which is mostly useful for tracking changes across nested lists.
I opted to get structs right first, then refocus on lists, so our list IO support is still broken for now.

I also picked up an issue with our dictionary write support, in that some of the indexing didn't look right. i'm going to work on that in a separate commit, as I squashed all my work on this PR into one commit.

There's a nested struct test that's failing in this PR, but that's because I need #8739 in order for that test to pass.

I haven't focused on making this performant as yet, so there might be plenty of room to improve performance. I mainly focused on getting the nesting arithmetic correct. I think we should be able to paralleli(s|z)e the logic per array, but alas.

In future PR's, I'll work on list support.

@github-actions
Copy link

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb @carols10cents I removed a lot of the dictionary code, because casting to a primitive, then writing that primitive, is a simpler approach.

I initially thought the code was meant to perform better than the cast, but I noticed that right before we write the dictionary, we manually cast it by iterating over the key-values to create an array. That convinced me that we could avoid all of that by casting from the onset.

What are your thoughts? If you have any benchmarks on IOx, it'd be great if you could check if this regresses you in any way. If it does, then there's likely a bug in the materialization that we need to look at again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't yet have any dictionary array benchmarks in IOx yet (as we haven't yet hooked up the array writer -- that is planned 🔜 ).

I defer to @carols10cents on the intent of the dictionary code here thought, as she did all the work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we perform a cast of the values ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then here we iterate through the keys to create the underlying primitives ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fill this part in once I find a strategy for dealing with list arrays

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say I reviewed / understood every line of this PR, but the overall structure and what I did review looks 👍 . Epic work @nevi-me 🏅

Some of the comments lead me to believe this implementation is not quite complete -- I personally recommend generating errors in cases we are not sure of so that we don't end up with hard to track down bugs later.

I am definitely not an expert in this code nor the parquet encoding of structures -- I (still) find the notion of repetition and definition levels somewhat confusing, but this PR seems like a good improvement and seems well tested.

It would help me (and maybe other reviewers) to have some more complete writeup (especially with examples of nullable and non-nullable structs and fields) of the algorithm this code is trying to implement -- I kind of get it from the comments and the code, but I am struggling to really understand deeply "what is being computed" so my attempted evaluation of if the "how it is being computed matches" is likely limited

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't yet have any dictionary array benchmarks in IOx yet (as we haven't yet hooked up the array writer -- that is planned 🔜 ).

I defer to @carols10cents on the intent of the dictionary code here thought, as she did all the work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is a runtime error a better behavior here than a TODO comment, just to warn potential users of the 'not yet implemented' status?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really need to be updated? It seems like the time/duration/interval types can be treated as primitives for the null calculation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a symptom of me having worked on this on & off for about 3 months now. so some TODOs are quite old. I've cleaned up many, and those that still remain are to help me with list support, which I'm doing next.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYIW the PR from @ch-sc in #8715 will almost certainly conflict with this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests still pass after rebasing, so that's great. CC @jorgecarleitao

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it also worth testing with all fields non-nullable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I renamed this to mixed_null, then added a non_null with all fields non-nullable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how definition and definition_mask differ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to write this up, as it becomes relevant when dealing with lists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some details, but if it's still unclear, I'll be able to better demostrate it with nested lists :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed this for now, stashed them somewhere, as these are relevant for list tests.
So the levels.rs is currently indirectly tested by the roundtrip tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why #ignore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still failing as it tests lists, I'll add the reason

@nevi-me
Copy link
Contributor Author

nevi-me commented Nov 28, 2020

Looks like there's still a lot for me clean up on the levels.rs file. I'll document in detail how the algorithm I've adopted works, with some examples.

@github-actions github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Nov 28, 2020
@nevi-me nevi-me changed the title ARROW-9728: [Rust] [Parquet] Nested definition & repetition for structs ARROW-9728: [WIP] [Rust] [Parquet] Nested definition & repetition for structs Nov 28, 2020
@github-actions github-actions bot removed the needs-rebase A PR that needs to be rebased by the author label Nov 28, 2020
@github-actions github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Nov 29, 2020
@github-actions github-actions bot removed the needs-rebase A PR that needs to be rebased by the author label Nov 29, 2020
@nevi-me nevi-me changed the title ARROW-9728: [WIP] [Rust] [Parquet] Nested definition & repetition for structs ARROW-9728: [Rust] [Parquet] Nested definition & repetition for structs Nov 29, 2020
@nevi-me
Copy link
Contributor Author

nevi-me commented Nov 29, 2020

@alamb I've now cleaned up the PR to strictly focus only on structs.
Working on both structs and lists was proving to be difficult, so I'll submit a separate PR on top of this work for lists.

@github-actions github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Nov 30, 2020
save progress (11/11/2020)

save progress

Integrating level calculations in writer

Some tests are failing, still have a long way to go

fix lints

save progress

I'm nearly able to reproduce a `<struct<struct<primitive>>`
I'm writing one level too high for nulls, so my null counts differ.
Fixing this should result in nested struct roundtrip for the fully
nullable case.

Currently failing tests:

```rust
failures:
    arrow::arrow_writer::tests::arrow_writer_2_level_struct
    arrow::arrow_writer::tests::arrow_writer_complex
    arrow::levels::tests::test_calculate_array_levels_2
    arrow::levels::tests::test_calculate_array_levels_nested_list
    arrow::levels::tests::test_calculate_one_level_2
```

They are mainly failing because we don't roundtrip lists correctly

save progress 19/20-11-2020

Structs that have nulls are working (need to revert non-null logic)

TODOs that need addressing later on

save progress

- Focused more on nested structs.
- Confident that writes are now fine
- Found issue with struct logical comparison, blocks this work

add failing arrow struct array test

a bit of cleanup for failing tests

Also document why dictionary test is failing
strip out list support, to be worked on separately
@github-actions github-actions bot removed the needs-rebase A PR that needs to be rebased by the author label Nov 30, 2020
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is looking good enough to merge personally -- the existing test pass, and new tests added to show the new functionality. I haven't verified all the logic details (as I don't really understand what they should be) but the test cases look good to me.

I vote :shipit:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants