Read nested Parquet 2-level lists correctly#6757
Conversation
|
|
||
| /// Returns `true` if this type is annotated as a list. | ||
| pub fn is_list(&self) -> bool { | ||
| match self.is_group() { |
There was a problem hiding this comment.
Since .is_group() returns a boolean wouldn't it be better semantics to put it inside a branch conditional just using if?
There was a problem hiding this comment.
I think Rust has warped me...I feel like I'm doing it wrong if I use if 😅. But you're right that it would be more succinct.
| use arrow::datatypes::ToByteSlice; | ||
|
|
||
| let testdata = arrow::util::test_util::parquet_test_data(); | ||
| // message my_record { |
There was a problem hiding this comment.
Nice, thank you for putting a comment with the test data. I like this :)
I wish I could 😅. It appears that test is borked for everyone due to some upstream shenanigans. And now fixed by #6745 |
tustvold
left a comment
There was a problem hiding this comment.
Looks good, mostly just some minor nits
| || repeated_field.name() == "array" | ||
| || repeated_field.name() == format!("{}_tuple", list_type.name()) | ||
| || (!repeated_field.is_list() | ||
| && !repeated_field.has_single_repeated_child() |
There was a problem hiding this comment.
| && !repeated_field.has_single_repeated_child() | |
| && items[0].get_basic_info().repetition() != Repetition::REPEATED |
There was a problem hiding this comment.
I didn't do this because repetition() can theoretically panic, even though by this point in the schema every node should have a repetition. Perhaps panicking here is warranted if the schema is invalid. I'll change if you'd prefer the panic (or return an error à la #6738).
| if items.len() != 1 | ||
| || repeated_field.name() == "array" | ||
| || repeated_field.name() == format!("{}_tuple", list_type.name()) | ||
| || (!repeated_field.is_list() |
There was a problem hiding this comment.
Why is this check necessary, given we're in visit_list?
There was a problem hiding this comment.
We're testing if the child of the list we're processing is also LIST annotated.
optional group my_list (LIST) { <---- this is `list_type`
repeated group array (LIST) { <---- this is `repeated_field`
repeated int32 array;
};
}
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
|
Looks like an improvement to me -- thank you @etseidl , @tustvold and @devanbenz |
* read nested 2-level lists correctly * address review comments * Apply suggestions from code review Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com> --------- Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
Which issue does this PR close?
Closes #6756.
Rationale for this change
See issue.
What changes are included in this PR?
Modifies both the arrow and record readers to check for nested lists before triggering the rule that repeated groups named "array" are treated as
list<OneTuple>.Are there any user-facing changes?
Changes the interpretation of some legacy schemas.