ARROW-7960: [C++][Parquet][WIP] Add schema translation for missing logicalTypes#6758
ARROW-7960: [C++][Parquet][WIP] Add schema translation for missing logicalTypes#6758igorcalabria wants to merge 1 commit intoapache:masterfrom
Conversation
emkornfield
left a comment
There was a problem hiding this comment.
Can you please add schema tests to confirm this code?
Also, for the different list types in the original JIRA I mentioned determining them empirically, but if the schema is written to parquet (via metadata) we should be able to determine that exactly. Do you think you could look into that under a separate issue?
|
|
||
| const auto& key_val_group = static_cast<const GroupNode&>(key_val_node); | ||
|
|
||
| if (key_val_group.field_count() != 2) { |
There was a problem hiding this comment.
Reading https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps it seems that value can be omitted in this case I think we should probably use Null arrays.
There was a problem hiding this comment.
Yes, it seems that omitted value could be used when a "Map" represents a "Set" value. Using Null for the value type sounds right to me
| RETURN_NOT_OK(arrow::GroupToStruct(key_val_group, max_def_level, max_rep_level, ctx, | ||
| out, child_field)); | ||
|
|
||
| const auto key_field = child_field->children[0].field; |
There was a problem hiding this comment.
don't use auto here (spell out the type) and below.
| const SchemaField* parent, SchemaField* out) { | ||
| if (node.logical_type()->is_list()) { | ||
| return ListToSchemaField(node, max_def_level, max_rep_level, ctx, parent, out); | ||
| } else if (node.logical_type()->is_map()) { |
There was a problem hiding this comment.
It doesn't seem like this handles the backward compatibility case of "MAP_KEY_VALUE" listed in the spec
|
|
||
| if (!key_val_node.is_repeated()) { | ||
| return Status::NotImplemented( | ||
| "Non-repeated nodes in MAP-annotated group are not supported."); |
There was a problem hiding this comment.
| "Non-repeated nodes in MAP-annotated group are not supported."); | |
| "Non-repeated key_value node in MAP-annotated group is not supported."); |
| const Node& key_val_node = *group.field(0); | ||
|
|
||
| if (!key_val_node.is_repeated()) { | ||
| return Status::NotImplemented( |
There was a problem hiding this comment.
I don't think NotImplemented here is correct, this should be considered invalid input. Consider providing the offending field name (and additional metadata if possible) in all these error messages, it will make determining errors easier.
What do you mean broke? I would have expected more tests to fail in CI if things were actually broken. |
|
|
||
| if (key_val_group.field_count() != 2) { | ||
| return Status::NotImplemented( | ||
| "Only groups with 2 fields are supported on MAP-annoted key val group"); |
I'm guessing that he has a file containing Map data and it failed to instantiate a column reader. This patch will need some unit tests for the different Map cases |
|
I'm closing this until it can be picked up again |
WIP PR for adding missing LogicalTypes listed in https://issues.apache.org/jira/browse/ARROW-7960.
Adding direct translation to map broke the reader because there's no special case for maps. It should be straight forward to reuse the struct reader for it