Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions cpp/src/parquet/arrow/reader_internal.cc
Original file line number Diff line number Diff line change
Expand Up @@ -402,6 +402,54 @@ Status GroupToStruct(const GroupNode& node, int16_t max_def_level, int16_t max_r
return Status::OK();
}

Status MapToSchemaField(const GroupNode& group, int16_t max_def_level,
int16_t max_rep_level, SchemaTreeContext* ctx,
const SchemaField* parent, SchemaField* out) {
if (group.field_count() != 1) {
return Status::NotImplemented(
"Only MAP-annotated groups with a single child can handled");
}
out->children.resize(1);
SchemaField* child_field = &out->children[0];
ctx->LinkParent(out, parent);
ctx->LinkParent(child_field, out);

const Node& key_val_node = *group.field(0);

if (!key_val_node.is_repeated()) {
return Status::NotImplemented(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think NotImplemented here is correct, this should be considered invalid input. Consider providing the offending field name (and additional metadata if possible) in all these error messages, it will make determining errors easier.

"Non-repeated nodes in MAP-annotated group are not supported.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Non-repeated nodes in MAP-annotated group are not supported.");
"Non-repeated key_value node in MAP-annotated group is not supported.");

}

if (!key_val_node.is_group()) {
return Status::NotImplemented(
"Non-group nodes in MAP-annotated group are not supported.");
}

++max_def_level;
++max_rep_level;

const auto& key_val_group = static_cast<const GroupNode&>(key_val_node);

if (key_val_group.field_count() != 2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps it seems that value can be omitted in this case I think we should probably use Null arrays.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm thoughts on omitted value columns?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it seems that omitted value could be used when a "Map" represents a "Set" value. Using Null for the value type sounds right to me

return Status::NotImplemented(
"Only groups with 2 fields are supported on MAP-annoted key val group");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: annotated

}

RETURN_NOT_OK(arrow::GroupToStruct(key_val_group, max_def_level, max_rep_level, ctx,
out, child_field));

const auto key_field = child_field->children[0].field;
Copy link
Contributor

@emkornfield emkornfield Mar 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use auto here (spell out the type) and below.

const auto val_field = child_field->children[1].field;

out->field = ::arrow::field(group.name(), ::arrow::map(key_field->type(), val_field),
group.is_optional(), FieldIdMetadata(group.field_id()));
out->max_definition_level = max_def_level;
out->max_repetition_level = max_rep_level;

return Status::OK();
}

Status ListToSchemaField(const GroupNode& group, int16_t max_def_level,
int16_t max_rep_level, SchemaTreeContext* ctx,
const SchemaField* parent, SchemaField* out) {
Expand Down Expand Up @@ -487,6 +535,8 @@ Status GroupToSchemaField(const GroupNode& node, int16_t max_def_level,
const SchemaField* parent, SchemaField* out) {
if (node.logical_type()->is_list()) {
return ListToSchemaField(node, max_def_level, max_rep_level, ctx, parent, out);
} else if (node.logical_type()->is_map()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like this handles the backward compatibility case of "MAP_KEY_VALUE" listed in the spec

return MapToSchemaField(node, max_def_level, max_rep_level, ctx, parent, out);
}
std::shared_ptr<DataType> type;
if (node.is_repeated()) {
Expand Down