Skip to content

[C++][Parquet] Field-level metadata are not supported? (ColumnMetadata.key_value_metadata) #31018

@asfimport

Description

@asfimport

Due to an application where we are considering to use field-level metadata (so not schema-level metadata), but also want to be able to save this data to Parquet, I was looking into "field-level metadata" for Parquet, which I assumed we supported this.

We can roundtrip Arrow's field-level metadata to/from Parquet, as shown with this example:

schema = pa.schema([pa.field("column_name", pa.int64(), metadata={"key": "value"})])
table = pa.table({'column_name': [0, 1, 2]}, schema=schema)
pq.write_table(table, "test_field_metadata.parquet")

>>> pq.read_table("test_field_metadata.parquet").schema
column_name: int64
  -- field metadata --
  key: 'value'

However, the reason this is restored is actually because of this being stored in the Arrow schema that we (by default) store in the ARROW:schema metadata in the Parquet FileMetaData.key_value_metadata.

With a small patched version to be able to turn this off (currently this is harcoded to be turned on in the python bindings), it is clear this field-level metadata is not restored on roundtrip without this stored arrow schema:

pq.write_table(table, "test_field_metadata_without_schema.parquet", store_arrow_schema=False)

>>> pq.read_table("test_field_metadata_without_schema.parquet").schema
column_name: int64

So there is currently no mapping from Arrow's field level metadata to Parquet's column-level metadata (ColumnMetaData.key_value_metadata in Parquet's thrift structures).

(which also means that using field-level metadata roundtripping to parquet only works as long as you are using Arrow for writing/reading, but not if you want to be able to also exchange such data with non-Arrow Parquet implementations)

In addition, it also seems we don't even expose this field in our C++ or Python bindings, to just access that data if you would have a Parquet file (written by another implementation) that has key_value_metadata in the ColumnMetaData.

cc @emkornfield

Reporter: Joris Van den Bossche / @jorisvandenbossche

Related issues:

Note: This issue was originally created as ARROW-15548. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions