-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Due to an application where we are considering to use field-level metadata (so not schema-level metadata), but also want to be able to save this data to Parquet, I was looking into "field-level metadata" for Parquet, which I assumed we supported this.
We can roundtrip Arrow's field-level metadata to/from Parquet, as shown with this example:
schema = pa.schema([pa.field("column_name", pa.int64(), metadata={"key": "value"})])
table = pa.table({'column_name': [0, 1, 2]}, schema=schema)
pq.write_table(table, "test_field_metadata.parquet")
>>> pq.read_table("test_field_metadata.parquet").schema
column_name: int64
-- field metadata --
key: 'value'However, the reason this is restored is actually because of this being stored in the Arrow schema that we (by default) store in the ARROW:schema metadata in the Parquet FileMetaData.key_value_metadata.
With a small patched version to be able to turn this off (currently this is harcoded to be turned on in the python bindings), it is clear this field-level metadata is not restored on roundtrip without this stored arrow schema:
pq.write_table(table, "test_field_metadata_without_schema.parquet", store_arrow_schema=False)
>>> pq.read_table("test_field_metadata_without_schema.parquet").schema
column_name: int64So there is currently no mapping from Arrow's field level metadata to Parquet's column-level metadata (ColumnMetaData.key_value_metadata in Parquet's thrift structures).
(which also means that using field-level metadata roundtripping to parquet only works as long as you are using Arrow for writing/reading, but not if you want to be able to also exchange such data with non-Arrow Parquet implementations)
In addition, it also seems we don't even expose this field in our C++ or Python bindings, to just access that data if you would have a Parquet file (written by another implementation) that has key_value_metadata in the ColumnMetaData.
cc @emkornfield
Reporter: Joris Van den Bossche / @jorisvandenbossche
Related issues:
- [C++] field's metadata is not written to Parquet file (duplicates)
- [C++] field's metadata is not written to Parquet file (relates to)
- [C++][Parquet] Parquet FileMetaData key_value_metadata not always mapped to Arrow Schema metadata (relates to)
Note: This issue was originally created as ARROW-15548. Please see the migration documentation for further details.