Is your feature request related to a problem or challenge? Please describe what you are trying to do.
As we work on various features of Parquet metadata it is becoming clear that working with the current code organization is challenging.
I just wanted to write down some of my thoughts about how it all fits together
Here are some challenges:
The naming is challenging Consistent naming for Parquet page index structures #6097
There is no way to easily write to bytes outside the context of a parquet file: Add ParquetMetadataWriter allow ad-hoc encoding of ParquetMetadata #6000
It is complicated to understand how to read optional parts of the metadata that are not inlined (e.g. OffsetIndexes) - Document when the ParquetRecordBatchReader will re-read metadata #5887
If we ever wanted to speed up (e.g. Use custom thrift decoder to improve speed of parsing parquet metadata #5854 ) it would be hard with the current structure
There is not always a 1-1 correspondence between file::metadata and the thrift structures in format::metadata,
Describe the solution you'd like
I would like to propose
We continue to clarify the distinction between file::metadata and format::metadata
Improve the API to translate back and forth between them and bytes and de-emphasize the conversion between thrift structures
Maybe this is clear to others but it is not to me
Here is how I see the structures involved:
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
┌──────────────┐ ┌───────────────────────┐ │
│ │ ColumnIndex │ │ ││ ParquetMetaData │
└──────────────┘ └───────────────────────┘ │
┌──────────────┐ │ ┌────────────────┐ │ │┌───────────────────────┐
│ ..0x24.. │ ◀────────▶ │ OffsetIndex │ ◀────────▶ │ ParquetMetaData │ │
└──────────────┘ │ └────────────────┘ │ │└───────────────────────┘
... ... │
│ ┌──────────────────┐ │ │ ┌──────────────────┐
bytes │ FileMetaData* │ │ FileMetaData* │ │
(thrift encoded) │ └──────────────────┘ │ │ └──────────────────┘
─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
format::meta structures file::metadata structures
* Same name, different struct
I would like to focus on improving the API for going back/forth between bytes and the file::metadata structures
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
┌───────────────────────┐ │
┌──────────────┐ ││ ParquetMetaData │
│ ..0x24.. │ ◀────────▶ └───────────────────────┘ │
└──────────────┘ │┌───────────────────────┐
│ ParquetMetaData │ │
Would like to focus │└───────────────────────┘
bytes on this API to/from │
(thrift encoded) bytes and the │ ┌──────────────────┐
file::metadata │ FileMetaData* │ │
│ └──────────────────┘
─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
file::metadata structures
Describe alternatives you've considered
I think we probably need at least two different APIs:
Reading
One that writes to [u8] buffered in memory ( decode_footer and decode_metadata )
One that reads from an AsyncReader or something equivalent (MetadataLoader is enough / needs some more information)
Writing
Writes to [u8] API for encoding/decoding ParquetMetadata with more control #6002 )
Writes to an AsyncWriter perhaps
Additional context
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
As we work on various features of Parquet metadata it is becoming clear that working with the current code organization is challenging.
I just wanted to write down some of my thoughts about how it all fits together
Here are some challenges:
ParquetMetadataWriterallow ad-hoc encoding ofParquetMetadata#6000file::metadataand the thrift structures informat::metadata,Describe the solution you'd like
I would like to propose
file::metadataandformat::metadataMaybe this is clear to others but it is not to me
Here is how I see the structures involved:
I would like to focus on improving the API for going back/forth between bytes and the
file::metadatastructuresDescribe alternatives you've considered
I think we probably need at least two different APIs:
Reading
[u8]buffered in memory ( decode_footer and decode_metadata)AsyncReaderor something equivalent (MetadataLoaderis enough / needs some more information)Writing
[u8]API for encoding/decoding ParquetMetadata with more control #6002)AsyncWriterperhapsAdditional context