-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Describe the enhancement requested
As a user of the Arrow Dataset API I would like to write partitioned data and preserve Parquet schema information.
For example, I may have an arrow::Table containing Parquet INTERVAL data stored in it's physical type representation, a fixed_len_byte_array of length 12. Because no arrow::Schema types are a direct match I use a arrow::FixedSizeBinaryBuilder to create the table. Existing properties and arrow::dataset::FileSystemDataset::Write() don't support providing a native schema for the output file format. As a result, the Parquet logical types of written data that do not have an arrow::schema equivalent are lost.
Some Parquet logical types affected:
- interval
- uuid
- enum
- json
- bson
Current behavior: When using the Arrow Dataset API, data type roundtripping is limited by the types arrow::schema can represent
Desired behavior: Provide a parquet schema that allows the user to specify a target schema.
Component(s)
C++