Skip to content

[C++][Dataset] Parquet schema lost on dataset write #45969

@beryan

Description

@beryan

Describe the enhancement requested

As a user of the Arrow Dataset API I would like to write partitioned data and preserve Parquet schema information.

For example, I may have an arrow::Table containing Parquet INTERVAL data stored in it's physical type representation, a fixed_len_byte_array of length 12. Because no arrow::Schema types are a direct match I use a arrow::FixedSizeBinaryBuilder to create the table. Existing properties and arrow::dataset::FileSystemDataset::Write() don't support providing a native schema for the output file format. As a result, the Parquet logical types of written data that do not have an arrow::schema equivalent are lost.

Some Parquet logical types affected:

  • interval
  • uuid
  • enum
  • json
  • bson

Current behavior: When using the Arrow Dataset API, data type roundtripping is limited by the types arrow::schema can represent

Desired behavior: Provide a parquet schema that allows the user to specify a target schema.

Component(s)

C++

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions