Skip to content

Extension type <--> Parquet LogicalType registry / user defined mappings #8479

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The Parquet type system includes LogicalTypes types without a direct arrow equivalent, such as JSON, Variant, and UUID

However, Arrow includes the idea of "Extension" types that add extra semantics to an existing Arrow physical type, and the arrow-rs parquet reader will automatically map these the relevant parquet types to a canonical Arrow extension type if the arrow_canonical_extension_types feature is set.

However, right now that mapping of Parquet LogicalType --> Arrow (Canonical) ExtensionType is hard coded, which is unfortunate as it means:

  1. Users can not override the mapping (if they want to write their own implementation of parquet LogicalTypes, for example)
  2. The code has a bunch of #[cfg(...)] sprinkled in it -- see Support parquet canonical extension type roundtrip #8409 for an example

Describe the solution you'd like
@paleolimbot suggested on https://github.com/apache/arrow-rs/pull/8409/files#r2371071848 that we could maintain some sort of registry that was more ergonomic to configure and would allow user defined extension types

Describe alternatives you've considered
Quoting @paleolimbot on https://github.com/apache/arrow-rs/pull/8409/files#r2371071848:

you could also consider an injection approach like:

pub trait ParquetArrowExtension {
    fn try_from_logical_type(&self, mut arrow_field: Field, logical_type: &LogicalType) -> Result<Option<Field>>;
    fn try_to_logical_type(&self, &Field) -> Result<Option<LogicalType>>;
}

...and maintain a registry of those in the reader/writer options. Then you don't need compile time flags to support the extensions (something like DataFusion or a derivative could wire it all together at runtime).

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions