Skip to content

Field metadata for dictionary-encoded extension types is lost #7982

@paleolimbot

Description

@paleolimbot

Describe the bug

The representation of Dictionary in the data types enum seems to exclude field metadata, so extension types are dropped when they go through arrow-rs structures:

Dictionary(Box<DataType>, Box<DataType>),

The definition of RunEndEncoded and others seem to use a FieldRef and I'm wondering if it was a deliberate choice not to do this or whether it's just never come up.

To Reproduce

I used arro3 to reproduce:

import arro3.core as a3
import geoarrow.pyarrow as ga
import nanoarrow as na
import pyarrow as pa

c_schema = na.c_schema(pa.dictionary(pa.int32(), ga.wkb()))

c_schema.metadata is None
#> True
c_schema.dictionary.metadata
#> <nanoarrow._schema.SchemaMetadata>
#> - b'ARROW:extension:name': b'geoarrow.wkb'
#> - b'ARROW:extension:metadata': b'{}'

c_schema2 = na.c_schema(a3.DataType.dictionary(pa.int32(), ga.wkb()))
c_schema2.metadata is None
#> True
c_schema2.dictionary.metadata is None
#> True

Expected behavior

I would have expected the metadata to roundtrip through the arrow-rs data type representation

Additional context

Occasionally Parquet readers will return dictionary-encoded arrays on read whose representation is not entirely in control of the user.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions