Skip to content

[C++, Parquet] dictionary(..., large_string) type not preserved when writing to Parquet #37875

@mattaubury

Description

@mattaubury

(tested on pyarrow-13.0.0, Linux x64)

When writing a dictionary encoded column of large_string type to Parquet file and reading it back, it is read back as a plain string type.

Repro in Python but I see the same in C++:

>>> import pyarrow as pa
>>> import pyarrow.compute as pc
>>> import pyarrow.parquet as pq

>>> strings = pc.dictionary_encode(pa.array(["foo, bar, foo"], pa.large_string()))
>>> table = pa.table([strings], ["strings"])
>>> table.schema
strings: dictionary<values=large_string, indices=int32, ordered=0>

>>> pq.write_table(table, "table.parquet")
>>> pq.read_table("table.parquet").schema
strings: dictionary<values=string, indices=int32, ordered=0>

I'd expect to get it back as a dictionary<values=large_string, indices=int32, ordered=0> type.

Component(s)

C++, Parquet, Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions