[C++/Python] Writing dictionary encoded columns to parquet is extremely slow when using chunk size

Currently, there is a workaround for dict encoded columns in place to handle writing dict encoded columns to parquet.

The workaround converts the dict encoded array to its plain version before writing to parquet. This is painfully slow since for every row group the entire array is converted over and over again.

The following example is orders of magnitude slower than the non-dict encoded version:
```Java

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
df = pd.DataFrame({"col": ["A", "B"] * 100000}).astype("category")
table = pa.Table.from_pandas(df)
buf = pa.BufferOutputStream()
pq.write_table(
    table,
    buf,
    chunk_size=100,
)
 
```

**Reporter**: [Florian Jetter](https://issues.apache.org/jira/browse/ARROW-5089) / @fjetter
**Assignee**: [Wes McKinney](https://issues.apache.org/jira/browse/ARROW-5089) / @wesm
#### Related issues:
- [[Python][Parquet] direct reading/writing of pandas categoricals in parquet](https://github.com/apache/arrow/issues/19588) (relates to)

<sub>**Note**: *This issue was originally created as [ARROW-5089](https://issues.apache.org/jira/browse/ARROW-5089). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++/Python] Writing dictionary encoded columns to parquet is extremely slow when using chunk size #21577

Related issues:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++/Python] Writing dictionary encoded columns to parquet is extremely slow when using chunk size #21577

Description

Related issues:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions