Skip to content

[C++/Python] Writing dictionary encoded columns to parquet is extremely slow when using chunk size #21577

@asfimport

Description

@asfimport

Currently, there is a workaround for dict encoded columns in place to handle writing dict encoded columns to parquet.

The workaround converts the dict encoded array to its plain version before writing to parquet. This is painfully slow since for every row group the entire array is converted over and over again.

The following example is orders of magnitude slower than the non-dict encoded version:

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
df = pd.DataFrame({"col": ["A", "B"] * 100000}).astype("category")
table = pa.Table.from_pandas(df)
buf = pa.BufferOutputStream()
pq.write_table(
    table,
    buf,
    chunk_size=100,
)
 

Reporter: Florian Jetter / @fjetter
Assignee: Wes McKinney / @wesm

Related issues:

Note: This issue was originally created as ARROW-5089. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions