[Python] Reading a dictionary column from Parquet results in disproportionate memory usage

I'm using pyarrow to read a 40MB parquet file.

When reading all of the columns besides the "body" columns, the process peaks at 170MB.

Reading only the "body" column results in over 6GB of memory used.

I made the file publicly accessible: s3://dhavivresearch/pyarrow/demofile.parquet

 

 

**Reporter**: [Daniel Haviv](https://issues.apache.org/jira/browse/ARROW-5993)
**Assignee**: [Wes McKinney](https://issues.apache.org/jira/browse/ARROW-5993) / @wesm
#### Related issues:
- [[Python] too large memory cost using pyarrow.parquet.read_table with use_threads=True](https://github.com/apache/arrow/issues/22462) (duplicates)
- [[Python] too large memory cost using pyarrow.parquet.read_table with use_threads=True](https://github.com/apache/arrow/issues/22462) (is caused by)
- [[C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray](https://github.com/apache/arrow/issues/20110) (relates to)

<sub>**Note**: *This issue was originally created as [ARROW-5993](https://issues.apache.org/jira/browse/ARROW-5993). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Reading a dictionary column from Parquet results in disproportionate memory usage #22400

Related issues:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Python] Reading a dictionary column from Parquet results in disproportionate memory usage #22400

Description

Related issues:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions