Skip to content

[Python] Reading a dictionary column from Parquet results in disproportionate memory usage #22400

@asfimport

Description

@asfimport

I'm using pyarrow to read a 40MB parquet file.

When reading all of the columns besides the "body" columns, the process peaks at 170MB.

Reading only the "body" column results in over 6GB of memory used.

I made the file publicly accessible: s3://dhavivresearch/pyarrow/demofile.parquet

 

 

Reporter: Daniel Haviv
Assignee: Wes McKinney / @wesm

Related issues:

Note: This issue was originally created as ARROW-5993. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions