Skip to content

[Python] Regression memory issue when calling pandas.read_parquet #22461

@asfimport

Description

@asfimport

I have a ~3MB parquet file with the next schema:

bag_stamp: timestamp[ns]
transforms_[]_.header.seq: list<item: int64>
  child 0, item: int64
transforms_[]_.header.stamp: list<item: timestamp[ns]>
  child 0, item: timestamp[ns]
transforms_[]_.header.frame_id: list<item: string>
  child 0, item: string
transforms_[]_.child_frame_id: list<item: string>
  child 0, item: string
transforms_[]_.transform.translation.x: list<item: double>
  child 0, item: double
transforms_[]_.transform.translation.y: list<item: double>
  child 0, item: double
transforms_[]_.transform.translation.z: list<item: double>
  child 0, item: double
transforms_[]_.transform.rotation.x: list<item: double>
  child 0, item: double
transforms_[]_.transform.rotation.y: list<item: double>
  child 0, item: double
transforms_[]_.transform.rotation.z: list<item: double>
  child 0, item: double
transforms_[]_.transform.rotation.w: list<item: double>
  child 0, item: double

 If I read it with pandas.read_parquet() using pyarrow 0.13.0 all seems fine and it takes no time to load. If I try the same with 0.14.0 or 0.14.1 it takes a lot of time and uses ~10GB of RAM. Many times if I don't have enough available memory it will just be killed OOM. Now, if I use the next code snippet instead it works perfectly with all the versions:

parquet_file = pq.ParquetFile(input_file)
tables = []
for row_group in range(parquet_file.num_row_groups):
    tables.append(parquet_file.read_row_group(row_group, columns=columns, use_pandas_metadata=True))
df = pa.concat_tables(tables).to_pandas()

Reporter: Francisco Sanchez

Related issues:

Original Issue Attachments:

Note: This issue was originally created as ARROW-6059. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions