Skip to content

Method pyarrow.parquet.read_table has memory spikes from version 0.14 #22753

@asfimport

Description

@asfimport

Method pyarrow.parquet.read_table is very slow and cause RAM spikes from version 0.14.0

Reading a 40MB parquet file takes less than 1 second in versions 0.11, 0.12 and 0.13. wheras it takes from 6 to 30 seconds in versions 0.14.x

This impact in performance is easily measured. However, there is another problem that I could only detect on htop screen. While opening a 40MB parquet, the process occupies almost 16GB for some miliseconds. The pyarrow table will result in around 300MB in the python process (registered using memory-profiler). This does not happens in versions 0.13 and previous ones.

Environment: ubuntu 18, 16GB ram, 4 cpus
Reporter: Renan Alves Fonseca

Related issues:

Note: This issue was originally created as ARROW-6380. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions