Skip to content

[Python] read_csv from python is slow on some work loads #26299

@asfimport

Description

@asfimport

Hi!

I've noticed that pyarrow.csv.read_csv can be slow on real workloads, processing data around 0.5GiB/s. "Real workloads" means many string, float, and all-null columns, and large file size (5-10GiB), though the file size didn't matter to much.

Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of the time is spent on shared pointer lock mechanisms (though I'm not sure if this is to be trusted). I've attached the dumps in svg format.

I've also attached a script and a Dockerfile to run a benchmark, which reproduces the speeds I see. Building the docker image and running it on a large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly around 0.5GiB/s.

This is all also available here: https://github.com/drorspei/arrow-csv-benchmark

Environment: Machine: Azure, 48 vcpus, 384GiB ram
OS: Ubuntu 18.04
Dockerfile and script: attached, or here: https://github.com/drorspei/arrow-csv-benchmark
Reporter: Dror Speiser

Related issues:

Original Issue Attachments:

Note: This issue was originally created as ARROW-10308. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions