[Python] read_csv from python is slow on some work loads

Hi!

I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, processing data around 0.5GiB/s. "Real workloads" means many string, float, and all-null columns, and large file size (5-10GiB), though the file size didn't matter to much.

Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of the time is spent on shared pointer lock mechanisms (though I'm not sure if this is to be trusted). I've attached the dumps in svg format.

I've also attached a script and a Dockerfile to run a benchmark, which reproduces the speeds I see. Building the docker image and running it on a large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly around 0.5GiB/s.

This is all also available here: https://github.com/drorspei/arrow-csv-benchmark

**Environment**: Machine: Azure, 48 vcpus, 384GiB ram
OS: Ubuntu 18.04
Dockerfile and script: attached, or here: https://github.com/drorspei/arrow-csv-benchmark
**Reporter**: [Dror Speiser](https://issues.apache.org/jira/browse/ARROW-10308)
#### Related issues:
- [[C++] Consider using fast-double-parser](https://github.com/apache/arrow/issues/26317) (is related to)
- [[C++] Improve UTF8 validation speed and CSV string conversion](https://github.com/apache/arrow/issues/26304) (is related to)
#### Original Issue Attachments:
- [arrow-csv-benchmark-plot.png](https://issues.apache.org/jira/secure/attachment/13013595/arrow-csv-benchmark-plot.png)
- [arrow-csv-benchmark-times.csv](https://issues.apache.org/jira/secure/attachment/13013594/arrow-csv-benchmark-times.csv)
- [benchmark-csv.py](https://issues.apache.org/jira/secure/attachment/13013590/benchmark-csv.py)
- [Dockerfile](https://issues.apache.org/jira/secure/attachment/13013591/Dockerfile)
- [profile1.svg](https://issues.apache.org/jira/secure/attachment/13013586/profile1.svg)
- [profile2.svg](https://issues.apache.org/jira/secure/attachment/13013585/profile2.svg)
- [profile3.svg](https://issues.apache.org/jira/secure/attachment/13013584/profile3.svg)
- [profile4.svg](https://issues.apache.org/jira/secure/attachment/13013583/profile4.svg)

<sub>**Note**: *This issue was originally created as [ARROW-10308](https://issues.apache.org/jira/browse/ARROW-10308). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Python] read_csv from python is slow on some work loads #26299

Related issues:

Original Issue Attachments:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Python] read_csv from python is slow on some work loads #26299

Description

Related issues:

Original Issue Attachments:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions