-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Hi!
I've noticed that pyarrow.csv.read_csv can be slow on real workloads, processing data around 0.5GiB/s. "Real workloads" means many string, float, and all-null columns, and large file size (5-10GiB), though the file size didn't matter to much.
Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of the time is spent on shared pointer lock mechanisms (though I'm not sure if this is to be trusted). I've attached the dumps in svg format.
I've also attached a script and a Dockerfile to run a benchmark, which reproduces the speeds I see. Building the docker image and running it on a large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly around 0.5GiB/s.
This is all also available here: https://github.com/drorspei/arrow-csv-benchmark
Environment: Machine: Azure, 48 vcpus, 384GiB ram
OS: Ubuntu 18.04
Dockerfile and script: attached, or here: https://github.com/drorspei/arrow-csv-benchmark
Reporter: Dror Speiser
Related issues:
- [C++] Consider using fast-double-parser (is related to)
- [C++] Improve UTF8 validation speed and CSV string conversion (is related to)
Original Issue Attachments:
- arrow-csv-benchmark-plot.png
- arrow-csv-benchmark-times.csv
- benchmark-csv.py
- Dockerfile
- profile1.svg
- profile2.svg
- profile3.svg
- profile4.svg
Note: This issue was originally created as ARROW-10308. Please see the migration documentation for further details.