-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Describe the bug, including details regarding any error messages, version, and platform.
I'm debugging slow performance in Dask DataFrame and have tracked things down, I think, to slow parquet deserialization in PyArrow. Based on what I know of Arrow I expect to get GB/s and I'm getting more in the range of 100-200 MB/s. What's more is that this seems to depend strongly on the environment (Linux / OSX) I'm using. I could use help tracking this down.
Experiment
I've isolated the performance difference down to the following simple experiment (notebook here):
# Create dataset
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np
import pandas as pd
import time
import io
x = np.random.randint(0, 100000, size=(1000000, 100))
df = pd.DataFrame(x)
t = pa.Table.from_pandas(df)
# Write to local parquet file
pq.write_table(t, "foo.parquet")
# Time Disk speeds
start = time.time()
with open("foo.parquet", mode="rb") as f:
bytes = f.read()
nbytes = len(bytes)
stop = time.time()
print("Disk Bandwidth:", int(nbytes / (stop - start) / 2**20), "MiB/s")
# Time Arrow Parquet Speeds
start = time.time()
_ = pq.read_table("foo.parquet")
stop = time.time()
print("PyArrow Read Bandwidth:", int(nbytes / (stop - start) / 2**20), "MiB/s")
# Time In-Memory Read Speeds
start = time.time()
pq.read_table(io.BytesIO(bytes))
stop = time.time()
print("PyArrow In-Memory Bandwidth:", int(nbytes / (stop - start) / 2**20), "MiB/s")Results
I've tried this on a variety of cloud machines (intel/arm, VMs/metal, 8-core/64-core, AWS/GCP) and they all get fast disk speeds (probably cached), but only about 150MB/s parquet deserialization speeds. I've tried this on two laptops, one a MBP and one a ThinkPad running Ubuntu and I get ...
- MacBookPro: 1GiB/s PyArrow deserialization performance (what I expect)
- Ubuntu/Thinkpad: 150MB/s PyArrow deserialization
In all cases I've installed latest release, PyArrow 13 from conda-forge
Summary
I'm confused by this. I've seen Arrow go way faster than this. I've tried to isolate the problem as much as possible to identify something in my environment that is the cause, but I can't. Everything seems to point to the conclusion that "PyArrow Parquet is just slow on Linux" which doesn't make any sense to me.
I'd welcome any help. Thank you all for your work historically.
Component(s)
Parquet, Python