-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Describe the bug, including details regarding any error messages, version, and platform.
On Linux, pyarrow.parquet.write_to_dataset shows a large performance regression in Arrow 12.0 versus 11.0.
The following results were collected using Ubuntu 22.04.2 LTS (5.15.0-71-generic), Intel Haswell 4-core @ 3.6GHz, 16 GB RAM, Samsung 840 Pro SSD. They are elapsed times in seconds to write a single int64 column of integers [0,..., length-1] with no compression and no multi-threading:
| Array length | Arrow 11 (s) | Arrow 12 (s) |
|---|---|---|
| 1,000,000 | 0.1 | 0.1 |
| 2,000,000 | 0.2 | 0.4 |
| 4,000,000 | 0.3 | 1.6 |
| 8,000,000 | 0.8 | 6.2 |
| 16,000,000 | 2.3 | 24.4 |
| 32,000,000 | 6.5 | 94.1 |
| 64,000,000 | 13.5 | 371.7 |
The output directory was deleted before each run.
"""check.py"""
import sys
import time
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
def main():
path = '/tmp/test.parquet'
length = 10_000_000 if len(sys.argv) < 2 else int(sys.argv[1])
table = pa.Table.from_arrays([pa.array(np.arange(length))], names=['A'])
t0 = time.perf_counter()
pq.write_to_dataset(
table, path, schema=table.schema, use_legacy_dataset=False, use_threads=False, compression=None
)
duration = time.perf_counter() - t0
print(f'{duration:.2f}s')
if __name__ == '__main__':
main()
Running git bisect on local builds leads me to this commit: 660d259: [C++] Add ordered/segmented aggregation Substrait extension (#34627).
Following that change, Flamegraphs show a lot of additional time spent in arrow::util::EnsureAlignment calling glibc memcpy:
Before ~1.3s (ddd0a33)
After ~9.6s (660d259)
Reading and pyarrow.parquet.write_table appear unaffected.
Component(s)
C++, Parquet