Skip to content

[C++][Parquet] Parquet write_to_dataset performance regression #35498

@alexhudspith

Description

@alexhudspith

Describe the bug, including details regarding any error messages, version, and platform.

On Linux, pyarrow.parquet.write_to_dataset shows a large performance regression in Arrow 12.0 versus 11.0.

The following results were collected using Ubuntu 22.04.2 LTS (5.15.0-71-generic), Intel Haswell 4-core @ 3.6GHz, 16 GB RAM, Samsung 840 Pro SSD. They are elapsed times in seconds to write a single int64 column of integers [0,..., length-1] with no compression and no multi-threading:

Array length Arrow 11 (s) Arrow 12 (s)
1,000,000 0.1 0.1
2,000,000 0.2 0.4
4,000,000 0.3 1.6
8,000,000 0.8 6.2
16,000,000 2.3 24.4
32,000,000 6.5 94.1
64,000,000 13.5 371.7

The output directory was deleted before each run.

"""check.py"""
import sys
import time
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

def main():
    path = '/tmp/test.parquet'
    length = 10_000_000 if len(sys.argv) < 2 else int(sys.argv[1])
    table = pa.Table.from_arrays([pa.array(np.arange(length))], names=['A'])
    t0 = time.perf_counter()
    pq.write_to_dataset(
        table, path, schema=table.schema, use_legacy_dataset=False, use_threads=False, compression=None
    )
    duration = time.perf_counter() - t0
    print(f'{duration:.2f}s')

if __name__ == '__main__':
    main()

Running git bisect on local builds leads me to this commit: 660d259: [C++] Add ordered/segmented aggregation Substrait extension (#34627).

Following that change, Flamegraphs show a lot of additional time spent in arrow::util::EnsureAlignment calling glibc memcpy:

Before ~1.3s (ddd0a33)
good-ddd0a33 perf

After ~9.6s (660d259)
bad-660d259 perf

Reading and pyarrow.parquet.write_table appear unaffected.

Component(s)

C++, Parquet

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions