Description
Currently, when writing shuffle data with many output partitions, each input batch gets split into many small per-partition batches (e.g. 8192 rows across 200 partitions ≈ ~41 rows per partition). Before serialization, we use Arrow's BatchCoalescer to combine these small batches into larger ones to reduce per-batch IPC overhead.
However, BatchCoalescer allocates new arrays and copies row data to produce the coalesced batch. This is unnecessary work — the data is about to be serialized anyway.
Proposed Optimization
Explore writing multiple small batches as a single IPC record batch message by concatenating their Arrow buffers directly in the IPC writer, avoiding the intermediate copy. This would be a "coalesce-on-write" approach:
- Instead of: small batches →
BatchCoalescer (alloc + copy) → large batch → IPC serialize
- Do: small batches → IPC serialize directly as one message (concatenate buffers)
This would eliminate one full data copy per batch in the shuffle write hot path.
Context
This applies to both the block-based and IPC stream shuffle formats. The BatchCoalescer is used in BufBatchWriter (block format) and in the IPC stream multi-partition write path.
Benchmark data (200 partitions, LZ4, 10M rows from TPC-H SF100 lineitem):
- Block format: 2.64M rows/s, 609 MiB output
- IPC stream format: 2.60M rows/s, 634 MiB output
Both paths use BatchCoalescer and would benefit from this optimization.
Description
Currently, when writing shuffle data with many output partitions, each input batch gets split into many small per-partition batches (e.g. 8192 rows across 200 partitions ≈ ~41 rows per partition). Before serialization, we use Arrow's
BatchCoalescerto combine these small batches into larger ones to reduce per-batch IPC overhead.However,
BatchCoalescerallocates new arrays and copies row data to produce the coalesced batch. This is unnecessary work — the data is about to be serialized anyway.Proposed Optimization
Explore writing multiple small batches as a single IPC record batch message by concatenating their Arrow buffers directly in the IPC writer, avoiding the intermediate copy. This would be a "coalesce-on-write" approach:
BatchCoalescer(alloc + copy) → large batch → IPC serializeThis would eliminate one full data copy per batch in the shuffle write hot path.
Context
This applies to both the block-based and IPC stream shuffle formats. The
BatchCoalesceris used inBufBatchWriter(block format) and in the IPC stream multi-partition write path.Benchmark data (200 partitions, LZ4, 10M rows from TPC-H SF100 lineitem):
Both paths use
BatchCoalescerand would benefit from this optimization.