Skip to content

Explore zero-copy batch coalescing during shuffle write #3836

@andygrove

Description

@andygrove

Description

Currently, when writing shuffle data with many output partitions, each input batch gets split into many small per-partition batches (e.g. 8192 rows across 200 partitions ≈ ~41 rows per partition). Before serialization, we use Arrow's BatchCoalescer to combine these small batches into larger ones to reduce per-batch IPC overhead.

However, BatchCoalescer allocates new arrays and copies row data to produce the coalesced batch. This is unnecessary work — the data is about to be serialized anyway.

Proposed Optimization

Explore writing multiple small batches as a single IPC record batch message by concatenating their Arrow buffers directly in the IPC writer, avoiding the intermediate copy. This would be a "coalesce-on-write" approach:

  • Instead of: small batches → BatchCoalescer (alloc + copy) → large batch → IPC serialize
  • Do: small batches → IPC serialize directly as one message (concatenate buffers)

This would eliminate one full data copy per batch in the shuffle write hot path.

Context

This applies to both the block-based and IPC stream shuffle formats. The BatchCoalescer is used in BufBatchWriter (block format) and in the IPC stream multi-partition write path.

Benchmark data (200 partitions, LZ4, 10M rows from TPC-H SF100 lineitem):

  • Block format: 2.64M rows/s, 609 MiB output
  • IPC stream format: 2.60M rows/s, 634 MiB output

Both paths use BatchCoalescer and would benefit from this optimization.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions