feat: add standalone shuffle benchmark tool by andygrove · Pull Request #3752 · apache/datafusion-comet

andygrove · 2026-03-21T13:43:49Z

Which issue does this PR close?

N/A

Rationale for this change

Profiling and optimizing Comet's native shuffle writer requires running it in isolation outside of Spark. This PR adds a standalone benchmark binary that streams data from Parquet files through the shuffle writer, and adds finer-grained metrics to identify bottlenecks.

What changes are included in this PR?

Standalone shuffle benchmark (`shuffle_bench`)

A new binary in the shuffle crate for benchmarking shuffle write and read performance outside of Spark. Streams input directly from Parquet files.

cargo run --release --features shuffle-bench --bin shuffle_bench -- \
  --input /data/tpch-sf100/lineitem/ \
  --partitions 200 \
  --codec zstd --zstd-level 1 \
  --hash-columns 0,3 \
  --memory-limit 2147000000

Key features:

Supports hash, single, and round-robin partitioning
Configurable compression (none, lz4, zstd, snappy)
Memory limit with spill support
Optional read-back benchmarking
Row limit for quick profiling runs
Multiple iterations with warmup
Concurrent task simulation (--concurrent-tasks)
Detailed metrics breakdown in output

Finer-grained shuffle metrics

Added three new timing metrics to ShufflePartitionerMetrics and threaded them through the existing shuffle writers:

interleave_time: Time spent in interleave_record_batch gathering rows into shuffled batches
coalesce_time: Time spent coalescing small batches before serialization
memcopy_time: Time spent buffering partition indices and memory accounting

These metrics are reported by the benchmark tool to help identify which phase of shuffle writing is the bottleneck.

How are these changes tested?

Manual benchmarking with TPC-H SF100 data. The benchmark binary is gated behind the shuffle-bench cargo feature flag and does not affect production builds. Existing shuffle tests continue to pass.

Add a `shuffle_bench` binary that benchmarks shuffle write and read performance independently from Spark, making it easy to profile with tools like `cargo flamegraph`, `perf`, or `instruments`. Supports reading Parquet files (e.g. TPC-H/TPC-DS) or generating synthetic data with configurable schema. Covers different scenarios including compression codecs, partition counts, partitioning schemes, and memory-constrained spilling.

…arquet - Add `spark.comet.exec.shuffle.maxBufferedBatches` config to limit the number of batches buffered before spilling, allowing earlier spilling to reduce peak memory usage on executors - Fix too-many-open-files: close spill file FD after each spill and reopen in append mode, rather than holding one FD open per partition - Refactor shuffle_bench to stream directly from Parquet instead of loading all input data into memory; remove synthetic data generation - Add --max-buffered-batches CLI arg to shuffle_bench - Add shuffle benchmark documentation to README

Merge latest from apache/main, resolve conflicts, and strip out COMET_SHUFFLE_MAX_BUFFERED_BATCHES config and all related plumbing. This branch now only adds the shuffle benchmark binary.

Spawns N parallel shuffle tasks to simulate executor parallelism. Each task reads the same input and writes to its own output files. Extracts core shuffle logic into shared async helper to avoid code duplication between single and concurrent paths.

mbutrovich

Minor comments so far. Thanks for working on this @andygrove!

mbutrovich · 2026-04-01T16:45:37Z

+| `--zstd-level`           | `1`                        | Zstd compression level (1–22)                                |
+| `--batch-size`           | `8192`                     | Batch size for reading Parquet data                          |
+| `--memory-limit`         | _(none)_                   | Memory limit in bytes; triggers spilling when exceeded       |
+| `--max-buffered-batches` | `0`                        | Max batches to buffer before spilling (0 = memory-pool-only) |


Is this used or an old argument? I don't see it in Args anywhere.

parthchandra · 2026-04-01T17:01:43Z

+### Basic usage
+
+```sh
+cargo run --release --features shuffle-bench --bin shuffle_bench -- \


I don't think we need to hide this behind a features flag. If the intention is that a dev need not build it every time, then I guess that is ok. Might be useful to add a target in the Makefile.
Or, perhaps, we can have a tools directory (maybe under dev) where we can add standalone tools and a target in the Makefile to build the tools directory.

The benchmark brings in additional dependencies (clap and parquet) so increases build time. That was the motivation for using a feature flag.

I like the feature flag, FWIW.

parthchandra · 2026-04-01T17:04:32Z

+//! # Usage
+//!
+//! ```sh
+//! cargo run --release --bin shuffle_bench -- \


Do we need to specify the features flag here?

No. Updated.

parthchandra · 2026-04-01T17:08:21Z

+        })
+        .collect();
+
+    let data_bytes = fs::read(data_file).expect("Failed to read data file");


The data_file could be large. Would reading the entire file emulate the behavior when we read the file in production? (For large files, we probably would/should have a buffered read?)

I reverted the shuffle read part of the benchmark so that this PR is just for profiling shuffle write (which is the complex part). We can add a shuffle read benchmark in the future as and when we need it.

Revert new shuffle metrics (interleave_time, coalesce_time, memcopy_time) to keep PR focused on the benchmark tool. Remove read-back functionality from shuffle_bench to focus on write performance. Remove undocumented --max-buffered-batches option from README.

parthchandra

This looks good to me. Pending the comment to be addressed and ci. (I tried re-running the failed pipeline but it failed to start).

andygrove · 2026-04-08T14:37:57Z

This looks good to me. Pending the comment to be addressed and ci. (I tried re-running the failed pipeline but it failed to start).

Thanks for the approval @parthchandra. I'm not clear on which comment still needs to be addressed.

@mbutrovich Could you take another look? You requested changes in your last review.

mbutrovich

Thanks for putting this together, @andygrove!

parthchandra · 2026-04-09T23:55:36Z

This looks good to me. Pending the comment to be addressed and ci. (I tried re-running the failed pipeline but it failed to start).

Thanks for the approval @parthchandra. I'm not clear on which comment still needs to be addressed.

There was a comment from Matt that was still pending at that time. All good now.

andygrove added 5 commits March 21, 2026 07:43

feat: add --limit option to shuffle benchmark (default 1M rows)

9b5b305

perf: apply limit during parquet read to avoid scanning all files

e1ab490

feat: move shuffle_bench binary into shuffle crate

b7682f4

chore: add comment explaining parquet/rand deps in shuffle crate

ca36cbd

andygrove marked this pull request as ready for review March 23, 2026 13:44

andygrove mentioned this pull request Mar 23, 2026

[EPIC] Experiment with some different shuffle implementations #3778

Open

Merge remote-tracking branch 'apache/main' into shuffle-bench-binary

7225afd

andygrove marked this pull request as draft March 26, 2026 14:00

andygrove added 3 commits March 26, 2026 07:13

merge apache/main, remove max_buffered_batches changes

16ce30f

Merge latest from apache/main, resolve conflicts, and strip out COMET_SHUFFLE_MAX_BUFFERED_BATCHES config and all related plumbing. This branch now only adds the shuffle benchmark binary.

cargo fmt

2ef57e7

andygrove marked this pull request as ready for review March 27, 2026 18:08

andygrove added 3 commits March 27, 2026 12:08

prettier

9136e10

machete

7e16819

andygrove mentioned this pull request Mar 29, 2026

feat: Add support for IPC stream format shuffle files (native shuffle) [experimental] #3837

Closed

andygrove added 2 commits March 30, 2026 08:42

show metrics

58ab927

improve metrics

c469077

andygrove changed the title ~~feat: add standalone shuffle benchmark binary for profiling~~ [EXPERIMENTAL] Add ImmediateModePartitioner for native shuffle Mar 30, 2026

andygrove marked this pull request as draft March 30, 2026 21:56

andygrove changed the title ~~[EXPERIMENTAL] Add ImmediateModePartitioner for native shuffle~~ feat: shuffle benchmark improvements Mar 30, 2026

andygrove force-pushed the shuffle-bench-binary branch from cf56e39 to c469077 Compare March 30, 2026 22:18

andygrove changed the title ~~feat: shuffle benchmark improvements~~ feat: add standalone shuffle benchmark tool and finer-grained shuffle metrics Mar 30, 2026

andygrove marked this pull request as ready for review March 30, 2026 22:20

merge apache/main

45b5e75

mbutrovich self-requested a review April 1, 2026 16:40

mbutrovich requested changes Apr 1, 2026

View reviewed changes

parthchandra reviewed Apr 1, 2026

View reviewed changes

andygrove added 3 commits April 1, 2026 13:51

style: apply cargo fmt and prettier

a56c201

Merge remote-tracking branch 'apache/main' into shuffle-bench-binary

f60aedc

fix: default codec to lz4 to match Comet default, fix about string

cd4e550

andygrove requested review from mbutrovich and parthchandra April 1, 2026 19:57

parthchandra approved these changes Apr 3, 2026

View reviewed changes

Shekharrajak mentioned this pull request Apr 6, 2026

feat: Use single spill file for multiple partitions in native shuffle #3903

Open

Merge branch 'main' into shuffle-bench-binary

74c404d

Shekharrajak mentioned this pull request Apr 6, 2026

bench: [DO NOT MERGE] single spill file for multiple partitions #3904

Draft

andygrove changed the title ~~feat: add standalone shuffle benchmark tool and finer-grained shuffle metrics~~ feat: add standalone shuffle benchmark tool Apr 8, 2026

upmerge

e9654c3

andygrove mentioned this pull request Apr 8, 2026

Current shuffle format has too much overhead with default batch size #3882

Open

mbutrovich approved these changes Apr 8, 2026

View reviewed changes

andygrove merged commit 6260665 into apache:main Apr 8, 2026
159 checks passed

andygrove deleted the shuffle-bench-binary branch April 8, 2026 15:31

andygrove added a commit to andygrove/datafusion-comet that referenced this pull request Apr 8, 2026

feat: add standalone shuffle benchmark tool (apache#3752)

503cf6f

Conversation

andygrove commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Standalone shuffle benchmark (shuffle_bench)

Finer-grained shuffle metrics

How are these changes tested?

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

parthchandra left a comment

Choose a reason for hiding this comment

Uh oh!

andygrove commented Apr 8, 2026

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

parthchandra commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andygrove commented Mar 21, 2026 •

edited

Loading

Standalone shuffle benchmark (`shuffle_bench`)