Skip to content

feat: add standalone shuffle benchmark tool#3752

Merged
andygrove merged 21 commits intoapache:mainfrom
andygrove:shuffle-bench-binary
Apr 8, 2026
Merged

feat: add standalone shuffle benchmark tool#3752
andygrove merged 21 commits intoapache:mainfrom
andygrove:shuffle-bench-binary

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented Mar 21, 2026

Which issue does this PR close?

N/A

Rationale for this change

Profiling and optimizing Comet's native shuffle writer requires running it in isolation outside of Spark. This PR adds a standalone benchmark binary that streams data from Parquet files through the shuffle writer, and adds finer-grained metrics to identify bottlenecks.

What changes are included in this PR?

Standalone shuffle benchmark (shuffle_bench)

A new binary in the shuffle crate for benchmarking shuffle write and read performance outside of Spark. Streams input directly from Parquet files.

cargo run --release --features shuffle-bench --bin shuffle_bench -- \
  --input /data/tpch-sf100/lineitem/ \
  --partitions 200 \
  --codec zstd --zstd-level 1 \
  --hash-columns 0,3 \
  --memory-limit 2147000000

Key features:

  • Supports hash, single, and round-robin partitioning
  • Configurable compression (none, lz4, zstd, snappy)
  • Memory limit with spill support
  • Optional read-back benchmarking
  • Row limit for quick profiling runs
  • Multiple iterations with warmup
  • Concurrent task simulation (--concurrent-tasks)
  • Detailed metrics breakdown in output

Finer-grained shuffle metrics

Added three new timing metrics to ShufflePartitionerMetrics and threaded them through the existing shuffle writers:

  • interleave_time: Time spent in interleave_record_batch gathering rows into shuffled batches
  • coalesce_time: Time spent coalescing small batches before serialization
  • memcopy_time: Time spent buffering partition indices and memory accounting

These metrics are reported by the benchmark tool to help identify which phase of shuffle writing is the bottleneck.

How are these changes tested?

Manual benchmarking with TPC-H SF100 data. The benchmark binary is gated behind the shuffle-bench cargo feature flag and does not affect production builds. Existing shuffle tests continue to pass.

Add a `shuffle_bench` binary that benchmarks shuffle write and read
performance independently from Spark, making it easy to profile with
tools like `cargo flamegraph`, `perf`, or `instruments`.

Supports reading Parquet files (e.g. TPC-H/TPC-DS) or generating
synthetic data with configurable schema. Covers different scenarios
including compression codecs, partition counts, partitioning schemes,
and memory-constrained spilling.
@andygrove andygrove marked this pull request as draft March 26, 2026 14:00
…arquet

- Add `spark.comet.exec.shuffle.maxBufferedBatches` config to limit
  the number of batches buffered before spilling, allowing earlier
  spilling to reduce peak memory usage on executors
- Fix too-many-open-files: close spill file FD after each spill and
  reopen in append mode, rather than holding one FD open per partition
- Refactor shuffle_bench to stream directly from Parquet instead of
  loading all input data into memory; remove synthetic data generation
- Add --max-buffered-batches CLI arg to shuffle_bench
- Add shuffle benchmark documentation to README
Merge latest from apache/main, resolve conflicts, and strip out
COMET_SHUFFLE_MAX_BUFFERED_BATCHES config and all related plumbing.
This branch now only adds the shuffle benchmark binary.
@andygrove andygrove marked this pull request as ready for review March 27, 2026 18:08
Spawns N parallel shuffle tasks to simulate executor parallelism.
Each task reads the same input and writes to its own output files.
Extracts core shuffle logic into shared async helper to avoid
code duplication between single and concurrent paths.
@andygrove andygrove changed the title feat: add standalone shuffle benchmark binary for profiling [EXPERIMENTAL] Add ImmediateModePartitioner for native shuffle Mar 30, 2026
@andygrove andygrove marked this pull request as draft March 30, 2026 21:56
@andygrove andygrove changed the title [EXPERIMENTAL] Add ImmediateModePartitioner for native shuffle feat: shuffle benchmark improvements Mar 30, 2026
@andygrove andygrove force-pushed the shuffle-bench-binary branch from cf56e39 to c469077 Compare March 30, 2026 22:18
@andygrove andygrove changed the title feat: shuffle benchmark improvements feat: add standalone shuffle benchmark tool and finer-grained shuffle metrics Mar 30, 2026
@andygrove andygrove marked this pull request as ready for review March 30, 2026 22:20
@mbutrovich mbutrovich self-requested a review April 1, 2026 16:40
Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments so far. Thanks for working on this @andygrove!

Comment thread native/shuffle/README.md Outdated
| `--zstd-level` | `1` | Zstd compression level (1–22) |
| `--batch-size` | `8192` | Batch size for reading Parquet data |
| `--memory-limit` | _(none)_ | Memory limit in bytes; triggers spilling when exceeded |
| `--max-buffered-batches` | `0` | Max batches to buffer before spilling (0 = memory-pool-only) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used or an old argument? I don't see it in Args anywhere.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment thread native/shuffle/src/bin/shuffle_bench.rs Outdated
Comment thread native/shuffle/src/metrics.rs Outdated
Comment thread native/shuffle/README.md
### Basic usage

```sh
cargo run --release --features shuffle-bench --bin shuffle_bench -- \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to hide this behind a features flag. If the intention is that a dev need not build it every time, then I guess that is ok. Might be useful to add a target in the Makefile.
Or, perhaps, we can have a tools directory (maybe under dev) where we can add standalone tools and a target in the Makefile to build the tools directory.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark brings in additional dependencies (clap and parquet) so increases build time. That was the motivation for using a feature flag.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the feature flag, FWIW.

//! # Usage
//!
//! ```sh
//! cargo run --release --bin shuffle_bench -- \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to specify the features flag here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Updated.

Comment thread native/shuffle/src/bin/shuffle_bench.rs Outdated
})
.collect();

let data_bytes = fs::read(data_file).expect("Failed to read data file");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data_file could be large. Would reading the entire file emulate the behavior when we read the file in production? (For large files, we probably would/should have a buffered read?)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted the shuffle read part of the benchmark so that this PR is just for profiling shuffle write (which is the complex part). We can add a shuffle read benchmark in the future as and when we need it.

Revert new shuffle metrics (interleave_time, coalesce_time,
memcopy_time) to keep PR focused on the benchmark tool. Remove
read-back functionality from shuffle_bench to focus on write
performance. Remove undocumented --max-buffered-batches option
from README.
Copy link
Copy Markdown
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Pending the comment to be addressed and ci. (I tried re-running the failed pipeline but it failed to start).

@andygrove andygrove changed the title feat: add standalone shuffle benchmark tool and finer-grained shuffle metrics feat: add standalone shuffle benchmark tool Apr 8, 2026
@andygrove
Copy link
Copy Markdown
Member Author

This looks good to me. Pending the comment to be addressed and ci. (I tried re-running the failed pipeline but it failed to start).

Thanks for the approval @parthchandra. I'm not clear on which comment still needs to be addressed.

@mbutrovich Could you take another look? You requested changes in your last review.

Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together, @andygrove!

@andygrove andygrove merged commit 6260665 into apache:main Apr 8, 2026
159 checks passed
@andygrove andygrove deleted the shuffle-bench-binary branch April 8, 2026 15:31
andygrove added a commit to andygrove/datafusion-comet that referenced this pull request Apr 8, 2026
@parthchandra
Copy link
Copy Markdown
Contributor

This looks good to me. Pending the comment to be addressed and ci. (I tried re-running the failed pipeline but it failed to start).

Thanks for the approval @parthchandra. I'm not clear on which comment still needs to be addressed.

There was a comment from Matt that was still pending at that time. All good now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants