perf: Parallelize chunk_and_generate with par_chunks#563
Conversation
Codex Code ReviewNo issues found in the PR diff. The change is limited to parallelizing chunked trace/table generation behind the existing Validation note: I attempted |
| #[cfg(feature = "parallel")] | ||
| { | ||
| use rayon::prelude::*; | ||
| ops.par_chunks(max_rows).map(&generate).collect() |
There was a problem hiding this comment.
This parallelizes within one chunk_and_generate call — across chunks of the same op type. But the 10 calls at the Phase 5 call site (cpu, memw, memw_aligned, memw_register, load, lt, shift, mul, dvrm, branch) are still sequential. When each op type has only one chunk (ops.len() ≤ max_rows), par_chunks yields a single item and the Rayon pool stays idle — no speedup at all for the common small-trace case.
The higher-leverage optimization would be running the 10 calls in parallel via rayon::scope or nested rayon::join, mirroring the pattern already used for pages/register/halt below. If that's deferred to PR #545, a note at the call site would help.
|
/bench 10 |
Review of perf: Parallelize chunk_and_generate with par_chunksNo security issues. One functional concern about the parallelism model. What the PR doesAdds CorrectnessThe implementation is correct:
Performance concernThe 10 The higher-impact opportunity is running the 10 calls concurrently via |
Benchmark — fib_iterative_8M (median of 3)Table parallelism: 1
Commit: 4e6f5c0 · Baseline: built from main · Runner: self-hosted bench |
|
/bench k=1 |
Optimization extracted from PR #545.
The 10 Phase 5 trace generators (cpu, memw, memw_aligned, memw_register, load, lt, shift, mul,
dvrm, branch) run serially inside
chunk_and_generate, leaving the rayon pool idle.This PR replaces
ops.chunks(max_rows).map(generate).collect()withpar_chunksunder#[cfg(feature = "parallel")], with a sequential fallback. AddsT: SyncandSync + Sendbounds on the generator.