perf: parallelize FRI fold with Rayon#597
Conversation
The FRI fold loop was fully sequential despite being embarrassingly parallel (each output element depends only on its input pair and twiddle factor). Parallelize with par_chunks_exact(2) for layers above 4096 elements, falling back to the sequential path for small final layers where Rayon overhead dominates.
Codex Code ReviewFindings:
Verification: I attempted |
| ) where | ||
| F: IsSubFieldOf<E> + Send + Sync, | ||
| E: IsField + Send + Sync, | ||
| FieldElement<E>: Send + Sync, | ||
| FieldElement<F>: Send + Sync, |
There was a problem hiding this comment.
Low – Send + Sync bounds leak into non-parallel builds
These four bounds are unconditional, so they apply even when the parallel feature is off. Any caller with a field element type that doesn't implement Send + Sync (e.g. a type containing Rc or Cell) will fail to compile, even in a single-threaded build. In practice, all concrete field element types in this crate are Send + Sync, but the bounds still unnecessarily restrict the API surface.
The cleanest fix without duplicating the function body is to gate them with a custom supertrait:
#[cfg(feature = "parallel")]
pub trait MaybeParallel: Send + Sync {}
#[cfg(feature = "parallel")]
impl<T: Send + Sync> MaybeParallel for T {}
#[cfg(not(feature = "parallel"))]
pub trait MaybeParallel {}
#[cfg(not(feature = "parallel"))]
impl<T> MaybeParallel for T {}then replace Send + Sync with MaybeParallel throughout. Alternatively, if all callers satisfy the bounds unconditionally, document that assumption and leave it as-is — the important thing is the intent is clear.
| .par_chunks_exact(2) | ||
| .zip(inv_twiddles.par_iter()) |
There was a problem hiding this comment.
Low – zip silently truncates if lengths diverge; sequential panics
.par_chunks_exact(2).zip(inv_twiddles.par_iter()) stops at the shorter iterator, so if inv_twiddles.len() < evals.len() / 2 the parallel path silently produces a shorter-than-expected vector and returns with a wrong result. The sequential path would catch that with an out-of-bounds index panic.
The caller (commit_phase_from_evaluations) always passes a correctly-sized inv_twiddles slice, so this won't trigger in practice. But a debug-mode assert would make the invariant explicit and catch misuse during development:
debug_assert_eq!(inv_twiddles.len(), evals.len() / 2);
Review: perf: parallelize FRI fold with RayonOverall: looks correct and clean. The parallel fold formula matches the sequential one exactly, the new-buffer approach for the parallel path is the right call to avoid read/write aliasing, and the threshold of 4096 is a reasonable starting point backed by benchmarks. Two low-severity issues flagged inline:
No correctness bugs, no security concerns, no unsafe code. |
|
/bench |
Benchmark — fib_iterative_8M (median of 5)Table parallelism: auto (cores / 3)
Commit: 225817d · Baseline: cached · Runner: self-hosted bench |
|
/bench 5 1 |
Summary
fold_evaluations_in_placeusingpar_chunks_exact(2)+par_iterfor layers with >= 4096 elementsBenchmark results
fib_iterative_4M, PARALLEL_TABLES=1, 5 samples:
Verification: baseline verifier accepts the proof.
Test plan
cargo test --release -p stark(124/124 pass)/benchon CI runner