Skip to content

perf: parallelize FRI fold with Rayon#597

Open
MauroToscano wants to merge 1 commit into
mainfrom
opt/28-parallel-fri-fold
Open

perf: parallelize FRI fold with Rayon#597
MauroToscano wants to merge 1 commit into
mainfrom
opt/28-parallel-fri-fold

Conversation

@MauroToscano
Copy link
Copy Markdown
Contributor

Summary

  • Parallelize fold_evaluations_in_place using par_chunks_exact(2) + par_iter for layers with >= 4096 elements
  • Falls back to sequential for small final layers where Rayon overhead dominates
  • Each FRI fold iteration is embarrassingly parallel — the output at position j depends only on inputs at 2j, 2j+1 and the j-th twiddle factor

Benchmark results

fib_iterative_4M, PARALLEL_TABLES=1, 5 samples:

Metric Before After Delta
Wall clock (median) 38.6s 36.8s -4.7%
CV 2.3% 2.2%
R4 fri::commit_phase (instruments) 3.83s 2.92s -23.8%
Heap 26,498 MB 26,242 MB -256 MB

Verification: baseline verifier accepts the proof.

Test plan

  • cargo test --release -p stark (124/124 pass)
  • Proof verified by baseline verifier binary
  • /bench on CI runner

The FRI fold loop was fully sequential despite being embarrassingly
parallel (each output element depends only on its input pair and
twiddle factor). Parallelize with par_chunks_exact(2) for layers
above 4096 elements, falling back to the sequential path for small
final layers where Rayon overhead dominates.
@github-actions
Copy link
Copy Markdown

Codex Code Review

Findings:

  • Potential Bug: crypto/stark/src/fri/fri_functions.rs:39
    The parallel branch uses .par_chunks_exact(2).zip(inv_twiddles.par_iter()), so if inv_twiddles.len() < evals.len() / 2, it silently computes fewer folded values and replaces evals with that shorter vector. The serial path indexes inv_twiddles[j] and would panic on the same invariant violation. Since commit_phase_from_evaluations is public and does not validate evals.len() == domain_size, this can turn a bad caller input into a silently malformed FRI commitment when parallel is enabled. Add an explicit invariant check before both branches, e.g. assert_eq!(inv_twiddles.len(), half), or return an error if this API should be fallible.

Verification: I attempted cargo test -p stark fri --features parallel, but rustup failed before compiling because /home/runner/.rustup/tmp is read-only in this sandbox. git diff --check passed.

Comment on lines +27 to +31
) where
F: IsSubFieldOf<E> + Send + Sync,
E: IsField + Send + Sync,
FieldElement<E>: Send + Sync,
FieldElement<F>: Send + Sync,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low – Send + Sync bounds leak into non-parallel builds

These four bounds are unconditional, so they apply even when the parallel feature is off. Any caller with a field element type that doesn't implement Send + Sync (e.g. a type containing Rc or Cell) will fail to compile, even in a single-threaded build. In practice, all concrete field element types in this crate are Send + Sync, but the bounds still unnecessarily restrict the API surface.

The cleanest fix without duplicating the function body is to gate them with a custom supertrait:

#[cfg(feature = "parallel")]
pub trait MaybeParallel: Send + Sync {}
#[cfg(feature = "parallel")]
impl<T: Send + Sync> MaybeParallel for T {}
#[cfg(not(feature = "parallel"))]
pub trait MaybeParallel {}
#[cfg(not(feature = "parallel"))]
impl<T> MaybeParallel for T {}

then replace Send + Sync with MaybeParallel throughout. Alternatively, if all callers satisfy the bounds unconditionally, document that assumption and leave it as-is — the important thing is the intent is clear.

Comment on lines +40 to +41
.par_chunks_exact(2)
.zip(inv_twiddles.par_iter())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low – zip silently truncates if lengths diverge; sequential panics

.par_chunks_exact(2).zip(inv_twiddles.par_iter()) stops at the shorter iterator, so if inv_twiddles.len() < evals.len() / 2 the parallel path silently produces a shorter-than-expected vector and returns with a wrong result. The sequential path would catch that with an out-of-bounds index panic.

The caller (commit_phase_from_evaluations) always passes a correctly-sized inv_twiddles slice, so this won't trigger in practice. But a debug-mode assert would make the invariant explicit and catch misuse during development:

debug_assert_eq!(inv_twiddles.len(), evals.len() / 2);

@claude
Copy link
Copy Markdown

claude Bot commented May 19, 2026

Review: perf: parallelize FRI fold with Rayon

Overall: looks correct and clean. The parallel fold formula matches the sequential one exactly, the new-buffer approach for the parallel path is the right call to avoid read/write aliasing, and the threshold of 4096 is a reasonable starting point backed by benchmarks.

Two low-severity issues flagged inline:

# Severity Location Issue
1 Low fri_functions.rs:27-31 Send + Sync bounds are unconditional — apply even when parallel feature is off, unnecessarily restricting the API
2 Low fri_functions.rs:40-41 zip silently truncates on length mismatch where sequential would panic; a debug_assert_eq!(inv_twiddles.len(), evals.len() / 2) would surface this during development

No correctness bugs, no security concerns, no unsafe code.

@MauroToscano
Copy link
Copy Markdown
Contributor Author

/bench

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 21, 2026

Benchmark — fib_iterative_8M (median of 5)

Table parallelism: auto (cores / 3)

Metric main PR Δ
Peak heap 52516 MB 51894 MB -622 MB (-1.2%) ⚪
Prove time 25.536s 25.593s +0.057s (+0.2%) ⚪

✅ No significant change.

✅ Low variance (time: 2.6%, heap: 1.0%)

Commit: 225817d · Baseline: cached · Runner: self-hosted bench

@MauroToscano
Copy link
Copy Markdown
Contributor Author

/bench 5 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant