Skip to content

Misc optimizations#545

Open
gabrielbosio wants to merge 7 commits into
mainfrom
perf/parallel-bitrev-batch-inverse-merkle
Open

Misc optimizations#545
gabrielbosio wants to merge 7 commits into
mainfrom
perf/parallel-bitrev-batch-inverse-merkle

Conversation

@gabrielbosio
Copy link
Copy Markdown
Collaborator

@gabrielbosio gabrielbosio commented Apr 23, 2026

Stacks five independent prover optimizations in R1 commit, trace build, and the FFT / batch-inverse pipelines.

  1. Sequential reads in commit_columns_bit_reversed: replaces scattered columns[col][br(row)] reads (~2 GB) with sequential reads plus a 64 MB post bit-reverse on the digest vector.
  2. par_chunks inside chunk_and_generate: runs the 10 trace-build Phase 5 generators in parallel on the idle rayon pool.
  3. Skip the redundant bit-reverse pair in the R4 deep-composition LDE: adds evaluate_fft_bit_reversed to drop both cancelling permutes.
  4. Parallel in_place_bit_reverse_permute: swap pairs (i, br(i)) with i < br(i) are disjoint, so a Send/Sync raw-pointer wrapper lets rayon drive them; sequential fallback below N=16K.
  5. Chunked parallel inplace_batch_inverse: chunks the Montgomery prefix product, trading K−1 extra inversions for K-way parallelism; sequential fallback below N=2^16.

@gabrielbosio gabrielbosio force-pushed the perf/parallel-bitrev-batch-inverse-merkle branch from 18e9c67 to 33e9573 Compare April 23, 2026 18:09
pub fn inplace_batch_inverse(numbers: &mut [Self]) -> Result<(), FieldError> {
pub fn inplace_batch_inverse(numbers: &mut [Self]) -> Result<(), FieldError>
where
Self: Send + Sync,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — Unconditional API-breaking bound

The where Self: Send + Sync constraint is on the public function signature unconditionally, not gated by #[cfg(feature = "parallel")]. This means every caller of inplace_batch_inverse must now satisfy the bound even on no_std targets or crates that never enable the parallel feature, which is a breaking change for any downstream field implementation whose BaseType is not Send + Sync.

Consider splitting:

Suggested change
Self: Send + Sync,
pub fn inplace_batch_inverse(numbers: &mut [Self]) -> Result<(), FieldError> {

…and adding the Send + Sync bound only in the #[cfg(feature = "parallel")] branch, e.g. via an internal helper with the tighter bound, keeping the public API signature unchanged.

}
}
unsafe impl<E> Send for SendPtr<E> {}
unsafe impl<E> Sync for SendPtr<E> {}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low — Sync impl is broader than needed

for_each on a rayon ParallelIterator only requires the closure to be Send. Since ptr is Copy, each closure invocation gets its own copy of the raw pointer, so the closure is Send as long as SendPtr: Send. The Sync impl (&SendPtr<E> safe to share across threads) is never exercised by this code and makes a stronger safety claim than what is actually verified. It should be removed to keep the unsafe surface minimal.

Suggested change
unsafe impl<E> Sync for SendPtr<E> {}
unsafe impl<E> Send for SendPtr<E> {}

@claude
Copy link
Copy Markdown

claude Bot commented Apr 23, 2026

Review: Misc optimizations (perf/parallel-bitrev-batch-inverse-merkle)

Summary

Four orthogonal performance improvements: parallel bit-reverse permutation, parallel batch inverse, cache-friendly Merkle commit (sequential column reads + bit-reverse hashes), and evaluate_fft_bit_reversed to skip a redundant permute on the FRI path. The algorithmic correctness of all four changes is sound. A few issues below.


Accidental files — should not be in this PR

These look like artifacts from an automated optimization session:

File Issue
.claude/scheduled_tasks.lock Internal Claude Code scheduler lock file (contains a session ID + PID). Not source code.
.claude/skills/optimize-prover/SKILL.md AI-agent skill definition. Fine to keep in the repo if intentional, but worth a deliberate commit rather than bundling with perf work.
disable-advise-drop-cache.patch Has // DEBUG: disabled to test MADV_DONTNEED hypothesis — a debug experiment patch, not a real change.
noop-spill.patch Has // DEBUG: turn into a no-op to isolate spill mechanism — same, a debug probe.
fix-fib-iterative-halt-exit-code.patch Inserts li a0, 0 after mv a0, t1, discarding the Fibonacci result before the ecall — this is a benchmarking workaround, not a correctness fix. Should not land as-is.
others/optimize_report.md Optimization session notes; fine as documentation but not a code artifact.

Code issues

Medium — inplace_batch_inverse unconditional Send + Sync bound (see inline comment on element.rs:56)

The where Self: Send + Sync constraint appears on the public function signature without any #[cfg(feature = "parallel")] gate. Callers on platforms or crates that never enable parallel must now satisfy this bound, breaking any field implementation whose BaseType is not Send + Sync. The bound should be confined to the parallel branch (e.g. via an internal helper).

Low — unsafe impl Sync for SendPtr<E> is unnecessary (see inline on bit_reversing.rs:21)

The parallel swap pattern only requires SendPtr: Send (the closure is move, so each rayon task gets its own copy of the pointer). The Sync impl grants &SendPtr<E> cross-thread shareability without a corresponding correctness argument. Removing it keeps the unsafe surface minimal.


Correctness notes (no action needed)

  • Parallel bit-reverse safety: The proof is correct. For any i ≠ j, {i, rev(i)} ∩ {j, rev(j)} = ∅ except when i = rev(j), in which case the br > i guard ensures only the smaller-index thread executes the swap. No two threads touch the same slot.
  • Merkle commit reorder: Sequential column reads followed by in_place_bit_reverse_permute on 32-byte digests is equivalent to the previous scattered-read approach. Cache behaviour is strictly better.
  • Parallel chunk_and_generate: rayon::collect() preserves order on indexed iterators; result is identical to the sequential version provided generate is pure, which it appears to be.

@github-actions
Copy link
Copy Markdown

Codex Code Review

Findings:

  • Low: crypto/math/src/field/element.rs:54
    inplace_batch_inverse() is no longer all-or-nothing on error. The old implementation only wrote back after the single inv() succeeded; the new parallel path uses par_chunks_mut(...).try_for_each(...), so if one chunk hits a zero and returns Err, other chunks may already have overwritten their inputs with inverses. That leaves callers with partially mutated data after a failure, which is a real behavioral regression for a public API. Either preflight for zero / stage the work before mutating, or keep the old semantics.

  • Low: crypto/math/src/fft/cpu/bit_reversing.rs:2, crypto/math/src/field/element.rs:54-56
    The PR adds Send / Sync bounds to public math APIs unconditionally, not just to the parallel-only path. That is a breaking API change for downstream users even when the parallel feature is disabled. The bounds should live on internal parallel helpers/branches, not on the public function signatures.

No high/medium security or correctness issue stood out in the FFT/FRI ordering changes themselves.

I couldn’t run cargo test in this sandbox because Cargo/Rustup needs write access under ~/.cargo / ~/.rustup, so this review is source-based.

@gabrielbosio
Copy link
Copy Markdown
Collaborator Author

/bench

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 23, 2026

Benchmark — fib_iterative_8M (median of 3)

Table parallelism: 1

Metric main PR Δ
Peak heap 49533 MB 49509 MB -24 MB (+-0.0%) ⚪
Prove time 70.451s 59.769s -10.682s (-15.2%) 🟢

🎉 Improvement detected — heap or time decreased by more than 5%.

✅ Low variance (time: 1.1%, heap: 0.0%)

Commit: 3de2694 · Baseline: built from main · Runner: self-hosted bench

@gabrielbosio
Copy link
Copy Markdown
Collaborator Author

/bench k=4

@gabrielbosio
Copy link
Copy Markdown
Collaborator Author

It only shows a speedup with low table parallelism. Closing.

@gabrielbosio gabrielbosio reopened this Apr 23, 2026
@gabrielbosio gabrielbosio force-pushed the perf/parallel-bitrev-batch-inverse-merkle branch from 33e9573 to 365d044 Compare April 23, 2026 19:18
Read columns at natural index k inside the parallel hashing loop, then apply
in_place_bit_reverse_permute to the Commitment vector before building the
Merkle tree. Same leaves as reading at br(row_idx) inside the loop; replaces
scattered column reads (~2GB volume on MEMW_R) with sequential reads plus a
64MB in-place bit-reverse pass.
Phase 5 of trace build invokes chunk_and_generate 10 times; each call
walked its chunks sequentially. MEMW alone produces ~12 chunks at
fib_iterative_2M, so there is substantial per-chunk parallelism available
on a free rayon pool (trace build runs before multi_prove).

fib_iterative_2M on Linux x86_64, 12 cores, 3 samples:
- prove wall-clock: 75.4s -> 74.3s median (-1.5%)
- Trace build sub-phase: 4.56s -> 3.96s (-13.2%)
- Verification against baseline binary: PASS
round_4 called evaluate_fft (which internally permutes the FFT output to
natural order) and then in_place_bit_reverse_permute on the result to
flip it back. Both permutes cancel. FRI commit_phase_from_evaluations
pairs evals as chunks_exact(2) expecting {f(x), f(-x)} adjacency, which
is exactly the bit-reversed output of the Bowers forward FFT.

Added Polynomial::evaluate_fft_bit_reversed that skips the final permute,
and called it from round_4. Result: two ~24ms permutes (at 2N=4M per
table) eliminated per prove.

fib_iterative_2M on Linux x86_64, 12 cores, 5 samples:
- prove wall-clock: 75.4s -> 74.4s median (-1.3%), 75.5s -> 74.3s mean (-1.6%)
- R4 interpolate+evaluate_fft sub-phase: 2.73s -> 1.95s (-29%)
- CV 0.6% (2xCV=1.2% threshold, 1.3% improvement clears it)
- Verification against baseline binary: PASS
Every FFT call site ends with a sequential O(N) bit-reverse permutation.
At N=4M elements this is ~24ms on its own, called dozens of times per
prove across all column LDEs, composition-poly parts, and the R4 deep
LDE. Bottlenecks the otherwise-parallel FFT pipeline (Amdahl).

Swap pairs (i, br(i)) with i < br(i) are disjoint, so parallelization is
safe with a Send/Sync raw-pointer wrapper (the `i < br(i)` predicate
selects a unique owner per pair, so no two threads ever touch the same
slot). Sequential fallback retained for N < 16K.

fib_iterative_2M on Linux x86_64, 12 cores, 3 samples:
- prove wall-clock: 75.4s -> 73.9s median (-2.0%), 75.5s -> 74.1s mean (-1.9%)
- R2 decompose_and_extend_d2: 8.28s -> 7.78s (-6.0%)
- R4 interpolate+evaluate_fft: 2.73s -> 2.40s (-12.1%)
- CV 0.7% (2xCV=1.4% threshold, 2.0% improvement clears it comfortably)
- Verification against baseline binary: PASS
- All 121 stark lib tests + math bit_reverse tests pass
Montgomery batch inverse has a serial prefix-product dependency, but
chunks are independent: each chunk inverts its own elements without
needing values from other chunks. Trade K-1 extra field inversions
(~1000 mults each in Goldilocks, negligible next to the ~2N mults per
chunk) for K-way parallelism.

Threshold at 2^16 elements so short batches (single-FRI-layer twiddles
at smaller layers, inv arrays in small tables) keep the sequential
path. Above threshold, split into num_threads chunks and invert each
independently via par_chunks_mut.

This is surprisingly impactful because batch_inverse is called on large
inputs throughout the prover — coset-point inverses in R2 decompose,
constraint-denominator inverses, OOD x_i - z inverses, deep-composition
inv_h / inv_t, FRI coset twiddles, etc. Every summed-over-tables hot
denominator pipeline hits it.

fib_iterative_2M on Linux x86_64, 12 cores, 3 samples:
- prove wall-clock: 75.4s -> 72.45s median (-3.9%), 75.5s -> 72.4s mean (-4.1%)
- R3 OOD evaluation: 5.66s -> 4.30s (-24%)
- R4 deep_composition_poly_evals: 5.62s -> 4.42s (-21%)
- R4 queries & openings: 1.58s -> 1.14s (-28%)
- R4 interpolate+evaluate_fft: 2.87s -> 2.70s (-6%)
- CV 0.4% (2xCV=0.8%, 3.9% improvement clears it easily)
- Verification against baseline binary: PASS
- All math batch_inverse + stark 121 lib tests pass
@gabrielbosio gabrielbosio force-pushed the perf/parallel-bitrev-batch-inverse-merkle branch from 365d044 to c1dd556 Compare April 23, 2026 19:19
@gabrielbosio gabrielbosio reopened this Apr 24, 2026
@gabrielbosio gabrielbosio marked this pull request as ready for review April 24, 2026 14:27
Comment on lines +10 to +11
const PARALLEL_BITREV_THRESHOLD: usize = 1 << 14;
if n >= PARALLEL_BITREV_THRESHOLD {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — Missing power-of-two guard before unsafe parallel swap

The SAFETY argument for core::ptr::swap relies on reverse_index being a bijection on [0, n), which only holds when n is a power of two. The sequential path just produces wrong output if the contract is broken; the parallel path invokes undefined behaviour (a data race) because two threads could swap the same element concurrently.

A debug_assert! catches violations in debug builds at zero release cost:

Suggested change
const PARALLEL_BITREV_THRESHOLD: usize = 1 << 14;
if n >= PARALLEL_BITREV_THRESHOLD {
const PARALLEL_BITREV_THRESHOLD: usize = 1 << 14;
debug_assert!(n.is_power_of_two(), "in_place_bit_reverse_permute requires a power-of-two length");
if n >= PARALLEL_BITREV_THRESHOLD {

Comment on lines +87 to +112
pub fn evaluate_fft_bit_reversed<F: IsFFTField + IsSubFieldOf<E>>(
poly: &Polynomial<FieldElement<E>>,
blowup_factor: usize,
domain_size: Option<usize>,
) -> Result<Vec<FieldElement<E>>, FFTError>
where
E: Send + Sync,
{
let domain_size = domain_size.unwrap_or(0);
let len = core::cmp::max(poly.coeff_len(), domain_size).next_power_of_two() * blowup_factor;
if len.trailing_zeros() as u64 > F::TWO_ADICITY {
return Err(FFTError::DomainSizeError(len.trailing_zeros() as usize));
}
if poly.coefficients().is_empty() {
return Ok(vec![FieldElement::zero(); len]);
}

let mut coeffs = poly.coefficients().to_vec();
coeffs.resize(len, FieldElement::zero());

let order = len.trailing_zeros() as u64;
let layer_twiddles =
LayerTwiddles::<F>::new(order).ok_or(FFTError::DomainSizeError(order as usize))?;
dispatch_fft(&mut coeffs, &layer_twiddles)?;
Ok(coeffs)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low — No test for the new public API

evaluate_fft_bit_reversed is used in a correctness-critical path (FRI commit phase) but has no unit test. The invariant to verify is simple: the output should equal evaluate_fft with in_place_bit_reverse_permute applied.

#[test]
fn evaluate_fft_bit_reversed_matches_evaluate_fft_permuted() {
    use crate::fft::cpu::bit_reversing::in_place_bit_reverse_permute;
    let coeffs: Vec<FE> = (0u64..8).map(FE::from).collect();
    let poly = Polynomial::new(&coeffs);
    let mut expected = Polynomial::evaluate_fft::<F>(&poly, 2, None).unwrap();
    in_place_bit_reverse_permute(&mut expected);
    let got = Polynomial::evaluate_fft_bit_reversed::<F>(&poly, 2, None).unwrap();
    assert_eq!(got, expected);
}

@github-actions
Copy link
Copy Markdown

Codex Code Review

  • Medium: crypto/math/src/field/element.rs changes inplace_batch_inverse from all-or-nothing to partially mutating on error. The new par_chunks_mut(...).try_for_each(...) path can successfully invert and write back some chunks before another chunk hits a zero and returns Err. Previously the single-slice algorithm failed before any write-back. That makes failure handling nondeterministic and can leave the caller’s buffer corrupted on Err. Fix by validating the full slice first, or by computing chunk results into temporary storage and committing only after all chunks succeed.

  • Low: crypto/math/src/fft/cpu/bit_reversing.rs and crypto/math/src/field/element.rs add unconditional Send/Sync bounds to APIs that used to work in sequential builds. Even with parallel disabled, callers now need thread-safe element types for purely local operations. That is a source-compatibility regression with no runtime benefit on the non-parallel path. Keep those bounds on parallel-only helpers instead of the public function signatures.

No other concrete security, correctness, or significant performance issues stood out in the reviewed diff.

I couldn’t run targeted cargo test here because the sandbox blocks the toolchain/dependency writes and network access cargo attempted.

Comment on lines +87 to +112
pub fn evaluate_fft_bit_reversed<F: IsFFTField + IsSubFieldOf<E>>(
poly: &Polynomial<FieldElement<E>>,
blowup_factor: usize,
domain_size: Option<usize>,
) -> Result<Vec<FieldElement<E>>, FFTError>
where
E: Send + Sync,
{
let domain_size = domain_size.unwrap_or(0);
let len = core::cmp::max(poly.coeff_len(), domain_size).next_power_of_two() * blowup_factor;
if len.trailing_zeros() as u64 > F::TWO_ADICITY {
return Err(FFTError::DomainSizeError(len.trailing_zeros() as usize));
}
if poly.coefficients().is_empty() {
return Ok(vec![FieldElement::zero(); len]);
}

let mut coeffs = poly.coefficients().to_vec();
coeffs.resize(len, FieldElement::zero());

let order = len.trailing_zeros() as u64;
let layer_twiddles =
LayerTwiddles::<F>::new(order).ok_or(FFTError::DomainSizeError(order as usize))?;
dispatch_fft(&mut coeffs, &layer_twiddles)?;
Ok(coeffs)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low — Duplicate setup code with evaluate_fft

evaluate_fft_bit_reversed is identical to evaluate_fft except it skips the final in_place_bit_reverse_permute. Both functions share ~20 lines of setup (domain-size computation, empty-poly check, zero-padding, twiddle construction). If either is changed in the future the other will silently diverge.

Consider a private helper:

fn evaluate_fft_raw<F: IsFFTField + IsSubFieldOf<E>, E: IsField + Send + Sync>(
    poly: &Polynomial<FieldElement<E>>,
    blowup_factor: usize,
    domain_size: Option<usize>,
) -> Result<Vec<FieldElement<E>>, FFTError> {
    // shared setup + dispatch_fft, no permutation
}

Then evaluate_fft calls the helper and permutes, while evaluate_fft_bit_reversed just calls the helper.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 24, 2026

Review: Misc optimizations (PR #545)

Summary

The PR introduces four targeted performance improvements: parallel bit-reverse permutation, parallel batch inverse chunking, sequential-read + single bit-reverse permute for Merkle leaf hashing, and parallel trace chunk generation. The overall approach is sound and the correctness reasoning is solid. A few issues are worth addressing before merge.


Issues

Medium — Missing power-of-two guard before unsafe parallel swap (bit_reversing.rs)

The unsafe parallel swap relies on reverse_index being a bijection, which only holds when n is a power of two. Violating the (documented) contract in the sequential path gives wrong output; in the parallel path it causes undefined behaviour (a data race). A debug_assert!(n.is_power_of_two()) at the top of the parallel block catches this in debug builds at zero release cost. See inline comment.

Low — No test for evaluate_fft_bit_reversed (polynomial.rs)

This new public API sits in a correctness-critical path (FRI commit phase) but ships without a unit test. The property to verify is simple: output should equal evaluate_fft with in_place_bit_reverse_permute applied. See inline comment for a suggested test.

Low — Duplicate setup code between evaluate_fft and evaluate_fft_bit_reversed (polynomial.rs)

Both functions share ~20 lines of identical setup (domain-size computation, empty-poly check, zero-padding, twiddle construction). Extracting a private evaluate_fft_raw helper would prevent silent divergence if either is changed. See inline comment.


No issues found in

  • inplace_batch_inverse parallel chunking — correctness reasoning (K-1 extra inversions, each chunk independent) is sound and the threshold is reasonable.
  • commit_columns_bit_reversed rewrite — the equivalence of sequential hashing + bit-reverse permute over digests vs. scattered reads is correct (leaves[br(k)] = hash(col[k]) ⟺ leaves[j] = hash(col[br(j)])).
  • chunk_and_generate parallelisation — par_chunks().collect() preserves chunk order; &generate borrow is correct for the Sync bound.

@nicole-graus
Copy link
Copy Markdown
Collaborator

/bench k=1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants