Misc optimizations by gabrielbosio · Pull Request #545 · yetanotherco/lambda_vm

gabrielbosio · 2026-04-23T18:06:16Z

Stacks five independent prover optimizations in R1 commit, trace build, and the FFT / batch-inverse pipelines.

Sequential reads in commit_columns_bit_reversed: replaces scattered columns[col][br(row)] reads (~2 GB) with sequential reads plus a 64 MB post bit-reverse on the digest vector.
par_chunks inside chunk_and_generate: runs the 10 trace-build Phase 5 generators in parallel on the idle rayon pool.
Skip the redundant bit-reverse pair in the R4 deep-composition LDE: adds evaluate_fft_bit_reversed to drop both cancelling permutes.
Parallel in_place_bit_reverse_permute: swap pairs (i, br(i)) with i < br(i) are disjoint, so a Send/Sync raw-pointer wrapper lets rayon drive them; sequential fallback below N=16K.
Chunked parallel inplace_batch_inverse: chunks the Montgomery prefix product, trading K−1 extra inversions for K-way parallelism; sequential fallback below N=2^16.

claude · 2026-04-23T18:09:37Z

-    pub fn inplace_batch_inverse(numbers: &mut [Self]) -> Result<(), FieldError> {
+    pub fn inplace_batch_inverse(numbers: &mut [Self]) -> Result<(), FieldError>
+    where
+        Self: Send + Sync,


Medium — Unconditional API-breaking bound

The where Self: Send + Sync constraint is on the public function signature unconditionally, not gated by #[cfg(feature = "parallel")]. This means every caller of inplace_batch_inverse must now satisfy the bound even on no_std targets or crates that never enable the parallel feature, which is a breaking change for any downstream field implementation whose BaseType is not Send + Sync.

Consider splitting:

Suggested change

Self: Send + Sync,

pub fn inplace_batch_inverse(numbers: &mut [Self]) -> Result<(), FieldError> {

…and adding the Send + Sync bound only in the #[cfg(feature = "parallel")] branch, e.g. via an internal helper with the tighter bound, keeping the public API signature unchanged.

claude · 2026-04-23T18:09:40Z

+                }
+            }
+            unsafe impl<E> Send for SendPtr<E> {}
+            unsafe impl<E> Sync for SendPtr<E> {}


Low — Sync impl is broader than needed

for_each on a rayon ParallelIterator only requires the closure to be Send. Since ptr is Copy, each closure invocation gets its own copy of the raw pointer, so the closure is Send as long as SendPtr: Send. The Sync impl (&SendPtr<E> safe to share across threads) is never exercised by this code and makes a stronger safety claim than what is actually verified. It should be removed to keep the unsafe surface minimal.

Suggested change

unsafe impl<E> Sync for SendPtr<E> {}

unsafe impl<E> Send for SendPtr<E> {}

claude · 2026-04-23T18:10:01Z

Review: Misc optimizations (perf/parallel-bitrev-batch-inverse-merkle)

Summary

Four orthogonal performance improvements: parallel bit-reverse permutation, parallel batch inverse, cache-friendly Merkle commit (sequential column reads + bit-reverse hashes), and evaluate_fft_bit_reversed to skip a redundant permute on the FRI path. The algorithmic correctness of all four changes is sound. A few issues below.

Accidental files — should not be in this PR

These look like artifacts from an automated optimization session:

File	Issue
`.claude/scheduled_tasks.lock`	Internal Claude Code scheduler lock file (contains a session ID + PID). Not source code.
`.claude/skills/optimize-prover/SKILL.md`	AI-agent skill definition. Fine to keep in the repo if intentional, but worth a deliberate commit rather than bundling with perf work.
`disable-advise-drop-cache.patch`	Has `// DEBUG: disabled to test MADV_DONTNEED hypothesis` — a debug experiment patch, not a real change.
`noop-spill.patch`	Has `// DEBUG: turn into a no-op to isolate spill mechanism` — same, a debug probe.
`fix-fib-iterative-halt-exit-code.patch`	Inserts `li a0, 0` after `mv a0, t1`, discarding the Fibonacci result before the ecall — this is a benchmarking workaround, not a correctness fix. Should not land as-is.
`others/optimize_report.md`	Optimization session notes; fine as documentation but not a code artifact.

Code issues

Medium — inplace_batch_inverse unconditional Send + Sync bound (see inline comment on element.rs:56)

The where Self: Send + Sync constraint appears on the public function signature without any #[cfg(feature = "parallel")] gate. Callers on platforms or crates that never enable parallel must now satisfy this bound, breaking any field implementation whose BaseType is not Send + Sync. The bound should be confined to the parallel branch (e.g. via an internal helper).

Low — unsafe impl Sync for SendPtr<E> is unnecessary (see inline on bit_reversing.rs:21)

The parallel swap pattern only requires SendPtr: Send (the closure is move, so each rayon task gets its own copy of the pointer). The Sync impl grants &SendPtr<E> cross-thread shareability without a corresponding correctness argument. Removing it keeps the unsafe surface minimal.

Correctness notes (no action needed)

Parallel bit-reverse safety: The proof is correct. For any i ≠ j, {i, rev(i)} ∩ {j, rev(j)} = ∅ except when i = rev(j), in which case the br > i guard ensures only the smaller-index thread executes the swap. No two threads touch the same slot.
Merkle commit reorder: Sequential column reads followed by in_place_bit_reverse_permute on 32-byte digests is equivalent to the previous scattered-read approach. Cache behaviour is strictly better.
Parallel chunk_and_generate: rayon::collect() preserves order on indexed iterators; result is identical to the sequential version provided generate is pure, which it appears to be.

github-actions · 2026-04-23T18:10:33Z

Codex Code Review

Findings:

Low: crypto/math/src/field/element.rs:54
inplace_batch_inverse() is no longer all-or-nothing on error. The old implementation only wrote back after the single inv() succeeded; the new parallel path uses par_chunks_mut(...).try_for_each(...), so if one chunk hits a zero and returns Err, other chunks may already have overwritten their inputs with inverses. That leaves callers with partially mutated data after a failure, which is a real behavioral regression for a public API. Either preflight for zero / stage the work before mutating, or keep the old semantics.
Low: crypto/math/src/fft/cpu/bit_reversing.rs:2, crypto/math/src/field/element.rs:54-56
The PR adds Send / Sync bounds to public math APIs unconditionally, not just to the parallel-only path. That is a breaking API change for downstream users even when the parallel feature is disabled. The bounds should live on internal parallel helpers/branches, not on the public function signatures.

No high/medium security or correctness issue stood out in the FFT/FRI ordering changes themselves.

I couldn’t run cargo test in this sandbox because Cargo/Rustup needs write access under ~/.cargo / ~/.rustup, so this review is source-based.

gabrielbosio · 2026-04-23T18:28:47Z

/bench

github-actions · 2026-04-23T18:35:11Z

Benchmark — fib_iterative_8M (median of 3)

_{Table parallelism: 1}

Metric	main	PR	Δ
Peak heap	49533 MB	49509 MB	-24 MB (+-0.0%) ⚪
Prove time	70.451s	59.769s	-10.682s (-15.2%) 🟢

🎉 Improvement detected — heap or time decreased by more than 5%.

✅ Low variance (time: 1.1%, heap: 0.0%)

_{Commit: 3de2694 · Baseline: built from main · Runner: self-hosted bench}

gabrielbosio · 2026-04-23T18:47:14Z

/bench k=4

gabrielbosio · 2026-04-23T19:02:35Z

It only shows a speedup with low table parallelism. Closing.

Read columns at natural index k inside the parallel hashing loop, then apply in_place_bit_reverse_permute to the Commitment vector before building the Merkle tree. Same leaves as reading at br(row_idx) inside the loop; replaces scattered column reads (~2GB volume on MEMW_R) with sequential reads plus a 64MB in-place bit-reverse pass.

Phase 5 of trace build invokes chunk_and_generate 10 times; each call walked its chunks sequentially. MEMW alone produces ~12 chunks at fib_iterative_2M, so there is substantial per-chunk parallelism available on a free rayon pool (trace build runs before multi_prove). fib_iterative_2M on Linux x86_64, 12 cores, 3 samples: - prove wall-clock: 75.4s -> 74.3s median (-1.5%) - Trace build sub-phase: 4.56s -> 3.96s (-13.2%) - Verification against baseline binary: PASS

round_4 called evaluate_fft (which internally permutes the FFT output to natural order) and then in_place_bit_reverse_permute on the result to flip it back. Both permutes cancel. FRI commit_phase_from_evaluations pairs evals as chunks_exact(2) expecting {f(x), f(-x)} adjacency, which is exactly the bit-reversed output of the Bowers forward FFT. Added Polynomial::evaluate_fft_bit_reversed that skips the final permute, and called it from round_4. Result: two ~24ms permutes (at 2N=4M per table) eliminated per prove. fib_iterative_2M on Linux x86_64, 12 cores, 5 samples: - prove wall-clock: 75.4s -> 74.4s median (-1.3%), 75.5s -> 74.3s mean (-1.6%) - R4 interpolate+evaluate_fft sub-phase: 2.73s -> 1.95s (-29%) - CV 0.6% (2xCV=1.2% threshold, 1.3% improvement clears it) - Verification against baseline binary: PASS

Every FFT call site ends with a sequential O(N) bit-reverse permutation. At N=4M elements this is ~24ms on its own, called dozens of times per prove across all column LDEs, composition-poly parts, and the R4 deep LDE. Bottlenecks the otherwise-parallel FFT pipeline (Amdahl). Swap pairs (i, br(i)) with i < br(i) are disjoint, so parallelization is safe with a Send/Sync raw-pointer wrapper (the `i < br(i)` predicate selects a unique owner per pair, so no two threads ever touch the same slot). Sequential fallback retained for N < 16K. fib_iterative_2M on Linux x86_64, 12 cores, 3 samples: - prove wall-clock: 75.4s -> 73.9s median (-2.0%), 75.5s -> 74.1s mean (-1.9%) - R2 decompose_and_extend_d2: 8.28s -> 7.78s (-6.0%) - R4 interpolate+evaluate_fft: 2.73s -> 2.40s (-12.1%) - CV 0.7% (2xCV=1.4% threshold, 2.0% improvement clears it comfortably) - Verification against baseline binary: PASS - All 121 stark lib tests + math bit_reverse tests pass

Montgomery batch inverse has a serial prefix-product dependency, but chunks are independent: each chunk inverts its own elements without needing values from other chunks. Trade K-1 extra field inversions (~1000 mults each in Goldilocks, negligible next to the ~2N mults per chunk) for K-way parallelism. Threshold at 2^16 elements so short batches (single-FRI-layer twiddles at smaller layers, inv arrays in small tables) keep the sequential path. Above threshold, split into num_threads chunks and invert each independently via par_chunks_mut. This is surprisingly impactful because batch_inverse is called on large inputs throughout the prover — coset-point inverses in R2 decompose, constraint-denominator inverses, OOD x_i - z inverses, deep-composition inv_h / inv_t, FRI coset twiddles, etc. Every summed-over-tables hot denominator pipeline hits it. fib_iterative_2M on Linux x86_64, 12 cores, 3 samples: - prove wall-clock: 75.4s -> 72.45s median (-3.9%), 75.5s -> 72.4s mean (-4.1%) - R3 OOD evaluation: 5.66s -> 4.30s (-24%) - R4 deep_composition_poly_evals: 5.62s -> 4.42s (-21%) - R4 queries & openings: 1.58s -> 1.14s (-28%) - R4 interpolate+evaluate_fft: 2.87s -> 2.70s (-6%) - CV 0.4% (2xCV=0.8%, 3.9% improvement clears it easily) - Verification against baseline binary: PASS - All math batch_inverse + stark 121 lib tests pass

claude · 2026-04-24T14:29:59Z

+        const PARALLEL_BITREV_THRESHOLD: usize = 1 << 14;
+        if n >= PARALLEL_BITREV_THRESHOLD {


Medium — Missing power-of-two guard before unsafe parallel swap

The SAFETY argument for core::ptr::swap relies on reverse_index being a bijection on [0, n), which only holds when n is a power of two. The sequential path just produces wrong output if the contract is broken; the parallel path invokes undefined behaviour (a data race) because two threads could swap the same element concurrently.

A debug_assert! catches violations in debug builds at zero release cost:

Suggested change

const PARALLEL_BITREV_THRESHOLD: usize = 1 << 14;

if n >= PARALLEL_BITREV_THRESHOLD {

const PARALLEL_BITREV_THRESHOLD: usize = 1 << 14;

debug_assert!(n.is_power_of_two(), "in_place_bit_reverse_permute requires a power-of-two length");

if n >= PARALLEL_BITREV_THRESHOLD {

claude · 2026-04-24T14:30:08Z

+    pub fn evaluate_fft_bit_reversed<F: IsFFTField + IsSubFieldOf<E>>(
+        poly: &Polynomial<FieldElement<E>>,
+        blowup_factor: usize,
+        domain_size: Option<usize>,
+    ) -> Result<Vec<FieldElement<E>>, FFTError>
+    where
+        E: Send + Sync,
+    {
+        let domain_size = domain_size.unwrap_or(0);
+        let len = core::cmp::max(poly.coeff_len(), domain_size).next_power_of_two() * blowup_factor;
+        if len.trailing_zeros() as u64 > F::TWO_ADICITY {
+            return Err(FFTError::DomainSizeError(len.trailing_zeros() as usize));
+        }
+        if poly.coefficients().is_empty() {
+            return Ok(vec![FieldElement::zero(); len]);
+        }
+
+        let mut coeffs = poly.coefficients().to_vec();
+        coeffs.resize(len, FieldElement::zero());
+
+        let order = len.trailing_zeros() as u64;
+        let layer_twiddles =
+            LayerTwiddles::<F>::new(order).ok_or(FFTError::DomainSizeError(order as usize))?;
+        dispatch_fft(&mut coeffs, &layer_twiddles)?;
+        Ok(coeffs)
+    }


Low — No test for the new public API

evaluate_fft_bit_reversed is used in a correctness-critical path (FRI commit phase) but has no unit test. The invariant to verify is simple: the output should equal evaluate_fft with in_place_bit_reverse_permute applied.

#[test] fn evaluate_fft_bit_reversed_matches_evaluate_fft_permuted() { use crate::fft::cpu::bit_reversing::in_place_bit_reverse_permute; let coeffs: Vec<FE> = (0u64..8).map(FE::from).collect(); let poly = Polynomial::new(&coeffs); let mut expected = Polynomial::evaluate_fft::<F>(&poly, 2, None).unwrap(); in_place_bit_reverse_permute(&mut expected); let got = Polynomial::evaluate_fft_bit_reversed::<F>(&poly, 2, None).unwrap(); assert_eq!(got, expected); }

github-actions · 2026-04-24T14:30:16Z

Codex Code Review

Medium: crypto/math/src/field/element.rs changes inplace_batch_inverse from all-or-nothing to partially mutating on error. The new par_chunks_mut(...).try_for_each(...) path can successfully invert and write back some chunks before another chunk hits a zero and returns Err. Previously the single-slice algorithm failed before any write-back. That makes failure handling nondeterministic and can leave the caller’s buffer corrupted on Err. Fix by validating the full slice first, or by computing chunk results into temporary storage and committing only after all chunks succeed.
Low: crypto/math/src/fft/cpu/bit_reversing.rs and crypto/math/src/field/element.rs add unconditional Send/Sync bounds to APIs that used to work in sequential builds. Even with parallel disabled, callers now need thread-safe element types for purely local operations. That is a source-compatibility regression with no runtime benefit on the non-parallel path. Keep those bounds on parallel-only helpers instead of the public function signatures.

No other concrete security, correctness, or significant performance issues stood out in the reviewed diff.

I couldn’t run targeted cargo test here because the sandbox blocks the toolchain/dependency writes and network access cargo attempted.

claude · 2026-04-24T14:30:20Z

+    pub fn evaluate_fft_bit_reversed<F: IsFFTField + IsSubFieldOf<E>>(
+        poly: &Polynomial<FieldElement<E>>,
+        blowup_factor: usize,
+        domain_size: Option<usize>,
+    ) -> Result<Vec<FieldElement<E>>, FFTError>
+    where
+        E: Send + Sync,
+    {
+        let domain_size = domain_size.unwrap_or(0);
+        let len = core::cmp::max(poly.coeff_len(), domain_size).next_power_of_two() * blowup_factor;
+        if len.trailing_zeros() as u64 > F::TWO_ADICITY {
+            return Err(FFTError::DomainSizeError(len.trailing_zeros() as usize));
+        }
+        if poly.coefficients().is_empty() {
+            return Ok(vec![FieldElement::zero(); len]);
+        }
+
+        let mut coeffs = poly.coefficients().to_vec();
+        coeffs.resize(len, FieldElement::zero());
+
+        let order = len.trailing_zeros() as u64;
+        let layer_twiddles =
+            LayerTwiddles::<F>::new(order).ok_or(FFTError::DomainSizeError(order as usize))?;
+        dispatch_fft(&mut coeffs, &layer_twiddles)?;
+        Ok(coeffs)
+    }


Low — Duplicate setup code with evaluate_fft

evaluate_fft_bit_reversed is identical to evaluate_fft except it skips the final in_place_bit_reverse_permute. Both functions share ~20 lines of setup (domain-size computation, empty-poly check, zero-padding, twiddle construction). If either is changed in the future the other will silently diverge.

Consider a private helper:

fn evaluate_fft_raw<F: IsFFTField + IsSubFieldOf<E>, E: IsField + Send + Sync>( poly: &Polynomial<FieldElement<E>>, blowup_factor: usize, domain_size: Option<usize>, ) -> Result<Vec<FieldElement<E>>, FFTError> { // shared setup + dispatch_fft, no permutation }

Then evaluate_fft calls the helper and permutes, while evaluate_fft_bit_reversed just calls the helper.

claude · 2026-04-24T14:30:37Z

Review: Misc optimizations (PR #545)

Summary

The PR introduces four targeted performance improvements: parallel bit-reverse permutation, parallel batch inverse chunking, sequential-read + single bit-reverse permute for Merkle leaf hashing, and parallel trace chunk generation. The overall approach is sound and the correctness reasoning is solid. A few issues are worth addressing before merge.

Issues

Medium — Missing power-of-two guard before unsafe parallel swap (bit_reversing.rs)

The unsafe parallel swap relies on reverse_index being a bijection, which only holds when n is a power of two. Violating the (documented) contract in the sequential path gives wrong output; in the parallel path it causes undefined behaviour (a data race). A debug_assert!(n.is_power_of_two()) at the top of the parallel block catches this in debug builds at zero release cost. See inline comment.

Low — No test for evaluate_fft_bit_reversed (polynomial.rs)

This new public API sits in a correctness-critical path (FRI commit phase) but ships without a unit test. The property to verify is simple: output should equal evaluate_fft with in_place_bit_reverse_permute applied. See inline comment for a suggested test.

Low — Duplicate setup code between evaluate_fft and evaluate_fft_bit_reversed (polynomial.rs)

Both functions share ~20 lines of identical setup (domain-size computation, empty-poly check, zero-padding, twiddle construction). Extracting a private evaluate_fft_raw helper would prevent silent divergence if either is changed. See inline comment.

No issues found in

inplace_batch_inverse parallel chunking — correctness reasoning (K-1 extra inversions, each chunk independent) is sound and the threshold is reasonable.
commit_columns_bit_reversed rewrite — the equivalence of sequential hashing + bit-reverse permute over digests vs. scattered reads is correct (leaves[br(k)] = hash(col[k]) ⟺ leaves[j] = hash(col[br(j)])).
chunk_and_generate parallelisation — par_chunks().collect() preserves chunk order; &generate borrow is correct for the Sync bound.

nicole-graus · 2026-04-27T19:00:43Z

/bench k=1

gabrielbosio force-pushed the perf/parallel-bitrev-batch-inverse-merkle branch from 18e9c67 to 33e9573 Compare April 23, 2026 18:09

claude Bot reviewed Apr 23, 2026

View reviewed changes

gabrielbosio closed this Apr 23, 2026

gabrielbosio reopened this Apr 23, 2026

gabrielbosio force-pushed the perf/parallel-bitrev-batch-inverse-merkle branch from 33e9573 to 365d044 Compare April 23, 2026 19:18

gabrielbosio added 6 commits April 23, 2026 16:19

import in_place_bit_reverse_permute in prover

c1dd556

gabrielbosio force-pushed the perf/parallel-bitrev-batch-inverse-merkle branch from 365d044 to c1dd556 Compare April 23, 2026 19:19

gabrielbosio closed this Apr 23, 2026

gabrielbosio reopened this Apr 24, 2026

gabrielbosio mentioned this pull request Apr 24, 2026

Merkle cache reads and skip R4 permute #547

Closed

gabrielbosio marked this pull request as ready for review April 24, 2026 14:27

claude Bot reviewed Apr 24, 2026

View reviewed changes

This was referenced Apr 24, 2026

perf: Sequential reads in commit_columns_bit_reversed #560

Draft

perf: Parallelize chunk_and_generate with par_chunks #563

Draft

nicole-graus mentioned this pull request Apr 27, 2026

perf: skip redundant bit-reverse pair in R4 deep-composition LDE #566

Merged

Merge branch 'main' into perf/parallel-bitrev-batch-inverse-merkle

3de2694

	Self: Send + Sync,
	pub fn inplace_batch_inverse(numbers: &mut [Self]) -> Result<(), FieldError> {

	unsafe impl<E> Sync for SendPtr<E> {}
	unsafe impl<E> Send for SendPtr<E> {}

		const PARALLEL_BITREV_THRESHOLD: usize = 1 << 14;
		if n >= PARALLEL_BITREV_THRESHOLD {

Conversation

gabrielbosio commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Apr 23, 2026

Review: Misc optimizations (perf/parallel-bitrev-batch-inverse-merkle)

Summary

Accidental files — should not be in this PR

Code issues

Correctness notes (no action needed)

Uh oh!

github-actions Bot commented Apr 23, 2026

Codex Code Review

Uh oh!

gabrielbosio commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark — fib_iterative_8M (median of 3)

Uh oh!

gabrielbosio commented Apr 23, 2026

Uh oh!

gabrielbosio commented Apr 23, 2026

Uh oh!

claude Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 24, 2026

Codex Code Review

Uh oh!

claude Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Apr 24, 2026

Review: Misc optimizations (PR #545)

Summary

Issues

No issues found in

Uh oh!

nicole-graus commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gabrielbosio commented Apr 23, 2026 •

edited

Loading

github-actions Bot commented Apr 23, 2026 •

edited

Loading