Merkle cache reads and skip R4 permute by gabrielbosio · Pull Request #547 · yetanotherco/lambda_vm

gabrielbosio · 2026-04-23T19:21:39Z

No description provided.

Read columns at natural index k inside the parallel hashing loop, then apply in_place_bit_reverse_permute to the Commitment vector before building the Merkle tree. Same leaves as reading at br(row_idx) inside the loop; replaces scattered column reads (~2GB volume on MEMW_R) with sequential reads plus a 64MB in-place bit-reverse pass.

round_4 called evaluate_fft (which internally permutes the FFT output to natural order) and then in_place_bit_reverse_permute on the result to flip it back. Both permutes cancel. FRI commit_phase_from_evaluations pairs evals as chunks_exact(2) expecting {f(x), f(-x)} adjacency, which is exactly the bit-reversed output of the Bowers forward FFT. Added Polynomial::evaluate_fft_bit_reversed that skips the final permute, and called it from round_4. Result: two ~24ms permutes (at 2N=4M per table) eliminated per prove. fib_iterative_2M on Linux x86_64, 12 cores, 5 samples: - prove wall-clock: 75.4s -> 74.4s median (-1.3%), 75.5s -> 74.3s mean (-1.6%) - R4 interpolate+evaluate_fft sub-phase: 2.73s -> 1.95s (-29%) - CV 0.6% (2xCV=1.2% threshold, 1.3% improvement clears it) - Verification against baseline binary: PASS

gabrielbosio · 2026-04-23T19:21:50Z

/bench k=4

github-actions · 2026-04-23T19:23:26Z

Codex Code Review

No findings in the PR diff.

The bit-reversed FFT path appears consistent with the existing consumers, and the commit_columns_bit_reversed rewrite preserves the same leaf ordering while improving locality. Residual risk: I could not run targeted Rust tests in this environment because cargo/rustup failed trying to write under /home/runner/.rustup on a read-only filesystem, so this review is static only.

github-actions · 2026-04-23T19:27:03Z

Benchmark — fib_iterative_8M (median of 3)

_{Table parallelism: 32 (auto = cores / 3)}

Metric	main	PR	Δ
Peak heap	64562 MB	66101 MB	+1539 MB (+2.4%) ⚪
Prove time	31.902s	31.530s	-0.372s (-1.2%) ⚪

✅ No significant change.

⚠️ Baseline heap spread: 5.4% (66868 MB / 64562 MB / 63409 MB) — comparison may be less reliable

_{Commit: 8c4b9c7 · Baseline: built from main · Runner: self-hosted bench}

gabrielbosio · 2026-04-23T19:28:57Z

/bench

gabrielbosio · 2026-04-23T20:03:36Z

/claude

gabrielbosio · 2026-04-23T20:11:33Z

/claude

gabrielbosio · 2026-04-23T21:27:44Z

/claude

gabrielbosio · 2026-04-23T22:10:35Z

/claude

claude · 2026-04-23T22:16:42Z

Code Review

Summary: Two targeted performance optimizations in the STARK prover — both are correct, well-tested, and the equivalence proofs are straightforward.

Change 1: Cache-friendly Merkle commitment (`commit_columns_bit_reversed`)

Before: hash columns[col][br(k)] for each row k → scattered reads across column arrays, one cache miss per row.
After: hash columns[col][k] sequentially → full cache-line utilization, then a single in_place_bit_reverse_permute pass over 32-byte digests.

The equivalence holds because bit-reversal is its own inverse:

old: leaf[k] = hash(col[br(k)])
new (before permute): leaf[k] = hash(col[k]); after permute: leaf[br(k)] = hash(col[k]) ⟹ with j=br(k): leaf[j] = hash(col[br(j)]) ✓

The test commit_columns_bit_reversed_matches_naive_reference confirms this. Works correctly under the parallel feature too since collect() preserves order and the permute runs sequentially after.

Change 2: Skip redundant permutations in deep-poly LDE (R4)

The old pattern was:

evaluate_fft → internally permutes to natural order
in_place_bit_reverse_permute → permutes back to bit-reversed

That's two full O(n) passes that cancel each other out. The new evaluate_fft_bit_reversed simply omits the internal natural-order permute, and evaluate_fft_cpu_inner correctly gates it with permute_after. ✓

Minor observations (no blocking issues)

Validation order in evaluate_fft_bit_reversed: The trailing_zeros > TWO_ADICITY check fires before !is_power_of_two(). In theory, a non-power-of-two blowup could return DomainSizeError instead of InputError if the polynomial is large enough to exhaust adicity first. In practice (Goldilocks TWO_ADICITY=32), this requires a polynomial with >2^32 coefficients — completely unreachable. No action needed.

Empty-polynomial test gap: evaluate_fft_bit_reversed_matches_evaluate_fft_then_permute doesn't cover the empty-polynomial fast path. The result is trivially correct (all-zeros is invariant under any permutation), but a case Polynomial::<FE>::new(&[])... blowup=4 wouldn't hurt.

is_power_of_two in evaluate_fft_cpu_inner: The check is now guaranteed to pass when called from evaluate_fft_bit_reversed (outer function pre-validates), but it's still needed for the evaluate_fft_cpu call path. Correct as-is.

Overall: LGTM. Logic is sound, the cache optimization is real for large domain sizes, and the test coverage is solid.

gabrielbosio · 2026-04-23T22:27:12Z

/bench

gabrielbosio · 2026-04-23T22:36:21Z

/bench

gabrielbosio · 2026-04-23T22:42:32Z

/bench

claude · 2026-04-23T23:46:23Z

            })
            .collect();

+        in_place_bit_reverse_permute(&mut hashed_leaves);


Low – Sequential permute after parallel hashing

When the parallel feature is enabled, the iter.map(...).collect() runs on a Rayon thread pool, but in_place_bit_reverse_permute is always sequential. For large num_rows this single-threaded pass could become a meaningful bottleneck relative to the parallelized hash work above it.

The bit-reverse permutation is a small, branch-heavy scatter/gather that parallelises poorly at typical domain sizes, so this is only worth addressing if profiling shows it as a hot spot. Just noting it so the trade-off is explicit.

claude · 2026-04-23T23:46:28Z

+        E: Send + Sync,
+    {
+        let domain_size = domain_size.unwrap_or(0);
+        let len = core::cmp::max(poly.coeff_len(), domain_size).next_power_of_two() * blowup_factor;


Low – blowup_factor = 0 produces a confusing error

let len = core::cmp::max(poly.coeff_len(), domain_size).next_power_of_two() * blowup_factor;

When blowup_factor = 0, len = 0. On a 64-bit target 0usize.trailing_zeros() returns 64, so the first guard fires and returns DomainSizeError(64) — a misleading message for a caller that passed a zero blowup.

evaluate_fft has the same behaviour, so this isn't a regression, but since evaluate_fft_bit_reversed is a new public API it's a good place to add an early guard:

Suggested change

let len = core::cmp::max(poly.coeff_len(), domain_size).next_power_of_two() * blowup_factor;

if blowup_factor == 0 {

return Err(FFTError::InputError(0));

}

let len = core::cmp::max(poly.coeff_len(), domain_size).next_power_of_two() * blowup_factor;

claude · 2026-04-23T23:46:46Z

Review: Merkle cache reads and skip R4 permute

Summary: Two related optimisations — (1) sequential column reads + post-hoc bit-reverse permutation of the 32-byte digest vector in commit_columns_bit_reversed, and (2) a new evaluate_fft_bit_reversed that skips the internal natural-order permute, used to eliminate the redundant permute pair in the deep-poly LDE path.

Correctness ✅

Both optimisations are mathematically sound:

commit_columns_bit_reversed: Reading at sequential index k and then applying in_place_bit_reverse_permute to the hash vector is equivalent to reading at br(k) directly — in_place_bit_reverse_permute sets buf[i] = buf_old[br(i)], which matches the old loop body exactly.
evaluate_fft_bit_reversed: The Cooley-Tukey DIF implementation (dispatch_fft) already emits results in bit-reversed order; the existing in_place_bit_reverse_permute in evaluate_fft_cpu was converting to natural order. Skipping it correctly preserves the native bit-reversed layout that fold_evaluations_in_place expects (consecutive pairs (2j, 2j+1) are conjugates).
Both are directly covered by the new regression tests.

Issues

See inline comments for details. In brief:

Severity	Location	Issue
Low	`polynomial.rs:100`	`blowup_factor = 0` silently maps to `DomainSizeError(64)` rather than an informative error — an early guard would clarify the API
Low	`prover.rs:408`	`in_place_bit_reverse_permute` is sequential even when `parallel` feature is active; fine at current domain sizes but worth tracking

Other observations

The permute_after: bool flag in the private evaluate_fft_cpu_inner is a reasonable choice for a private helper with exactly two call-sites. No duplication concern.
The docstring accurately captures the stricter validation difference vs evaluate_fft (the is_power_of_two check before the empty-polynomial fast path).
Test coverage is solid: correctness of the new function against evaluate_fft + permute, rejection of non-power-of-two blowup, and a reference-vs-optimised Merkle root comparison.

No critical or high severity issues. The optimisation is clean and well-tested. The two low-severity notes are optional polish.

gabrielbosio · 2026-04-24T14:26:55Z

Reopening #545 which is a superset of this PR.

gabrielbosio added 3 commits April 23, 2026 16:03

import in_place_bit_reverse_permute in prover

9555385

claude Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread crypto/math/src/fft/polynomial.rs Outdated

gabrielbosio added 2 commits April 23, 2026 16:56

guard non-power-of-two len in evaluate_fft_bit_reversed

74587c6

test evaluate_fft_bit_reversed invariants

b6f798b

yetanotherco deleted a comment from claude Bot Apr 23, 2026

claude Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread crypto/math/src/fft/polynomial.rs Outdated

hoist non-power-of-two guard before empty check

8266846

yetanotherco deleted a comment from claude Bot Apr 23, 2026

claude Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread crypto/math/src/fft/polynomial.rs

claude Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread crypto/math/src/fft/polynomial.rs Outdated

gabrielbosio added 2 commits April 23, 2026 18:23

share FFT implementation via evaluate_fft_cpu_inner

1fe9a22

note stricter validation in evaluate_fft_bit_reversed docs

d7b0dd3

yetanotherco deleted a comment from claude Bot Apr 23, 2026

claude Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread crypto/math/src/fft/polynomial.rs

claude Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread crypto/stark/src/prover.rs

gabrielbosio added 2 commits April 23, 2026 19:09

cover zero-padding case in evaluate_fft_bit_reversed test

d2cb6ea

test commit_columns_bit_reversed matches naive reference

8c4b9c7

yetanotherco deleted a comment from claude Bot Apr 23, 2026

gabrielbosio marked this pull request as ready for review April 23, 2026 23:42

claude Bot reviewed Apr 23, 2026

View reviewed changes

gabrielbosio closed this Apr 24, 2026

diegokingston deleted the perf/merkle-cache-reads-and-skip-r4-permute branch May 20, 2026 12:49

Conversation

gabrielbosio commented Apr 23, 2026

Uh oh!

gabrielbosio commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Codex Code Review

Uh oh!

Uh oh!

github-actions Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark — fib_iterative_8M (median of 3)

Uh oh!

gabrielbosio commented Apr 23, 2026

Uh oh!

gabrielbosio commented Apr 23, 2026

Uh oh!

Uh oh!

gabrielbosio commented Apr 23, 2026

Uh oh!

Uh oh!

Uh oh!

gabrielbosio commented Apr 23, 2026

Uh oh!

Uh oh!

Uh oh!

gabrielbosio commented Apr 23, 2026

Uh oh!

claude Bot commented Apr 23, 2026

Code Review

Change 1: Cache-friendly Merkle commitment (commit_columns_bit_reversed)

Change 2: Skip redundant permutations in deep-poly LDE (R4)

Minor observations (no blocking issues)

Uh oh!

gabrielbosio commented Apr 23, 2026

Uh oh!

gabrielbosio commented Apr 23, 2026

Uh oh!

gabrielbosio commented Apr 23, 2026

Uh oh!

claude Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Apr 23, 2026

Review: Merkle cache reads and skip R4 permute

Correctness ✅

Issues

Other observations

Uh oh!

gabrielbosio commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Apr 23, 2026 •

edited

Loading

Change 1: Cache-friendly Merkle commitment (`commit_columns_bit_reversed`)