Merkle cache reads and skip R4 permute#547
Conversation
Read columns at natural index k inside the parallel hashing loop, then apply in_place_bit_reverse_permute to the Commitment vector before building the Merkle tree. Same leaves as reading at br(row_idx) inside the loop; replaces scattered column reads (~2GB volume on MEMW_R) with sequential reads plus a 64MB in-place bit-reverse pass.
round_4 called evaluate_fft (which internally permutes the FFT output to
natural order) and then in_place_bit_reverse_permute on the result to
flip it back. Both permutes cancel. FRI commit_phase_from_evaluations
pairs evals as chunks_exact(2) expecting {f(x), f(-x)} adjacency, which
is exactly the bit-reversed output of the Bowers forward FFT.
Added Polynomial::evaluate_fft_bit_reversed that skips the final permute,
and called it from round_4. Result: two ~24ms permutes (at 2N=4M per
table) eliminated per prove.
fib_iterative_2M on Linux x86_64, 12 cores, 5 samples:
- prove wall-clock: 75.4s -> 74.4s median (-1.3%), 75.5s -> 74.3s mean (-1.6%)
- R4 interpolate+evaluate_fft sub-phase: 2.73s -> 1.95s (-29%)
- CV 0.6% (2xCV=1.2% threshold, 1.3% improvement clears it)
- Verification against baseline binary: PASS
|
/bench k=4 |
Codex Code ReviewNo findings in the PR diff. The bit-reversed FFT path appears consistent with the existing consumers, and the |
Benchmark — fib_iterative_8M (median of 3)Table parallelism: 32 (auto = cores / 3)
Commit: 8c4b9c7 · Baseline: built from main · Runner: self-hosted bench |
|
/bench |
|
/claude |
|
/claude |
|
/claude |
|
/claude |
Code ReviewSummary: Two targeted performance optimizations in the STARK prover — both are correct, well-tested, and the equivalence proofs are straightforward. Change 1: Cache-friendly Merkle commitment (
|
|
/bench |
|
/bench |
1 similar comment
|
/bench |
| }) | ||
| .collect(); | ||
|
|
||
| in_place_bit_reverse_permute(&mut hashed_leaves); |
There was a problem hiding this comment.
Low – Sequential permute after parallel hashing
When the parallel feature is enabled, the iter.map(...).collect() runs on a Rayon thread pool, but in_place_bit_reverse_permute is always sequential. For large num_rows this single-threaded pass could become a meaningful bottleneck relative to the parallelized hash work above it.
The bit-reverse permutation is a small, branch-heavy scatter/gather that parallelises poorly at typical domain sizes, so this is only worth addressing if profiling shows it as a hot spot. Just noting it so the trade-off is explicit.
| E: Send + Sync, | ||
| { | ||
| let domain_size = domain_size.unwrap_or(0); | ||
| let len = core::cmp::max(poly.coeff_len(), domain_size).next_power_of_two() * blowup_factor; |
There was a problem hiding this comment.
Low – blowup_factor = 0 produces a confusing error
let len = core::cmp::max(poly.coeff_len(), domain_size).next_power_of_two() * blowup_factor;When blowup_factor = 0, len = 0. On a 64-bit target 0usize.trailing_zeros() returns 64, so the first guard fires and returns DomainSizeError(64) — a misleading message for a caller that passed a zero blowup.
evaluate_fft has the same behaviour, so this isn't a regression, but since evaluate_fft_bit_reversed is a new public API it's a good place to add an early guard:
| let len = core::cmp::max(poly.coeff_len(), domain_size).next_power_of_two() * blowup_factor; | |
| if blowup_factor == 0 { | |
| return Err(FFTError::InputError(0)); | |
| } | |
| let len = core::cmp::max(poly.coeff_len(), domain_size).next_power_of_two() * blowup_factor; |
Review: Merkle cache reads and skip R4 permuteSummary: Two related optimisations — (1) sequential column reads + post-hoc bit-reverse permutation of the 32-byte digest vector in Correctness ✅Both optimisations are mathematically sound:
IssuesSee inline comments for details. In brief:
Other observations
No critical or high severity issues. The optimisation is clean and well-tested. The two low-severity notes are optional polish. |
|
Reopening #545 which is a superset of this PR. |
No description provided.