Skip to content

Merkle cache reads and skip R4 permute#547

Closed
gabrielbosio wants to merge 10 commits into
mainfrom
perf/merkle-cache-reads-and-skip-r4-permute
Closed

Merkle cache reads and skip R4 permute#547
gabrielbosio wants to merge 10 commits into
mainfrom
perf/merkle-cache-reads-and-skip-r4-permute

Conversation

@gabrielbosio
Copy link
Copy Markdown
Collaborator

No description provided.

Read columns at natural index k inside the parallel hashing loop, then apply
in_place_bit_reverse_permute to the Commitment vector before building the
Merkle tree. Same leaves as reading at br(row_idx) inside the loop; replaces
scattered column reads (~2GB volume on MEMW_R) with sequential reads plus a
64MB in-place bit-reverse pass.
round_4 called evaluate_fft (which internally permutes the FFT output to
natural order) and then in_place_bit_reverse_permute on the result to
flip it back. Both permutes cancel. FRI commit_phase_from_evaluations
pairs evals as chunks_exact(2) expecting {f(x), f(-x)} adjacency, which
is exactly the bit-reversed output of the Bowers forward FFT.

Added Polynomial::evaluate_fft_bit_reversed that skips the final permute,
and called it from round_4. Result: two ~24ms permutes (at 2N=4M per
table) eliminated per prove.

fib_iterative_2M on Linux x86_64, 12 cores, 5 samples:
- prove wall-clock: 75.4s -> 74.4s median (-1.3%), 75.5s -> 74.3s mean (-1.6%)
- R4 interpolate+evaluate_fft sub-phase: 2.73s -> 1.95s (-29%)
- CV 0.6% (2xCV=1.2% threshold, 1.3% improvement clears it)
- Verification against baseline binary: PASS
@gabrielbosio
Copy link
Copy Markdown
Collaborator Author

/bench k=4

@github-actions
Copy link
Copy Markdown

Codex Code Review

No findings in the PR diff.

The bit-reversed FFT path appears consistent with the existing consumers, and the commit_columns_bit_reversed rewrite preserves the same leaf ordering while improving locality. Residual risk: I could not run targeted Rust tests in this environment because cargo/rustup failed trying to write under /home/runner/.rustup on a read-only filesystem, so this review is static only.

Comment thread crypto/math/src/fft/polynomial.rs Outdated
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 23, 2026

Benchmark — fib_iterative_8M (median of 3)

Table parallelism: 32 (auto = cores / 3)

Metric main PR Δ
Peak heap 64562 MB 66101 MB +1539 MB (+2.4%) ⚪
Prove time 31.902s 31.530s -0.372s (-1.2%) ⚪

✅ No significant change.

⚠️ Baseline heap spread: 5.4% (66868 MB / 64562 MB / 63409 MB) — comparison may be less reliable

Commit: 8c4b9c7 · Baseline: built from main · Runner: self-hosted bench

@gabrielbosio
Copy link
Copy Markdown
Collaborator Author

/bench

@yetanotherco yetanotherco deleted a comment from claude Bot Apr 23, 2026
@gabrielbosio
Copy link
Copy Markdown
Collaborator Author

/claude

Comment thread crypto/math/src/fft/polynomial.rs Outdated
@yetanotherco yetanotherco deleted a comment from claude Bot Apr 23, 2026
@gabrielbosio
Copy link
Copy Markdown
Collaborator Author

/claude

Comment thread crypto/math/src/fft/polynomial.rs
Comment thread crypto/math/src/fft/polynomial.rs Outdated
@yetanotherco yetanotherco deleted a comment from claude Bot Apr 23, 2026
@gabrielbosio
Copy link
Copy Markdown
Collaborator Author

/claude

Comment thread crypto/math/src/fft/polynomial.rs
Comment thread crypto/stark/src/prover.rs
@yetanotherco yetanotherco deleted a comment from claude Bot Apr 23, 2026
@gabrielbosio
Copy link
Copy Markdown
Collaborator Author

/claude

@claude
Copy link
Copy Markdown

claude Bot commented Apr 23, 2026

Code Review

Summary: Two targeted performance optimizations in the STARK prover — both are correct, well-tested, and the equivalence proofs are straightforward.


Change 1: Cache-friendly Merkle commitment (commit_columns_bit_reversed)

Before: hash columns[col][br(k)] for each row k → scattered reads across column arrays, one cache miss per row.
After: hash columns[col][k] sequentially → full cache-line utilization, then a single in_place_bit_reverse_permute pass over 32-byte digests.

The equivalence holds because bit-reversal is its own inverse:

  • old: leaf[k] = hash(col[br(k)])
  • new (before permute): leaf[k] = hash(col[k]); after permute: leaf[br(k)] = hash(col[k]) ⟹ with j=br(k): leaf[j] = hash(col[br(j)])

The test commit_columns_bit_reversed_matches_naive_reference confirms this. Works correctly under the parallel feature too since collect() preserves order and the permute runs sequentially after.


Change 2: Skip redundant permutations in deep-poly LDE (R4)

The old pattern was:

  1. evaluate_fft → internally permutes to natural order
  2. in_place_bit_reverse_permute → permutes back to bit-reversed

That's two full O(n) passes that cancel each other out. The new evaluate_fft_bit_reversed simply omits the internal natural-order permute, and evaluate_fft_cpu_inner correctly gates it with permute_after. ✓


Minor observations (no blocking issues)

Validation order in evaluate_fft_bit_reversed: The trailing_zeros > TWO_ADICITY check fires before !is_power_of_two(). In theory, a non-power-of-two blowup could return DomainSizeError instead of InputError if the polynomial is large enough to exhaust adicity first. In practice (Goldilocks TWO_ADICITY=32), this requires a polynomial with >2^32 coefficients — completely unreachable. No action needed.

Empty-polynomial test gap: evaluate_fft_bit_reversed_matches_evaluate_fft_then_permute doesn't cover the empty-polynomial fast path. The result is trivially correct (all-zeros is invariant under any permutation), but a case Polynomial::<FE>::new(&[])... blowup=4 wouldn't hurt.

is_power_of_two in evaluate_fft_cpu_inner: The check is now guaranteed to pass when called from evaluate_fft_bit_reversed (outer function pre-validates), but it's still needed for the evaluate_fft_cpu call path. Correct as-is.


Overall: LGTM. Logic is sound, the cache optimization is real for large domain sizes, and the test coverage is solid.

@gabrielbosio
Copy link
Copy Markdown
Collaborator Author

/bench

@gabrielbosio
Copy link
Copy Markdown
Collaborator Author

/bench

1 similar comment
@gabrielbosio
Copy link
Copy Markdown
Collaborator Author

/bench

@gabrielbosio gabrielbosio marked this pull request as ready for review April 23, 2026 23:42
})
.collect();

in_place_bit_reverse_permute(&mut hashed_leaves);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low – Sequential permute after parallel hashing

When the parallel feature is enabled, the iter.map(...).collect() runs on a Rayon thread pool, but in_place_bit_reverse_permute is always sequential. For large num_rows this single-threaded pass could become a meaningful bottleneck relative to the parallelized hash work above it.

The bit-reverse permutation is a small, branch-heavy scatter/gather that parallelises poorly at typical domain sizes, so this is only worth addressing if profiling shows it as a hot spot. Just noting it so the trade-off is explicit.

E: Send + Sync,
{
let domain_size = domain_size.unwrap_or(0);
let len = core::cmp::max(poly.coeff_len(), domain_size).next_power_of_two() * blowup_factor;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low – blowup_factor = 0 produces a confusing error

let len = core::cmp::max(poly.coeff_len(), domain_size).next_power_of_two() * blowup_factor;

When blowup_factor = 0, len = 0. On a 64-bit target 0usize.trailing_zeros() returns 64, so the first guard fires and returns DomainSizeError(64) — a misleading message for a caller that passed a zero blowup.

evaluate_fft has the same behaviour, so this isn't a regression, but since evaluate_fft_bit_reversed is a new public API it's a good place to add an early guard:

Suggested change
let len = core::cmp::max(poly.coeff_len(), domain_size).next_power_of_two() * blowup_factor;
if blowup_factor == 0 {
return Err(FFTError::InputError(0));
}
let len = core::cmp::max(poly.coeff_len(), domain_size).next_power_of_two() * blowup_factor;

@claude
Copy link
Copy Markdown

claude Bot commented Apr 23, 2026

Review: Merkle cache reads and skip R4 permute

Summary: Two related optimisations — (1) sequential column reads + post-hoc bit-reverse permutation of the 32-byte digest vector in commit_columns_bit_reversed, and (2) a new evaluate_fft_bit_reversed that skips the internal natural-order permute, used to eliminate the redundant permute pair in the deep-poly LDE path.

Correctness ✅

Both optimisations are mathematically sound:

  • commit_columns_bit_reversed: Reading at sequential index k and then applying in_place_bit_reverse_permute to the hash vector is equivalent to reading at br(k) directly — in_place_bit_reverse_permute sets buf[i] = buf_old[br(i)], which matches the old loop body exactly.
  • evaluate_fft_bit_reversed: The Cooley-Tukey DIF implementation (dispatch_fft) already emits results in bit-reversed order; the existing in_place_bit_reverse_permute in evaluate_fft_cpu was converting to natural order. Skipping it correctly preserves the native bit-reversed layout that fold_evaluations_in_place expects (consecutive pairs (2j, 2j+1) are conjugates).
  • Both are directly covered by the new regression tests.

Issues

See inline comments for details. In brief:

Severity Location Issue
Low polynomial.rs:100 blowup_factor = 0 silently maps to DomainSizeError(64) rather than an informative error — an early guard would clarify the API
Low prover.rs:408 in_place_bit_reverse_permute is sequential even when parallel feature is active; fine at current domain sizes but worth tracking

Other observations

  • The permute_after: bool flag in the private evaluate_fft_cpu_inner is a reasonable choice for a private helper with exactly two call-sites. No duplication concern.
  • The docstring accurately captures the stricter validation difference vs evaluate_fft (the is_power_of_two check before the empty-polynomial fast path).
  • Test coverage is solid: correctness of the new function against evaluate_fft + permute, rejection of non-power-of-two blowup, and a reference-vs-optimised Merkle root comparison.

No critical or high severity issues. The optimisation is clean and well-tested. The two low-severity notes are optional polish.

@gabrielbosio
Copy link
Copy Markdown
Collaborator Author

Reopening #545 which is a superset of this PR.

@diegokingston diegokingston deleted the perf/merkle-cache-reads-and-skip-r4-permute branch May 20, 2026 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant