Misc optimizations#545
Conversation
18e9c67 to
33e9573
Compare
| pub fn inplace_batch_inverse(numbers: &mut [Self]) -> Result<(), FieldError> { | ||
| pub fn inplace_batch_inverse(numbers: &mut [Self]) -> Result<(), FieldError> | ||
| where | ||
| Self: Send + Sync, |
There was a problem hiding this comment.
Medium — Unconditional API-breaking bound
The where Self: Send + Sync constraint is on the public function signature unconditionally, not gated by #[cfg(feature = "parallel")]. This means every caller of inplace_batch_inverse must now satisfy the bound even on no_std targets or crates that never enable the parallel feature, which is a breaking change for any downstream field implementation whose BaseType is not Send + Sync.
Consider splitting:
| Self: Send + Sync, | |
| pub fn inplace_batch_inverse(numbers: &mut [Self]) -> Result<(), FieldError> { |
…and adding the Send + Sync bound only in the #[cfg(feature = "parallel")] branch, e.g. via an internal helper with the tighter bound, keeping the public API signature unchanged.
| } | ||
| } | ||
| unsafe impl<E> Send for SendPtr<E> {} | ||
| unsafe impl<E> Sync for SendPtr<E> {} |
There was a problem hiding this comment.
Low — Sync impl is broader than needed
for_each on a rayon ParallelIterator only requires the closure to be Send. Since ptr is Copy, each closure invocation gets its own copy of the raw pointer, so the closure is Send as long as SendPtr: Send. The Sync impl (&SendPtr<E> safe to share across threads) is never exercised by this code and makes a stronger safety claim than what is actually verified. It should be removed to keep the unsafe surface minimal.
| unsafe impl<E> Sync for SendPtr<E> {} | |
| unsafe impl<E> Send for SendPtr<E> {} |
Review: Misc optimizations (perf/parallel-bitrev-batch-inverse-merkle)SummaryFour orthogonal performance improvements: parallel bit-reverse permutation, parallel batch inverse, cache-friendly Merkle commit (sequential column reads + bit-reverse hashes), and Accidental files — should not be in this PRThese look like artifacts from an automated optimization session:
Code issuesMedium — The Low — The parallel swap pattern only requires Correctness notes (no action needed)
|
Codex Code ReviewFindings:
No high/medium security or correctness issue stood out in the FFT/FRI ordering changes themselves. I couldn’t run |
|
/bench |
Benchmark — fib_iterative_8M (median of 3)Table parallelism: 1
Commit: 3de2694 · Baseline: built from main · Runner: self-hosted bench |
|
/bench k=4 |
|
It only shows a speedup with low table parallelism. Closing. |
33e9573 to
365d044
Compare
Read columns at natural index k inside the parallel hashing loop, then apply in_place_bit_reverse_permute to the Commitment vector before building the Merkle tree. Same leaves as reading at br(row_idx) inside the loop; replaces scattered column reads (~2GB volume on MEMW_R) with sequential reads plus a 64MB in-place bit-reverse pass.
Phase 5 of trace build invokes chunk_and_generate 10 times; each call walked its chunks sequentially. MEMW alone produces ~12 chunks at fib_iterative_2M, so there is substantial per-chunk parallelism available on a free rayon pool (trace build runs before multi_prove). fib_iterative_2M on Linux x86_64, 12 cores, 3 samples: - prove wall-clock: 75.4s -> 74.3s median (-1.5%) - Trace build sub-phase: 4.56s -> 3.96s (-13.2%) - Verification against baseline binary: PASS
round_4 called evaluate_fft (which internally permutes the FFT output to
natural order) and then in_place_bit_reverse_permute on the result to
flip it back. Both permutes cancel. FRI commit_phase_from_evaluations
pairs evals as chunks_exact(2) expecting {f(x), f(-x)} adjacency, which
is exactly the bit-reversed output of the Bowers forward FFT.
Added Polynomial::evaluate_fft_bit_reversed that skips the final permute,
and called it from round_4. Result: two ~24ms permutes (at 2N=4M per
table) eliminated per prove.
fib_iterative_2M on Linux x86_64, 12 cores, 5 samples:
- prove wall-clock: 75.4s -> 74.4s median (-1.3%), 75.5s -> 74.3s mean (-1.6%)
- R4 interpolate+evaluate_fft sub-phase: 2.73s -> 1.95s (-29%)
- CV 0.6% (2xCV=1.2% threshold, 1.3% improvement clears it)
- Verification against baseline binary: PASS
Every FFT call site ends with a sequential O(N) bit-reverse permutation. At N=4M elements this is ~24ms on its own, called dozens of times per prove across all column LDEs, composition-poly parts, and the R4 deep LDE. Bottlenecks the otherwise-parallel FFT pipeline (Amdahl). Swap pairs (i, br(i)) with i < br(i) are disjoint, so parallelization is safe with a Send/Sync raw-pointer wrapper (the `i < br(i)` predicate selects a unique owner per pair, so no two threads ever touch the same slot). Sequential fallback retained for N < 16K. fib_iterative_2M on Linux x86_64, 12 cores, 3 samples: - prove wall-clock: 75.4s -> 73.9s median (-2.0%), 75.5s -> 74.1s mean (-1.9%) - R2 decompose_and_extend_d2: 8.28s -> 7.78s (-6.0%) - R4 interpolate+evaluate_fft: 2.73s -> 2.40s (-12.1%) - CV 0.7% (2xCV=1.4% threshold, 2.0% improvement clears it comfortably) - Verification against baseline binary: PASS - All 121 stark lib tests + math bit_reverse tests pass
Montgomery batch inverse has a serial prefix-product dependency, but chunks are independent: each chunk inverts its own elements without needing values from other chunks. Trade K-1 extra field inversions (~1000 mults each in Goldilocks, negligible next to the ~2N mults per chunk) for K-way parallelism. Threshold at 2^16 elements so short batches (single-FRI-layer twiddles at smaller layers, inv arrays in small tables) keep the sequential path. Above threshold, split into num_threads chunks and invert each independently via par_chunks_mut. This is surprisingly impactful because batch_inverse is called on large inputs throughout the prover — coset-point inverses in R2 decompose, constraint-denominator inverses, OOD x_i - z inverses, deep-composition inv_h / inv_t, FRI coset twiddles, etc. Every summed-over-tables hot denominator pipeline hits it. fib_iterative_2M on Linux x86_64, 12 cores, 3 samples: - prove wall-clock: 75.4s -> 72.45s median (-3.9%), 75.5s -> 72.4s mean (-4.1%) - R3 OOD evaluation: 5.66s -> 4.30s (-24%) - R4 deep_composition_poly_evals: 5.62s -> 4.42s (-21%) - R4 queries & openings: 1.58s -> 1.14s (-28%) - R4 interpolate+evaluate_fft: 2.87s -> 2.70s (-6%) - CV 0.4% (2xCV=0.8%, 3.9% improvement clears it easily) - Verification against baseline binary: PASS - All math batch_inverse + stark 121 lib tests pass
365d044 to
c1dd556
Compare
| const PARALLEL_BITREV_THRESHOLD: usize = 1 << 14; | ||
| if n >= PARALLEL_BITREV_THRESHOLD { |
There was a problem hiding this comment.
Medium — Missing power-of-two guard before unsafe parallel swap
The SAFETY argument for core::ptr::swap relies on reverse_index being a bijection on [0, n), which only holds when n is a power of two. The sequential path just produces wrong output if the contract is broken; the parallel path invokes undefined behaviour (a data race) because two threads could swap the same element concurrently.
A debug_assert! catches violations in debug builds at zero release cost:
| const PARALLEL_BITREV_THRESHOLD: usize = 1 << 14; | |
| if n >= PARALLEL_BITREV_THRESHOLD { | |
| const PARALLEL_BITREV_THRESHOLD: usize = 1 << 14; | |
| debug_assert!(n.is_power_of_two(), "in_place_bit_reverse_permute requires a power-of-two length"); | |
| if n >= PARALLEL_BITREV_THRESHOLD { |
| pub fn evaluate_fft_bit_reversed<F: IsFFTField + IsSubFieldOf<E>>( | ||
| poly: &Polynomial<FieldElement<E>>, | ||
| blowup_factor: usize, | ||
| domain_size: Option<usize>, | ||
| ) -> Result<Vec<FieldElement<E>>, FFTError> | ||
| where | ||
| E: Send + Sync, | ||
| { | ||
| let domain_size = domain_size.unwrap_or(0); | ||
| let len = core::cmp::max(poly.coeff_len(), domain_size).next_power_of_two() * blowup_factor; | ||
| if len.trailing_zeros() as u64 > F::TWO_ADICITY { | ||
| return Err(FFTError::DomainSizeError(len.trailing_zeros() as usize)); | ||
| } | ||
| if poly.coefficients().is_empty() { | ||
| return Ok(vec![FieldElement::zero(); len]); | ||
| } | ||
|
|
||
| let mut coeffs = poly.coefficients().to_vec(); | ||
| coeffs.resize(len, FieldElement::zero()); | ||
|
|
||
| let order = len.trailing_zeros() as u64; | ||
| let layer_twiddles = | ||
| LayerTwiddles::<F>::new(order).ok_or(FFTError::DomainSizeError(order as usize))?; | ||
| dispatch_fft(&mut coeffs, &layer_twiddles)?; | ||
| Ok(coeffs) | ||
| } |
There was a problem hiding this comment.
Low — No test for the new public API
evaluate_fft_bit_reversed is used in a correctness-critical path (FRI commit phase) but has no unit test. The invariant to verify is simple: the output should equal evaluate_fft with in_place_bit_reverse_permute applied.
#[test]
fn evaluate_fft_bit_reversed_matches_evaluate_fft_permuted() {
use crate::fft::cpu::bit_reversing::in_place_bit_reverse_permute;
let coeffs: Vec<FE> = (0u64..8).map(FE::from).collect();
let poly = Polynomial::new(&coeffs);
let mut expected = Polynomial::evaluate_fft::<F>(&poly, 2, None).unwrap();
in_place_bit_reverse_permute(&mut expected);
let got = Polynomial::evaluate_fft_bit_reversed::<F>(&poly, 2, None).unwrap();
assert_eq!(got, expected);
}
Codex Code Review
No other concrete security, correctness, or significant performance issues stood out in the reviewed diff. I couldn’t run targeted |
| pub fn evaluate_fft_bit_reversed<F: IsFFTField + IsSubFieldOf<E>>( | ||
| poly: &Polynomial<FieldElement<E>>, | ||
| blowup_factor: usize, | ||
| domain_size: Option<usize>, | ||
| ) -> Result<Vec<FieldElement<E>>, FFTError> | ||
| where | ||
| E: Send + Sync, | ||
| { | ||
| let domain_size = domain_size.unwrap_or(0); | ||
| let len = core::cmp::max(poly.coeff_len(), domain_size).next_power_of_two() * blowup_factor; | ||
| if len.trailing_zeros() as u64 > F::TWO_ADICITY { | ||
| return Err(FFTError::DomainSizeError(len.trailing_zeros() as usize)); | ||
| } | ||
| if poly.coefficients().is_empty() { | ||
| return Ok(vec![FieldElement::zero(); len]); | ||
| } | ||
|
|
||
| let mut coeffs = poly.coefficients().to_vec(); | ||
| coeffs.resize(len, FieldElement::zero()); | ||
|
|
||
| let order = len.trailing_zeros() as u64; | ||
| let layer_twiddles = | ||
| LayerTwiddles::<F>::new(order).ok_or(FFTError::DomainSizeError(order as usize))?; | ||
| dispatch_fft(&mut coeffs, &layer_twiddles)?; | ||
| Ok(coeffs) | ||
| } |
There was a problem hiding this comment.
Low — Duplicate setup code with evaluate_fft
evaluate_fft_bit_reversed is identical to evaluate_fft except it skips the final in_place_bit_reverse_permute. Both functions share ~20 lines of setup (domain-size computation, empty-poly check, zero-padding, twiddle construction). If either is changed in the future the other will silently diverge.
Consider a private helper:
fn evaluate_fft_raw<F: IsFFTField + IsSubFieldOf<E>, E: IsField + Send + Sync>(
poly: &Polynomial<FieldElement<E>>,
blowup_factor: usize,
domain_size: Option<usize>,
) -> Result<Vec<FieldElement<E>>, FFTError> {
// shared setup + dispatch_fft, no permutation
}Then evaluate_fft calls the helper and permutes, while evaluate_fft_bit_reversed just calls the helper.
Review: Misc optimizations (PR #545)SummaryThe PR introduces four targeted performance improvements: parallel bit-reverse permutation, parallel batch inverse chunking, sequential-read + single bit-reverse permute for Merkle leaf hashing, and parallel trace chunk generation. The overall approach is sound and the correctness reasoning is solid. A few issues are worth addressing before merge. IssuesMedium — Missing power-of-two guard before unsafe parallel swap ( The Low — No test for This new public API sits in a correctness-critical path (FRI commit phase) but ships without a unit test. The property to verify is simple: output should equal Low — Duplicate setup code between Both functions share ~20 lines of identical setup (domain-size computation, empty-poly check, zero-padding, twiddle construction). Extracting a private No issues found in
|
|
/bench k=1 |
Stacks five independent prover optimizations in R1 commit, trace build, and the FFT / batch-inverse pipelines.
commit_columns_bit_reversed: replaces scatteredcolumns[col][br(row)]reads (~2 GB) with sequential reads plus a 64 MB post bit-reverse on the digest vector.par_chunksinsidechunk_and_generate: runs the 10 trace-build Phase 5 generators in parallel on the idle rayon pool.evaluate_fft_bit_reversedto drop both cancelling permutes.in_place_bit_reverse_permute: swap pairs(i, br(i))withi < br(i)are disjoint, so a Send/Sync raw-pointer wrapper lets rayon drive them; sequential fallback below N=16K.inplace_batch_inverse: chunks the Montgomery prefix product, trading K−1 extra inversions for K-way parallelism; sequential fallback below N=2^16.