logup parallelism and twiddle deduplication#518
Conversation
Add `par_batch_inverse` to `FieldElement` (math crate, parallel feature) that splits the input into per-thread chunks and runs one Montgomery batch inversion per chunk, trading K extra inversions for O(N/K) sequential work per thread. Falls back to sequential for inputs < 1024. Use it in `compute_logup_term_column` and `compute_logup_batched_term_column` in lookup.rs (guarded by #[cfg(feature = "parallel")]). Also add `math/parallel` to stark's `parallel` feature so the new method is visible when stark is compiled with parallelism enabled.
|
/bench 3 |
Codex Code ReviewNo concrete issues found in the PR diff. I reviewed the changed paths for security (VM/crypto/rust safety), correctness, significant performance regressions, and unnecessary complexity, and I don’t see actionable problems in the introduced parallelization logic. Residual risk/testing gap: I could not run |
Benchmark — fib_iterative_8M (median of 3)Table parallelism: 32 (auto = cores / 3)
Commit: 95c7718 · Baseline: cached · Runner: self-hosted bench |
| #[cfg(not(feature = "parallel"))] | ||
| let mut fingerprints: Vec<FieldElement<E>> = (0..trace_len) | ||
| .map(|row| { | ||
| let mut linear_combination = &bus_id_f * &alpha_powers[0]; | ||
| let mut alpha_offset = 1; | ||
| for bv in &table_interaction.values { | ||
| let consumed = bv.accumulate_fingerprint( | ||
| main_segment_cols, | ||
| row, | ||
| &alpha_powers, | ||
| alpha_offset, | ||
| &mut linear_combination, | ||
| &shifts, | ||
| ); | ||
| alpha_offset += consumed; | ||
| } | ||
| z - &linear_combination | ||
| }) | ||
| .collect(); |
There was a problem hiding this comment.
Medium – Simplicity: massive duplication with #[cfg] pairs
The sequential block is a verbatim copy of the parallel block above — only into_par_iter() vs .map() differs. The same pattern is repeated twice more for fingerprints_a and fingerprints_b below. Every future change to the inner computation must be applied in two places.
Extract the body to a closure first, then the #[cfg] branches reduce to a single line each:
let compute_fingerprint = |row: usize| {
let mut linear_combination = &bus_id_f * &alpha_powers[0];
let mut alpha_offset = 1;
for bv in &table_interaction.values {
let consumed = bv.accumulate_fingerprint(
main_segment_cols,
row,
&alpha_powers,
alpha_offset,
&mut linear_combination,
&shifts,
);
alpha_offset += consumed;
}
z - &linear_combination
};
#[cfg(feature = "parallel")]
let mut fingerprints: Vec<FieldElement<E>> = (0..trace_len).into_par_iter().map(compute_fingerprint).collect();
#[cfg(not(feature = "parallel"))]
let mut fingerprints: Vec<FieldElement<E>> = (0..trace_len).map(compute_fingerprint).collect();Same fix applies to the fingerprints_a/fingerprints_b blocks lower in this file.
| #[cfg(feature = "parallel")] | ||
| let fingerprints_a: Vec<FieldElement<E>> = (0..trace_len) | ||
| .into_par_iter() | ||
| .map(|row| { | ||
| let mut lc_a = &bus_id_a * &alpha_powers[0]; | ||
| let mut alpha_offset = 1; | ||
| for bv in &interaction_a.values { | ||
| let consumed = bv.accumulate_fingerprint( | ||
| main_segment_cols, | ||
| row, | ||
| &alpha_powers, | ||
| alpha_offset, | ||
| &mut lc_a, | ||
| &shifts, | ||
| ); | ||
| alpha_offset += consumed; | ||
| } | ||
| z - &lc_a | ||
| }) | ||
| .collect(); | ||
| #[cfg(not(feature = "parallel"))] | ||
| let fingerprints_a: Vec<FieldElement<E>> = (0..trace_len) | ||
| .map(|row| { | ||
| let mut lc_a = &bus_id_a * &alpha_powers[0]; | ||
| let mut alpha_offset = 1; | ||
| for bv in &interaction_a.values { | ||
| let consumed = bv.accumulate_fingerprint( | ||
| main_segment_cols, | ||
| row, | ||
| &alpha_powers, | ||
| alpha_offset, | ||
| &mut lc_a, | ||
| &shifts, | ||
| ); | ||
| alpha_offset += consumed; | ||
| } | ||
| z - &lc_a | ||
| }) | ||
| .collect(); | ||
|
|
||
| for row in 0..trace_len { | ||
| let mut lc_a = &bus_id_a * &alpha_powers[0]; | ||
| let mut alpha_offset = 1; | ||
| for bv in &interaction_a.values { | ||
| let consumed = bv.accumulate_fingerprint( | ||
| main_segment_cols, | ||
| row, | ||
| &alpha_powers, | ||
| alpha_offset, | ||
| &mut lc_a, | ||
| &shifts, | ||
| ); | ||
| alpha_offset += consumed; | ||
| } | ||
| all_fingerprints.push(z - &lc_a); | ||
| } | ||
| for row in 0..trace_len { | ||
| let mut lc_b = &bus_id_b * &alpha_powers[0]; | ||
| let mut alpha_offset = 1; | ||
| for bv in &interaction_b.values { | ||
| let consumed = bv.accumulate_fingerprint( | ||
| main_segment_cols, | ||
| row, | ||
| &alpha_powers, | ||
| alpha_offset, | ||
| &mut lc_b, | ||
| &shifts, | ||
| ); | ||
| alpha_offset += consumed; | ||
| } | ||
| all_fingerprints.push(z - &lc_b); | ||
| } | ||
| #[cfg(feature = "parallel")] | ||
| let fingerprints_b: Vec<FieldElement<E>> = (0..trace_len) | ||
| .into_par_iter() | ||
| .map(|row| { | ||
| let mut lc_b = &bus_id_b * &alpha_powers[0]; | ||
| let mut alpha_offset = 1; | ||
| for bv in &interaction_b.values { | ||
| let consumed = bv.accumulate_fingerprint( | ||
| main_segment_cols, | ||
| row, | ||
| &alpha_powers, | ||
| alpha_offset, | ||
| &mut lc_b, | ||
| &shifts, | ||
| ); | ||
| alpha_offset += consumed; | ||
| } | ||
| z - &lc_b | ||
| }) | ||
| .collect(); | ||
| #[cfg(not(feature = "parallel"))] | ||
| let fingerprints_b: Vec<FieldElement<E>> = (0..trace_len) | ||
| .map(|row| { | ||
| let mut lc_b = &bus_id_b * &alpha_powers[0]; | ||
| let mut alpha_offset = 1; | ||
| for bv in &interaction_b.values { | ||
| let consumed = bv.accumulate_fingerprint( | ||
| main_segment_cols, | ||
| row, | ||
| &alpha_powers, | ||
| alpha_offset, | ||
| &mut lc_b, | ||
| &shifts, | ||
| ); | ||
| alpha_offset += consumed; | ||
| } | ||
| z - &lc_b | ||
| }) | ||
| .collect(); |
There was a problem hiding this comment.
Low – Performance: fingerprints_a and fingerprints_b run sequentially
Both parallel loops are fully independent (different bus_id, different interaction.values, same immutable main_segment_cols). Running them back-to-back leaves half the potential parallelism on the table. Use rayon::join to execute both concurrently:
let (fingerprints_a, fingerprints_b) = rayon::join(
|| (0..trace_len).into_par_iter().map(|row| { /* lc_a ... */ z - &lc_a }).collect::<Vec<_>>(),
|| (0..trace_len).into_par_iter().map(|row| { /* lc_b ... */ z - &lc_b }).collect::<Vec<_>>(),
);(This also halves the number of #[cfg] blocks needed here.)
| return Self::inplace_batch_inverse(numbers); | ||
| } | ||
| use rayon::prelude::*; | ||
| let num_chunks = rayon::current_num_threads().min(numbers.len() / 256); |
There was a problem hiding this comment.
Low – Correctness edge case: num_chunks can be 0 if the global thread pool size is 0
rayon::current_num_threads() always returns ≥ 1 in practice, and the < 1024 guard ensures numbers.len() / 256 ≥ 4, so this is safe today. But the invariant is implicit. A one-liner makes it explicit and guards against exotic pool configurations:
| let num_chunks = rayon::current_num_threads().min(numbers.len() / 256); | |
| let num_chunks = rayon::current_num_threads().max(1).min(numbers.len() / 256); |
Review: perf/parallel LogUp + par_batch_inverseNo security or correctness issues. Two structural concerns worth addressing before merge. Medium – Simplicity:
|
…sted Rayon over-subscription)
Replace the column-parallel LogUp auxiliary trace build (which caused Rayon over-subscription when called from an already-parallel context) with a chunk-local approach inspired by Plonky3. Key changes: - New `compute_logup_batched_term_column_chunked` and `compute_logup_term_column_chunked` functions process rows in chunks of 1024, fusing fingerprint computation + batch inverse + term evaluation per chunk for L2 cache locality - Parallelism is across row-chunks (par_chunks_mut), not across interaction pairs, avoiding nested Rayon over-subscription - New `compute_multiplicity_for_row` helper avoids materializing full Vec<FieldElement> per interaction in the chunked path - `build_accumulated_column_from_terms` now uses parallel reduction for table_contribution and 3-phase parallel prefix sum for the accumulated column - Sequential (non-parallel) path unchanged, using original functions - All 121 stark tests pass with and without parallel feature
|
/bench 3 |
Two optimizations to the FRI commit phase: 1. Precompute zeta * inv_twiddles[j] once per layer (F×E = 3 base muls each). The per-row fold then uses one E×E multiply (9 base muls) instead of E×E + F×E (12 base muls). Saves ~25% of fold arithmetic. 2. Hash FRI leaves directly from evals pairs via build_from_hashed_leaves, eliminating the intermediate Vec<[FieldElement; 2]> allocation (~24MB at FRI layer 0).
Use Rayon map_init to allocate one byte buffer per thread (reused across all rows) instead of vec![0u8; N] per row. For CPU table (74 cols × 2^21 rows), this eliminates ~2M heap allocations. Applied to both commit_columns_bit_reversed (main/aux trace) and commit_composition_polynomial (composition poly).
|
/bench 3 |
…s isolated benchmarking)
|
/bench 3 |
…ot the bottleneck, Keccak dominates)
…n in chunked LogUp
|
/bench 3 |
FRI produces ~190 small tree builds per proof (layers 10-18 have 2-512 leaves). Rayon scheduling overhead exceeds computation for these tiny trees. Add a 1024-node threshold: below it, use sequential iteration for both leaf hashing and internal node construction.
|
/bench 3 |
…, blowup) Tables with the same domain size (e.g., 7+ tables at 2^20) were each creating their own Domain (~24 MB) and LdeTwiddles (~32 MB). With ~20 tables and only 4-5 distinct sizes, this wasted ~300 MB of memory and redundant root-of-unity + twiddle generation. Now uses a HashMap cache keyed by (trace_length, blowup_factor). Domain and LdeTwiddles are shared via Arc across all tables with the same parameters.
|
/bench 3 |
2 similar comments
|
/bench 3 |
|
/bench 3 |
116583e to
9eec069
Compare
|
/bench 3 |
9eec069 to
8e70557
Compare
|
/bench 3 |
|
/bench 3 |
Codex Code ReviewFindings
Security-specific note
|
| let domain = new_domain(*air, trace_length); | ||
| let twiddles = LdeTwiddles::new(&domain); | ||
| let blowup = air.options().blowup_factor as usize; | ||
| let key = (trace_length, blowup); |
There was a problem hiding this comment.
Medium — incorrect cache key: coset_offset is missing
new_domain uses air.options().coset_offset (in addition to trace_length and blowup_factor) to build the LDE coset and lde_roots_of_unity_coset. If two AIRs share the same (trace_length, blowup) but use different coset offsets, the second one silently receives the wrong Domain (and therefore wrong LdeTwiddles), producing an incorrect proof.
In the current codebase all AIRs appear to use coset_offset = 3, but this assumption is not enforced here and would be a silent correctness bug if it ever changed.
Fix: include the coset offset in the key:
| let key = (trace_length, blowup); | |
| let coset_offset = air.options().coset_offset as usize; | |
| let key = (trace_length, blowup, coset_offset); |
(And update the HashMap type annotation accordingly.)
MauroToscano
left a comment
There was a problem hiding this comment.
@nicole-graus is splitting it and checking the optimization one by one
|
/bench |
|
/bench 10 |
|
Pushed fixes for the two concrete findings to (H) (M) eliminate 2N allocations in Trade-off: one extra pass over Verified on
Would be good to re- |
No description provided.