Skip to content

logup parallelism and twiddle deduplication#518

Closed
diegokingston wants to merge 18 commits into
mainfrom
investigate/prover-performance
Closed

logup parallelism and twiddle deduplication#518
diegokingston wants to merge 18 commits into
mainfrom
investigate/prover-performance

Conversation

@diegokingston
Copy link
Copy Markdown
Collaborator

No description provided.

Add `par_batch_inverse` to `FieldElement` (math crate, parallel feature)
that splits the input into per-thread chunks and runs one Montgomery
batch inversion per chunk, trading K extra inversions for O(N/K)
sequential work per thread. Falls back to sequential for inputs < 1024.

Use it in `compute_logup_term_column` and `compute_logup_batched_term_column`
in lookup.rs (guarded by #[cfg(feature = "parallel")]). Also add
`math/parallel` to stark's `parallel` feature so the new method is
visible when stark is compiled with parallelism enabled.
@diegokingston
Copy link
Copy Markdown
Collaborator Author

/bench 3

@github-actions
Copy link
Copy Markdown

Codex Code Review

No concrete issues found in the PR diff.

I reviewed the changed paths for security (VM/crypto/rust safety), correctness, significant performance regressions, and unnecessary complexity, and I don’t see actionable problems in the introduced parallelization logic.

Residual risk/testing gap: I could not run cargo check in this environment due a rustup temp-file filesystem restriction, so this is a static review only.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 21, 2026

Benchmark — fib_iterative_8M (median of 3)

Table parallelism: 32 (auto = cores / 3)

Metric main PR Δ
Peak heap 64049 MB 62570 MB -1479 MB (-2.3%) ⚪
Prove time 31.930s 30.078s -1.852s (-5.8%) 🟢

🎉 Improvement detected — heap or time decreased by more than 5%.

⚠️ Baseline heap spread: 5.5% (66704 MB / 63180 MB / 64049 MB) — comparison may be less reliable

Commit: 95c7718 · Baseline: cached · Runner: self-hosted bench

Comment thread crypto/stark/src/lookup.rs Outdated
Comment on lines +1489 to +1507
#[cfg(not(feature = "parallel"))]
let mut fingerprints: Vec<FieldElement<E>> = (0..trace_len)
.map(|row| {
let mut linear_combination = &bus_id_f * &alpha_powers[0];
let mut alpha_offset = 1;
for bv in &table_interaction.values {
let consumed = bv.accumulate_fingerprint(
main_segment_cols,
row,
&alpha_powers,
alpha_offset,
&mut linear_combination,
&shifts,
);
alpha_offset += consumed;
}
z - &linear_combination
})
.collect();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium – Simplicity: massive duplication with #[cfg] pairs

The sequential block is a verbatim copy of the parallel block above — only into_par_iter() vs .map() differs. The same pattern is repeated twice more for fingerprints_a and fingerprints_b below. Every future change to the inner computation must be applied in two places.

Extract the body to a closure first, then the #[cfg] branches reduce to a single line each:

let compute_fingerprint = |row: usize| {
    let mut linear_combination = &bus_id_f * &alpha_powers[0];
    let mut alpha_offset = 1;
    for bv in &table_interaction.values {
        let consumed = bv.accumulate_fingerprint(
            main_segment_cols,
            row,
            &alpha_powers,
            alpha_offset,
            &mut linear_combination,
            &shifts,
        );
        alpha_offset += consumed;
    }
    z - &linear_combination
};

#[cfg(feature = "parallel")]
let mut fingerprints: Vec<FieldElement<E>> = (0..trace_len).into_par_iter().map(compute_fingerprint).collect();
#[cfg(not(feature = "parallel"))]
let mut fingerprints: Vec<FieldElement<E>> = (0..trace_len).map(compute_fingerprint).collect();

Same fix applies to the fingerprints_a/fingerprints_b blocks lower in this file.

Comment thread crypto/stark/src/lookup.rs Outdated
Comment on lines +1646 to +1724
#[cfg(feature = "parallel")]
let fingerprints_a: Vec<FieldElement<E>> = (0..trace_len)
.into_par_iter()
.map(|row| {
let mut lc_a = &bus_id_a * &alpha_powers[0];
let mut alpha_offset = 1;
for bv in &interaction_a.values {
let consumed = bv.accumulate_fingerprint(
main_segment_cols,
row,
&alpha_powers,
alpha_offset,
&mut lc_a,
&shifts,
);
alpha_offset += consumed;
}
z - &lc_a
})
.collect();
#[cfg(not(feature = "parallel"))]
let fingerprints_a: Vec<FieldElement<E>> = (0..trace_len)
.map(|row| {
let mut lc_a = &bus_id_a * &alpha_powers[0];
let mut alpha_offset = 1;
for bv in &interaction_a.values {
let consumed = bv.accumulate_fingerprint(
main_segment_cols,
row,
&alpha_powers,
alpha_offset,
&mut lc_a,
&shifts,
);
alpha_offset += consumed;
}
z - &lc_a
})
.collect();

for row in 0..trace_len {
let mut lc_a = &bus_id_a * &alpha_powers[0];
let mut alpha_offset = 1;
for bv in &interaction_a.values {
let consumed = bv.accumulate_fingerprint(
main_segment_cols,
row,
&alpha_powers,
alpha_offset,
&mut lc_a,
&shifts,
);
alpha_offset += consumed;
}
all_fingerprints.push(z - &lc_a);
}
for row in 0..trace_len {
let mut lc_b = &bus_id_b * &alpha_powers[0];
let mut alpha_offset = 1;
for bv in &interaction_b.values {
let consumed = bv.accumulate_fingerprint(
main_segment_cols,
row,
&alpha_powers,
alpha_offset,
&mut lc_b,
&shifts,
);
alpha_offset += consumed;
}
all_fingerprints.push(z - &lc_b);
}
#[cfg(feature = "parallel")]
let fingerprints_b: Vec<FieldElement<E>> = (0..trace_len)
.into_par_iter()
.map(|row| {
let mut lc_b = &bus_id_b * &alpha_powers[0];
let mut alpha_offset = 1;
for bv in &interaction_b.values {
let consumed = bv.accumulate_fingerprint(
main_segment_cols,
row,
&alpha_powers,
alpha_offset,
&mut lc_b,
&shifts,
);
alpha_offset += consumed;
}
z - &lc_b
})
.collect();
#[cfg(not(feature = "parallel"))]
let fingerprints_b: Vec<FieldElement<E>> = (0..trace_len)
.map(|row| {
let mut lc_b = &bus_id_b * &alpha_powers[0];
let mut alpha_offset = 1;
for bv in &interaction_b.values {
let consumed = bv.accumulate_fingerprint(
main_segment_cols,
row,
&alpha_powers,
alpha_offset,
&mut lc_b,
&shifts,
);
alpha_offset += consumed;
}
z - &lc_b
})
.collect();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low – Performance: fingerprints_a and fingerprints_b run sequentially

Both parallel loops are fully independent (different bus_id, different interaction.values, same immutable main_segment_cols). Running them back-to-back leaves half the potential parallelism on the table. Use rayon::join to execute both concurrently:

let (fingerprints_a, fingerprints_b) = rayon::join(
    || (0..trace_len).into_par_iter().map(|row| { /* lc_a ... */ z - &lc_a }).collect::<Vec<_>>(),
    || (0..trace_len).into_par_iter().map(|row| { /* lc_b ... */ z - &lc_b }).collect::<Vec<_>>(),
);

(This also halves the number of #[cfg] blocks needed here.)

Comment thread crypto/math/src/field/element.rs Outdated
return Self::inplace_batch_inverse(numbers);
}
use rayon::prelude::*;
let num_chunks = rayon::current_num_threads().min(numbers.len() / 256);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low – Correctness edge case: num_chunks can be 0 if the global thread pool size is 0

rayon::current_num_threads() always returns ≥ 1 in practice, and the < 1024 guard ensures numbers.len() / 256 ≥ 4, so this is safe today. But the invariant is implicit. A one-liner makes it explicit and guards against exotic pool configurations:

Suggested change
let num_chunks = rayon::current_num_threads().min(numbers.len() / 256);
let num_chunks = rayon::current_num_threads().max(1).min(numbers.len() / 256);

@claude
Copy link
Copy Markdown

claude Bot commented Apr 21, 2026

Review: perf/parallel LogUp + par_batch_inverse

No security or correctness issues. Two structural concerns worth addressing before merge.

Medium – Simplicity: #[cfg] duplication across all computation blocks

Every fingerprint computation in lookup.rs is written out twice — once under #[cfg(feature = "parallel")] and once under #[cfg(not(feature = "parallel"))] — with the only difference being into_par_iter() vs .map(). This affects 4 blocks (~120 duplicated lines total). Any future fix to the inner computation must be applied in two places. Fix: extract each computation to a closure before the #[cfg] branches. See inline comment on lines 1489–1507.

Low – Performance: fingerprints_a / fingerprints_b run sequentially

The two parallel loops in compute_logup_batched_term_column are independent and could run concurrently with rayon::join, recovering another ~2x within that function. See inline comment on lines 1646–1724.

Low – Defensive guard in par_batch_inverse

num_chunks relies on rayon::current_num_threads() >= 1 to avoid a later integer overflow. Adding .max(1) makes the invariant explicit. See inline comment on element.rs:86.


The Cargo.toml addition of math/parallel to the parallel feature and the overall parallelization strategy (embarrassingly-parallel rows, single batch inversion per chunk) are correct.

Replace the column-parallel LogUp auxiliary trace build (which caused
Rayon over-subscription when called from an already-parallel context)
with a chunk-local approach inspired by Plonky3.

Key changes:
- New `compute_logup_batched_term_column_chunked` and
  `compute_logup_term_column_chunked` functions process rows in chunks
  of 1024, fusing fingerprint computation + batch inverse + term
  evaluation per chunk for L2 cache locality
- Parallelism is across row-chunks (par_chunks_mut), not across
  interaction pairs, avoiding nested Rayon over-subscription
- New `compute_multiplicity_for_row` helper avoids materializing full
  Vec<FieldElement> per interaction in the chunked path
- `build_accumulated_column_from_terms` now uses parallel reduction
  for table_contribution and 3-phase parallel prefix sum for the
  accumulated column
- Sequential (non-parallel) path unchanged, using original functions
- All 121 stark tests pass with and without parallel feature
@diegokingston
Copy link
Copy Markdown
Collaborator Author

/bench 3

Two optimizations to the FRI commit phase:

1. Precompute zeta * inv_twiddles[j] once per layer (F×E = 3 base muls
   each). The per-row fold then uses one E×E multiply (9 base muls)
   instead of E×E + F×E (12 base muls). Saves ~25% of fold arithmetic.

2. Hash FRI leaves directly from evals pairs via build_from_hashed_leaves,
   eliminating the intermediate Vec<[FieldElement; 2]> allocation
   (~24MB at FRI layer 0).
Use Rayon map_init to allocate one byte buffer per thread (reused
across all rows) instead of vec![0u8; N] per row. For CPU table
(74 cols × 2^21 rows), this eliminates ~2M heap allocations.

Applied to both commit_columns_bit_reversed (main/aux trace) and
commit_composition_polynomial (composition poly).
@diegokingston
Copy link
Copy Markdown
Collaborator Author

/bench 3

@diegokingston
Copy link
Copy Markdown
Collaborator Author

/bench 3

@diegokingston
Copy link
Copy Markdown
Collaborator Author

/bench 3

FRI produces ~190 small tree builds per proof (layers 10-18 have
2-512 leaves). Rayon scheduling overhead exceeds computation for
these tiny trees. Add a 1024-node threshold: below it, use
sequential iteration for both leaf hashing and internal node
construction.
@diegokingston
Copy link
Copy Markdown
Collaborator Author

/bench 3

@diegokingston diegokingston changed the title docs: add prover parallelism improvement plan (6 tasks) logup parallelism Apr 21, 2026
…, blowup)

Tables with the same domain size (e.g., 7+ tables at 2^20) were each
creating their own Domain (~24 MB) and LdeTwiddles (~32 MB). With
~20 tables and only 4-5 distinct sizes, this wasted ~300 MB of memory
and redundant root-of-unity + twiddle generation.

Now uses a HashMap cache keyed by (trace_length, blowup_factor).
Domain and LdeTwiddles are shared via Arc across all tables with the
same parameters.
@diegokingston
Copy link
Copy Markdown
Collaborator Author

/bench 3

2 similar comments
@diegokingston
Copy link
Copy Markdown
Collaborator Author

/bench 3

@diegokingston
Copy link
Copy Markdown
Collaborator Author

/bench 3

@diegokingston diegokingston force-pushed the investigate/prover-performance branch from 116583e to 9eec069 Compare April 21, 2026 21:02
@diegokingston
Copy link
Copy Markdown
Collaborator Author

/bench 3

@diegokingston diegokingston force-pushed the investigate/prover-performance branch from 9eec069 to 8e70557 Compare April 21, 2026 21:22
@diegokingston
Copy link
Copy Markdown
Collaborator Author

/bench 3

@diegokingston diegokingston changed the title logup parallelism logup parallelism and twiddle deduplication Apr 21, 2026
@diegokingston
Copy link
Copy Markdown
Collaborator Author

/bench 3

@diegokingston diegokingston marked this pull request as ready for review April 22, 2026 13:02
@github-actions
Copy link
Copy Markdown

Codex Code Review

Findings

  1. High (Crypto correctness bug): domain cache key is incomplete and can reuse the wrong domain
  1. Medium (Performance/memory regression): accumulated-column build now materializes ~3x trace-length buffers

Security-specific note

  • No new unsafe/memory-safety issues observed in this diff.
  • The first finding is still security-relevant as a cryptographic correctness issue (domain mismatch).

let domain = new_domain(*air, trace_length);
let twiddles = LdeTwiddles::new(&domain);
let blowup = air.options().blowup_factor as usize;
let key = (trace_length, blowup);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — incorrect cache key: coset_offset is missing

new_domain uses air.options().coset_offset (in addition to trace_length and blowup_factor) to build the LDE coset and lde_roots_of_unity_coset. If two AIRs share the same (trace_length, blowup) but use different coset offsets, the second one silently receives the wrong Domain (and therefore wrong LdeTwiddles), producing an incorrect proof.

In the current codebase all AIRs appear to use coset_offset = 3, but this assumption is not enforced here and would be a silent correctness bug if it ever changed.

Fix: include the coset offset in the key:

Suggested change
let key = (trace_length, blowup);
let coset_offset = air.options().coset_offset as usize;
let key = (trace_length, blowup, coset_offset);

(And update the HashMap type annotation accordingly.)

Copy link
Copy Markdown
Contributor

@MauroToscano MauroToscano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicole-graus is splitting it and checking the optimization one by one

@nicole-graus
Copy link
Copy Markdown
Collaborator

/bench

@nicole-graus
Copy link
Copy Markdown
Collaborator

/bench 10

@diegokingston
Copy link
Copy Markdown
Collaborator Author

Pushed fixes for the two concrete findings to triage/pr518-fixes — two focused commits on top of the PR head. Cherry-pick or merge as you prefer.

(H) coset_offset in domain cache keyc7f6aaef
Adds coset_offset (u64) to the HashMap key tuple so that AIRs with same (trace_length, blowup) but different offsets don't silently alias onto the same Domain/LdeTwiddles. Zero runtime cost, 12-line diff in prover.rs. Current call sites thread one ProofOptions everywhere, so this is purely defensive — but the cache is in a generic crate and other callers shouldn't need to keep the invariant in their head.

(M) eliminate 2N allocations in build_accumulated_column_from_terms04602bb3
Restructures from row_sums → chunk_data → acc_col → trace.set_aux (three N-sized live buffers) to chunk_totals → chunk_data → trace.set_aux (one N-sized live buffer, plus O(num_chunks) scalars). Peak transient storage drops from ~3N to ~N extension-field elements per table — roughly 670 MiB less allocator pressure on a 14-table proof at N=2^20.

Trade-off: one extra pass over term_columns in Pass 1 (just chunk totals, parallel). The final scatter into trace.set_aux is now sequential because TraceTable is row-major + &mut self-only; one extension-field add per row is ~10 ms/table — rounding error next to the Round 1 LDE FFTs.

Verified on triage/pr518-fixes:

  • cargo check --workspace + cargo check -p stark --no-default-features — clean
  • make lint — clean (fmt + clippy default + debug-checks)
  • cargo test -p stark --release --lib — 121/121 pass
  • cargo test -p lambda-vm-prover --release --lib non-ELF subset — 247/247 pass

Would be good to re-/bench once these land on the PR branch to confirm the −5.8% prove time from commit 95c7718 holds (expect at worst flat, at best a small additional heap win).

@diegokingston diegokingston deleted the investigate/prover-performance branch May 20, 2026 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants