logup parallelism and twiddle deduplication by diegokingston · Pull Request #518 · yetanotherco/lambda_vm

diegokingston · 2026-04-21T18:58:03Z

No description provided.

Add `par_batch_inverse` to `FieldElement` (math crate, parallel feature) that splits the input into per-thread chunks and runs one Montgomery batch inversion per chunk, trading K extra inversions for O(N/K) sequential work per thread. Falls back to sequential for inputs < 1024. Use it in `compute_logup_term_column` and `compute_logup_batched_term_column` in lookup.rs (guarded by #[cfg(feature = "parallel")]). Also add `math/parallel` to stark's `parallel` feature so the new method is visible when stark is compiled with parallelism enabled.

diegokingston · 2026-04-21T18:58:11Z

/bench 3

github-actions · 2026-04-21T18:59:18Z

Codex Code Review

No concrete issues found in the PR diff.

I reviewed the changed paths for security (VM/crypto/rust safety), correctness, significant performance regressions, and unnecessary complexity, and I don’t see actionable problems in the introduced parallelization logic.

Residual risk/testing gap: I could not run cargo check in this environment due a rustup temp-file filesystem restriction, so this is a static review only.

github-actions · 2026-04-21T19:00:29Z

Benchmark — fib_iterative_8M (median of 3)

_{Table parallelism: 32 (auto = cores / 3)}

Metric	main	PR	Δ
Peak heap	64049 MB	62570 MB	-1479 MB (-2.3%) ⚪
Prove time	31.930s	30.078s	-1.852s (-5.8%) 🟢

🎉 Improvement detected — heap or time decreased by more than 5%.

⚠️ Baseline heap spread: 5.5% (66704 MB / 63180 MB / 64049 MB) — comparison may be less reliable

_{Commit: 95c7718 · Baseline: cached · Runner: self-hosted bench}

claude · 2026-04-21T19:03:04Z

+    #[cfg(not(feature = "parallel"))]
+    let mut fingerprints: Vec<FieldElement<E>> = (0..trace_len)
+        .map(|row| {
+            let mut linear_combination = &bus_id_f * &alpha_powers[0];
+            let mut alpha_offset = 1;
+            for bv in &table_interaction.values {
+                let consumed = bv.accumulate_fingerprint(
+                    main_segment_cols,
+                    row,
+                    &alpha_powers,
+                    alpha_offset,
+                    &mut linear_combination,
+                    &shifts,
+                );
+                alpha_offset += consumed;
+            }
+            z - &linear_combination
+        })
+        .collect();


Medium – Simplicity: massive duplication with #[cfg] pairs

The sequential block is a verbatim copy of the parallel block above — only into_par_iter() vs .map() differs. The same pattern is repeated twice more for fingerprints_a and fingerprints_b below. Every future change to the inner computation must be applied in two places.

Extract the body to a closure first, then the #[cfg] branches reduce to a single line each:

let compute_fingerprint = |row: usize| { let mut linear_combination = &bus_id_f * &alpha_powers[0]; let mut alpha_offset = 1; for bv in &table_interaction.values { let consumed = bv.accumulate_fingerprint( main_segment_cols, row, &alpha_powers, alpha_offset, &mut linear_combination, &shifts, ); alpha_offset += consumed; } z - &linear_combination }; #[cfg(feature = "parallel")] let mut fingerprints: Vec<FieldElement<E>> = (0..trace_len).into_par_iter().map(compute_fingerprint).collect(); #[cfg(not(feature = "parallel"))] let mut fingerprints: Vec<FieldElement<E>> = (0..trace_len).map(compute_fingerprint).collect();

Same fix applies to the fingerprints_a/fingerprints_b blocks lower in this file.

claude · 2026-04-21T19:03:08Z

+    #[cfg(feature = "parallel")]
+    let fingerprints_a: Vec<FieldElement<E>> = (0..trace_len)
+        .into_par_iter()
+        .map(|row| {
+            let mut lc_a = &bus_id_a * &alpha_powers[0];
+            let mut alpha_offset = 1;
+            for bv in &interaction_a.values {
+                let consumed = bv.accumulate_fingerprint(
+                    main_segment_cols,
+                    row,
+                    &alpha_powers,
+                    alpha_offset,
+                    &mut lc_a,
+                    &shifts,
+                );
+                alpha_offset += consumed;
+            }
+            z - &lc_a
+        })
+        .collect();
+    #[cfg(not(feature = "parallel"))]
+    let fingerprints_a: Vec<FieldElement<E>> = (0..trace_len)
+        .map(|row| {
+            let mut lc_a = &bus_id_a * &alpha_powers[0];
+            let mut alpha_offset = 1;
+            for bv in &interaction_a.values {
+                let consumed = bv.accumulate_fingerprint(
+                    main_segment_cols,
+                    row,
+                    &alpha_powers,
+                    alpha_offset,
+                    &mut lc_a,
+                    &shifts,
+                );
+                alpha_offset += consumed;
+            }
+            z - &lc_a
+        })
+        .collect();

-    for row in 0..trace_len {
-        let mut lc_a = &bus_id_a * &alpha_powers[0];
-        let mut alpha_offset = 1;
-        for bv in &interaction_a.values {
-            let consumed = bv.accumulate_fingerprint(
-                main_segment_cols,
-                row,
-                &alpha_powers,
-                alpha_offset,
-                &mut lc_a,
-                &shifts,
-            );
-            alpha_offset += consumed;
-        }
-        all_fingerprints.push(z - &lc_a);
-    }
-    for row in 0..trace_len {
-        let mut lc_b = &bus_id_b * &alpha_powers[0];
-        let mut alpha_offset = 1;
-        for bv in &interaction_b.values {
-            let consumed = bv.accumulate_fingerprint(
-                main_segment_cols,
-                row,
-                &alpha_powers,
-                alpha_offset,
-                &mut lc_b,
-                &shifts,
-            );
-            alpha_offset += consumed;
-        }
-        all_fingerprints.push(z - &lc_b);
-    }
+    #[cfg(feature = "parallel")]
+    let fingerprints_b: Vec<FieldElement<E>> = (0..trace_len)
+        .into_par_iter()
+        .map(|row| {
+            let mut lc_b = &bus_id_b * &alpha_powers[0];
+            let mut alpha_offset = 1;
+            for bv in &interaction_b.values {
+                let consumed = bv.accumulate_fingerprint(
+                    main_segment_cols,
+                    row,
+                    &alpha_powers,
+                    alpha_offset,
+                    &mut lc_b,
+                    &shifts,
+                );
+                alpha_offset += consumed;
+            }
+            z - &lc_b
+        })
+        .collect();
+    #[cfg(not(feature = "parallel"))]
+    let fingerprints_b: Vec<FieldElement<E>> = (0..trace_len)
+        .map(|row| {
+            let mut lc_b = &bus_id_b * &alpha_powers[0];
+            let mut alpha_offset = 1;
+            for bv in &interaction_b.values {
+                let consumed = bv.accumulate_fingerprint(
+                    main_segment_cols,
+                    row,
+                    &alpha_powers,
+                    alpha_offset,
+                    &mut lc_b,
+                    &shifts,
+                );
+                alpha_offset += consumed;
+            }
+            z - &lc_b
+        })
+        .collect();


Low – Performance: fingerprints_a and fingerprints_b run sequentially

Both parallel loops are fully independent (different bus_id, different interaction.values, same immutable main_segment_cols). Running them back-to-back leaves half the potential parallelism on the table. Use rayon::join to execute both concurrently:

let (fingerprints_a, fingerprints_b) = rayon::join( || (0..trace_len).into_par_iter().map(|row| { /* lc_a ... */ z - &lc_a }).collect::<Vec<_>>(), || (0..trace_len).into_par_iter().map(|row| { /* lc_b ... */ z - &lc_b }).collect::<Vec<_>>(), );

(This also halves the number of #[cfg] blocks needed here.)

claude · 2026-04-21T19:03:11Z

+            return Self::inplace_batch_inverse(numbers);
+        }
+        use rayon::prelude::*;
+        let num_chunks = rayon::current_num_threads().min(numbers.len() / 256);


Low – Correctness edge case: num_chunks can be 0 if the global thread pool size is 0

rayon::current_num_threads() always returns ≥ 1 in practice, and the < 1024 guard ensures numbers.len() / 256 ≥ 4, so this is safe today. But the invariant is implicit. A one-liner makes it explicit and guards against exotic pool configurations:

Suggested change

let num_chunks = rayon::current_num_threads().min(numbers.len() / 256);

let num_chunks = rayon::current_num_threads().max(1).min(numbers.len() / 256);

claude · 2026-04-21T19:03:29Z

Review: perf/parallel LogUp + par_batch_inverse

No security or correctness issues. Two structural concerns worth addressing before merge.

Medium – Simplicity: `#[cfg]` duplication across all computation blocks

Every fingerprint computation in lookup.rs is written out twice — once under #[cfg(feature = "parallel")] and once under #[cfg(not(feature = "parallel"))] — with the only difference being into_par_iter() vs .map(). This affects 4 blocks (~120 duplicated lines total). Any future fix to the inner computation must be applied in two places. Fix: extract each computation to a closure before the #[cfg] branches. See inline comment on lines 1489–1507.

Low – Performance: `fingerprints_a` / `fingerprints_b` run sequentially

The two parallel loops in compute_logup_batched_term_column are independent and could run concurrently with rayon::join, recovering another ~2x within that function. See inline comment on lines 1646–1724.

Low – Defensive guard in `par_batch_inverse`

num_chunks relies on rayon::current_num_threads() >= 1 to avoid a later integer overflow. Adding .max(1) makes the invariant explicit. See inline comment on element.rs:86.

The Cargo.toml addition of math/parallel to the parallel feature and the overall parallelization strategy (embarrassingly-parallel rows, single batch inversion per chunk) are correct.

…sted Rayon over-subscription)

Replace the column-parallel LogUp auxiliary trace build (which caused Rayon over-subscription when called from an already-parallel context) with a chunk-local approach inspired by Plonky3. Key changes: - New `compute_logup_batched_term_column_chunked` and `compute_logup_term_column_chunked` functions process rows in chunks of 1024, fusing fingerprint computation + batch inverse + term evaluation per chunk for L2 cache locality - Parallelism is across row-chunks (par_chunks_mut), not across interaction pairs, avoiding nested Rayon over-subscription - New `compute_multiplicity_for_row` helper avoids materializing full Vec<FieldElement> per interaction in the chunked path - `build_accumulated_column_from_terms` now uses parallel reduction for table_contribution and 3-phase parallel prefix sum for the accumulated column - Sequential (non-parallel) path unchanged, using original functions - All 121 stark tests pass with and without parallel feature

diegokingston · 2026-04-21T19:26:55Z

/bench 3

Two optimizations to the FRI commit phase: 1. Precompute zeta * inv_twiddles[j] once per layer (F×E = 3 base muls each). The per-row fold then uses one E×E multiply (9 base muls) instead of E×E + F×E (12 base muls). Saves ~25% of fold arithmetic. 2. Hash FRI leaves directly from evals pairs via build_from_hashed_leaves, eliminating the intermediate Vec<[FieldElement; 2]> allocation (~24MB at FRI layer 0).

Use Rayon map_init to allocate one byte buffer per thread (reused across all rows) instead of vec![0u8; N] per row. For CPU table (74 cols × 2^21 rows), this eliminates ~2M heap allocations. Applied to both commit_columns_bit_reversed (main/aux trace) and commit_composition_polynomial (composition poly).

diegokingston · 2026-04-21T19:57:03Z

/bench 3

…s isolated benchmarking)

diegokingston · 2026-04-21T20:03:28Z

/bench 3

…ot the bottleneck, Keccak dominates)

…n in chunked LogUp

diegokingston · 2026-04-21T20:25:20Z

/bench 3

FRI produces ~190 small tree builds per proof (layers 10-18 have 2-512 leaves). Rayon scheduling overhead exceeds computation for these tiny trees. Add a 1024-node threshold: below it, use sequential iteration for both leaf hashing and internal node construction.

diegokingston · 2026-04-21T20:30:03Z

/bench 3

…, blowup) Tables with the same domain size (e.g., 7+ tables at 2^20) were each creating their own Domain (~24 MB) and LdeTwiddles (~32 MB). With ~20 tables and only 4-5 distinct sizes, this wasted ~300 MB of memory and redundant root-of-unity + twiddle generation. Now uses a HashMap cache keyed by (trace_length, blowup_factor). Domain and LdeTwiddles are shared via Arc across all tables with the same parameters.

diegokingston · 2026-04-21T20:41:56Z

/bench 3

diegokingston · 2026-04-21T20:52:33Z

/bench 3

diegokingston · 2026-04-21T21:00:45Z

/bench 3

diegokingston · 2026-04-21T21:09:42Z

/bench 3

diegokingston · 2026-04-21T21:23:20Z

/bench 3

diegokingston · 2026-04-21T21:39:48Z

/bench 3

github-actions · 2026-04-22T13:03:56Z

Codex Code Review

Findings

High (Crypto correctness bug): domain cache key is incomplete and can reuse the wrong domain

The new dedup cache in multi_prove keys by only (trace_length, blowup_factor), but new_domain() also depends on coset_offset.
If two AIRs share length/blowup but have different coset_offset, one AIR will get the other’s domain/twiddles, which can produce incorrect LDE evaluations/commitments (at best proof failure, at worst subtle protocol misuse).
References:

Medium (Performance/memory regression): accumulated-column build now materializes ~3x trace-length buffers

In build_accumulated_column_from_terms (parallel path), code allocates:
- row_sums (N)
- chunk_data local-prefix vectors (total N)
- final acc_col (N)
This significantly increases peak memory versus prior streaming approach and can become a practical bottleneck/OOM on large traces.
References:

Security-specific note

No new unsafe/memory-safety issues observed in this diff.
The first finding is still security-relevant as a cryptographic correctness issue (domain mismatch).

claude · 2026-04-22T13:07:52Z

-            let domain = new_domain(*air, trace_length);
-            let twiddles = LdeTwiddles::new(&domain);
+            let blowup = air.options().blowup_factor as usize;
+            let key = (trace_length, blowup);


Medium — incorrect cache key: coset_offset is missing

new_domain uses air.options().coset_offset (in addition to trace_length and blowup_factor) to build the LDE coset and lde_roots_of_unity_coset. If two AIRs share the same (trace_length, blowup) but use different coset offsets, the second one silently receives the wrong Domain (and therefore wrong LdeTwiddles), producing an incorrect proof.

In the current codebase all AIRs appear to use coset_offset = 3, but this assumption is not enforced here and would be a silent correctness bug if it ever changed.

Fix: include the coset offset in the key:

Suggested change

let key = (trace_length, blowup);

let coset_offset = air.options().coset_offset as usize;

let key = (trace_length, blowup, coset_offset);

(And update the HashMap type annotation accordingly.)

MauroToscano

@nicole-graus is splitting it and checking the optimization one by one

nicole-graus · 2026-04-23T15:33:31Z

/bench

nicole-graus · 2026-04-23T16:06:04Z

/bench 10

diegokingston · 2026-04-23T20:54:28Z

Pushed fixes for the two concrete findings to triage/pr518-fixes — two focused commits on top of the PR head. Cherry-pick or merge as you prefer.

(H) coset_offset in domain cache key — c7f6aaef
Adds coset_offset (u64) to the HashMap key tuple so that AIRs with same (trace_length, blowup) but different offsets don't silently alias onto the same Domain/LdeTwiddles. Zero runtime cost, 12-line diff in prover.rs. Current call sites thread one ProofOptions everywhere, so this is purely defensive — but the cache is in a generic crate and other callers shouldn't need to keep the invariant in their head.

(M) eliminate 2N allocations in build_accumulated_column_from_terms — 04602bb3
Restructures from row_sums → chunk_data → acc_col → trace.set_aux (three N-sized live buffers) to chunk_totals → chunk_data → trace.set_aux (one N-sized live buffer, plus O(num_chunks) scalars). Peak transient storage drops from ~3N to ~N extension-field elements per table — roughly 670 MiB less allocator pressure on a 14-table proof at N=2^20.

Trade-off: one extra pass over term_columns in Pass 1 (just chunk totals, parallel). The final scatter into trace.set_aux is now sequential because TraceTable is row-major + &mut self-only; one extension-field add per row is ~10 ms/table — rounding error next to the Round 1 LDE FFTs.

Verified on triage/pr518-fixes:

cargo check --workspace + cargo check -p stark --no-default-features — clean
make lint — clean (fmt + clippy default + debug-checks)
cargo test -p stark --release --lib — 121/121 pass
cargo test -p lambda-vm-prover --release --lib non-ELF subset — 247/247 pass

Would be good to re-/bench once these land on the PR branch to confirm the −5.8% prove time from commit 95c7718 holds (expect at worst flat, at best a small additional heap win).

diegokingston added 4 commits April 21, 2026 15:42

docs: add prover parallelism improvement plan (6 tasks)

41316ff

perf: parallelize LogUp fingerprint computation with rayon

1451d7c

perf: parallelize table_contribution sum with rayon reduce

3ab5504

claude Bot reviewed Apr 21, 2026

View reviewed changes

diegokingston added 2 commits April 21, 2026 16:03

revert: undo LogUp parallelization (caused +3.9% regression due to ne…

35f8614

…sted Rayon over-subscription)

diegokingston added 2 commits April 21, 2026 16:54

revert: undo FRI fold twiddle precomputation (caused regression, need…

bd7bc46

…s isolated benchmarking)

diegokingston added 2 commits April 21, 2026 17:10

revert: undo per-row buffer reuse in Merkle hashing (allocation was n…

64f3839

…ot the bottleneck, Keccak dominates)

fix: extract fingerprint computation closure to reduce cfg duplicatio…

10fe99f

…n in chunked LogUp

diegokingston changed the title ~~docs: add prover parallelism improvement plan (6 tasks)~~ logup parallelism Apr 21, 2026

diegokingston force-pushed the investigate/prover-performance branch from 116583e to 9eec069 Compare April 21, 2026 21:02

diegokingston force-pushed the investigate/prover-performance branch from 9eec069 to 8e70557 Compare April 21, 2026 21:22

diegokingston changed the title ~~logup parallelism~~ logup parallelism and twiddle deduplication Apr 21, 2026

diegokingston added 2 commits April 21, 2026 18:23

chore: remove outdated parallelism plan (approach changed)

2ed6460

fix: resolve clippy warnings (div_ceil, loop variable, complex type)

895f8aa

style: cargo fmt --all

85e5c61

diegokingston marked this pull request as ready for review April 22, 2026 13:02

claude Bot reviewed Apr 22, 2026

View reviewed changes

diegokingston and others added 2 commits April 22, 2026 10:18

fix: update run_debug_checks to accept Arc<Domain> and Arc<LdeTwiddles>

4ec3d92

Merge branch 'main' into investigate/prover-performance

95c7718

MauroToscano requested changes Apr 23, 2026

View reviewed changes

jotabulacios closed this Apr 24, 2026

diegokingston deleted the investigate/prover-performance branch May 20, 2026 12:51

	let num_chunks = rayon::current_num_threads().min(numbers.len() / 256);
	let num_chunks = rayon::current_num_threads().max(1).min(numbers.len() / 256);

	let key = (trace_length, blowup);
	let coset_offset = air.options().coset_offset as usize;
	let key = (trace_length, blowup, coset_offset);

Conversation

diegokingston commented Apr 21, 2026

Uh oh!

diegokingston commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Codex Code Review

Uh oh!

github-actions Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark — fib_iterative_8M (median of 3)

Uh oh!

claude Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Apr 21, 2026

Review: perf/parallel LogUp + par_batch_inverse

Medium – Simplicity: #[cfg] duplication across all computation blocks

Low – Performance: fingerprints_a / fingerprints_b run sequentially

Low – Defensive guard in par_batch_inverse

Uh oh!

diegokingston commented Apr 21, 2026

Uh oh!

diegokingston commented Apr 21, 2026

Uh oh!

diegokingston commented Apr 21, 2026

Uh oh!

diegokingston commented Apr 21, 2026

Uh oh!

diegokingston commented Apr 21, 2026

Uh oh!

diegokingston commented Apr 21, 2026

Uh oh!

diegokingston commented Apr 21, 2026

Uh oh!

diegokingston commented Apr 21, 2026

Uh oh!

diegokingston commented Apr 21, 2026

Uh oh!

diegokingston commented Apr 21, 2026

Uh oh!

diegokingston commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Codex Code Review

Uh oh!

claude Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

MauroToscano left a comment

Choose a reason for hiding this comment

Uh oh!

nicole-graus commented Apr 23, 2026

Uh oh!

nicole-graus commented Apr 23, 2026

Uh oh!

diegokingston commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions Bot commented Apr 21, 2026 •

edited

Loading

Medium – Simplicity: `#[cfg]` duplication across all computation blocks

Low – Performance: `fingerprints_a` / `fingerprints_b` run sequentially

Low – Defensive guard in `par_batch_inverse`