perf: parallel prefix sum for accumulated LogUp column by nicole-graus · Pull Request #549 · yetanotherco/lambda_vm

nicole-graus · 2026-04-23T20:10:37Z

Replaces the sequential accumulated column construction in build_accumulated_column_from_terms with a 3-phase parallel prefix sum: chunk-local sums in parallel, sequential offset scan, then parallel offset application using chunks of 1024 rows.

Optimization extracted from PR #518.

github-actions · 2026-04-23T20:12:25Z

Codex Code Review

Medium (performance): crypto/stark/src/lookup.rs, 1754, 1780 turn the old streaming accumulation into a fully buffered one. The previous code used O(1) extra memory; this version always allocates row_sums and accumulated_col, and under parallel also keeps all chunk-local prefixes in chunk_data. For prover-scale trace lengths, that is a large transient memory regression and can easily become hundreds of MB to GB of extra allocation in a hot path. This is a real risk for large proofs even if the arithmetic is correct.

No security or correctness issues stood out in the diff beyond that. I could not run cargo check here because the environment blocks rustup from creating temp files under ~/.rustup.

claude · 2026-04-23T20:13:10Z

+    let row_sums: Vec<FieldElement<E>> = (0..trace_len)
+        .map(|row| {
+            let mut s = FieldElement::<E>::zero();
+            for col in term_columns {
+                s = s + &col[row];
+            }
+            s
+        })
+        .collect();


Medium – row_sums is sequential even in the parallel build path

The row_sums computation uses a plain (0..trace_len).map(...) iterator regardless of the parallel feature flag. For a large trace with many term columns this is the dominant work unit, yet it doesn't benefit from rayon at all.

Suggested change

let row_sums: Vec<FieldElement<E>> = (0..trace_len)

.map(|row| {

let mut s = FieldElement::<E>::zero();

for col in term_columns {

s = s + &col[row];

}

s

})

.collect();

#[cfg(feature = "parallel")]

let row_sums: Vec<FieldElement<E>> = {

use rayon::prelude::IntoParallelIterator;

(0..trace_len)

.into_par_iter()

.map(|row| {

let mut s = FieldElement::<E>::zero();

for col in term_columns {

s = s + &col[row];

}

s

})

.collect()

};

#[cfg(not(feature = "parallel"))]

let row_sums: Vec<FieldElement<E>> = (0..trace_len)

.map(|row| {

let mut s = FieldElement::<E>::zero();

for col in term_columns {

s = s + &col[row];

}

s

})

.collect();

claude · 2026-04-23T20:13:19Z

+    let accumulated_col = {
+        let num_chunks = trace_len.div_ceil(LOGUP_CHUNK_SIZE);
+
+        // Phase 1: Compute chunk-local prefix sums
+        let chunk_data: Vec<(Vec<FieldElement<E>>, FieldElement<E>)> = (0..num_chunks)
+            .into_par_iter()
+            .map(|chunk_idx| {
+                let start = chunk_idx * LOGUP_CHUNK_SIZE;
+                let end = (start + LOGUP_CHUNK_SIZE).min(trace_len);
+
+                let mut local_prefix = Vec::with_capacity(end - start);
+                let mut acc = FieldElement::<E>::zero();
+                for rs in &row_sums[start..end] {
+                    acc = &acc + rs - &offset_per_row;
+                    local_prefix.push(acc.clone());
+                }
+                let chunk_total = acc;
+                (local_prefix, chunk_total)
+            })
+            .collect();
+
+        // Phase 2: Sequential scan of chunk totals to get per-chunk offsets
+        let mut chunk_offsets = Vec::with_capacity(num_chunks);
+        let mut running = FieldElement::<E>::zero();
+        for (_, chunk_total) in &chunk_data {
+            chunk_offsets.push(running.clone());
+            running = &running + chunk_total;
        }
-        accumulated = &accumulated + &row_sum - &offset_per_row;
-        trace.set_aux(row, acc_column_idx, accumulated.clone());
+
+        // Phase 3: Build final accumulated vector (parallel across chunks)
+        let mut acc_col = vec![FieldElement::<E>::zero(); trace_len];


Low – ~3× memory overhead vs. original

The parallel path now materialises three trace_len-sized allocations simultaneously:

row_sums – trace_len elements

chunk_data – all local_prefix vecs combined hold trace_len elements

acc_col – trace_len elements

The original sequential code used O(1) extra space (just a scalar accumulator). This is an inherent tradeoff for the parallelism, but worth calling out for callers on memory-constrained environments.

One small win: chunk_data could be split into a flat partial_sums: Vec<FieldElement<E>> (reusing the same memory as acc_col with an in-place Phase-3 update), which would drop the overhead from 3× to 2×. Not blocking, just noting the headroom.

claude · 2026-04-23T20:13:25Z

+    // Build circular accumulated column using 3-phase parallel prefix sum.
+    //
+    // Phase 1: Compute chunk-local prefix sums in parallel.
+    //   Each chunk computes partial_sums[i] = Σ(row_sums[j] - offset) for j in chunk.


Nit – comment describes "sum" but the code computes a prefix sum

partial_sums[i] = Σ(row_sums[j] - offset) for j in chunk

That reads like a total sum. It's actually a running prefix sum: element i holds Σ(row_sums[j] - offset) for j in start..start+i+1. The distinction matters because Phase 3 relies on this being a prefix (not a per-chunk total), so clarifying it avoids future confusion.

claude · 2026-04-23T20:13:39Z

Review: perf: parallel prefix sum for accumulated LogUp column

Summary: The 3-phase parallel prefix sum algorithm is mathematically correct. The circular constraint acc[N-1] = L - N·(L/N) = 0 is preserved: chunk offsets are constructed so that the final element across all chunks reduces correctly to zero.

Issues

Severity	Issue
Medium	`row_sums` computation is sequential even with `feature = "parallel"` enabled — see inline
Low	~3× peak memory vs. original O(1) accumulator — see inline
Nit	Phase 1 block-comment says "sum" but means "prefix sum" — see inline

No issues found with

Algorithm correctness (prefix sum identity verified algebraically)
Rayon data-race safety (par_chunks_mut gives disjoint mutable slices; chunk_data/chunk_offsets are read-only in Phase 3)
Panic safety (n.inv().unwrap() on trace_len = 0 is pre-existing and guarded upstream)
reduce identity: FieldElement::zero is correct for additive reduction

nicole-graus · 2026-04-23T20:14:58Z

/bench 10

github-actions · 2026-04-23T20:37:45Z

Benchmark — fib_iterative_8M (median of 10)

_{Table parallelism: 32 (auto = cores / 3)}

Metric	main	PR	Δ
Peak heap	63703 MB	65018 MB	+1315 MB (+2.1%) ⚪
Prove time	31.958s	32.063s	+0.105s (+0.3%) ⚪

✅ No significant change.

⚠️ Heap spread: 8.2% (66935 MB / 62569 MB / 65018 MB / 67920 MB / 62857 MB / 65087 MB / 64809 MB / 64163 MB / 65090 MB / 67740 MB)
⚠️ Baseline heap spread: 5.5% (62951 MB / 64057 MB / 66435 MB / 63703 MB / 63283 MB / 64822 MB / 65277 MB / 63543 MB / 63307 MB / 63922 MB) — comparison may be less reliable
Consider re-running /bench

_{Commit: 676e93a · Baseline: built from main · Runner: self-hosted bench}

build accumulated LogUp column with 3-phase parallel prefix sum

676e93a

claude Bot reviewed Apr 23, 2026

View reviewed changes

nicole-graus closed this Apr 24, 2026

diegokingston deleted the logup-parallel-prefix-sum branch May 20, 2026 12:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: parallel prefix sum for accumulated LogUp column#549

perf: parallel prefix sum for accumulated LogUp column#549
nicole-graus wants to merge 1 commit into
mainfrom
logup-parallel-prefix-sum

nicole-graus commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

claude Bot Apr 23, 2026

Uh oh!

claude Bot Apr 23, 2026

Uh oh!

claude Bot Apr 23, 2026

Uh oh!

claude Bot commented Apr 23, 2026

Uh oh!

nicole-graus commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nicole-graus commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Codex Code Review

Uh oh!

claude Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Apr 23, 2026

Review: perf: parallel prefix sum for accumulated LogUp column

Issues

No issues found with

Uh oh!

nicole-graus commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Benchmark — fib_iterative_8M (median of 10)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant