Skip to content

perf: parallel prefix sum for accumulated LogUp column#549

Closed
nicole-graus wants to merge 1 commit into
mainfrom
logup-parallel-prefix-sum
Closed

perf: parallel prefix sum for accumulated LogUp column#549
nicole-graus wants to merge 1 commit into
mainfrom
logup-parallel-prefix-sum

Conversation

@nicole-graus
Copy link
Copy Markdown
Collaborator

Replaces the sequential accumulated column construction in build_accumulated_column_from_terms with a 3-phase parallel prefix sum: chunk-local sums in parallel, sequential offset scan, then parallel offset application using chunks of 1024 rows.

Optimization extracted from PR #518.

@github-actions
Copy link
Copy Markdown

Codex Code Review

  • Medium (performance): crypto/stark/src/lookup.rs, 1754, 1780 turn the old streaming accumulation into a fully buffered one. The previous code used O(1) extra memory; this version always allocates row_sums and accumulated_col, and under parallel also keeps all chunk-local prefixes in chunk_data. For prover-scale trace lengths, that is a large transient memory regression and can easily become hundreds of MB to GB of extra allocation in a hot path. This is a real risk for large proofs even if the arithmetic is correct.

No security or correctness issues stood out in the diff beyond that. I could not run cargo check here because the environment blocks rustup from creating temp files under ~/.rustup.

Comment on lines +1714 to +1722
let row_sums: Vec<FieldElement<E>> = (0..trace_len)
.map(|row| {
let mut s = FieldElement::<E>::zero();
for col in term_columns {
s = s + &col[row];
}
s
})
.collect();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium – row_sums is sequential even in the parallel build path

The row_sums computation uses a plain (0..trace_len).map(...) iterator regardless of the parallel feature flag. For a large trace with many term columns this is the dominant work unit, yet it doesn't benefit from rayon at all.

Suggested change
let row_sums: Vec<FieldElement<E>> = (0..trace_len)
.map(|row| {
let mut s = FieldElement::<E>::zero();
for col in term_columns {
s = s + &col[row];
}
s
})
.collect();
#[cfg(feature = "parallel")]
let row_sums: Vec<FieldElement<E>> = {
use rayon::prelude::IntoParallelIterator;
(0..trace_len)
.into_par_iter()
.map(|row| {
let mut s = FieldElement::<E>::zero();
for col in term_columns {
s = s + &col[row];
}
s
})
.collect()
};
#[cfg(not(feature = "parallel"))]
let row_sums: Vec<FieldElement<E>> = (0..trace_len)
.map(|row| {
let mut s = FieldElement::<E>::zero();
for col in term_columns {
s = s + &col[row];
}
s
})
.collect();

Comment on lines +1750 to +1780
let accumulated_col = {
let num_chunks = trace_len.div_ceil(LOGUP_CHUNK_SIZE);

// Phase 1: Compute chunk-local prefix sums
let chunk_data: Vec<(Vec<FieldElement<E>>, FieldElement<E>)> = (0..num_chunks)
.into_par_iter()
.map(|chunk_idx| {
let start = chunk_idx * LOGUP_CHUNK_SIZE;
let end = (start + LOGUP_CHUNK_SIZE).min(trace_len);

let mut local_prefix = Vec::with_capacity(end - start);
let mut acc = FieldElement::<E>::zero();
for rs in &row_sums[start..end] {
acc = &acc + rs - &offset_per_row;
local_prefix.push(acc.clone());
}
let chunk_total = acc;
(local_prefix, chunk_total)
})
.collect();

// Phase 2: Sequential scan of chunk totals to get per-chunk offsets
let mut chunk_offsets = Vec::with_capacity(num_chunks);
let mut running = FieldElement::<E>::zero();
for (_, chunk_total) in &chunk_data {
chunk_offsets.push(running.clone());
running = &running + chunk_total;
}
accumulated = &accumulated + &row_sum - &offset_per_row;
trace.set_aux(row, acc_column_idx, accumulated.clone());

// Phase 3: Build final accumulated vector (parallel across chunks)
let mut acc_col = vec![FieldElement::<E>::zero(); trace_len];
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low – ~3× memory overhead vs. original

The parallel path now materialises three trace_len-sized allocations simultaneously:

  1. row_sumstrace_len elements
  2. chunk_data – all local_prefix vecs combined hold trace_len elements
  3. acc_coltrace_len elements

The original sequential code used O(1) extra space (just a scalar accumulator). This is an inherent tradeoff for the parallelism, but worth calling out for callers on memory-constrained environments.

One small win: chunk_data could be split into a flat partial_sums: Vec<FieldElement<E>> (reusing the same memory as acc_col with an in-place Phase-3 update), which would drop the overhead from 3× to 2×. Not blocking, just noting the headroom.

// Build circular accumulated column using 3-phase parallel prefix sum.
//
// Phase 1: Compute chunk-local prefix sums in parallel.
// Each chunk computes partial_sums[i] = Σ(row_sums[j] - offset) for j in chunk.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit – comment describes "sum" but the code computes a prefix sum

partial_sums[i] = Σ(row_sums[j] - offset) for j in chunk

That reads like a total sum. It's actually a running prefix sum: element i holds Σ(row_sums[j] - offset) for j in start..start+i+1. The distinction matters because Phase 3 relies on this being a prefix (not a per-chunk total), so clarifying it avoids future confusion.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 23, 2026

Review: perf: parallel prefix sum for accumulated LogUp column

Summary: The 3-phase parallel prefix sum algorithm is mathematically correct. The circular constraint acc[N-1] = L - N·(L/N) = 0 is preserved: chunk offsets are constructed so that the final element across all chunks reduces correctly to zero.

Issues

Severity Issue
Medium row_sums computation is sequential even with feature = "parallel" enabled — see inline
Low ~3× peak memory vs. original O(1) accumulator — see inline
Nit Phase 1 block-comment says "sum" but means "prefix sum" — see inline

No issues found with

  • Algorithm correctness (prefix sum identity verified algebraically)
  • Rayon data-race safety (par_chunks_mut gives disjoint mutable slices; chunk_data/chunk_offsets are read-only in Phase 3)
  • Panic safety (n.inv().unwrap() on trace_len = 0 is pre-existing and guarded upstream)
  • reduce identity: FieldElement::zero is correct for additive reduction

@nicole-graus
Copy link
Copy Markdown
Collaborator Author

/bench 10

@github-actions
Copy link
Copy Markdown

Benchmark — fib_iterative_8M (median of 10)

Table parallelism: 32 (auto = cores / 3)

Metric main PR Δ
Peak heap 63703 MB 65018 MB +1315 MB (+2.1%) ⚪
Prove time 31.958s 32.063s +0.105s (+0.3%) ⚪

✅ No significant change.

⚠️ Heap spread: 8.2% (66935 MB / 62569 MB / 65018 MB / 67920 MB / 62857 MB / 65087 MB / 64809 MB / 64163 MB / 65090 MB / 67740 MB)
⚠️ Baseline heap spread: 5.5% (62951 MB / 64057 MB / 66435 MB / 63703 MB / 63283 MB / 64822 MB / 65277 MB / 63543 MB / 63307 MB / 63922 MB) — comparison may be less reliable
Consider re-running /bench

Commit: 676e93a · Baseline: built from main · Runner: self-hosted bench

@diegokingston diegokingston deleted the logup-parallel-prefix-sum branch May 20, 2026 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant