perf: parallel prefix sum for accumulated LogUp column#549
Conversation
Codex Code Review
No security or correctness issues stood out in the diff beyond that. I could not run |
| let row_sums: Vec<FieldElement<E>> = (0..trace_len) | ||
| .map(|row| { | ||
| let mut s = FieldElement::<E>::zero(); | ||
| for col in term_columns { | ||
| s = s + &col[row]; | ||
| } | ||
| s | ||
| }) | ||
| .collect(); |
There was a problem hiding this comment.
Medium – row_sums is sequential even in the parallel build path
The row_sums computation uses a plain (0..trace_len).map(...) iterator regardless of the parallel feature flag. For a large trace with many term columns this is the dominant work unit, yet it doesn't benefit from rayon at all.
| let row_sums: Vec<FieldElement<E>> = (0..trace_len) | |
| .map(|row| { | |
| let mut s = FieldElement::<E>::zero(); | |
| for col in term_columns { | |
| s = s + &col[row]; | |
| } | |
| s | |
| }) | |
| .collect(); | |
| #[cfg(feature = "parallel")] | |
| let row_sums: Vec<FieldElement<E>> = { | |
| use rayon::prelude::IntoParallelIterator; | |
| (0..trace_len) | |
| .into_par_iter() | |
| .map(|row| { | |
| let mut s = FieldElement::<E>::zero(); | |
| for col in term_columns { | |
| s = s + &col[row]; | |
| } | |
| s | |
| }) | |
| .collect() | |
| }; | |
| #[cfg(not(feature = "parallel"))] | |
| let row_sums: Vec<FieldElement<E>> = (0..trace_len) | |
| .map(|row| { | |
| let mut s = FieldElement::<E>::zero(); | |
| for col in term_columns { | |
| s = s + &col[row]; | |
| } | |
| s | |
| }) | |
| .collect(); |
| let accumulated_col = { | ||
| let num_chunks = trace_len.div_ceil(LOGUP_CHUNK_SIZE); | ||
|
|
||
| // Phase 1: Compute chunk-local prefix sums | ||
| let chunk_data: Vec<(Vec<FieldElement<E>>, FieldElement<E>)> = (0..num_chunks) | ||
| .into_par_iter() | ||
| .map(|chunk_idx| { | ||
| let start = chunk_idx * LOGUP_CHUNK_SIZE; | ||
| let end = (start + LOGUP_CHUNK_SIZE).min(trace_len); | ||
|
|
||
| let mut local_prefix = Vec::with_capacity(end - start); | ||
| let mut acc = FieldElement::<E>::zero(); | ||
| for rs in &row_sums[start..end] { | ||
| acc = &acc + rs - &offset_per_row; | ||
| local_prefix.push(acc.clone()); | ||
| } | ||
| let chunk_total = acc; | ||
| (local_prefix, chunk_total) | ||
| }) | ||
| .collect(); | ||
|
|
||
| // Phase 2: Sequential scan of chunk totals to get per-chunk offsets | ||
| let mut chunk_offsets = Vec::with_capacity(num_chunks); | ||
| let mut running = FieldElement::<E>::zero(); | ||
| for (_, chunk_total) in &chunk_data { | ||
| chunk_offsets.push(running.clone()); | ||
| running = &running + chunk_total; | ||
| } | ||
| accumulated = &accumulated + &row_sum - &offset_per_row; | ||
| trace.set_aux(row, acc_column_idx, accumulated.clone()); | ||
|
|
||
| // Phase 3: Build final accumulated vector (parallel across chunks) | ||
| let mut acc_col = vec![FieldElement::<E>::zero(); trace_len]; |
There was a problem hiding this comment.
Low – ~3× memory overhead vs. original
The parallel path now materialises three trace_len-sized allocations simultaneously:
row_sums–trace_lenelementschunk_data– alllocal_prefixvecs combined holdtrace_lenelementsacc_col–trace_lenelements
The original sequential code used O(1) extra space (just a scalar accumulator). This is an inherent tradeoff for the parallelism, but worth calling out for callers on memory-constrained environments.
One small win: chunk_data could be split into a flat partial_sums: Vec<FieldElement<E>> (reusing the same memory as acc_col with an in-place Phase-3 update), which would drop the overhead from 3× to 2×. Not blocking, just noting the headroom.
| // Build circular accumulated column using 3-phase parallel prefix sum. | ||
| // | ||
| // Phase 1: Compute chunk-local prefix sums in parallel. | ||
| // Each chunk computes partial_sums[i] = Σ(row_sums[j] - offset) for j in chunk. |
There was a problem hiding this comment.
Nit – comment describes "sum" but the code computes a prefix sum
partial_sums[i] = Σ(row_sums[j] - offset) for j in chunk
That reads like a total sum. It's actually a running prefix sum: element i holds Σ(row_sums[j] - offset) for j in start..start+i+1. The distinction matters because Phase 3 relies on this being a prefix (not a per-chunk total), so clarifying it avoids future confusion.
Review: perf: parallel prefix sum for accumulated LogUp columnSummary: The 3-phase parallel prefix sum algorithm is mathematically correct. The circular constraint Issues
No issues found with
|
|
/bench 10 |
Benchmark — fib_iterative_8M (median of 10)Table parallelism: 32 (auto = cores / 3)
Commit: 676e93a · Baseline: built from main · Runner: self-hosted bench |
Replaces the sequential accumulated column construction in
build_accumulated_column_from_termswith a 3-phase parallel prefix sum: chunk-local sums in parallel, sequential offset scan, then parallel offset application using chunks of 1024 rows.Optimization extracted from PR #518.