Skip to content

perf: sparse gather for LogUp aux trace when multiplicity is selector…#565

Open
diegokingston wants to merge 8 commits into
mainfrom
perf/sparse-logup-aux-trace
Open

perf: sparse gather for LogUp aux trace when multiplicity is selector…#565
diegokingston wants to merge 8 commits into
mainfrom
perf/sparse-logup-aux-trace

Conversation

@diegokingston
Copy link
Copy Markdown
Collaborator

…-gated

Many bus interactions are gated by mutually-exclusive selector columns (e.g., AND/OR/XOR op flags, ADD/SUB/MUL op flags): when the selector is zero, the row's contribution to the term column is zero regardless of the fingerprint. We now detect Column/Sum/Sum3 multiplicities and, when fewer than 7/8 of a chunk's rows are active, compute fingerprints only at the active rows before batch-inverting. Inactive rows stay at zero from the initial vec![zero; trace_len] allocation.

Soundness: sparse_active_rows marks a row as active whenever any contributing column is non-zero, which is an over-approximation of m != 0 (cancellation can only cause us to process more rows, never fewer). Dense multiplicities (One, Negated, Diff, Linear) fall back to the unchanged contiguous path.

Applies to both single (compute_logup_term_column) and paired (compute_logup_batched_term_column) term-column builders. Verified via all 71 stark bus soundness tests and 265 prover lib tests.

…-gated

Many bus interactions are gated by mutually-exclusive selector columns
(e.g., AND/OR/XOR op flags, ADD/SUB/MUL op flags): when the selector is
zero, the row's contribution to the term column is zero regardless of
the fingerprint. We now detect Column/Sum/Sum3 multiplicities and, when
fewer than 7/8 of a chunk's rows are active, compute fingerprints only
at the active rows before batch-inverting. Inactive rows stay at zero
from the initial `vec![zero; trace_len]` allocation.

Soundness: `sparse_active_rows` marks a row as active whenever any
contributing column is non-zero, which is an over-approximation of
`m != 0` (cancellation can only cause us to process more rows, never
fewer). Dense multiplicities (One, Negated, Diff, Linear) fall back to
the unchanged contiguous path.

Applies to both single (`compute_logup_term_column`) and paired
(`compute_logup_batched_term_column`) term-column builders. Verified
via all 71 stark bus soundness tests and 265 prover lib tests.
@diegokingston
Copy link
Copy Markdown
Collaborator Author

/bench 3

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 24, 2026

Benchmark — fib_iterative_8M (median of 3)

Table parallelism: auto (cores / 3)

Metric main PR Δ
Peak heap 57694 MB 56843 MB -851 MB (-1.5%) ⚪
Prove time 26.343s 25.760s -0.583s (-2.2%) ⚪

✅ No significant change.

✅ Low variance (time: 0.8%, heap: 0.9%)

Commit: 7047d4d · Baseline: cached · Runner: self-hosted bench

@github-actions
Copy link
Copy Markdown

Codex Code Review

Findings

  • Medium - Performance/OOM risk: crypto/stark/src/lookup.rs:1489
    In default non-parallel builds, process_chunk(0, &mut result) makes chunk_len == trace_len. For dense Column/Sum/Sum3 multiplicities, sparse_active_rows first allocates and fills a Vec<usize> up to the full trace size, then discards it and reruns the dense path. On large traces and many interactions this can be a significant regression or memory spike. Consider bounded sequential chunks too, or early-abort sparse collection once the threshold is exceeded without retaining a full active list.

No security vulnerabilities or correctness bugs found in the PR diff.

I attempted cargo check, but this environment could not complete it: rustup/cargo needed writes under read-only home dirs and then network/git dependencies were unavailable.

let active: Vec<usize> = (0..chunk_len)
.filter(|&i| c[chunk_start + i] != zero)
.collect();
collect_if_sparse(active)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium – Wasted allocation on the dense fallback path

The full active: Vec<usize> is built (up to chunk_len elements) before the threshold check. When more than 7/8 of the rows are active (the dense case), the Vec is immediately discarded by collect_if_sparse returning None. This allocation+drop happens on every chunk for every interaction that turns out to be dense — exactly the hot path you're trying to keep fast.

A cheaper approach: count active rows first, bail early, then collect only when it's known to be worthwhile:

Multiplicity::Column(col) => {
    let c = &main_segment_cols[*col];
    let count = (0..chunk_len).filter(|&i| c[chunk_start + i] != zero).count();
    if count * SPARSE_DEN > chunk_len * SPARSE_NUM {
        return None;
    }
    let active: Vec<usize> = (0..chunk_len)
        .filter(|&i| c[chunk_start + i] != zero)
        .collect();
    Some(active)
}

(Same pattern for Sum and Sum3.) The double pass over chunk_len elements is cheap compared to a heap allocation + copy, and chunk_len = 1024 fits in L1.

Comment on lines +1808 to 1815
let indices_a: Vec<usize> = match &active_a {
Some(v) => v.clone(),
None => (0..chunk_len).collect(),
};
let indices_b: Vec<usize> = match &active_b {
Some(v) => v.clone(),
None => (0..chunk_len).collect(),
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium – Unnecessary Vec clone

active_a and active_b are owned Option<Vec<usize>> values that are never used again after this point. Matching on &active_a forces a .clone() in the Some branch. Use unwrap_or_else to consume them directly:

Suggested change
let indices_a: Vec<usize> = match &active_a {
Some(v) => v.clone(),
None => (0..chunk_len).collect(),
};
let indices_b: Vec<usize> = match &active_b {
Some(v) => v.clone(),
None => (0..chunk_len).collect(),
};
let indices_a = active_a.unwrap_or_else(|| (0..chunk_len).collect());
let indices_b = active_b.unwrap_or_else(|| (0..chunk_len).collect());

This avoids a heap allocation + copy for every chunk in the half-sparse case (one sparse, one dense interaction).

/// is non-zero — which is an over-approximation of `m != 0` (we don't miss any
/// non-zero rows, we just might process some m=0 rows due to cancellation).
///
/// The threshold check: if more than `SPARSE_ACTIVE_FRAC` of the chunk is active,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low – Doc refers to a non-existent constant name

The comment says SPARSE_ACTIVE_FRAC but no such constant exists; the actual constants are SPARSE_NUM and SPARSE_DEN defined inside the function body. Update to match:

Suggested change
/// The threshold check: if more than `SPARSE_ACTIVE_FRAC` of the chunk is active,
/// The threshold check: if more than `SPARSE_NUM / SPARSE_DEN` (7/8) of the chunk is active,

@claude
Copy link
Copy Markdown

claude Bot commented Apr 24, 2026

Review of perf/sparse-logup-aux-trace

Overview: Adds a sparse fast-path to the LogUp auxiliary trace builder. When a multiplicity is structurally identifiable as sparse (Column/Sum/Sum3 gated by selectors with <7/8 of rows active), it skips fingerprint computation for zero-multiplicity rows and batch-inverts only the active subset.

Soundness: The over-approximation invariant holds — sparse_active_rows always includes every row where m≠0, so no non-zero contribution is silently dropped. The dense fallback for One/Negated/Diff/Linear is correct.

Correctness: The scatter-then-add pattern in the half-sparse batched path relies on result_chunk starting at zero, which is guaranteed by the vec![FieldElement::<E>::zero(); trace_len] initialization at the call site. This implicit contract is safe as long as result is never reused.

Issues found (inline comments posted):

Severity Location Issue
Medium sparse_active_rows (Column/Sum/Sum3 arms) Full Vec<usize> is allocated before the density threshold check — wasted allocation on every dense chunk
Medium batched sparse path, lines 1808-1815 v.clone() on owned Option<Vec<usize>> — use unwrap_or_else to consume directly
Low function docstring, line 1459 References constant SPARSE_ACTIVE_FRAC which doesn't exist

No security or correctness bugs found. The soundness argument is well-reasoned and the existing test suite (71 bus + 265 prover tests) covers the behavioral contract.

One design note: the half-sparse case (one dense, one sparse interaction) still falls through to the "sparse path" which allocates a full (0..chunk_len) index Vec for the dense interaction. If this case is common, a dedicated (dense, sparse) branch would avoid that allocation entirely — but only worth it if profiling confirms it.

…debug_assertions

- `cargo fmt` on sparse LogUp changes
- Polynomial import in constraints/evaluator.rs is only used inside
  `#[cfg(all(debug_assertions, not(feature = "parallel")))]`; match the
  import cfg to the use site so release builds without `parallel` don't
  warn. (Pre-existing on origin/main but surfaced by CI lint here.)
@jotabulacios
Copy link
Copy Markdown
Collaborator

/bench

@diegokingston
Copy link
Copy Markdown
Collaborator Author

/bench 5

@diegokingston diegokingston marked this pull request as ready for review May 20, 2026 12:18
@github-actions
Copy link
Copy Markdown

Codex Code Review

Findings

  • Performance: crypto/stark/src/lookup.rs:1489 and crypto/stark/src/lookup.rs:1807 can allocate Vec<usize> proportional to the full trace length in non-parallel builds, because process_chunk(0, &mut result) uses the entire trace as one chunk. For dense Column/Sum/Sum3 multiplicities, sparse_active_rows first collects nearly all rows before deciding to fall back to dense, and the half-sparse batched path materializes (0..chunk_len).collect() for the dense side. On large STARK traces this is a real memory/time regression and can OOM. Prefer early-aborting collection once the sparse threshold is exceeded, and avoid materializing dense index ranges in the half-sparse path.

No security vulnerabilities or correctness bugs found in the changed logic.

I attempted to run cargo test -p stark --no-default-features lookup --lib, but the environment could not update rustup because /home/runner/.rustup is read-only.

Comment on lines +1807 to 1814
let indices_a: Vec<usize> = match &active_a {
Some(v) => v.clone(),
None => (0..chunk_len).collect(),
};
let indices_b: Vec<usize> = match &active_b {
Some(v) => v.clone(),
None => (0..chunk_len).collect(),
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low: unnecessary clone + silent perf regression in the half-sparse case

active_a/active_b are not used after these bindings, so the .clone() is avoidable. More importantly, when exactly one side is None (e.g. A is dense, B is sparse), this path allocates a full (0..chunk_len).collect() and then processes the dense interaction with indirect indexing — which is strictly worse than the dedicated dense branch above that uses sequential access. The dense-path early-return only triggers when both are None; the half-sparse case silently regresses.

Suggested fixes:

  1. Consume the Option to avoid the clone: active_a.unwrap_or_else(|| (0..chunk_len).collect()).
  2. Add dedicated half-sparse branches (active_a.is_some() && active_b.is_none() and vice versa) that only scatter the sparse side, similar to how compute_logup_term_column handles its sparse path. The dense side can use the same sequential loop as the dense-path branch.

@claude
Copy link
Copy Markdown

claude Bot commented May 20, 2026

Review: perf/sparse-logup-aux-trace

The optimization is sound. The over-approximation argument in the PR description is correct: marking a row active when any contributing column is non-zero guarantees every row with non-zero multiplicity is processed; field cancellation (e.g. col_a + col_b = 0 with both non-zero) just causes harmless extra work (0 * fp_inv = 0). The empty-fingerprint edge case (all rows inactive) is handled correctly by inplace_batch_inverse_sequential's early return. The evaluator.rs import gate change is correct — Polynomial is used only in the debug_assertions block.

Low: half-sparse inefficiency in compute_logup_batched_term_column — see inline comment. When exactly one interaction is dense (None) and the other is sparse (Some), the current code falls to the sparse path and allocates a full (0..chunk_len) index vec for the dense side, then processes it with indirect indexing. This is both an extra allocation and slightly slower than sequential access. The single-interaction function (compute_logup_term_column) doesn't have this issue since it only has one arm.

No correctness or security issues found.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants