Skip to content

optimize rounds 3-4 — direct 2N DEEP evaluation + OOD stride reads#522

Merged
MauroToscano merged 17 commits into
mainfrom
perf/ood-evaluation
Apr 30, 2026
Merged

optimize rounds 3-4 — direct 2N DEEP evaluation + OOD stride reads#522
MauroToscano merged 17 commits into
mainfrom
perf/ood-evaluation

Conversation

@diegokingston
Copy link
Copy Markdown
Collaborator

@diegokingston diegokingston commented Apr 22, 2026

Reduce memory and compute in rounds 3–4:

  • OOD: pass shared scalars from round_3, hoist col_scale
    precompute, read LDE with stride (no Vec copies).
  • DEEP: column compression + direct evaluation at all 2N LDE points
    (no iFFT+FFT).

…, and stride reads

Three optimizations to get_trace_evaluations_from_lde:

1. Deduplicate coset_points/coset_offset_pow_n/n_inv/g_n_inv: these were
   computed identically in both round_3 (prover.rs) and the trace function.
   Now computed once in round_3 and passed as parameters.

2. Precompute col_scale[i] = coset_point[i] * inv_denom[i] once per eval
   point, shared across all columns. Eliminates N redundant F×E multiplies
   per column (saves ~95*N*2 F×E muls for CPU table).

3. Read LDE columns directly with stride (lde_col[i*bf]) instead of
   extracting N-element Vec copies per column. Eliminates num_cols * N
   field element allocations and copies.
@diegokingston
Copy link
Copy Markdown
Collaborator Author

/bench 3

@github-actions
Copy link
Copy Markdown

Codex Code Review

  1. Medium – Missing input consistency checks can cause silent wrong evaluations or panic
  • Where: trace.rs:374, trace.rs:411, trace.rs:431, trace.rs:452
  • Issue: get_trace_evaluations_from_lde now trusts externally supplied coset_points, coset_offset_pow_n, n_inv, g_n_inv with no validation against domain.
    • If coset_points.len() < n, interpolation is silently computed on a truncated domain (incorrect result, no error).
    • If coset_points.len() > n, lde_col[i * bf] can go out of bounds and panic.
  • Why this matters: This function previously derived these values internally, so mismatches were impossible. The refactor introduced a new correctness/panic surface.
  • Fix: Add explicit checks (coset_points.len() == n, expected LDE column lengths) and consider returning Result on invalid input; at minimum use strong assertions with clear messages.

No other concrete security/performance/simplicity issues found in this diff.

I couldn’t run tests in this environment because cargo failed due a read-only rustup temp path (/home/runner/.rustup/tmp/...).

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 22, 2026

Benchmark — fib_iterative_8M (median of 5)

Table parallelism: auto (cores / 3)

Metric main PR Δ
Peak heap 56461 MB 53059 MB -3402 MB (-6.0%) 🟢
Prove time 26.060s 25.527s -0.533s (-2.0%) ⚪

🎉 Improvement detected — heap or time decreased by more than 5%.

✅ Low variance (time: 1.7%, heap: 1.3%)

Commit: 5f4c79f · Baseline: cached · Runner: self-hosted bench

Comment thread crypto/stark/src/trace.rs
Comment thread crypto/stark/src/trace.rs
@diegokingston
Copy link
Copy Markdown
Collaborator Author

/bench 3

@diegokingston
Copy link
Copy Markdown
Collaborator Author

/bench 3

Instead of computing DEEP at N trace-coset points then extending to 2N
via iFFT(N) + FFT(2N), compute directly at all 2N LDE points. The extra
N point evaluations (~8 ext ops each) are far cheaper than the 2 FFTs
they replace (O(N log N) each).

Combined with column compression from the previous commit, the DEEP
polynomial is now: compress all columns at 2N points (parallel), then
2 ext ops per row for the quotient. No FFTs in Round 4.
@diegokingston
Copy link
Copy Markdown
Collaborator Author

/bench 3

…ound-trip

build_auxiliary_trace now returns column-major aux data alongside
BusPublicInputs. In multi_prove Pass 2, the pre-built columns feed
directly into LDE + commit, eliminating an O(N * num_aux_cols)
column-to-row-to-column copy via trace.extract_columns_aux.
@diegokingston
Copy link
Copy Markdown
Collaborator Author

/bench 3

@diegokingston
Copy link
Copy Markdown
Collaborator Author

/bench 3

@diegokingston diegokingston marked this pull request as ready for review April 22, 2026 13:02
@github-actions
Copy link
Copy Markdown

Codex Code Review

  1. Medium - Performance/Availability: large memory amplification in DEEP computation
  • The new compressed precompute materializes num_eval_points * lde_size extension elements in memory before the hot loop. For large traces this can become very large and cause OOM/major slowdown.
  • Reference: prover.rs:1253, prover.rs:1277
  • Why it matters: previously the trace-term accumulation was streamed per-row; now memory is multiplied by both lde_size and num_eval_points.
  • Action: compute compression in chunks (or per-row/per-block) instead of storing the full 2D matrix.
  1. Low - Robustness: new public API relies on caller invariants without checks
  • get_trace_evaluations_from_lde now accepts precomputed coset_points, n_inv, etc., but does not validate that coset_points.len() == domain.interpolation_domain_size (and related consistency). A mismatched caller input can panic/out-of-bounds or produce invalid evaluations.
  • Reference: trace.rs:368, trace.rs:431, trace.rs:452
  • Action: add debug_assert!/checked assertions for key invariants (lengths and domain consistency) at function entry.

No direct memory-safety/unsafe/crypto implementation vulnerability was evident in the diff itself.

Comment thread crypto/stark/src/prover.rs
@claude
Copy link
Copy Markdown

claude Bot commented Apr 22, 2026

Review: perf/ood-evaluation

Overall: The math is sound and the changes are correct. The three main transforms (barycentric scalar sharing, stride-free LDE access, direct 2N DEEP evaluation) are all equivalent to the original formulations. No security issues found. One documentation bug and one missing test worth addressing.


Bug (Low) — Comment says i * bf but code uses no stride

See inline comment on prover.rs:1238. The Phase 1 block comment describes the old N-point access pattern (trace_col_j[i * bf]), but i now ranges over all lde_size points directly via get_main(i, j). This will mislead anyone who diffs this against the old path.


Missing test (Low–Medium) — No unit test for the DEEP poly at 2N points

The most algorithmically significant change — evaluating deep(x) at all 2N LDE points instead of N trace-coset points then iFFT+FFT-extending — has no dedicated comparison test. The prove/verify roundtrip provides indirect coverage, but the PR already adds test_decompose_and_extend_d2_matches_original as a precedent for exactly this kind of "new path == old path" check. A similar test for compute_deep_composition_poly_evaluations would give the same confidence.


Design note (Low) — compressed doubles trace evaluation work

compressed[k][i] is precomputed for all num_eval_points × lde_size = 2 × 2N rows across all ~95 columns. The original hot loop ran over only N trace-coset rows. Combined with the 2N evaluation domain, trace-term multiplications roughly double (380N vs 190N). The eliminated iFFT(N)+FFT(2N) saves O(N log N) ≈ 20–30N ops for typical N, which does not offset the extra 190N. Whether this is a net win in wall-clock time depends on parallelism and cache effects — benchmarks should confirm before merging to main.


Verified correct ✓

  • Barycentric formula refactor in trace.rs: vanishing_factor × Σ_i (coset_pt[i] × inv_denom[i]) × lde_col[i*bf] is algebraically identical to the inlined interpolate_coset_eval_with_g_n_inv it replaces.
  • compressed[k][i] - ood_compressed[k] correctly factors the column sum out of the denominator loop since inv_t_k_i doesn't depend on j.
  • lde_composition_poly_evaluations[j] has lde_size elements (produced by decompose_and_extend_d2 / extend_half_to_lde), so [j][i] for i in 0..lde_size is in-bounds.
  • Denominator slice indexing denoms[(1 + k) * lde_size + i] is within the allocated (1 + num_eval_points) * lde_size length.

@jotabulacios
Copy link
Copy Markdown
Collaborator

/bench 5

@jotabulacios jotabulacios changed the title perf: optimize OOD evaluation with scalar dedup, col_scale precompute… optimize rounds 3-4 — direct 2N DEEP evaluation + OOD stride reads Apr 23, 2026
@ColoCarletti
Copy link
Copy Markdown
Collaborator

/bench

@diegokingston
Copy link
Copy Markdown
Collaborator Author

/bench 3

@diegokingston
Copy link
Copy Markdown
Collaborator Author

/bench 10

@MauroToscano
Copy link
Copy Markdown
Contributor

/bench 5

@MauroToscano MauroToscano added this pull request to the merge queue Apr 30, 2026
Merged via the queue into main with commit 196cfac Apr 30, 2026
11 checks passed
@MauroToscano MauroToscano deleted the perf/ood-evaluation branch April 30, 2026 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants