optimize rounds 3-4 — direct 2N DEEP evaluation + OOD stride reads by diegokingston · Pull Request #522 · yetanotherco/lambda_vm

diegokingston · 2026-04-22T02:07:15Z

Reduce memory and compute in rounds 3–4:

OOD: pass shared scalars from round_3, hoist col_scale
precompute, read LDE with stride (no Vec copies).
DEEP: column compression + direct evaluation at all 2N LDE points
(no iFFT+FFT).

…, and stride reads Three optimizations to get_trace_evaluations_from_lde: 1. Deduplicate coset_points/coset_offset_pow_n/n_inv/g_n_inv: these were computed identically in both round_3 (prover.rs) and the trace function. Now computed once in round_3 and passed as parameters. 2. Precompute col_scale[i] = coset_point[i] * inv_denom[i] once per eval point, shared across all columns. Eliminates N redundant F×E multiplies per column (saves ~95*N*2 F×E muls for CPU table). 3. Read LDE columns directly with stride (lde_col[i*bf]) instead of extracting N-element Vec copies per column. Eliminates num_cols * N field element allocations and copies.

diegokingston · 2026-04-22T02:07:25Z

/bench 3

github-actions · 2026-04-22T02:08:48Z

Codex Code Review

Medium – Missing input consistency checks can cause silent wrong evaluations or panic

Where: trace.rs:374, trace.rs:411, trace.rs:431, trace.rs:452
Issue: get_trace_evaluations_from_lde now trusts externally supplied coset_points, coset_offset_pow_n, n_inv, g_n_inv with no validation against domain.
- If coset_points.len() < n, interpolation is silently computed on a truncated domain (incorrect result, no error).
- If coset_points.len() > n, lde_col[i * bf] can go out of bounds and panic.
Why this matters: This function previously derived these values internally, so mismatches were impossible. The refactor introduced a new correctness/panic surface.
Fix: Add explicit checks (coset_points.len() == n, expected LDE column lengths) and consider returning Result on invalid input; at minimum use strong assertions with clear messages.

No other concrete security/performance/simplicity issues found in this diff.

I couldn’t run tests in this environment because cargo failed due a read-only rustup temp path (/home/runner/.rustup/tmp/...).

github-actions · 2026-04-22T02:09:37Z

Benchmark — fib_iterative_8M (median of 5)

_{Table parallelism: auto (cores / 3)}

Metric	main	PR	Δ
Peak heap	56461 MB	53059 MB	-3402 MB (-6.0%) 🟢
Prove time	26.060s	25.527s	-0.533s (-2.0%) ⚪

🎉 Improvement detected — heap or time decreased by more than 5%.

✅ Low variance (time: 1.7%, heap: 1.3%)

_{Commit: 5f4c79f · Baseline: cached · Runner: self-hosted bench}

diegokingston · 2026-04-22T02:21:24Z

/bench 3

diegokingston · 2026-04-22T02:27:59Z

/bench 3

Instead of computing DEEP at N trace-coset points then extending to 2N via iFFT(N) + FFT(2N), compute directly at all 2N LDE points. The extra N point evaluations (~8 ext ops each) are far cheaper than the 2 FFTs they replace (O(N log N) each). Combined with column compression from the previous commit, the DEEP polynomial is now: compress all columns at 2N points (parallel), then 2 ext ops per row for the quotient. No FFTs in Round 4.

diegokingston · 2026-04-22T02:36:23Z

/bench 3

…ound-trip build_auxiliary_trace now returns column-major aux data alongside BusPublicInputs. In multi_prove Pass 2, the pre-built columns feed directly into LDE + commit, eliminating an O(N * num_aux_cols) column-to-row-to-column copy via trace.extract_columns_aux.

diegokingston · 2026-04-22T03:27:24Z

/bench 3

diegokingston · 2026-04-22T03:46:48Z

/bench 3

github-actions · 2026-04-22T13:04:15Z

Codex Code Review

Medium - Performance/Availability: large memory amplification in DEEP computation

The new compressed precompute materializes num_eval_points * lde_size extension elements in memory before the hot loop. For large traces this can become very large and cause OOM/major slowdown.
Reference: prover.rs:1253, prover.rs:1277
Why it matters: previously the trace-term accumulation was streamed per-row; now memory is multiplied by both lde_size and num_eval_points.
Action: compute compression in chunks (or per-row/per-block) instead of storing the full 2D matrix.

Low - Robustness: new public API relies on caller invariants without checks

get_trace_evaluations_from_lde now accepts precomputed coset_points, n_inv, etc., but does not validate that coset_points.len() == domain.interpolation_domain_size (and related consistency). A mismatched caller input can panic/out-of-bounds or produce invalid evaluations.
Reference: trace.rs:368, trace.rs:431, trace.rs:452
Action: add debug_assert!/checked assertions for key invariants (lengths and domain consistency) at function entry.

No direct memory-safety/unsafe/crypto implementation vulnerability was evident in the diff itself.

…guments)

claude · 2026-04-22T13:12:19Z

Review: perf/ood-evaluation

Overall: The math is sound and the changes are correct. The three main transforms (barycentric scalar sharing, stride-free LDE access, direct 2N DEEP evaluation) are all equivalent to the original formulations. No security issues found. One documentation bug and one missing test worth addressing.

Bug (Low) — Comment says `i * bf` but code uses no stride

See inline comment on prover.rs:1238. The Phase 1 block comment describes the old N-point access pattern (trace_col_j[i * bf]), but i now ranges over all lde_size points directly via get_main(i, j). This will mislead anyone who diffs this against the old path.

Missing test (Low–Medium) — No unit test for the DEEP poly at 2N points

The most algorithmically significant change — evaluating deep(x) at all 2N LDE points instead of N trace-coset points then iFFT+FFT-extending — has no dedicated comparison test. The prove/verify roundtrip provides indirect coverage, but the PR already adds test_decompose_and_extend_d2_matches_original as a precedent for exactly this kind of "new path == old path" check. A similar test for compute_deep_composition_poly_evaluations would give the same confidence.

Design note (Low) — `compressed` doubles trace evaluation work

compressed[k][i] is precomputed for all num_eval_points × lde_size = 2 × 2N rows across all ~95 columns. The original hot loop ran over only N trace-coset rows. Combined with the 2N evaluation domain, trace-term multiplications roughly double (380N vs 190N). The eliminated iFFT(N)+FFT(2N) saves O(N log N) ≈ 20–30N ops for typical N, which does not offset the extra 190N. Whether this is a net win in wall-clock time depends on parallelism and cache effects — benchmarks should confirm before merging to main.

Verified correct ✓

Barycentric formula refactor in trace.rs: vanishing_factor × Σ_i (coset_pt[i] × inv_denom[i]) × lde_col[i*bf] is algebraically identical to the inlined interpolate_coset_eval_with_g_n_inv it replaces.
compressed[k][i] - ood_compressed[k] correctly factors the column sum out of the denominator loop since inv_t_k_i doesn't depend on j.
lde_composition_poly_evaluations[j] has lde_size elements (produced by decompose_and_extend_d2 / extend_half_to_lde), so [j][i] for i in 0..lde_size is in-bounds.
Denominator slice indexing denoms[(1 + k) * lde_size + i] is within the allocated (1 + num_eval_points) * lde_size length.

coset_points length against domain size

jotabulacios · 2026-04-23T18:52:43Z

/bench 5

extended path

ColoCarletti · 2026-04-24T18:44:45Z

/bench

diegokingston · 2026-04-24T20:48:45Z

/bench 3

diegokingston · 2026-04-27T15:20:30Z

/bench 10

MauroToscano · 2026-04-30T17:38:44Z

/bench 5

claude Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread crypto/stark/src/trace.rs

claude Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread crypto/stark/src/trace.rs

perf: column-compressed DEEP polynomial (Plonky3-style)

ef1ee41

perf: parallelize DEEP column compression over rows

d551c5c

revert: undo aux column round-trip elimination (caused regression)

6a34bb6

diegokingston marked this pull request as ready for review April 22, 2026 13:02

fix: resolve clippy warnings (unused vars, clone on Copy, too_many_ar…

75771c0

…guments)

claude Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread crypto/stark/src/prover.rs

Merge branch 'main' into perf/ood-evaluation

45e348e

jotabulacios force-pushed the perf/ood-evaluation branch from d4ee91a to 45e348e Compare April 23, 2026 12:51

jotabulacios added 3 commits April 23, 2026 09:51

Merge branch 'main' into perf/ood-evaluation

f153847

fix stale stride comment in column compression and debug_assert

82ab479

coset_points length against domain size

Merge branch 'main' into perf/ood-evaluation

567a31c

jotabulacios changed the title ~~perf: optimize OOD evaluation with scalar dedup, col_scale precompute…~~ optimize rounds 3-4 — direct 2N DEEP evaluation + OOD stride reads Apr 23, 2026

jotabulacios added 2 commits April 24, 2026 14:59

Merge branch 'main' into perf/ood-evaluation

10108df

Add differential test for direct 2N DEEP evaluation vs iFFT+FFT

8dfa6d0

extended path

ColoCarletti approved these changes Apr 24, 2026

View reviewed changes

ColoCarletti and others added 2 commits April 24, 2026 16:36

Merge branch 'main' into perf/ood-evaluation

d062172

Merge branch 'main' into perf/ood-evaluation

75c56f7

jotabulacios added 2 commits April 30, 2026 11:02

solve conflicts

875d37c

lint

5f4c79f

MauroToscano approved these changes Apr 30, 2026

View reviewed changes

MauroToscano added this pull request to the merge queue Apr 30, 2026

Merged via the queue into main with commit 196cfac Apr 30, 2026
11 checks passed

MauroToscano deleted the perf/ood-evaluation branch April 30, 2026 18:42

Conversation

diegokingston commented Apr 22, 2026 • edited by jotabulacios Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

diegokingston commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Codex Code Review

Uh oh!

github-actions Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark — fib_iterative_8M (median of 5)

Uh oh!

Uh oh!

Uh oh!

diegokingston commented Apr 22, 2026

Uh oh!

diegokingston commented Apr 22, 2026

Uh oh!

diegokingston commented Apr 22, 2026

Uh oh!

diegokingston commented Apr 22, 2026

Uh oh!

diegokingston commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Codex Code Review

Uh oh!

Uh oh!

claude Bot commented Apr 22, 2026

Review: perf/ood-evaluation

Bug (Low) — Comment says i * bf but code uses no stride

Missing test (Low–Medium) — No unit test for the DEEP poly at 2N points

Design note (Low) — compressed doubles trace evaluation work

Verified correct ✓

Uh oh!

jotabulacios commented Apr 23, 2026

Uh oh!

ColoCarletti commented Apr 24, 2026

Uh oh!

diegokingston commented Apr 24, 2026

Uh oh!

diegokingston commented Apr 27, 2026

Uh oh!

MauroToscano commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

diegokingston commented Apr 22, 2026 •

edited by jotabulacios

Loading

github-actions Bot commented Apr 22, 2026 •

edited

Loading

Bug (Low) — Comment says `i * bf` but code uses no stride

Design note (Low) — `compressed` doubles trace evaluation work