optimize rounds 3-4 — direct 2N DEEP evaluation + OOD stride reads#522
Conversation
…, and stride reads Three optimizations to get_trace_evaluations_from_lde: 1. Deduplicate coset_points/coset_offset_pow_n/n_inv/g_n_inv: these were computed identically in both round_3 (prover.rs) and the trace function. Now computed once in round_3 and passed as parameters. 2. Precompute col_scale[i] = coset_point[i] * inv_denom[i] once per eval point, shared across all columns. Eliminates N redundant F×E multiplies per column (saves ~95*N*2 F×E muls for CPU table). 3. Read LDE columns directly with stride (lde_col[i*bf]) instead of extracting N-element Vec copies per column. Eliminates num_cols * N field element allocations and copies.
|
/bench 3 |
Codex Code Review
No other concrete security/performance/simplicity issues found in this diff. I couldn’t run tests in this environment because |
Benchmark — fib_iterative_8M (median of 5)Table parallelism: auto (cores / 3)
Commit: 5f4c79f · Baseline: cached · Runner: self-hosted bench |
|
/bench 3 |
|
/bench 3 |
Instead of computing DEEP at N trace-coset points then extending to 2N via iFFT(N) + FFT(2N), compute directly at all 2N LDE points. The extra N point evaluations (~8 ext ops each) are far cheaper than the 2 FFTs they replace (O(N log N) each). Combined with column compression from the previous commit, the DEEP polynomial is now: compress all columns at 2N points (parallel), then 2 ext ops per row for the quotient. No FFTs in Round 4.
|
/bench 3 |
…ound-trip build_auxiliary_trace now returns column-major aux data alongside BusPublicInputs. In multi_prove Pass 2, the pre-built columns feed directly into LDE + commit, eliminating an O(N * num_aux_cols) column-to-row-to-column copy via trace.extract_columns_aux.
|
/bench 3 |
|
/bench 3 |
Codex Code Review
No direct memory-safety/unsafe/crypto implementation vulnerability was evident in the diff itself. |
Review: perf/ood-evaluationOverall: The math is sound and the changes are correct. The three main transforms (barycentric scalar sharing, stride-free LDE access, direct 2N DEEP evaluation) are all equivalent to the original formulations. No security issues found. One documentation bug and one missing test worth addressing. Bug (Low) — Comment says
|
d4ee91a to
45e348e
Compare
coset_points length against domain size
|
/bench 5 |
|
/bench |
|
/bench 3 |
|
/bench 10 |
|
/bench 5 |
Reduce memory and compute in rounds 3–4:
precompute, read LDE with stride (no Vec copies).
(no iFFT+FFT).