Parallelize inplace_batch_inverse by jotabulacios · Pull Request #561 · yetanotherco/lambda_vm

jotabulacios · 2026-04-24T19:47:28Z

This PR adds a parallel path to inplace_batch_inverse: when the parallel feature is on and the slice has ≥ 2^16 elements, it splits the input into one chunk per Rayon thread and runs the existing Montgomery batch-inverse on each chunk in parallel. Otherwise it falls back to the sequential version

jotabulacios · 2026-04-24T19:48:00Z

/bench

github-actions · 2026-04-24T19:48:54Z

Codex Code Review

Found one issue.

Low - Partial mutation on Err in parallel path
crypto/math/src/field/element.rs: try_for_each(Self::inplace_batch_inverse_sequential) mutates each successful chunk independently. If any chunk contains zero, the method returns Err, but other chunks may already have been inverted. The previous sequential implementation returned Err before mutating numbers, so callers that handle the error now get a partially modified slice. Consider pre-checking for zero, documenting the new behavior, or writing into temporary chunk buffers and committing only after all chunks succeed.

I didn’t find security issues in the successful inversion path.

Verification note: I attempted cargo check -p math --all-targets and a targeted math test, but both were blocked because rustup tried to write under read-only /home/runner/.rustup/tmp.

claude · 2026-04-24T19:49:34Z

Review: Parallelize inplace_batch_inverse

Good direction overall — Rayon-based chunked parallelism is a natural fit here and the algorithm is correct (each chunk is fully independent). Two issues to address before merging:

Medium — Unconditional `Send + Sync` bound (breaking change)

The where Self: Send + Sync clause sits outside the #[cfg(feature = "parallel")] block, so it is compiled in for every feature configuration. Any downstream field type whose BaseType is !Send or !Sync will silently stop compiling, even if they never enable the parallel feature. The bound must be conditional. The cleanest fix is two #[cfg]-gated function bodies (see inline comment).

Low — Threshold is magic and likely too high

1 << 16 = 65536 elements is far above many real call-sites in the prover (e.g. FRI and constraint evaluator paths call this on slices well under that size). Each chunk adds a full field inversion (~300x the cost of a multiplication), so the threshold must be large enough that thread-level speedup exceeds K−1 extra inversions — but 65536 is so conservative it leaves nearly all production calls on the serial path. A benchmark-backed threshold around 1 << 12–1 << 13 is likely more effective.

Low — No test for the parallel path

The existing proptest generates slices up to 1 << 10 = 1024 elements, never reaching PARALLEL_BATCH_INV_THRESHOLD. A dedicated test (or bumped proptest bound) should exercise the parallel branch to guard against future regressions.

github-actions · 2026-04-24T19:53:21Z

Benchmark — fib_iterative_8M (median of 3)

_{Table parallelism: 1}

Metric	main	PR	Δ
Peak heap	49533 MB	49522 MB	-11 MB (+-0.0%) ⚪
Prove time	70.672s	63.496s	-7.176s (-10.2%) 🟢

🎉 Improvement detected — heap or time decreased by more than 5%.

✅ Low variance (time: 1.8%, heap: 0.0%)

_{Commit: 52e16f2 · Baseline: built from main · Runner: self-hosted bench}

jotabulacios · 2026-04-24T20:18:23Z

/bench 10

jotabulacios · 2026-04-27T19:26:32Z

/bench k=1

github-actions · 2026-04-27T20:04:19Z

Codex Code Review

Found one issue.

Potential Bug / Low security impact
crypto/math/src/field/element.rs:68

The new parallel path can partially mutate numbers before returning Err if any chunk contains a non-invertible element. The old sequential implementation computed the final prefix inverse before writing back, so on inv()? failure the input slice was left unchanged. With par_chunks_mut(...).try_for_each(...), one chunk can fail while other chunks have already replaced their elements with inverses.

This is especially risky because callers may handle the error and still reuse the original buffer. Keep the error behavior all-or-nothing, for example by validating/inverting chunk products before mutating chunks, or by falling back to the old sequential path when preserving that behavior matters.

I couldn’t run cargo check because rustup tried to write under /home/runner/.rustup, which is read-only in this environment.

claude · 2026-04-27T20:05:56Z

Review: Parallelize inplace_batch_inverse

Overall: The approach is correct — chunked batch inversion is mathematically sound since each element's inverse is independent. The implementation is clean and consistent with how the existing FFT parallelism is structured.

Medium — No test coverage for the parallel path

The existing tests use vectors of ≤10 elements (far below the 65536 threshold), so the new #[cfg(feature = "parallel")] branch is never exercised by the test suite. A regression (wrong inverses, silent data corruption) in the parallel path would go undetected.

The FFT code sets a good precedent in bowers_fft_tests.rs — it uses rayon::ThreadPoolBuilder to run correctness assertions with explicit thread counts (1, 2, 4, 8). A similar test here should:

build a slice of ≥65536 non-zero elements,
call inplace_batch_inverse,
verify each result satisfies x * inv == 1.

Low — `where Self: Send + Sync` bound is redundant

IsField already requires type BaseType: ... + Send + Sync (traits.rs:101). Because FieldElement<F> contains only F::BaseType, Rust auto-derives Send + Sync for any F: IsField. The bound is always satisfied and won't break callers, but it is unnecessary API surface noise. The parallel code inside the #[cfg(feature = "parallel")] block still needs it for the compiler — consider limiting it to an inner function or trait object cast rather than propagating it to the public signature.

Low — No benchmark for the 2^16 threshold

There is no benchmark validating the threshold or measuring actual speedup. The FFT uses 2^14; batch inverse uses 2^16 — likely intentional due to different per-element cost, but without a benchmark it is hard to verify the threshold is well-calibrated. Consider adding an entry to the existing goldilocks_benchmark to document the rationale.

jotabulacios · 2026-04-28T12:40:26Z

/claude

claude · 2026-04-28T12:44:29Z

Review Summary

Overall: The approach is correct and well-tested. The Montgomery batch-inverse per chunk is a sound strategy — each chunk is independent (no zero divisors in a field, so the product of non-zero elements is non-zero within a chunk), and the all-or-nothing invariant is properly preserved via the zero pre-scan.

What works well

Correct results: chunk-based inversion gives the same per-element results as a full-array inversion ✓
All-or-nothing semantics properly enforced via pre-scan before any mutations ✓
Good test coverage: multiple thread counts, zero-error case, mutation-on-error prevention ✓
K extra inversions are correctly documented as negligible vs ~2N multiplications ✓
inplace_batch_inverse_sequential kept private ✓

Issues

Medium — Double data traversal overhead
The zero pre-scan (par_iter().any(|x| x == &zero)) adds a full extra O(N) pass over the data before the actual computation. For the common case (no zeros), this is ~50% extra work and doubles cache pressure. At N = 2^20 across 8 threads, the pre-scan is significant. See inline comment.

Low — try_for_each is misleading
After a successful zero pre-scan, inplace_batch_inverse_sequential can never return Err (field has no zero divisors). The try_for_each signals fallible intent that doesn't exist. Also worth noting: Rayon's try_for_each does not abort in-flight parallel tasks on error — it only short-circuits result collection — so the all-or-nothing guarantee relies entirely on the pre-scan, not on try_for_each's behavior. See inline comment.

Low — Benchmark zero exclusion biases toward 1

if v == 0 { v = 1; }

Maps 0 (probability ~1/2^64) to 1, very slightly over-representing 1 in the benchmark. Negligible in practice but worth a comment or using NonZeroU64 / v.max(1).

Parallelize inplace_batch_inverse

2896c6b

claude Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread crypto/math/src/field/element.rs Outdated

claude Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread crypto/math/src/field/element.rs

Merge branch 'main' into perf/chunked-parallel-batch-inverse

5522b81

Merge branch 'main' into perf/chunked-parallel-batch-inverse

52e16f2

jotabulacios marked this pull request as ready for review April 27, 2026 20:02

claude Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread crypto/math/src/field/element.rs

claude Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread crypto/math/src/field/element.rs Outdated

addres comments

396dbde

claude Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread crypto/math/src/field/element.rs

claude Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread crypto/math/src/field/element.rs

nicole-graus approved these changes Apr 28, 2026

View reviewed changes

MauroToscano approved these changes Apr 28, 2026

View reviewed changes

MauroToscano added this pull request to the merge queue Apr 28, 2026

Merged via the queue into main with commit 5bcca15 Apr 28, 2026
10 checks passed

MauroToscano deleted the perf/chunked-parallel-batch-inverse branch April 28, 2026 18:36

Conversation

jotabulacios commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jotabulacios commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Codex Code Review

Uh oh!

Uh oh!

Uh oh!

claude Bot commented Apr 24, 2026

Review: Parallelize inplace_batch_inverse

Medium — Unconditional Send + Sync bound (breaking change)

Low — Threshold is magic and likely too high

Low — No test for the parallel path

Uh oh!

github-actions Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark — fib_iterative_8M (median of 3)

Uh oh!

jotabulacios commented Apr 24, 2026

Uh oh!

jotabulacios commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Codex Code Review

Uh oh!

claude Bot commented Apr 27, 2026

Review: Parallelize inplace_batch_inverse

Medium — No test coverage for the parallel path

Low — where Self: Send + Sync bound is redundant

Low — No benchmark for the 2^16 threshold

Uh oh!

Uh oh!

Uh oh!

jotabulacios commented Apr 28, 2026

Uh oh!

Uh oh!

Uh oh!

claude Bot commented Apr 28, 2026

Review Summary

What works well

Issues

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jotabulacios commented Apr 24, 2026 •

edited

Loading

Medium — Unconditional `Send + Sync` bound (breaking change)

github-actions Bot commented Apr 24, 2026 •

edited

Loading

Low — `where Self: Send + Sync` bound is redundant