Skip to content

Parallelize inplace_batch_inverse#561

Merged
MauroToscano merged 4 commits into
mainfrom
perf/chunked-parallel-batch-inverse
Apr 28, 2026
Merged

Parallelize inplace_batch_inverse#561
MauroToscano merged 4 commits into
mainfrom
perf/chunked-parallel-batch-inverse

Conversation

@jotabulacios
Copy link
Copy Markdown
Collaborator

@jotabulacios jotabulacios commented Apr 24, 2026

This PR adds a parallel path to inplace_batch_inverse: when the parallel feature is on and the slice has ≥ 2^16 elements, it splits the input into one chunk per Rayon thread and runs the existing Montgomery batch-inverse on each chunk in parallel. Otherwise it falls back to the sequential version

@jotabulacios
Copy link
Copy Markdown
Collaborator Author

/bench

@github-actions
Copy link
Copy Markdown

Codex Code Review

Found one issue.

Low - Partial mutation on Err in parallel path
crypto/math/src/field/element.rs: try_for_each(Self::inplace_batch_inverse_sequential) mutates each successful chunk independently. If any chunk contains zero, the method returns Err, but other chunks may already have been inverted. The previous sequential implementation returned Err before mutating numbers, so callers that handle the error now get a partially modified slice. Consider pre-checking for zero, documenting the new behavior, or writing into temporary chunk buffers and committing only after all chunks succeed.

I didn’t find security issues in the successful inversion path.

Verification note: I attempted cargo check -p math --all-targets and a targeted math test, but both were blocked because rustup tried to write under read-only /home/runner/.rustup/tmp.

Comment thread crypto/math/src/field/element.rs Outdated
Comment thread crypto/math/src/field/element.rs
@claude
Copy link
Copy Markdown

claude Bot commented Apr 24, 2026

Review: Parallelize inplace_batch_inverse

Good direction overall — Rayon-based chunked parallelism is a natural fit here and the algorithm is correct (each chunk is fully independent). Two issues to address before merging:

Medium — Unconditional Send + Sync bound (breaking change)

The where Self: Send + Sync clause sits outside the #[cfg(feature = "parallel")] block, so it is compiled in for every feature configuration. Any downstream field type whose BaseType is !Send or !Sync will silently stop compiling, even if they never enable the parallel feature. The bound must be conditional. The cleanest fix is two #[cfg]-gated function bodies (see inline comment).

Low — Threshold is magic and likely too high

1 << 16 = 65536 elements is far above many real call-sites in the prover (e.g. FRI and constraint evaluator paths call this on slices well under that size). Each chunk adds a full field inversion (~300x the cost of a multiplication), so the threshold must be large enough that thread-level speedup exceeds K−1 extra inversions — but 65536 is so conservative it leaves nearly all production calls on the serial path. A benchmark-backed threshold around 1 << 121 << 13 is likely more effective.

Low — No test for the parallel path

The existing proptest generates slices up to 1 << 10 = 1024 elements, never reaching PARALLEL_BATCH_INV_THRESHOLD. A dedicated test (or bumped proptest bound) should exercise the parallel branch to guard against future regressions.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 24, 2026

Benchmark — fib_iterative_8M (median of 3)

Table parallelism: 1

Metric main PR Δ
Peak heap 49533 MB 49522 MB -11 MB (+-0.0%) ⚪
Prove time 70.672s 63.496s -7.176s (-10.2%) 🟢

🎉 Improvement detected — heap or time decreased by more than 5%.

✅ Low variance (time: 1.8%, heap: 0.0%)

Commit: 52e16f2 · Baseline: built from main · Runner: self-hosted bench

@jotabulacios
Copy link
Copy Markdown
Collaborator Author

/bench 10

@jotabulacios
Copy link
Copy Markdown
Collaborator Author

/bench k=1

@jotabulacios jotabulacios marked this pull request as ready for review April 27, 2026 20:02
@github-actions
Copy link
Copy Markdown

Codex Code Review

Found one issue.

Potential Bug / Low security impact
crypto/math/src/field/element.rs:68

The new parallel path can partially mutate numbers before returning Err if any chunk contains a non-invertible element. The old sequential implementation computed the final prefix inverse before writing back, so on inv()? failure the input slice was left unchanged. With par_chunks_mut(...).try_for_each(...), one chunk can fail while other chunks have already replaced their elements with inverses.

This is especially risky because callers may handle the error and still reuse the original buffer. Keep the error behavior all-or-nothing, for example by validating/inverting chunk products before mutating chunks, or by falling back to the old sequential path when preserving that behavior matters.

I couldn’t run cargo check because rustup tried to write under /home/runner/.rustup, which is read-only in this environment.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

Review: Parallelize inplace_batch_inverse

Overall: The approach is correct — chunked batch inversion is mathematically sound since each element's inverse is independent. The implementation is clean and consistent with how the existing FFT parallelism is structured.

Medium — No test coverage for the parallel path

The existing tests use vectors of ≤10 elements (far below the 65536 threshold), so the new #[cfg(feature = "parallel")] branch is never exercised by the test suite. A regression (wrong inverses, silent data corruption) in the parallel path would go undetected.

The FFT code sets a good precedent in bowers_fft_tests.rs — it uses rayon::ThreadPoolBuilder to run correctness assertions with explicit thread counts (1, 2, 4, 8). A similar test here should:

  • build a slice of ≥65536 non-zero elements,
  • call inplace_batch_inverse,
  • verify each result satisfies x * inv == 1.

Low — where Self: Send + Sync bound is redundant

IsField already requires type BaseType: ... + Send + Sync (traits.rs:101). Because FieldElement<F> contains only F::BaseType, Rust auto-derives Send + Sync for any F: IsField. The bound is always satisfied and won't break callers, but it is unnecessary API surface noise. The parallel code inside the #[cfg(feature = "parallel")] block still needs it for the compiler — consider limiting it to an inner function or trait object cast rather than propagating it to the public signature.

Low — No benchmark for the 2^16 threshold

There is no benchmark validating the threshold or measuring actual speedup. The FFT uses 2^14; batch inverse uses 2^16 — likely intentional due to different per-element cost, but without a benchmark it is hard to verify the threshold is well-calibrated. Consider adding an entry to the existing goldilocks_benchmark to document the rationale.

Comment thread crypto/math/src/field/element.rs
Comment thread crypto/math/src/field/element.rs Outdated
@jotabulacios
Copy link
Copy Markdown
Collaborator Author

/claude

Comment thread crypto/math/src/field/element.rs
Comment thread crypto/math/src/field/element.rs
@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

Review Summary

Overall: The approach is correct and well-tested. The Montgomery batch-inverse per chunk is a sound strategy — each chunk is independent (no zero divisors in a field, so the product of non-zero elements is non-zero within a chunk), and the all-or-nothing invariant is properly preserved via the zero pre-scan.

What works well

  • Correct results: chunk-based inversion gives the same per-element results as a full-array inversion ✓
  • All-or-nothing semantics properly enforced via pre-scan before any mutations ✓
  • Good test coverage: multiple thread counts, zero-error case, mutation-on-error prevention ✓
  • K extra inversions are correctly documented as negligible vs ~2N multiplications ✓
  • inplace_batch_inverse_sequential kept private ✓

Issues

Medium — Double data traversal overhead
The zero pre-scan (par_iter().any(|x| x == &zero)) adds a full extra O(N) pass over the data before the actual computation. For the common case (no zeros), this is ~50% extra work and doubles cache pressure. At N = 2^20 across 8 threads, the pre-scan is significant. See inline comment.

Low — try_for_each is misleading
After a successful zero pre-scan, inplace_batch_inverse_sequential can never return Err (field has no zero divisors). The try_for_each signals fallible intent that doesn't exist. Also worth noting: Rayon's try_for_each does not abort in-flight parallel tasks on error — it only short-circuits result collection — so the all-or-nothing guarantee relies entirely on the pre-scan, not on try_for_each's behavior. See inline comment.

Low — Benchmark zero exclusion biases toward 1

if v == 0 { v = 1; }

Maps 0 (probability ~1/2^64) to 1, very slightly over-representing 1 in the benchmark. Negligible in practice but worth a comment or using NonZeroU64 / v.max(1).

@MauroToscano MauroToscano added this pull request to the merge queue Apr 28, 2026
Merged via the queue into main with commit 5bcca15 Apr 28, 2026
10 checks passed
@MauroToscano MauroToscano deleted the perf/chunked-parallel-batch-inverse branch April 28, 2026 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants