Skip to content

feat(cuda): math-cuda crate scaffolding — arith + NTT + coset LDE#576

Merged
ColoCarletti merged 9 commits into
mainfrom
feat/cuda-pr1a-math-cuda-foundation
May 13, 2026
Merged

feat(cuda): math-cuda crate scaffolding — arith + NTT + coset LDE#576
ColoCarletti merged 9 commits into
mainfrom
feat/cuda-pr1a-math-cuda-foundation

Conversation

@ColoCarletti
Copy link
Copy Markdown
Collaborator

Summary

Introduces a new math-cuda crate that ships the foundation kernels for
GPU-accelerated polynomial work: Goldilocks/ext3 field arithmetic,
batched NTT (single-level + 8-level fused), and plain coset LDE entry
points. Adds a cuda feature flag to crypto/stark, opt-in and disabled
by default. No prover code consumes it yet — the kernels and dispatch
surface land here so future PRs can wire them into Round 1 LDE/commit,
Round 3 OOD evaluation, Round 4 DEEP/FRI, etc.

The whole crate is gated. CPU-only consumers (-p lambda-vm-prover,
-p cli) don't pull math-cuda in and don't require nvcc to build.

What's in

  • New crate crypto/math-cuda/ (~2.5k LoC):
    • kernels/{goldilocks.cuh, ext3.cuh, arith.cu, ntt.cu} — hand-written CUDA
    • src/device.rs — process-wide Backend singleton (32-stream pool,
      pinned host staging, twiddle cache, event-tracking disabled)
    • src/ntt.rs — host wrappers for forward/inverse NTT
    • src/lde.rscoset_lde_base, coset_lde_batch_base[_into],
      coset_lde_batch_ext3_into, evaluate_poly_coset_batch_ext3_into[_keep]
    • GpuLdeBase / GpuLdeExt3 device-handle types — kept on device
      across calls so downstream consumers can read LDE buffers without
      re-paying PCIe transfer
  • 30 parity tests asserting byte-identical u64 output vs the CPU
    reference: arith, ext3, NTT round-trip, single-column LDE,
    batched LDE, ext3 LDE, coset evaluate.
  • Workspace wiring: crypto/math-cuda added to [workspace] members,
    math-cuda = { path, optional = true } + cuda = ["dep:math-cuda"]
    feature added to crypto/stark.

Why this design

  • Feature-gated, not replacement. The cuda flag is opt-in. The CPU
    path is the default and untouched.
  • Parity tests are the contract. Every kernel produces bit-identical
    u64 output to the CPU reference. Tests assert on raw u64s, not on
    FieldElements, so non-canonical representations must match exactly.
  • Single backend, stream pool. One CUDA context per process; a
    32-stream round-robin pool lets rayon-parallel callers overlap kernel
    launches. Twiddles are computed once on host and uploaded once per
    log_n.
  • Device-resident LDE handles. GpuLdeBase / GpuLdeExt3 wrap an
    Arc<CudaSlice<u64>> so the device buffer survives past the call.
    This lets later passes read the LDE without H2D transferring it again
    (a 240 MB+ saving per round at prover scale).

Verification

Check Result
cargo build --workspace --release
cargo build -p stark --features cuda --release
cargo build -p lambda-vm-prover --release (no cuda, CPU consumer) ✓ — does not pull math-cuda
cargo test -p math-cuda --release 30 / 30 parity tests pass

Run on RTX 5090 (sm_120 / Blackwell), driver 595.58.03, CUDA 13.1
toolchain. PTX targets compute_89 (Ada virtual arch); the driver
JIT-compiles for the actual GPU at module load, so the same PTX runs on
Ada / Hopper / Blackwell.

Microbench (bench_quick.rs, --ignored)

Workload CPU rayon GPU Ratio
Single-col LDE 2^18, blowup=4 115 ms 13.6 ms 8.48×
Batched 64 cols, log_n=16, blowup=4 100 ms 13.0 ms 7.73×
Prover-scale 20 cols × 2^20, blowup=4 (median of 10) 344 ms 207 ms 1.66×

Single-shot timings (other than the median-of-10 row) are directional
only.

Test plan

  • cargo build --workspace --release builds clean
  • CPU-only consumer crates build without invoking nvcc
  • cargo build -p stark --features cuda compiles cleanly
  • Every crypto/math-cuda/tests/*.rs parity test passes (30 tests)
  • bench_quick runs end-to-end on the GPU

Notes for reviewers

  • cudarc 0.19 features: driver, nvrtc, std, cuda-12080, dynamic-loading.
    dynamic-loading means libcuda is loaded at runtime, not linked, so
    the binary is portable across driver versions. cuda-12080 selects the
    CUDA 12.8 ABI bindings; verified working against a CUDA 13.1 toolchain
    • driver 595.58.03.
  • Workspace builds require nvcc: because crypto/math-cuda is a
    workspace member, cargo build --workspace invokes its build.rs,
    which calls nvcc. Consumer-crate builds via cargo build -p <crate>
    for lambda-vm-prover/cli do not pull in math-cuda. CPU-only
    dev-loop: use -p <crate> instead of --workspace.
  • Parity contract: bit-identical u64 output to the CPU reference,
    not field-equal. Catches non-canonical-representation drift early.
  • GpuLdeBase / GpuLdeExt3 ship without direct test coverage in
    this PR — they're constructed only via the _keep variants and consumed
    by integrations that aren't here yet. The byte-equivalence contract on
    the LDE buffers themselves is enforced by the existing parity tests.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Codex Code Review

Findings

  • High - Build blocker: Cargo.toml adds crypto/math-cuda as a default workspace member, but build.rs always invokes nvcc and panics if CUDA is unavailable. This means ordinary workspace builds/tests can fail on non-CUDA machines even though math-cuda is only an optional stark dependency. Consider excluding it from default workspace builds via default-members, or gating PTX generation behind an explicit CUDA build path.

  • Medium - Runtime error handling: device.rs panics on CUDA initialization failure despite all public APIs returning Result<T>. Missing drivers, no GPU, or context creation failure will abort caller control flow instead of returning a recoverable error. Store/return the initialization Result from backend() so callers can handle failure.

  • Low - Incorrect empty-input handling: Several functions check is_power_of_two() before their empty-input branch, so empty inputs panic even though the code appears to intend returning empty results. Example: ntt.rs and lde.rs. Move the n == 0 checks before the assertions or remove the dead empty branches.

No security vulnerabilities found in the changed lines.

I did not run the full test suite; nvcc is not installed in this environment, which is the build issue above.

Comment thread crypto/math-cuda/src/lde.rs
Comment thread crypto/math-cuda/src/ntt.rs
Comment thread crypto/math-cuda/src/device.rs
Comment thread crypto/math-cuda/src/lde.rs
@claude
Copy link
Copy Markdown

claude Bot commented May 6, 2026

Review: feat(cuda): math-cuda crate scaffolding — arith + NTT + coset LDE

Overview

Introduces a new math-cuda crate with CUDA-accelerated Goldilocks / ext3 field arithmetic, batched NTT, and coset LDE kernels for the STARK prover. Feature-gated (cuda), opt-in, CPU path untouched. 30 parity tests assert bit-identical u64 output against the CPU reference. Clean design overall.


Security / Correctness

[Medium] n as u32 silently truncates at maximum Goldilocks domain size
See inline comment on lde.rs:106. LaunchConfig::for_num_elems(lde_size as u32) — and all the equivalent (n as u32).div_ceil(256) grid computations — truncate to zero when n = 2^32 (the top of Goldilocks TWO_ADICITY). The affected kernel launches do nothing, producing silently wrong output rather than an error. Affects coset_lde_base, ntt_inplace, and the batched variants. Recommend u32::try_from(n).expect(...) or an upfront assertion.

[Low] twiddles_forward(0) / twiddles_inverse(0) panic
See inline comment on ntt.rs:24. 1usize << (log_n - 1) underflows for log_n = 0. Both functions are pub, so callers can trigger this. Easy fix: add assert!(log_n >= 1) at the top of each.

[Low] Unbounded log_n in cached_twiddles causes OOB panic
See inline comment on device.rs:216. guard[log_n as usize] panics for log_n >= 33. A debug_assert before the indexing would catch regressions without runtime cost in release builds.


Bugs / Edge Cases

[Low] Concurrent twiddle uploads share util_stream without a mutex
See inline comment on device.rs:216. Two threads computing twiddles for different log_n values both enqueue onto util_stream concurrently. CUDA serializes the operations correctly, but each thread's synchronize() waits for the other's H2D transfer too — invisible coupling that adds latency under parallel prover invocations.


Design

Code duplication in LDE pipeline
The 5-step sequence (bit-reverse -> iNTT -> pointwise-mul -> bit-reverse -> fNTT) is copy-pasted across coset_lde_batch_base, coset_lde_batch_base_into, coset_lde_batch_ext3_into, and evaluate_poly_coset_batch_ext3_into_inner. Each variant differs only in whether it has debug timing, where it writes output, and whether ext3 de-interleaving happens. A shared run_batched_lde_pipeline helper with callback-style pre/post steps would eliminate ~300 lines of near-identical unsafe kernel launches and reduce the risk of variants drifting.

GpuLdeBase / GpuLdeExt3 ship without any tests
PR notes this explicitly. They are constructed only through _keep variants that are themselves untested here. Recommend at least a smoke test that round-trips through evaluate_poly_coset_batch_ext3_into_keep and reads back a known value from the returned handle.

Debug timing only in coset_lde_batch_base
See inline comment on lde.rs:149. The MATH_CUDA_PHASE_TIMING env-var check also calls std::env::var (a syscall) on every invocation; consider a OnceLock<bool> flag cached at backend init.


What looks good

  • Parity tests are thorough and the bit-identical contract is the right level of precision.
  • Stream-pool round-robin and single shared pinned staging are well-motivated tradeoffs.
  • ext3.cuh dot3 overflow handling is correct: hi <= 2^64-2 (proven from 64x64 mul bounds) ensures hi + carry never wraps, so the two-overflow-bit accumulation is sound.
  • The rayon+raw-pointer parallel-copy pattern is safe: disjoint regions, staging lock held for the lifetime of the slices.
  • Feature gating is clean — CPU-only consumers are not affected.

Comment thread crypto/math-cuda/src/device.rs Outdated
Copy link
Copy Markdown
Contributor

@MauroToscano MauroToscano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cudarc cuda-13010 feature is too strict (real bug). Every test panics on CUDA-13.0 drivers because cudarc tries to dlsym cuDevSmResourceSplit, which only
exists in 13.1+. The driver here only exports cuDevSmResourceSplitByCount. Suggest cuda-13000 or feature-detect.

Default compute_89 PTX JIT-fails on Blackwell. Comment claims one PTX works across Ada/Hopper/Blackwell — empirically false with nvcc 13.0 + driver 580 +
RTX 5090, which rejects the PTX as "unsupported toolchain". Setting CUDARC_NVCC_ARCH=sm_120 fixes it. Either emit a fat binary or detect arch in build.rs.

The static note about "build.rs empty-PTX stub fallback" is wrong. No fallback exists; cargo check --workspace panics on no-nvcc CI runners. Per-crate non-cuda builds still work (math-cuda is optional = true).

@MauroToscano
Copy link
Copy Markdown
Contributor

cudarc cuda-13010 feature is too strict (real bug). Every test panics on CUDA-13.0 drivers because cudarc tries to dlsym cuDevSmResourceSplit, which only exists in 13.1+. The driver here only exports cuDevSmResourceSplitByCount. Suggest cuda-13000 or feature-detect.

Default compute_89 PTX JIT-fails on Blackwell. Comment claims one PTX works across Ada/Hopper/Blackwell — empirically false with nvcc 13.0 + driver 580 + RTX 5090, which rejects the PTX as "unsupported toolchain". Setting CUDARC_NVCC_ARCH=sm_120 fixes it. Either emit a fat binary or detect arch in build.rs.

The static note about "build.rs empty-PTX stub fallback" is wrong. No fallback exists; cargo check --workspace panics on no-nvcc CI runners. Per-crate non-cuda builds still work (math-cuda is optional = true).

Actually instead of setting it to 13000 check adding this:

cudarc = { version = "0.19", default-features = false, features = [
"driver",
"nvrtc",
"std",
"cuda-version-from-build-system", # auto-pick based on local nvcc
"fallback-latest", # don't panic if nvcc missing
"dynamic-loading",
] }

MauroToscano and others added 2 commits May 12, 2026 15:40
  - Add ext3_sub_kernel + ext3_sub_u64 wrapper (test infrastructure;
    the ext3::sub device function was previously unreachable from Rust).
  - Add tests/ext3_edge.rs: 7 adversarial tests for ext3::mul dot3
    overflow tracking — (p-1)^3, u64::MAX^3, non-canonical p
    representations, identity, and 98 base-field edge pairs in
    a/b/c slots.
  - Add tests/ext3_sub.rs: parity test for the new sub wrapper.
  - Add tests/ntt_known.rs: known-polynomial tests for p(x) = 1+x at
    sizes 16 and 256, and p(x) = x^(N/2) for the alternating ±1
    pattern.
  - Add tests/lde_batch_into.rs: direct parity test for
    coset_lde_batch_base_into vs coset_lde_batch_base.

  15 new tests total, all green on RTX 5090.
Comment thread crypto/math-cuda/src/lde.rs Outdated
Comment thread crypto/math-cuda/src/lde.rs Outdated
Comment thread crypto/math-cuda/src/lde.rs Outdated
Comment thread crypto/math-cuda/src/lde.rs Outdated
Comment thread crypto/math-cuda/src/lde.rs Outdated
Comment thread crypto/math-cuda/src/lde.rs Outdated
Comment thread crypto/math-cuda/src/lde.rs Outdated
Comment thread crypto/math-cuda/src/lde.rs Outdated
Comment thread crypto/math-cuda/tests/ext3_edge.rs Outdated
Comment thread crypto/math-cuda/tests/ext3_sub.rs Outdated
Comment thread crypto/math-cuda/tests/lde.rs Outdated
Comment thread crypto/math-cuda/tests/lde_batch_into.rs Outdated
Comment thread crypto/math-cuda/tests/ntt.rs Outdated
Comment thread crypto/math-cuda/tests/ntt.rs Outdated
Comment thread crypto/math-cuda/tests/ntt_known.rs Outdated
Comment thread crypto/math-cuda/tests/ntt_known.rs
Comment thread crypto/math-cuda/build.rs Outdated
Comment thread crypto/math-cuda/build.rs Outdated
Comment thread crypto/math-cuda/build.rs Outdated
Copy link
Copy Markdown
Collaborator

@gabrielbosio gabrielbosio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if this PR adds some make commands to run tests from math-cuda. #575 already has those.

Comment thread crypto/math-cuda/kernels/ntt.cu Outdated
Comment thread crypto/math-cuda/src/ntt.rs Outdated
Comment thread crypto/math-cuda/tests/lde.rs Outdated
@ColoCarletti ColoCarletti requested a review from MauroToscano May 13, 2026 15:03
@ColoCarletti ColoCarletti added this pull request to the merge queue May 13, 2026
Merged via the queue into main with commit 8ebca95 May 13, 2026
11 checks passed
@ColoCarletti ColoCarletti deleted the feat/cuda-pr1a-math-cuda-foundation branch May 13, 2026 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants