feat(cuda): math-cuda crate scaffolding — arith + NTT + coset LDE by ColoCarletti · Pull Request #576 · yetanotherco/lambda_vm

ColoCarletti · 2026-05-06T18:21:32Z

Summary

Introduces a new math-cuda crate that ships the foundation kernels for
GPU-accelerated polynomial work: Goldilocks/ext3 field arithmetic,
batched NTT (single-level + 8-level fused), and plain coset LDE entry
points. Adds a cuda feature flag to crypto/stark, opt-in and disabled
by default. No prover code consumes it yet — the kernels and dispatch
surface land here so future PRs can wire them into Round 1 LDE/commit,
Round 3 OOD evaluation, Round 4 DEEP/FRI, etc.

The whole crate is gated. CPU-only consumers (-p lambda-vm-prover,
-p cli) don't pull math-cuda in and don't require nvcc to build.

What's in

New crate crypto/math-cuda/ (~2.5k LoC):
- kernels/{goldilocks.cuh, ext3.cuh, arith.cu, ntt.cu} — hand-written CUDA
- src/device.rs — process-wide Backend singleton (32-stream pool,
  pinned host staging, twiddle cache, event-tracking disabled)
- src/ntt.rs — host wrappers for forward/inverse NTT
- src/lde.rs — coset_lde_base, coset_lde_batch_base[_into],
  coset_lde_batch_ext3_into, evaluate_poly_coset_batch_ext3_into[_keep]
- GpuLdeBase / GpuLdeExt3 device-handle types — kept on device
  across calls so downstream consumers can read LDE buffers without
  re-paying PCIe transfer
30 parity tests asserting byte-identical u64 output vs the CPU
reference: arith, ext3, NTT round-trip, single-column LDE,
batched LDE, ext3 LDE, coset evaluate.
Workspace wiring: crypto/math-cuda added to [workspace] members,
math-cuda = { path, optional = true } + cuda = ["dep:math-cuda"]
feature added to crypto/stark.

Why this design

Feature-gated, not replacement. The cuda flag is opt-in. The CPU
path is the default and untouched.
Parity tests are the contract. Every kernel produces bit-identical
u64 output to the CPU reference. Tests assert on raw u64s, not on
FieldElements, so non-canonical representations must match exactly.
Single backend, stream pool. One CUDA context per process; a
32-stream round-robin pool lets rayon-parallel callers overlap kernel
launches. Twiddles are computed once on host and uploaded once per
log_n.
Device-resident LDE handles. GpuLdeBase / GpuLdeExt3 wrap an
Arc<CudaSlice<u64>> so the device buffer survives past the call.
This lets later passes read the LDE without H2D transferring it again
(a 240 MB+ saving per round at prover scale).

Verification

Check	Result
`cargo build --workspace --release`	✓
`cargo build -p stark --features cuda --release`	✓
`cargo build -p lambda-vm-prover --release` (no cuda, CPU consumer)	✓ — does not pull `math-cuda`
`cargo test -p math-cuda --release`	✓ 30 / 30 parity tests pass

Run on RTX 5090 (sm_120 / Blackwell), driver 595.58.03, CUDA 13.1
toolchain. PTX targets compute_89 (Ada virtual arch); the driver
JIT-compiles for the actual GPU at module load, so the same PTX runs on
Ada / Hopper / Blackwell.

Microbench (`bench_quick.rs`, `--ignored`)

Workload	CPU rayon	GPU	Ratio
Single-col LDE 2^18, blowup=4	115 ms	13.6 ms	8.48×
Batched 64 cols, log_n=16, blowup=4	100 ms	13.0 ms	7.73×
Prover-scale 20 cols × 2^20, blowup=4 (median of 10)	344 ms	207 ms	1.66×

Single-shot timings (other than the median-of-10 row) are directional
only.

Test plan

cargo build --workspace --release builds clean
CPU-only consumer crates build without invoking nvcc
cargo build -p stark --features cuda compiles cleanly
Every crypto/math-cuda/tests/*.rs parity test passes (30 tests)
bench_quick runs end-to-end on the GPU

Notes for reviewers

cudarc 0.19 features: driver, nvrtc, std, cuda-12080, dynamic-loading.
dynamic-loading means libcuda is loaded at runtime, not linked, so
the binary is portable across driver versions. cuda-12080 selects the
CUDA 12.8 ABI bindings; verified working against a CUDA 13.1 toolchain
- driver 595.58.03.
Workspace builds require nvcc: because crypto/math-cuda is a
workspace member, cargo build --workspace invokes its build.rs,
which calls nvcc. Consumer-crate builds via cargo build -p <crate>
for lambda-vm-prover/cli do not pull in math-cuda. CPU-only
dev-loop: use -p <crate> instead of --workspace.
Parity contract: bit-identical u64 output to the CPU reference,
not field-equal. Catches non-canonical-representation drift early.
GpuLdeBase / GpuLdeExt3 ship without direct test coverage in
this PR — they're constructed only via the _keep variants and consumed
by integrations that aren't here yet. The byte-equivalence contract on
the LDE buffers themselves is enforced by the existing parity tests.

github-actions · 2026-05-06T18:24:04Z

Codex Code Review

Findings

High - Build blocker: Cargo.toml adds crypto/math-cuda as a default workspace member, but build.rs always invokes nvcc and panics if CUDA is unavailable. This means ordinary workspace builds/tests can fail on non-CUDA machines even though math-cuda is only an optional stark dependency. Consider excluding it from default workspace builds via default-members, or gating PTX generation behind an explicit CUDA build path.
Medium - Runtime error handling: device.rs panics on CUDA initialization failure despite all public APIs returning Result<T>. Missing drivers, no GPU, or context creation failure will abort caller control flow instead of returning a recoverable error. Store/return the initialization Result from backend() so callers can handle failure.
Low - Incorrect empty-input handling: Several functions check is_power_of_two() before their empty-input branch, so empty inputs panic even though the code appears to intend returning empty results. Example: ntt.rs and lde.rs. Move the n == 0 checks before the assertions or remove the dead empty branches.

No security vulnerabilities found in the changed lines.

I did not run the full test suite; nvcc is not installed in this environment, which is the build issue above.

claude · 2026-05-06T18:32:10Z

Review: feat(cuda): math-cuda crate scaffolding — arith + NTT + coset LDE

Overview

Introduces a new math-cuda crate with CUDA-accelerated Goldilocks / ext3 field arithmetic, batched NTT, and coset LDE kernels for the STARK prover. Feature-gated (cuda), opt-in, CPU path untouched. 30 parity tests assert bit-identical u64 output against the CPU reference. Clean design overall.

Security / Correctness

[Medium] n as u32 silently truncates at maximum Goldilocks domain size
See inline comment on lde.rs:106. LaunchConfig::for_num_elems(lde_size as u32) — and all the equivalent (n as u32).div_ceil(256) grid computations — truncate to zero when n = 2^32 (the top of Goldilocks TWO_ADICITY). The affected kernel launches do nothing, producing silently wrong output rather than an error. Affects coset_lde_base, ntt_inplace, and the batched variants. Recommend u32::try_from(n).expect(...) or an upfront assertion.

[Low] twiddles_forward(0) / twiddles_inverse(0) panic
See inline comment on ntt.rs:24. 1usize << (log_n - 1) underflows for log_n = 0. Both functions are pub, so callers can trigger this. Easy fix: add assert!(log_n >= 1) at the top of each.

[Low] Unbounded log_n in cached_twiddles causes OOB panic
See inline comment on device.rs:216. guard[log_n as usize] panics for log_n >= 33. A debug_assert before the indexing would catch regressions without runtime cost in release builds.

Bugs / Edge Cases

[Low] Concurrent twiddle uploads share util_stream without a mutex
See inline comment on device.rs:216. Two threads computing twiddles for different log_n values both enqueue onto util_stream concurrently. CUDA serializes the operations correctly, but each thread's synchronize() waits for the other's H2D transfer too — invisible coupling that adds latency under parallel prover invocations.

Design

Code duplication in LDE pipeline
The 5-step sequence (bit-reverse -> iNTT -> pointwise-mul -> bit-reverse -> fNTT) is copy-pasted across coset_lde_batch_base, coset_lde_batch_base_into, coset_lde_batch_ext3_into, and evaluate_poly_coset_batch_ext3_into_inner. Each variant differs only in whether it has debug timing, where it writes output, and whether ext3 de-interleaving happens. A shared run_batched_lde_pipeline helper with callback-style pre/post steps would eliminate ~300 lines of near-identical unsafe kernel launches and reduce the risk of variants drifting.

GpuLdeBase / GpuLdeExt3 ship without any tests
PR notes this explicitly. They are constructed only through _keep variants that are themselves untested here. Recommend at least a smoke test that round-trips through evaluate_poly_coset_batch_ext3_into_keep and reads back a known value from the returned handle.

Debug timing only in coset_lde_batch_base
See inline comment on lde.rs:149. The MATH_CUDA_PHASE_TIMING env-var check also calls std::env::var (a syscall) on every invocation; consider a OnceLock<bool> flag cached at backend init.

What looks good

Parity tests are thorough and the bit-identical contract is the right level of precision.
Stream-pool round-robin and single shared pinned staging are well-motivated tradeoffs.
ext3.cuh dot3 overflow handling is correct: hi <= 2^64-2 (proven from 64x64 mul bounds) ensures hi + carry never wraps, so the two-overflow-bit accumulation is sound.
The rayon+raw-pointer parallel-copy pattern is safe: disjoint regions, staging lock held for the lifetime of the slices.
Feature gating is clean — CPU-only consumers are not affected.

MauroToscano

cudarc cuda-13010 feature is too strict (real bug). Every test panics on CUDA-13.0 drivers because cudarc tries to dlsym cuDevSmResourceSplit, which only
exists in 13.1+. The driver here only exports cuDevSmResourceSplitByCount. Suggest cuda-13000 or feature-detect.

Default compute_89 PTX JIT-fails on Blackwell. Comment claims one PTX works across Ada/Hopper/Blackwell — empirically false with nvcc 13.0 + driver 580 +
RTX 5090, which rejects the PTX as "unsupported toolchain". Setting CUDARC_NVCC_ARCH=sm_120 fixes it. Either emit a fat binary or detect arch in build.rs.

The static note about "build.rs empty-PTX stub fallback" is wrong. No fallback exists; cargo check --workspace panics on no-nvcc CI runners. Per-crate non-cuda builds still work (math-cuda is optional = true).

MauroToscano · 2026-05-12T13:21:08Z

cudarc cuda-13010 feature is too strict (real bug). Every test panics on CUDA-13.0 drivers because cudarc tries to dlsym cuDevSmResourceSplit, which only exists in 13.1+. The driver here only exports cuDevSmResourceSplitByCount. Suggest cuda-13000 or feature-detect.

Default compute_89 PTX JIT-fails on Blackwell. Comment claims one PTX works across Ada/Hopper/Blackwell — empirically false with nvcc 13.0 + driver 580 + RTX 5090, which rejects the PTX as "unsupported toolchain". Setting CUDARC_NVCC_ARCH=sm_120 fixes it. Either emit a fat binary or detect arch in build.rs.

The static note about "build.rs empty-PTX stub fallback" is wrong. No fallback exists; cargo check --workspace panics on no-nvcc CI runners. Per-crate non-cuda builds still work (math-cuda is optional = true).

Actually instead of setting it to 13000 check adding this:

cudarc = { version = "0.19", default-features = false, features = [
"driver",
"nvrtc",
"std",
"cuda-version-from-build-system", # auto-pick based on local nvcc
"fallback-latest", # don't panic if nvcc missing
"dynamic-loading",
] }

- Add ext3_sub_kernel + ext3_sub_u64 wrapper (test infrastructure; the ext3::sub device function was previously unreachable from Rust). - Add tests/ext3_edge.rs: 7 adversarial tests for ext3::mul dot3 overflow tracking — (p-1)^3, u64::MAX^3, non-canonical p representations, identity, and 98 base-field edge pairs in a/b/c slots. - Add tests/ext3_sub.rs: parity test for the new sub wrapper. - Add tests/ntt_known.rs: known-polynomial tests for p(x) = 1+x at sizes 16 and 256, and p(x) = x^(N/2) for the alternating ±1 pattern. - Add tests/lde_batch_into.rs: direct parity test for coset_lde_batch_base_into vs coset_lde_batch_base. 15 new tests total, all green on RTX 5090.

gabrielbosio

It would be nice if this PR adds some make commands to run tests from math-cuda. #575 already has those.

add first cuda files

d1a0abf

fmt

79634ff

claude Bot reviewed May 6, 2026

View reviewed changes

Comment thread crypto/math-cuda/src/lde.rs

claude Bot reviewed May 6, 2026

View reviewed changes

Comment thread crypto/math-cuda/src/ntt.rs

claude Bot reviewed May 6, 2026

View reviewed changes

Comment thread crypto/math-cuda/src/device.rs

claude Bot reviewed May 6, 2026

View reviewed changes

Comment thread crypto/math-cuda/src/lde.rs

fix clippy

ac6fbb5

ColoCarletti mentioned this pull request May 8, 2026

feat(cuda): Round 1 GPU LDE+commit dispatch + device-resident handles #582

Open

ColoCarletti added 2 commits May 8, 2026 12:28

fix comments

0944a83

fix comments

12292ce

gabrielbosio reviewed May 11, 2026

View reviewed changes

Comment thread crypto/math-cuda/src/device.rs Outdated

MauroToscano requested changes May 12, 2026

View reviewed changes

MauroToscano and others added 2 commits May 12, 2026 15:40

Merge branch 'main' into feat/cuda-pr1a-math-cuda-foundation

5e3932c

gabrielbosio reviewed May 12, 2026

View reviewed changes

Comment thread crypto/math-cuda/src/lde.rs Outdated

gabrielbosio reviewed May 12, 2026

View reviewed changes

Comment thread crypto/math-cuda/src/lde.rs Outdated

gabrielbosio reviewed May 12, 2026

View reviewed changes

Comment thread crypto/math-cuda/src/lde.rs Outdated

gabrielbosio reviewed May 12, 2026

View reviewed changes

Comment thread crypto/math-cuda/src/lde.rs Outdated

gabrielbosio reviewed May 12, 2026

View reviewed changes

Comment thread crypto/math-cuda/src/lde.rs Outdated

gabrielbosio reviewed May 12, 2026

View reviewed changes

Comment thread crypto/math-cuda/src/lde.rs Outdated

gabrielbosio reviewed May 12, 2026

View reviewed changes

Comment thread crypto/math-cuda/src/lde.rs Outdated

gabrielbosio reviewed May 12, 2026

View reviewed changes

Comment thread crypto/math-cuda/src/lde.rs Outdated