feat(cuda): math-cuda crate scaffolding — arith + NTT + coset LDE#576
Conversation
Codex Code ReviewFindings
No security vulnerabilities found in the changed lines. I did not run the full test suite; |
Review: feat(cuda): math-cuda crate scaffolding — arith + NTT + coset LDEOverviewIntroduces a new Security / Correctness[Medium] [Low] [Low] Unbounded Bugs / Edge Cases[Low] Concurrent twiddle uploads share DesignCode duplication in LDE pipeline
Debug timing only in What looks good
|
MauroToscano
left a comment
There was a problem hiding this comment.
cudarc cuda-13010 feature is too strict (real bug). Every test panics on CUDA-13.0 drivers because cudarc tries to dlsym cuDevSmResourceSplit, which only
exists in 13.1+. The driver here only exports cuDevSmResourceSplitByCount. Suggest cuda-13000 or feature-detect.
Default compute_89 PTX JIT-fails on Blackwell. Comment claims one PTX works across Ada/Hopper/Blackwell — empirically false with nvcc 13.0 + driver 580 +
RTX 5090, which rejects the PTX as "unsupported toolchain". Setting CUDARC_NVCC_ARCH=sm_120 fixes it. Either emit a fat binary or detect arch in build.rs.
The static note about "build.rs empty-PTX stub fallback" is wrong. No fallback exists; cargo check --workspace panics on no-nvcc CI runners. Per-crate non-cuda builds still work (math-cuda is optional = true).
Actually instead of setting it to 13000 check adding this: cudarc = { version = "0.19", default-features = false, features = [ |
- Add ext3_sub_kernel + ext3_sub_u64 wrapper (test infrastructure;
the ext3::sub device function was previously unreachable from Rust).
- Add tests/ext3_edge.rs: 7 adversarial tests for ext3::mul dot3
overflow tracking — (p-1)^3, u64::MAX^3, non-canonical p
representations, identity, and 98 base-field edge pairs in
a/b/c slots.
- Add tests/ext3_sub.rs: parity test for the new sub wrapper.
- Add tests/ntt_known.rs: known-polynomial tests for p(x) = 1+x at
sizes 16 and 256, and p(x) = x^(N/2) for the alternating ±1
pattern.
- Add tests/lde_batch_into.rs: direct parity test for
coset_lde_batch_base_into vs coset_lde_batch_base.
15 new tests total, all green on RTX 5090.
gabrielbosio
left a comment
There was a problem hiding this comment.
It would be nice if this PR adds some make commands to run tests from math-cuda. #575 already has those.
Summary
Introduces a new
math-cudacrate that ships the foundation kernels forGPU-accelerated polynomial work: Goldilocks/ext3 field arithmetic,
batched NTT (single-level + 8-level fused), and plain coset LDE entry
points. Adds a
cudafeature flag tocrypto/stark, opt-in and disabledby default. No prover code consumes it yet — the kernels and dispatch
surface land here so future PRs can wire them into Round 1 LDE/commit,
Round 3 OOD evaluation, Round 4 DEEP/FRI, etc.
The whole crate is gated. CPU-only consumers (
-p lambda-vm-prover,-p cli) don't pullmath-cudain and don't require nvcc to build.What's in
crypto/math-cuda/(~2.5k LoC):kernels/{goldilocks.cuh, ext3.cuh, arith.cu, ntt.cu}— hand-written CUDAsrc/device.rs— process-wideBackendsingleton (32-stream pool,pinned host staging, twiddle cache, event-tracking disabled)
src/ntt.rs— host wrappers for forward/inverse NTTsrc/lde.rs—coset_lde_base,coset_lde_batch_base[_into],coset_lde_batch_ext3_into,evaluate_poly_coset_batch_ext3_into[_keep]GpuLdeBase/GpuLdeExt3device-handle types — kept on deviceacross calls so downstream consumers can read LDE buffers without
re-paying PCIe transfer
reference: arith, ext3, NTT round-trip, single-column LDE,
batched LDE, ext3 LDE, coset evaluate.
crypto/math-cudaadded to[workspace] members,math-cuda = { path, optional = true }+cuda = ["dep:math-cuda"]feature added to
crypto/stark.Why this design
cudaflag is opt-in. The CPUpath is the default and untouched.
u64 output to the CPU reference. Tests assert on raw u64s, not on
FieldElements, so non-canonical representations must match exactly.32-stream round-robin pool lets rayon-parallel callers overlap kernel
launches. Twiddles are computed once on host and uploaded once per
log_n.GpuLdeBase/GpuLdeExt3wrap anArc<CudaSlice<u64>>so the device buffer survives past the call.This lets later passes read the LDE without H2D transferring it again
(a 240 MB+ saving per round at prover scale).
Verification
cargo build --workspace --releasecargo build -p stark --features cuda --releasecargo build -p lambda-vm-prover --release(no cuda, CPU consumer)math-cudacargo test -p math-cuda --releaseRun on RTX 5090 (sm_120 / Blackwell), driver 595.58.03, CUDA 13.1
toolchain. PTX targets
compute_89(Ada virtual arch); the driverJIT-compiles for the actual GPU at module load, so the same PTX runs on
Ada / Hopper / Blackwell.
Microbench (
bench_quick.rs,--ignored)Single-shot timings (other than the median-of-10 row) are directional
only.
Test plan
cargo build --workspace --releasebuilds cleancargo build -p stark --features cudacompiles cleanlycrypto/math-cuda/tests/*.rsparity test passes (30 tests)bench_quickruns end-to-end on the GPUNotes for reviewers
cudarc 0.19features:driver, nvrtc, std, cuda-12080, dynamic-loading.dynamic-loadingmeanslibcudais loaded at runtime, not linked, sothe binary is portable across driver versions.
cuda-12080selects theCUDA 12.8 ABI bindings; verified working against a CUDA 13.1 toolchain
crypto/math-cudais aworkspace member,
cargo build --workspaceinvokes itsbuild.rs,which calls nvcc. Consumer-crate builds via
cargo build -p <crate>for
lambda-vm-prover/clido not pull inmath-cuda. CPU-onlydev-loop: use
-p <crate>instead of--workspace.not field-equal. Catches non-canonical-representation drift early.
GpuLdeBase/GpuLdeExt3ship without direct test coverage inthis PR — they're constructed only via the
_keepvariants and consumedby integrations that aren't here yet. The byte-equivalence contract on
the LDE buffers themselves is enforced by the existing parity tests.