feat(cuda): Round 1 GPU LDE+commit dispatch + device-resident handles#582
feat(cuda): Round 1 GPU LDE+commit dispatch + device-resident handles#582ColoCarletti wants to merge 20 commits into
Conversation
Review: feat(cuda) Round 1 GPU LDE+commit dispatchGood structural work — the CPU fallback is clean, feature-gating is consistent, and the fused LDE+leaf-hash+tree pipeline avoids the extra H2D that a separate step would need. A few issues need attention before merge. High1. 2. Medium3. GPU kernel failures panic (see inline on 4. 5. Low6. 7. Pervasive code duplication across |
Codex Code ReviewFindings
I could not run |
Codex Code ReviewFindings:
I attempted |
Review summaryFive issues found across the new GPU dispatch layer and the Merkle helper. High
Medium
Low
|
Codex Code ReviewFindings
No security vulnerabilities found in the reviewed diff beyond these compile-blocking correctness issues. I attempted |
Review: feat/cuda-pr2-r1-gpu-commitsThis PR wires in GPU (CUDA) acceleration for Round 1 LDE + Merkle commitments via a new HighH1 — H2 — MediumM1 — GPU kernel failures panic the prover with no fallback M2 — M3 — Plain LowL1 — Only L2 — |
|
solved in 01aa5e4 |
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Summary
Wires crypto/stark into the math-cuda crate (PR-1a + PR-1b) for the first time: Round 1 main + aux trace LDE + Merkle commits run on GPU when the cuda feature is
enabled and the table is above threshold. The committed LDE buffer is kept on device as a GpuLdeBase / GpuLdeExt3 handle, attached to LDETraceTable so future
work (R3 OOD, R4 DEEP, R4 FRI) can read the LDE without re-paying PCIe transfer.
The cuda feature stays opt-in. The CPU path is the default and untouched. With the feature on, the dispatch falls through to CPU when the type isn't
Goldilocks/ext3, the size is below threshold, or the GPU path returns None.
This is the structural piece — the headline wall-time win arrives in later rounds (R4 DEEP + FRI), but the device-resident handle plumbing has to land first.
What's in
try_expand_leaf_and_tree_batched_ext3_keep (R1 aux), try_extend_two_halves_gpu (R2 quotient extend, dormant for current sizes but ships now), threshold + atomic
call counters (LAMBDA_VM_GPU_LDE_THRESHOLD, default 2^19).
Lde<F, FE> gains #[cfg(feature = "cuda")] gpu_main / gpu_aux fields; dispatch sites in expand_columns_to_lde and the multi_prove aux-build chunk; GPU handles
thread through multi_prove into Round1Commitments and LDETraceTable.
the R3 GPU dispatch.
Verification
cargo build --workspace --releasecargo build -p stark --features cuda --releasecargo build -p lambda-vm-prover --release(no cuda)math-cudacargo fmt --check --allcargo clippy --workspace --all-targets -- -D warnings -A clippy::op_refcargo test -p math-cuda --releasecargo test -p lambda-vm-prover --features stark/cuda --release -- --test-threads=1Run on RTX 5090, driver 595.58.03, CUDA 13.1.
Known limitation: parallel cuda tests deadlock
cargo test -p lambda-vm-prover --features cuda deadlocks under default rayon parallelism (rayon-on-rayon contention while holding math-cuda's pinned_staging
Mutex). Workaround: run with --test-threads=1. Inherited from the source CUDA work; proper fix lives on the math-cuda side and is out of scope here.
Continuation of
Builds on PR-1a (#576) and PR-1b (#578). Base branch is feat/cuda-pr1b-keccak-merkle, not main.