Skip to content

feat(cuda): Round 1 GPU LDE+commit dispatch + device-resident handles#582

Open
ColoCarletti wants to merge 20 commits into
mainfrom
feat/cuda-pr2-r1-gpu-commits
Open

feat(cuda): Round 1 GPU LDE+commit dispatch + device-resident handles#582
ColoCarletti wants to merge 20 commits into
mainfrom
feat/cuda-pr2-r1-gpu-commits

Conversation

@ColoCarletti
Copy link
Copy Markdown
Collaborator

Summary

Wires crypto/stark into the math-cuda crate (PR-1a + PR-1b) for the first time: Round 1 main + aux trace LDE + Merkle commits run on GPU when the cuda feature is
enabled and the table is above threshold. The committed LDE buffer is kept on device as a GpuLdeBase / GpuLdeExt3 handle, attached to LDETraceTable so future
work (R3 OOD, R4 DEEP, R4 FRI) can read the LDE without re-paying PCIe transfer.

The cuda feature stays opt-in. The CPU path is the default and untouched. With the feature on, the dispatch falls through to CPU when the type isn't
Goldilocks/ext3, the size is below threshold, or the GPU path returns None.

This is the structural piece — the headline wall-time win arrives in later rounds (R4 DEEP + FRI), but the device-resident handle plumbing has to land first.

What's in

  • New file crypto/stark/src/gpu_lde.rs (~960 LoC) — dispatch surface: try_expand_columns_batched, try_expand_leaf_and_tree_batched_keep (R1 main),
    try_expand_leaf_and_tree_batched_ext3_keep (R1 aux), try_extend_two_halves_gpu (R2 quotient extend, dormant for current sizes but ships now), threshold + atomic
    call counters (LAMBDA_VM_GPU_LDE_THRESHOLD, default 2^19).
  • prover.rs (~250 LoC of changes): commit_main_trace / commit_preprocessed_trace return type refactored from a 6-tuple to a MainTraceCommitResult struct;
    Lde<F, FE> gains #[cfg(feature = "cuda")] gpu_main / gpu_aux fields; dispatch sites in expand_columns_to_lde and the multi_prove aux-build chunk; GPU handles
    thread through multi_prove into Round1Commitments and LDETraceTable.
  • trace.rs (~40 LoC): LDETraceTable gains gpu_main / gpu_aux: Option<...> fields + setters/accessors. R3 GPU branches are NOT added here — they'll land alongside
    the R3 GPU dispatch.
  • prover/tests/bench_single.rs (new) — small #[ignore]'d prove-fib bench so reviewers can verify R1 wall time on the GPU path.
  • crypto/crypto/src/merkle_tree/merkle.rs (+24 LoC) — small helper to construct a BatchedMerkleTree from a flat device-produced node buffer.

Verification

Check Result
cargo build --workspace --release
cargo build -p stark --features cuda --release
cargo build -p lambda-vm-prover --release (no cuda) ✓ — does not pull math-cuda
cargo fmt --check --all
cargo clippy --workspace --all-targets -- -D warnings -A clippy::op_ref
cargo test -p math-cuda --release 35 / 35 parity tests pass
cargo test -p lambda-vm-prover --features stark/cuda --release -- --test-threads=1 339 / 339

Run on RTX 5090, driver 595.58.03, CUDA 13.1.

Known limitation: parallel cuda tests deadlock

cargo test -p lambda-vm-prover --features cuda deadlocks under default rayon parallelism (rayon-on-rayon contention while holding math-cuda's pinned_staging
Mutex). Workaround: run with --test-threads=1. Inherited from the source CUDA work; proper fix lives on the math-cuda side and is out of scope here.

Continuation of

Builds on PR-1a (#576) and PR-1b (#578). Base branch is feat/cuda-pr1b-keccak-merkle, not main.

Comment thread crypto/stark/src/gpu_lde.rs
Comment thread crypto/stark/src/gpu_lde.rs
Comment thread crypto/stark/src/gpu_lde.rs
Comment thread crypto/stark/src/gpu_lde.rs
Comment thread crypto/crypto/src/merkle_tree/merkle.rs Outdated
Comment thread crypto/stark/src/prover.rs
@claude
Copy link
Copy Markdown

claude Bot commented May 8, 2026

Review: feat(cuda) Round 1 GPU LDE+commit dispatch

Good structural work — the CPU fallback is clean, feature-gating is consistent, and the fused LDE+leaf-hash+tree pipeline avoids the extra H2D that a separate step would need. A few issues need attention before merge.

High

1. type_name used to justify unsafe memory casts (see inline on gpu_lde.rs:99)
type_name::<T>() is explicitly documented as not stable across compiler versions and not suitable for type identity. Every unsafe cast in the file (*(e.value() as *const _ as *const u64), from_raw_parts on ext3 slices) is guarded solely by these string comparisons. A sealed-trait approach makes the same dispatch sound at compile time and removes the runtime fragility.

2. debug_assert protecting unsafe set_len (see inline on gpu_lde.rs:130)
debug_assert! is a no-op in release builds. Every col.set_len(lde_size) pattern in the file relies on the capacity precondition being true; if a caller regresses, this is UB (GPU writes past the allocation) with no diagnostic in production. Use a hard assert! or restructure with spare_capacity_mut + MaybeUninit.

Medium

3. GPU kernel failures panic (see inline on gpu_lde.rs:156)
All GPU calls end with .expect("..."). A transient CUDA error (OOM, context loss) kills the process. The functions should return Result so errors surface as ProvingError instead.

4. squared_offset parameter silently ignored (see inline on gpu_lde.rs:204)
try_extend_two_halves_gpu accepts squared_offset but discards it and rebuilds the offset from domain.coset_offset. This silently produces wrong weights if a future caller passes a different squared offset.

5. from_precomputed_nodes accepts unverified GPU output (see inline on merkle.rs:78)
Only the length is validated. A kernel bug or CUDA memory corruption produces a structurally-valid but cryptographically-wrong commitment with no detection. At minimum add a debug_assert-mode spot-check of the root node.

Low

6. AuxResult tuple uses () as phantom fourth field (see inline on prover.rs:1836)
Adds noise and forces #[allow(clippy::let_unit_value)]. Mirroring the main_gpu_handles side-vector pattern is cleaner.

7. Pervasive code duplication across gpu_lde.rs
try_expand_leaf_and_tree_batched, _keep, _ext3, _ext3_keep, try_expand_and_leaf_hash_batched, _ext3 share identical prologue blocks (~50 lines each). A small internal helper that validates preconditions and extracts the raw slices would halve the file size and make future changes safer.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

Codex Code Review

Findings

  • Medium Security: gpu_lde.rs uses type_name checks plus raw pointer casts to reinterpret Vec<FieldElement<E>> as &mut [u64]. Rust does not guarantee FieldElement<GoldilocksField> has u64 layout because FieldElement is not #[repr(transparent)]; the ext3 paths make the same assumption for [u64; 3]. This is unsafe layout-dependent code and can become UB if the type layout changes. Fix by making the layout contract explicit with repr(transparent) and size/alignment assertions, or by using explicit conversion buffers.

  • Bug: prover.rs constructs Lde { main, aux } inside #[cfg(feature = "debug-checks")], but Lde has extra gpu_main and gpu_aux fields under feature = "cuda". Building with both cuda and debug-checks will fail. Add cfg-gated None fields there.

I could not run cargo check; rustup failed creating temp files under /home/runner/.rustup because that path is read-only in this environment.

@ColoCarletti ColoCarletti marked this pull request as ready for review May 13, 2026 21:07
@ColoCarletti ColoCarletti marked this pull request as draft May 13, 2026 21:08
Comment thread crypto/stark/src/gpu_lde.rs
@github-actions
Copy link
Copy Markdown

Codex Code Review

Findings:

  1. [Medium][Security] Unsafe layout casts are not backed by a Rust layout guarantee
    crypto/stark/src/gpu_lde.rs repeatedly treats FieldElement<GoldilocksField> / FieldElement<Degree3GoldilocksExtensionField> memory as u64 slices based on type_name checks. type_name is not a type/layout contract, and FieldElement is repr(Rust), so these raw casts can become UB if layout changes. This affects both reads and writes, including from_raw_parts_mut output views. Add explicit layout guarantees, e.g. #[repr(transparent)] where valid plus compile-time size/align assertions, or convert through safe field/component APIs.

  2. Bug: cuda + debug-checks feature combo does not construct Lde correctly
    crypto/stark/src/prover.rs builds Lde { main, aux } inside reconstruct_round1, but Lde has extra gpu_main and gpu_aux fields under #[cfg(feature = "cuda")] at crypto/stark/src/prover.rs. Enabling both features should fail to compile. Add #[cfg(feature = "cuda")] gpu_main: None, gpu_aux: None there.

I attempted cargo check -p stark --features 'cuda debug-checks', but this environment’s rustup temp dir is read-only, so I could not confirm via compiler output.

Comment thread crypto/stark/src/gpu_lde.rs
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/crypto/src/merkle_tree/merkle.rs
@claude
Copy link
Copy Markdown

claude Bot commented May 13, 2026

Review summary

Five issues found across the new GPU dispatch layer and the Merkle helper.

High

  • type_name::<T>() is used as the sole proof that a generic type is memory-equivalent to u64 / [u64;3] before a pile of unsafe transmutes. The Rust docs explicitly say this string must not be considered a unique identifier of a type. The fix is a sealed unsafe trait IsU64Repr {} implemented only for the concrete types you control — the type system then enforces the invariant instead of a runtime string compare. All type_name + raw-pointer cast pairs throughout gpu_lde.rs share this issue.

Medium

  • debug_assert!(col.capacity() >= lde_size) immediately before unsafe { col.set_len(lde_size) } — the assert is stripped in --release, which is the only mode the GPU path runs. A capacity violation in release is silent UB. Use assert! or make the invariant structurally impossible to violate.
  • Every math_cuda call uses .expect(). A GPU OOM or driver error crashes the whole process; these should map to ProvingError and let the caller recover.

Low

  • try_extend_two_halves_gpu accepts squared_offset but immediately discards it (let _ = squared_offset), silently falling back to deriving the same value from domain. The parameter should either be removed or used — leaving it makes future callers believe they can influence the computation when they cannot.
  • from_precomputed_nodes validates only array length. A GPU bit-flip or kernel bug produces a structurally invalid MerkleTree with no diagnostic — consider a #[cfg(debug_assertions)] integrity walk to catch GPU correctness regressions in CI.

Comment thread crypto/crypto/src/merkle_tree/merkle.rs Outdated
Base automatically changed from feat/cuda-pr1b-keccak-merkle to main May 18, 2026 22:30
@ColoCarletti ColoCarletti marked this pull request as ready for review May 19, 2026 19:17
@github-actions
Copy link
Copy Markdown

Codex Code Review

Findings

  1. Build break: moved result is used after drop(result)
    prover/src/lib.rs:680 adds a second debug_assert_eq! that reads result.return_values after line 622 drops result. This will fail to compile. Remove the duplicate assertion or move it before drop(result).

  2. Build break with disk-spill: missing mmap_backing field
    crypto/crypto/src/merkle_tree/merkle.rs:147 constructs MerkleTree { root, nodes }, but under feature = "disk-spill" the struct also has mmap_backing. Add #[cfg(feature = "disk-spill")] mmap_backing: None.

  3. Build break with cuda + debug-checks
    crypto/stark/src/prover.rs:835 constructs Lde { main, aux }, but with cuda enabled Lde now also requires gpu_main and gpu_aux. Add None for both behind the same cfg.

  4. Test build break: missing Prover import
    prover/src/tests/prove_elfs_tests.rs:110 uses Prover::multi_prove without importing stark::prover::Prover.

No security vulnerabilities found in the reviewed diff beyond these compile-blocking correctness issues. I attempted cargo check -p lambda-vm-prover, but the sandbox cannot write to /home/runner/.rustup/tmp, so it could not run here.

@claude
Copy link
Copy Markdown

claude Bot commented May 19, 2026

Review: feat/cuda-pr2-r1-gpu-commits

This PR wires in GPU (CUDA) acceleration for Round 1 LDE + Merkle commitments via a new gpu_lde.rs (~960 lines) and refactors prover.rs to carry GPU handles through the commit pipeline. Issues found:


High

H1 — debug_assert! guards unsafe set_len — stripped in release builds
Every unsafe { col.set_len(lde_size) } is gated only by debug_assert!(col.capacity() >= lde_size). In release builds the assertion is compiled out. If a caller ever returns columns with insufficient capacity, set_len will point the Vec's length past its allocation — undefined behaviour (heap corruption). This is a safety invariant; use assert! unconditionally.
Affected lines: gpu_lde.rs:128, 297, 393, 482, 568, 656, 743, 822

H2 — type_name::<F>() used as a memory-safety gate
All unsafe casts and reinterpret-slices are enabled/disabled by comparing core::any::type_name::<F>() strings. The docs explicitly say the string is best-effort with no stability guarantee. The correct mechanism is std::any::TypeId::of::<F>() == TypeId::of::<GoldilocksField>(), which is the stable runtime type identity check. A type alias or a future compiler version could make two distinct types share a name, silently enabling wrong casts.
Affected lines: gpu_lde.rs:84, 97, 101, 190, 193, 273, 276, 364, 367, ...


Medium

M1 — GPU kernel failures panic the prover with no fallback
Every CUDA call terminates with .expect("GPU … failed"). A transient GPU error, OOM, or driver failure will panic the entire prover thread. These entry-point functions already return Option; they should return None (or Result) on kernel failure so the caller can fall back to the CPU path.

M2 — squared_offset parameter is silently discarded
try_extend_two_halves_gpu accepts squared_offset: &FieldElement<F> but immediately does let _ = squared_offset and re-derives the coset weight from domain.coset_offset. If a caller ever passes a domain whose coset_offset differs from the one used to compute squared_offset, the LDE output is silently wrong. Remove the parameter or use it.
Line: gpu_lde.rs:199

M3 — Plain * on total_nodes * 32 can overflow and silently truncate the slice
lde_size is computed with saturating_mul, but total_nodes = 2 * lde_size - 1 and then total_nodes * 32 use plain arithmetic. If lde_size is large (possible on a misconfigured domain), the multiplication wraps in release mode to a small number, the from_raw_parts_mut slice is smaller than the allocation, and the GPU writes past it — silent UB. Use checked_mul and return None on overflow.
Affected lines: gpu_lde.rs:412, 500, 586, 674, 760


Low

L1 — Only GPU_LDE_CALLS has a reset function
reset_gpu_lde_calls() resets GPU_LDE_CALLS but GPU_EXTEND_HALVES_CALLS, GPU_LEAF_HASH_CALLS, and GPU_MERKLE_TREE_CALLS accumulate forever. The bench test runs a warm-up pass then a profiled pass, so those three counters will be doubled relative to GPU_LDE_CALLS.

L2 — from_precomputed_nodes does not validate internal node consistency
The new constructor only checks that the node count forms a valid power-of-two tree. A GPU Merkle kernel bug producing an internally inconsistent tree (root doesn't match children) would be undetected — the prover commits to a corrupt tree and the verifier rejects the proof with no actionable diagnostic.

Comment thread crypto/stark/src/gpu_lde.rs Outdated
@ColoCarletti
Copy link
Copy Markdown
Collaborator Author

solved in 01aa5e4

Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/gpu_lde.rs Outdated
ColoCarletti and others added 5 commits May 21, 2026 18:32
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants