feat(cuda): Round 1 GPU LDE+commit dispatch + device-resident handles by ColoCarletti · Pull Request #582 · yetanotherco/lambda_vm

ColoCarletti · 2026-05-08T14:44:13Z

Summary

Wires crypto/stark into the math-cuda crate (PR-1a + PR-1b) for the first time: Round 1 main + aux trace LDE + Merkle commits run on GPU when the cuda feature is
enabled and the table is above threshold. The committed LDE buffer is kept on device as a GpuLdeBase / GpuLdeExt3 handle, attached to LDETraceTable so future
work (R3 OOD, R4 DEEP, R4 FRI) can read the LDE without re-paying PCIe transfer.

The cuda feature stays opt-in. The CPU path is the default and untouched. With the feature on, the dispatch falls through to CPU when the type isn't
Goldilocks/ext3, the size is below threshold, or the GPU path returns None.

This is the structural piece — the headline wall-time win arrives in later rounds (R4 DEEP + FRI), but the device-resident handle plumbing has to land first.

What's in

New file crypto/stark/src/gpu_lde.rs (~960 LoC) — dispatch surface: try_expand_columns_batched, try_expand_leaf_and_tree_batched_keep (R1 main),
try_expand_leaf_and_tree_batched_ext3_keep (R1 aux), try_extend_two_halves_gpu (R2 quotient extend, dormant for current sizes but ships now), threshold + atomic
call counters (LAMBDA_VM_GPU_LDE_THRESHOLD, default 2^19).
prover.rs (~250 LoC of changes): commit_main_trace / commit_preprocessed_trace return type refactored from a 6-tuple to a MainTraceCommitResult struct;
Lde<F, FE> gains #[cfg(feature = "cuda")] gpu_main / gpu_aux fields; dispatch sites in expand_columns_to_lde and the multi_prove aux-build chunk; GPU handles
thread through multi_prove into Round1Commitments and LDETraceTable.
trace.rs (~40 LoC): LDETraceTable gains gpu_main / gpu_aux: Option<...> fields + setters/accessors. R3 GPU branches are NOT added here — they'll land alongside
the R3 GPU dispatch.
prover/tests/bench_single.rs (new) — small #[ignore]'d prove-fib bench so reviewers can verify R1 wall time on the GPU path.
crypto/crypto/src/merkle_tree/merkle.rs (+24 LoC) — small helper to construct a BatchedMerkleTree from a flat device-produced node buffer.

Verification

Check	Result
`cargo build --workspace --release`	✓
`cargo build -p stark --features cuda --release`	✓
`cargo build -p lambda-vm-prover --release` (no cuda)	✓ — does not pull `math-cuda`
`cargo fmt --check --all`	✓
`cargo clippy --workspace --all-targets -- -D warnings -A clippy::op_ref`	✓
`cargo test -p math-cuda --release`	✓ 35 / 35 parity tests pass
`cargo test -p lambda-vm-prover --features stark/cuda --release -- --test-threads=1`	✓ 339 / 339

Run on RTX 5090, driver 595.58.03, CUDA 13.1.

Known limitation: parallel cuda tests deadlock

cargo test -p lambda-vm-prover --features cuda deadlocks under default rayon parallelism (rayon-on-rayon contention while holding math-cuda's pinned_staging
Mutex). Workaround: run with --test-threads=1. Inherited from the source CUDA work; proper fix lives on the math-cuda side and is out of scope here.

Continuation of

Builds on PR-1a (#576) and PR-1b (#578). Base branch is feat/cuda-pr1b-keccak-merkle, not main.

claude · 2026-05-08T14:47:14Z

Review: feat(cuda) Round 1 GPU LDE+commit dispatch

Good structural work — the CPU fallback is clean, feature-gating is consistent, and the fused LDE+leaf-hash+tree pipeline avoids the extra H2D that a separate step would need. A few issues need attention before merge.

High

1. type_name used to justify unsafe memory casts (see inline on gpu_lde.rs:99)
type_name::<T>() is explicitly documented as not stable across compiler versions and not suitable for type identity. Every unsafe cast in the file (*(e.value() as *const _ as *const u64), from_raw_parts on ext3 slices) is guarded solely by these string comparisons. A sealed-trait approach makes the same dispatch sound at compile time and removes the runtime fragility.

2. debug_assert protecting unsafe set_len (see inline on gpu_lde.rs:130)
debug_assert! is a no-op in release builds. Every col.set_len(lde_size) pattern in the file relies on the capacity precondition being true; if a caller regresses, this is UB (GPU writes past the allocation) with no diagnostic in production. Use a hard assert! or restructure with spare_capacity_mut + MaybeUninit.

Medium

3. GPU kernel failures panic (see inline on gpu_lde.rs:156)
All GPU calls end with .expect("..."). A transient CUDA error (OOM, context loss) kills the process. The functions should return Result so errors surface as ProvingError instead.

4. squared_offset parameter silently ignored (see inline on gpu_lde.rs:204)
try_extend_two_halves_gpu accepts squared_offset but discards it and rebuilds the offset from domain.coset_offset. This silently produces wrong weights if a future caller passes a different squared offset.

5. from_precomputed_nodes accepts unverified GPU output (see inline on merkle.rs:78)
Only the length is validated. A kernel bug or CUDA memory corruption produces a structurally-valid but cryptographically-wrong commitment with no detection. At minimum add a debug_assert-mode spot-check of the root node.

Low

6. AuxResult tuple uses () as phantom fourth field (see inline on prover.rs:1836)
Adds noise and forces #[allow(clippy::let_unit_value)]. Mirroring the main_gpu_handles side-vector pattern is cleaner.

7. Pervasive code duplication across gpu_lde.rs
try_expand_leaf_and_tree_batched, _keep, _ext3, _ext3_keep, try_expand_and_leaf_hash_batched, _ext3 share identical prologue blocks (~50 lines each). A small internal helper that validates preconditions and extracts the raw slices would halve the file size and make future changes safer.

github-actions · 2026-05-08T14:47:23Z

Codex Code Review

Findings

Medium Security: gpu_lde.rs uses type_name checks plus raw pointer casts to reinterpret Vec<FieldElement<E>> as &mut [u64]. Rust does not guarantee FieldElement<GoldilocksField> has u64 layout because FieldElement is not #[repr(transparent)]; the ext3 paths make the same assumption for [u64; 3]. This is unsafe layout-dependent code and can become UB if the type layout changes. Fix by making the layout contract explicit with repr(transparent) and size/alignment assertions, or by using explicit conversion buffers.
Bug: prover.rs constructs Lde { main, aux } inside #[cfg(feature = "debug-checks")], but Lde has extra gpu_main and gpu_aux fields under feature = "cuda". Building with both cuda and debug-checks will fail. Add cfg-gated None fields there.

I could not run cargo check; rustup failed creating temp files under /home/runner/.rustup because that path is read-only in this environment.

github-actions · 2026-05-13T21:10:53Z

Codex Code Review

Findings:

[Medium][Security] Unsafe layout casts are not backed by a Rust layout guarantee
crypto/stark/src/gpu_lde.rs repeatedly treats FieldElement<GoldilocksField> / FieldElement<Degree3GoldilocksExtensionField> memory as u64 slices based on type_name checks. type_name is not a type/layout contract, and FieldElement is repr(Rust), so these raw casts can become UB if layout changes. This affects both reads and writes, including from_raw_parts_mut output views. Add explicit layout guarantees, e.g. #[repr(transparent)] where valid plus compile-time size/align assertions, or convert through safe field/component APIs.
Bug: cuda + debug-checks feature combo does not construct Lde correctly
crypto/stark/src/prover.rs builds Lde { main, aux } inside reconstruct_round1, but Lde has extra gpu_main and gpu_aux fields under #[cfg(feature = "cuda")] at crypto/stark/src/prover.rs. Enabling both features should fail to compile. Add #[cfg(feature = "cuda")] gpu_main: None, gpu_aux: None there.

I attempted cargo check -p stark --features 'cuda debug-checks', but this environment’s rustup temp dir is read-only, so I could not confirm via compiler output.

claude · 2026-05-13T21:11:44Z

Review summary

Five issues found across the new GPU dispatch layer and the Merkle helper.

High

type_name::<T>() is used as the sole proof that a generic type is memory-equivalent to u64 / [u64;3] before a pile of unsafe transmutes. The Rust docs explicitly say this string must not be considered a unique identifier of a type. The fix is a sealed unsafe trait IsU64Repr {} implemented only for the concrete types you control — the type system then enforces the invariant instead of a runtime string compare. All type_name + raw-pointer cast pairs throughout gpu_lde.rs share this issue.

Medium

debug_assert!(col.capacity() >= lde_size) immediately before unsafe { col.set_len(lde_size) } — the assert is stripped in --release, which is the only mode the GPU path runs. A capacity violation in release is silent UB. Use assert! or make the invariant structurally impossible to violate.
Every math_cuda call uses .expect(). A GPU OOM or driver error crashes the whole process; these should map to ProvingError and let the caller recover.

Low

try_extend_two_halves_gpu accepts squared_offset but immediately discards it (let _ = squared_offset), silently falling back to deriving the same value from domain. The parameter should either be removed or used — leaving it makes future callers believe they can influence the computation when they cannot.
from_precomputed_nodes validates only array length. A GPU bit-flip or kernel bug produces a structurally invalid MerkleTree with no diagnostic — consider a #[cfg(debug_assertions)] integrity walk to catch GPU correctness regressions in CI.

github-actions · 2026-05-19T19:19:44Z

Codex Code Review

Findings

Build break: moved result is used after drop(result)
prover/src/lib.rs:680 adds a second debug_assert_eq! that reads result.return_values after line 622 drops result. This will fail to compile. Remove the duplicate assertion or move it before drop(result).
Build break with disk-spill: missing mmap_backing field
crypto/crypto/src/merkle_tree/merkle.rs:147 constructs MerkleTree { root, nodes }, but under feature = "disk-spill" the struct also has mmap_backing. Add #[cfg(feature = "disk-spill")] mmap_backing: None.
Build break with cuda + debug-checks
crypto/stark/src/prover.rs:835 constructs Lde { main, aux }, but with cuda enabled Lde now also requires gpu_main and gpu_aux. Add None for both behind the same cfg.
Test build break: missing Prover import
prover/src/tests/prove_elfs_tests.rs:110 uses Prover::multi_prove without importing stark::prover::Prover.

No security vulnerabilities found in the reviewed diff beyond these compile-blocking correctness issues. I attempted cargo check -p lambda-vm-prover, but the sandbox cannot write to /home/runner/.rustup/tmp, so it could not run here.

claude · 2026-05-19T19:23:03Z

Review: feat/cuda-pr2-r1-gpu-commits

This PR wires in GPU (CUDA) acceleration for Round 1 LDE + Merkle commitments via a new gpu_lde.rs (~960 lines) and refactors prover.rs to carry GPU handles through the commit pipeline. Issues found:

High

H1 — debug_assert! guards unsafe set_len — stripped in release builds
Every unsafe { col.set_len(lde_size) } is gated only by debug_assert!(col.capacity() >= lde_size). In release builds the assertion is compiled out. If a caller ever returns columns with insufficient capacity, set_len will point the Vec's length past its allocation — undefined behaviour (heap corruption). This is a safety invariant; use assert! unconditionally.
Affected lines: gpu_lde.rs:128, 297, 393, 482, 568, 656, 743, 822

H2 — type_name::<F>() used as a memory-safety gate
All unsafe casts and reinterpret-slices are enabled/disabled by comparing core::any::type_name::<F>() strings. The docs explicitly say the string is best-effort with no stability guarantee. The correct mechanism is std::any::TypeId::of::<F>() == TypeId::of::<GoldilocksField>(), which is the stable runtime type identity check. A type alias or a future compiler version could make two distinct types share a name, silently enabling wrong casts.
Affected lines: gpu_lde.rs:84, 97, 101, 190, 193, 273, 276, 364, 367, ...

Medium

M1 — GPU kernel failures panic the prover with no fallback
Every CUDA call terminates with .expect("GPU … failed"). A transient GPU error, OOM, or driver failure will panic the entire prover thread. These entry-point functions already return Option; they should return None (or Result) on kernel failure so the caller can fall back to the CPU path.

M2 — squared_offset parameter is silently discarded
try_extend_two_halves_gpu accepts squared_offset: &FieldElement<F> but immediately does let _ = squared_offset and re-derives the coset weight from domain.coset_offset. If a caller ever passes a domain whose coset_offset differs from the one used to compute squared_offset, the LDE output is silently wrong. Remove the parameter or use it.
Line: gpu_lde.rs:199

M3 — Plain * on total_nodes * 32 can overflow and silently truncate the slice
lde_size is computed with saturating_mul, but total_nodes = 2 * lde_size - 1 and then total_nodes * 32 use plain arithmetic. If lde_size is large (possible on a misconfigured domain), the multiplication wraps in release mode to a small number, the from_raw_parts_mut slice is smaller than the allocation, and the GPU writes past it — silent UB. Use checked_mul and return None on overflow.
Affected lines: gpu_lde.rs:412, 500, 586, 674, 760

Low

L1 — Only GPU_LDE_CALLS has a reset function
reset_gpu_lde_calls() resets GPU_LDE_CALLS but GPU_EXTEND_HALVES_CALLS, GPU_LEAF_HASH_CALLS, and GPU_MERKLE_TREE_CALLS accumulate forever. The bench test runs a warm-up pass then a profiled pass, so those three counters will be doubled relative to GPU_LDE_CALLS.

L2 — from_precomputed_nodes does not validate internal node consistency
The new constructor only checks that the node count forms a valid power-of-two tree. A GPU Merkle kernel bug producing an internally inconsistent tree (root doesn't match children) would be undetected — the prover commits to a corrupt tree and the verifier rejects the proof with no actionable diagnostic.

ColoCarletti · 2026-05-21T19:59:04Z

solved in 01aa5e4

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

…commits

ColoCarletti added 5 commits May 6, 2026 15:12

add first cuda files

d1a0abf

fmt

79634ff

fix clippy

ac6fbb5

gpu 2nd part

2ceb3b0

feat(cuda): Round 1 GPU LDE+commit dispatch + device-resident handles

affceb1

claude Bot reviewed May 8, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs

claude Bot reviewed May 8, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs

claude Bot reviewed May 8, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs

claude Bot reviewed May 8, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs

claude Bot reviewed May 8, 2026

View reviewed changes

Comment thread crypto/crypto/src/merkle_tree/merkle.rs Outdated

claude Bot reviewed May 8, 2026

View reviewed changes

Comment thread crypto/stark/src/prover.rs

ColoCarletti marked this pull request as ready for review May 13, 2026 21:07

ColoCarletti marked this pull request as draft May 13, 2026 21:08

claude Bot reviewed May 13, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs

claude Bot reviewed May 13, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs

claude Bot reviewed May 13, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs Outdated

claude Bot reviewed May 13, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs Outdated

claude Bot reviewed May 13, 2026

View reviewed changes

Comment thread crypto/crypto/src/merkle_tree/merkle.rs

gabrielbosio reviewed May 18, 2026

View reviewed changes

Comment thread crypto/crypto/src/merkle_tree/merkle.rs Outdated

Base automatically changed from feat/cuda-pr1b-keccak-merkle to main May 18, 2026 22:30

ColoCarletti added 2 commits May 19, 2026 15:45

merge main

01172f2

Merge branch 'main' into feat/cuda-pr2-r1-gpu-commits

c4627e1

ColoCarletti marked this pull request as ready for review May 19, 2026 19:17

claude Bot reviewed May 19, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs Outdated

gabrielbosio reviewed May 21, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs Outdated

gabrielbosio reviewed May 21, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs Outdated

gabrielbosio reviewed May 21, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs Outdated

ColoCarletti and others added 5 commits May 21, 2026 18:32

Update crypto/stark/src/gpu_lde.rs

ea5696f

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/stark/src/gpu_lde.rs

a8cf265

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/stark/src/gpu_lde.rs

fb8d31f

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/stark/src/gpu_lde.rs

a79f2b5

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/stark/src/gpu_lde.rs

761a2c0

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

gabrielbosio reviewed May 21, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs Outdated

gabrielbosio reviewed May 21, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs Outdated

gabrielbosio reviewed May 21, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs Outdated

gabrielbosio reviewed May 21, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs Outdated

gabrielbosio reviewed May 21, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs Outdated

gabrielbosio reviewed May 21, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs Outdated

gabrielbosio reviewed May 21, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs Outdated

gabrielbosio reviewed May 21, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs Outdated

gabrielbosio reviewed May 21, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs Outdated

gabrielbosio reviewed May 21, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs Outdated

gabrielbosio reviewed May 21, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs Outdated

address reviews

e066e9d

gabrielbosio reviewed May 21, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs Outdated

gabrielbosio reviewed May 21, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs Outdated

gabrielbosio reviewed May 21, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs Outdated

gabrielbosio reviewed May 21, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs Outdated

ColoCarletti and others added 5 commits May 22, 2026 12:44

fix review comments

7d3d0f0

Merge remote-tracking branch 'origin/main' into feat/cuda-pr2-r1-gpu-…

cf80771

…commits

address doc comment suggestions

71aba0d

Merge branch 'main' into feat/cuda-pr2-r1-gpu-commits

83d91b8

fix

34cae4b

Conversation

ColoCarletti commented May 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot commented May 8, 2026

Review: feat(cuda) Round 1 GPU LDE+commit dispatch

High

Medium

Low

Uh oh!

github-actions Bot commented May 8, 2026

Codex Code Review

Uh oh!

Uh oh!

github-actions Bot commented May 13, 2026

Codex Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot commented May 13, 2026

Review summary

Uh oh!

Uh oh!

github-actions Bot commented May 19, 2026

Codex Code Review

Uh oh!

claude Bot commented May 19, 2026

Review: feat/cuda-pr2-r1-gpu-commits

High

Medium

Low

Uh oh!

Uh oh!

ColoCarletti commented May 21, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants