Skip to content

feat(cuda): math-cuda crate Keccak + fused LDE+leaves+tree pipeline#578

Merged
MauroToscano merged 31 commits into
mainfrom
feat/cuda-pr1b-keccak-merkle
May 18, 2026
Merged

feat(cuda): math-cuda crate Keccak + fused LDE+leaves+tree pipeline#578
MauroToscano merged 31 commits into
mainfrom
feat/cuda-pr1b-keccak-merkle

Conversation

@ColoCarletti
Copy link
Copy Markdown
Collaborator

@ColoCarletti ColoCarletti commented May 6, 2026

Summary

Extends the math-cuda crate with Keccak-256 leaf hashing (4 column-layout
variants), a per-level Merkle tree builder kernel, and the fused
LDE+leaves+tree entry points. Single device-side pipeline: iNTT → coset
shift → NTT → leaf hash → Merkle tree, with the LDE buffer kept device-
resident throughout.

cuda feature stays opt-in. No prover code consumes the new entry points
yet — they ship here so future work can wire them into the prover commit
path.

What's in

  • kernels/keccak.cu — 5 kernels: base/ext3/comp-poly/FRI leaf hashing
    • keccak_merkle_level for inner-tree pair hashing.
  • src/merkle.rs — host wrappers: keccak_leaves_base,
    keccak_leaves_ext3, build_merkle_tree_on_device,
    build_comp_poly_tree_from_evals_ext3,
    build_fri_layer_tree_from_evals_ext3, plus pub(crate) launchers.
  • src/lde.rs — 9 fused entry points: base + ext3 variants of
    _with_leaf_hash, _with_merkle_tree, _with_merkle_tree_keep, and
    evaluate_poly_coset_batch_ext3_into_with_merkle_tree.
  • Backend — new pinned_hashes buffer for Merkle-leaf D2H staging at
    PCIe line-rate.
  • 5 new parity tests (35 total in math-cuda).

Verification

Check Result
cargo build --workspace --release
cargo build -p stark --features cuda --release
cargo fmt --check --all
cargo clippy --workspace --all-targets -- -D warnings -A clippy::op_ref ✓ (with and without nvcc)
cargo test -p math-cuda --release 35 / 35 parity tests pass
cargo test --workspace --release (ELFs pre-built) 953 / 953 pass
cargo test -p lambda-vm-prover --features stark/cuda --release 339 / 339

Run on RTX 5090, driver 595.58.03, CUDA 13.1.

New parity tests

  • keccak_leaves_base_matches_cpu / keccak_leaves_ext3_matches_cpu
    byte-identical to sha3::Keccak256 over FieldElement::write_bytes_be.
  • merkle_tree_small / medium / largebuild_merkle_tree_on_device
    vs CPU pair-hash inner tree, log₂(N) ∈ {1..=6, 10, 12, 14, 18}.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Codex Code Review

Findings

  1. Medium security: safe API can trigger CUDA out-of-bounds reads
    crypto/math-cuda/src/merkle.rs and crypto/math-cuda/src/merkle.rs validate buffer length against num_cols * col_stride, but never require col_stride >= num_rows. The kernels index c * col_stride + br where br < num_rows, so a caller can pass col_stride < num_rows and cause device out-of-bounds reads from a safe Rust function. Add assert!(col_stride >= num_rows) to both base and ext3 entry points, and preferably use checked multiplication for the length assertions.

  2. Potential bug: row-pair Merkle API leaves most of the output buffer unwritten
    crypto/math-cuda/src/lde.rs requires merkle_nodes_out.len() == (2 * lde_size - 1) * 32, but the row-pair tree only has lde_size / 2 leaves, so the real node count is lde_size - 1. Later, only tight_total_nodes * 32 bytes are copied at crypto/math-cuda/src/lde.rs, leaving the rest stale. Since the function returns (), it does not actually “expose the tight byte count”. Either require the tight length up front or return the written length/Vec so callers cannot accidentally build proofs from unwritten bytes.

No other concrete issues found in the changed diff.

Comment thread crypto/math-cuda/src/lde.rs Outdated
Comment thread crypto/math-cuda/kernels/keccak.cu
Comment thread crypto/math-cuda/src/lde.rs Outdated
Comment thread crypto/math-cuda/src/merkle.rs Outdated
Comment thread crypto/math-cuda/src/lde.rs Outdated
@claude
Copy link
Copy Markdown

claude Bot commented May 6, 2026

Review: GPU Keccak-256 Merkle Tree

The CUDA Keccak-f[1600] implementation is correct — round constants, rho offsets, theta/pi/chi/iota steps, padding delimiter (0x01 for pre-SHA-3 Keccak), and the inner-tree addressing all check out. The CPU parity tests are a solid safety net. A few issues to address:

Medium

  • Buffer contract mismatch (lde.rs:1700): evaluate_poly_coset_batch_ext3_into_with_merkle_tree asserts a (2*lde_size-1)*32-byte buffer but only writes (lde_size-1)*32 bytes (the R2 row-pair tree has lde_size/2 leaves). The upper half of the caller's buffer is left uninitialized. The doc comment and assertion need to match the actual tight size.

  • Undefined behavior: shift by 64 (keccak.cu:164, 201, 243-244): __brevll(tid) >> (64 - log_num_rows) is UB when log_num_rows == 0 (num_rows == 1). PTX wraps the shift so it works on current hardware, but it's latent UB. Guard with (log_num_rows == 0) ? 0 : (__brevll(tid) >> (64 - log_num_rows)).

Low

  • Misleading comment (lde.rs:649): "in parallel on the same stream" — CUDA stream operations are sequential. Two D2H copies on the same stream run back-to-back, not concurrently.

  • Dead code (merkle.rs:331): if tight_total_nodes == 0 is unreachable because the assert above guarantees num_leaves >= 1.

  • Dead code (lde.rs:1701): if n == 0 guards in ext3 functions are unreachable; assert!(n.is_power_of_two()) already panics for n == 0.

@ColoCarletti ColoCarletti changed the title gpu 2nd part feat(cuda): math-cuda crate Keccak + fused LDE+leaves+tree pipeline May 6, 2026
ColoCarletti and others added 6 commits May 8, 2026 12:28
  - Add ext3_sub_kernel + ext3_sub_u64 wrapper (test infrastructure;
    the ext3::sub device function was previously unreachable from Rust).
  - Add tests/ext3_edge.rs: 7 adversarial tests for ext3::mul dot3
    overflow tracking — (p-1)^3, u64::MAX^3, non-canonical p
    representations, identity, and 98 base-field edge pairs in
    a/b/c slots.
  - Add tests/ext3_sub.rs: parity test for the new sub wrapper.
  - Add tests/ntt_known.rs: known-polynomial tests for p(x) = 1+x at
    sizes 16 and 256, and p(x) = x^(N/2) for the alternating ±1
    pattern.
  - Add tests/lde_batch_into.rs: direct parity test for
    coset_lde_batch_base_into vs coset_lde_batch_base.

  15 new tests total, all green on RTX 5090.
Comment thread crypto/math-cuda/kernels/keccak.cu Outdated
Comment thread crypto/math-cuda/kernels/keccak.cu Outdated
Comment thread crypto/math-cuda/kernels/keccak.cu
Comment thread crypto/math-cuda/kernels/keccak.cu Outdated
Comment thread crypto/math-cuda/kernels/keccak.cu Outdated
Base automatically changed from feat/cuda-pr1a-math-cuda-foundation to main May 13, 2026 19:24
Comment thread crypto/math-cuda/src/merkle.rs Outdated
Comment thread crypto/math-cuda/src/merkle.rs Outdated
Comment thread crypto/math-cuda/src/merkle.rs Outdated
ColoCarletti and others added 3 commits May 18, 2026 11:30
Copy link
Copy Markdown
Collaborator

@gabrielbosio gabrielbosio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About code deduplication, found five places to extract:

  1. iNTT, coset scale, NTT launch sequence: 7 full sites in lde.rs (220, 397, 542, 766, 1009, 1258, 1972) and 2 NTT-only sites (1542, 1710). Same five-launch chain; only m vs mb = 3*m differs. Extract run_lde_pipeline(stream, &mut buf, inv_tw, fwd_tw, weights_dev, n, lde_size, batch).

  2. Inner tree level scan: 6 sites at lde.rs:861, :1345, :1779, merkle.rs:169, :290, :366. Extract pub(crate) fn launch_inner_tree(stream, &mut nodes_dev, num_leaves) in merkle.rs. KECCAK_BLOCK_DIM already exists but 5 of the 6 sites bypass it.

  3. ext3 de-interleave / re-interleave: 5 sites each. De-interleave at lde.rs:976, :1225, :1510, :1679, :1927. Re-interleave at lde.rs:1112, :1401, :1612, :1835, :2060. Replace with stage_ext3_pinned and unstage_ext3_pinned.

  4. Pinned hash D2H plus chunked memcpy: 5 sites at lde.rs:614, :877, :1076, :1356, :1618. Each: lock pinned_hashes, ensure capacity, cast to bytes, memcpy_dtoh, parallel chunked copy to caller. Extract staged_d2h_bytes(stream, &dev_buf, &mut out).

  5. _with_merkle_tree and _with_merkle_tree_keep already share an _inner(.., keep_device_buf: bool) helper. Apply the same shape to _with_leaf_hash; it's a strict subset of _with_merkle_tree_inner without the tree build.

Comment thread crypto/math-cuda/tests/keccak_leaves.rs
Comment thread crypto/math-cuda/tests/merkle_tree.rs
Comment thread crypto/crypto/src/merkle_tree/merkle.rs Outdated
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Comment thread crypto/math-cuda/tests/keccak_leaves.rs Outdated
Comment thread crypto/math-cuda/tests/keccak_leaves.rs
Comment thread crypto/stark/src/prover.rs Outdated
Comment thread crypto/math-cuda/tests/merkle_tree.rs Outdated
Comment thread crypto/stark/src/prover.rs Outdated
@MauroToscano MauroToscano added this pull request to the merge queue May 18, 2026
Merged via the queue into main with commit 57a8ab9 May 18, 2026
11 checks passed
@MauroToscano MauroToscano deleted the feat/cuda-pr1b-keccak-merkle branch May 18, 2026 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants