feat(cuda): math-cuda crate Keccak + fused LDE+leaves+tree pipeline#578
Conversation
Codex Code ReviewFindings
No other concrete issues found in the changed diff. |
Review: GPU Keccak-256 Merkle TreeThe CUDA Keccak-f[1600] implementation is correct — round constants, rho offsets, theta/pi/chi/iota steps, padding delimiter (0x01 for pre-SHA-3 Keccak), and the inner-tree addressing all check out. The CPU parity tests are a solid safety net. A few issues to address: Medium
Low
|
- Add ext3_sub_kernel + ext3_sub_u64 wrapper (test infrastructure;
the ext3::sub device function was previously unreachable from Rust).
- Add tests/ext3_edge.rs: 7 adversarial tests for ext3::mul dot3
overflow tracking — (p-1)^3, u64::MAX^3, non-canonical p
representations, identity, and 98 base-field edge pairs in
a/b/c slots.
- Add tests/ext3_sub.rs: parity test for the new sub wrapper.
- Add tests/ntt_known.rs: known-polynomial tests for p(x) = 1+x at
sizes 16 and 256, and p(x) = x^(N/2) for the alternating ±1
pattern.
- Add tests/lde_batch_into.rs: direct parity test for
coset_lde_batch_base_into vs coset_lde_batch_base.
15 new tests total, all green on RTX 5090.
…ion' into feat/cuda-pr1b-keccak-merkle
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
gabrielbosio
left a comment
There was a problem hiding this comment.
About code deduplication, found five places to extract:
-
iNTT, coset scale, NTT launch sequence: 7 full sites in
lde.rs(220, 397, 542, 766, 1009, 1258, 1972) and 2 NTT-only sites (1542, 1710). Same five-launch chain; onlymvsmb = 3*mdiffers. Extractrun_lde_pipeline(stream, &mut buf, inv_tw, fwd_tw, weights_dev, n, lde_size, batch). -
Inner tree level scan: 6 sites at
lde.rs:861,:1345,:1779,merkle.rs:169,:290,:366. Extractpub(crate) fn launch_inner_tree(stream, &mut nodes_dev, num_leaves)inmerkle.rs.KECCAK_BLOCK_DIMalready exists but 5 of the 6 sites bypass it. -
ext3 de-interleave / re-interleave: 5 sites each. De-interleave at
lde.rs:976,:1225,:1510,:1679,:1927. Re-interleave atlde.rs:1112,:1401,:1612,:1835,:2060. Replace withstage_ext3_pinnedandunstage_ext3_pinned. -
Pinned hash D2H plus chunked memcpy: 5 sites at
lde.rs:614,:877,:1076,:1356,:1618. Each: lockpinned_hashes, ensure capacity, cast to bytes,memcpy_dtoh, parallel chunked copy to caller. Extractstaged_d2h_bytes(stream, &dev_buf, &mut out). -
_with_merkle_treeand_with_merkle_tree_keepalready share an_inner(.., keep_device_buf: bool)helper. Apply the same shape to_with_leaf_hash; it's a strict subset of_with_merkle_tree_innerwithout the tree build.
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Summary
Extends the
math-cudacrate with Keccak-256 leaf hashing (4 column-layoutvariants), a per-level Merkle tree builder kernel, and the fused
LDE+leaves+tree entry points. Single device-side pipeline: iNTT → coset
shift → NTT → leaf hash → Merkle tree, with the LDE buffer kept device-
resident throughout.
cudafeature stays opt-in. No prover code consumes the new entry pointsyet — they ship here so future work can wire them into the prover commit
path.
What's in
kernels/keccak.cu— 5 kernels: base/ext3/comp-poly/FRI leaf hashingkeccak_merkle_levelfor inner-tree pair hashing.src/merkle.rs— host wrappers:keccak_leaves_base,keccak_leaves_ext3,build_merkle_tree_on_device,build_comp_poly_tree_from_evals_ext3,build_fri_layer_tree_from_evals_ext3, plus pub(crate) launchers.src/lde.rs— 9 fused entry points: base + ext3 variants of_with_leaf_hash,_with_merkle_tree,_with_merkle_tree_keep, andevaluate_poly_coset_batch_ext3_into_with_merkle_tree.Backend— newpinned_hashesbuffer for Merkle-leaf D2H staging atPCIe line-rate.
Verification
cargo build --workspace --releasecargo build -p stark --features cuda --releasecargo fmt --check --allcargo clippy --workspace --all-targets -- -D warnings -A clippy::op_refcargo test -p math-cuda --releasecargo test --workspace --release(ELFs pre-built)cargo test -p lambda-vm-prover --features stark/cuda --releaseRun on RTX 5090, driver 595.58.03, CUDA 13.1.
New parity tests
keccak_leaves_base_matches_cpu/keccak_leaves_ext3_matches_cpu—byte-identical to
sha3::Keccak256overFieldElement::write_bytes_be.merkle_tree_small / medium / large—build_merkle_tree_on_devicevs CPU pair-hash inner tree, log₂(N) ∈ {1..=6, 10, 12, 14, 18}.