Skip to content

Phase 9: CPU↔GPU parity harness + kernels SSOT + FTZ doc#57

Merged
gvonness-apolitical merged 2 commits intomainfrom
fix/harden-cycle-6-phase9
Apr 18, 2026
Merged

Phase 9: CPU↔GPU parity harness + kernels SSOT + FTZ doc#57
gvonness-apolitical merged 2 commits intomainfrom
fix/harden-cycle-6-phase9

Conversation

@gvonness-apolitical
Copy link
Copy Markdown
Contributor

Summary

Phase 9 (final phase of Cycle 6). Three deliverables:

  • R1 (tests/gpu_cpu_parity.rs): table-driven property test running 26 tapes across 3 backends (wgpu f32, CUDA f32, CUDA f64) at 3-4 input points each, asserting value + gradient ULP distances stay within per-case budgets. Catches future CPU-GPU drift introduced by CPU formula changes not mirrored into shaders, or shader refactors that silently change behaviour.
  • R2 (src/kernels/mod.rs): start factoring per-opcode math formulas into a shared module. Currently exports hypot_partials, atan2_partials, atan_deriv, asinh_deriv, acosh_deriv. The bytecode-tape opcode.rs dispatcher delegates to these helpers so CPU-internal drift across AD types becomes harder. Copies in dual.rs, dual_vec.rs, etc. remain for now — future commits can migrate them incrementally.
  • L12: Tape::reverse doc comment explains why the library does not set x86 MXCSR flush-to-zero, with guidance for callers who want to opt in.

Test plan

  • cargo test --features \"bytecode,gpu-wgpu,taylor,laurent\" --test gpu_cpu_parity (wgpu Metal on M4 Max)
  • CUDA f32 + f64 parity green on A100 via vast.ai
  • Full echidna test suite green with all relevant features
  • Clippy clean with -D warnings

Two Phase 9 deliverables: a property test suite catching CPU-GPU drift,
and a doc note capturing why the library doesn't set MXCSR FTZ itself.

R1: tests/gpu_cpu_parity.rs defines a `PARITY_CASES` table of 26 hand-
picked tapes covering every major op family (arithmetic, unary
algebraic, exp/log, trig, hyperbolic, binary max/min/hypot/atan2, plus
three composite tapes — Rosenbrock, sin(x²+1)·exp(-x), log-sum-exp).
Each case has per-op ULP tolerances (typically 2-16 ULPs f32, 2-16
ULPs f64; 64 for the deepest composites). Three runners — wgpu,
CUDA f32, CUDA f64 — iterate the whole table at 3-4 input points per
case and assert value + gradient ULP distances stay inside the
budget. A divergence names the case, point index, and computed ULPs
in the panic message so regressions point straight at the bad op.

The test was written to catch future CPU-GPU drift: adding a new
opcode path in CPU that isn't mirrored into shaders, or a shader
refactor that silently changes behaviour. Points deliberately avoid
catastrophic-cancellation regimes (near-π for sin/cos, near-unity
subtraction) where f64→f32 input rounding alone dominates the ULP
metric and would give false positives.

L12: src/tape.rs `Tape::reverse` doc comment now explains that the
library doesn't flip the x86 MXCSR flush-to-zero bit. On chains with
subnormal adjoints this costs 10-100× on x86 due to microcode
emulation, but setting FTZ in the library would change numerical
semantics for callers that depend on subnormal precision. Callers
can opt in themselves via `core::arch::x86_64::_mm_setcsr(0x9FC0)`
around the reverse-sweep call. ARM64 flushes by default so the note
is x86-specific.

Verified on M4 Max (wgpu Metal) and A100 via vast.ai (CUDA f32 + f64).
Phase 9 R2 — start factoring out per-opcode math formulas that today
live in both `opcode.rs` (bytecode-tape dispatch) and the various AD
type impls (`dual.rs`, `dual_vec.rs`, `reverse.rs`, `breverse.rs`,
`laurent.rs`, `traits/num_traits_impls.rs`). Phase 7 found three cases
of silent drift between copies: atan at large |a|, div at small |b|,
hypot Inf handling on the GPU side. R1 (tests/gpu_cpu_parity.rs,
committed earlier in this branch) catches drift between CPU and GPU;
this commit starts catching drift within the CPU tree.

The `src/kernels/mod.rs` module currently exports:
- `hypot_partials(a, b, r)` — `(a/r, b/r)` with `r == 0` → `(0, 0)`.
- `atan2_partials(a, b)` — `(b/h/h, -a/h/h)` with `h = hypot(a, b)`.
- `atan_deriv(a)` — `1/(1+a²)` with the `|a| > 1e8` inv-based fallback.
- `asinh_deriv(a)`, `acosh_deriv(a)` — matching large-|a| fallbacks.

`opcode.rs` delegates Atan2, Hypot, Atan, Asinh, Acosh to these
helpers. The duplicate copies in `dual.rs`, `dual_vec.rs`, and friends
remain for now — a future commit can migrate them. Every helper is
generic over `num_traits::Float` so any future migration needs zero
signature changes.

Verified on M4 Max (wgpu Metal) and A100 via vast.ai (CUDA f32 + f64,
gpu_cpu_parity table green across all three backends).
@gvonness-apolitical gvonness-apolitical merged commit acbc885 into main Apr 18, 2026
6 checks passed
@gvonness-apolitical gvonness-apolitical deleted the fix/harden-cycle-6-phase9 branch April 18, 2026 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant