Phase 9: CPU↔GPU parity harness + kernels SSOT + FTZ doc by gvonness-apolitical · Pull Request #57 · Entrolution/echidna

gvonness-apolitical · 2026-04-18T20:35:47Z

Summary

Phase 9 (final phase of Cycle 6). Three deliverables:

R1 (tests/gpu_cpu_parity.rs): table-driven property test running 26 tapes across 3 backends (wgpu f32, CUDA f32, CUDA f64) at 3-4 input points each, asserting value + gradient ULP distances stay within per-case budgets. Catches future CPU-GPU drift introduced by CPU formula changes not mirrored into shaders, or shader refactors that silently change behaviour.
R2 (src/kernels/mod.rs): start factoring per-opcode math formulas into a shared module. Currently exports hypot_partials, atan2_partials, atan_deriv, asinh_deriv, acosh_deriv. The bytecode-tape opcode.rs dispatcher delegates to these helpers so CPU-internal drift across AD types becomes harder. Copies in dual.rs, dual_vec.rs, etc. remain for now — future commits can migrate them incrementally.
L12: Tape::reverse doc comment explains why the library does not set x86 MXCSR flush-to-zero, with guidance for callers who want to opt in.

Test plan

cargo test --features \"bytecode,gpu-wgpu,taylor,laurent\" --test gpu_cpu_parity (wgpu Metal on M4 Max)
CUDA f32 + f64 parity green on A100 via vast.ai
Full echidna test suite green with all relevant features
Clippy clean with -D warnings

Two Phase 9 deliverables: a property test suite catching CPU-GPU drift, and a doc note capturing why the library doesn't set MXCSR FTZ itself. R1: tests/gpu_cpu_parity.rs defines a `PARITY_CASES` table of 26 hand- picked tapes covering every major op family (arithmetic, unary algebraic, exp/log, trig, hyperbolic, binary max/min/hypot/atan2, plus three composite tapes — Rosenbrock, sin(x²+1)·exp(-x), log-sum-exp). Each case has per-op ULP tolerances (typically 2-16 ULPs f32, 2-16 ULPs f64; 64 for the deepest composites). Three runners — wgpu, CUDA f32, CUDA f64 — iterate the whole table at 3-4 input points per case and assert value + gradient ULP distances stay inside the budget. A divergence names the case, point index, and computed ULPs in the panic message so regressions point straight at the bad op. The test was written to catch future CPU-GPU drift: adding a new opcode path in CPU that isn't mirrored into shaders, or a shader refactor that silently changes behaviour. Points deliberately avoid catastrophic-cancellation regimes (near-π for sin/cos, near-unity subtraction) where f64→f32 input rounding alone dominates the ULP metric and would give false positives. L12: src/tape.rs `Tape::reverse` doc comment now explains that the library doesn't flip the x86 MXCSR flush-to-zero bit. On chains with subnormal adjoints this costs 10-100× on x86 due to microcode emulation, but setting FTZ in the library would change numerical semantics for callers that depend on subnormal precision. Callers can opt in themselves via `core::arch::x86_64::_mm_setcsr(0x9FC0)` around the reverse-sweep call. ARM64 flushes by default so the note is x86-specific. Verified on M4 Max (wgpu Metal) and A100 via vast.ai (CUDA f32 + f64).

Phase 9 R2 — start factoring out per-opcode math formulas that today live in both `opcode.rs` (bytecode-tape dispatch) and the various AD type impls (`dual.rs`, `dual_vec.rs`, `reverse.rs`, `breverse.rs`, `laurent.rs`, `traits/num_traits_impls.rs`). Phase 7 found three cases of silent drift between copies: atan at large |a|, div at small |b|, hypot Inf handling on the GPU side. R1 (tests/gpu_cpu_parity.rs, committed earlier in this branch) catches drift between CPU and GPU; this commit starts catching drift within the CPU tree. The `src/kernels/mod.rs` module currently exports: - `hypot_partials(a, b, r)` — `(a/r, b/r)` with `r == 0` → `(0, 0)`. - `atan2_partials(a, b)` — `(b/h/h, -a/h/h)` with `h = hypot(a, b)`. - `atan_deriv(a)` — `1/(1+a²)` with the `|a| > 1e8` inv-based fallback. - `asinh_deriv(a)`, `acosh_deriv(a)` — matching large-|a| fallbacks. `opcode.rs` delegates Atan2, Hypot, Atan, Asinh, Acosh to these helpers. The duplicate copies in `dual.rs`, `dual_vec.rs`, and friends remain for now — a future commit can migrate them. Every helper is generic over `num_traits::Float` so any future migration needs zero signature changes. Verified on M4 Max (wgpu Metal) and A100 via vast.ai (CUDA f32 + f64, gpu_cpu_parity table green across all three backends).

gvonness-apolitical added 2 commits April 18, 2026 21:27

gvonness-apolitical merged commit acbc885 into main Apr 18, 2026
6 checks passed

gvonness-apolitical deleted the fix/harden-cycle-6-phase9 branch April 18, 2026 20:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 9: CPU↔GPU parity harness + kernels SSOT + FTZ doc#57

Phase 9: CPU↔GPU parity harness + kernels SSOT + FTZ doc#57
gvonness-apolitical merged 2 commits intomainfrom
fix/harden-cycle-6-phase9

gvonness-apolitical commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gvonness-apolitical commented Apr 18, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant