Phase 9: CPU↔GPU parity harness + kernels SSOT + FTZ doc#57
Merged
gvonness-apolitical merged 2 commits intomainfrom Apr 18, 2026
Merged
Phase 9: CPU↔GPU parity harness + kernels SSOT + FTZ doc#57gvonness-apolitical merged 2 commits intomainfrom
gvonness-apolitical merged 2 commits intomainfrom
Conversation
Two Phase 9 deliverables: a property test suite catching CPU-GPU drift, and a doc note capturing why the library doesn't set MXCSR FTZ itself. R1: tests/gpu_cpu_parity.rs defines a `PARITY_CASES` table of 26 hand- picked tapes covering every major op family (arithmetic, unary algebraic, exp/log, trig, hyperbolic, binary max/min/hypot/atan2, plus three composite tapes — Rosenbrock, sin(x²+1)·exp(-x), log-sum-exp). Each case has per-op ULP tolerances (typically 2-16 ULPs f32, 2-16 ULPs f64; 64 for the deepest composites). Three runners — wgpu, CUDA f32, CUDA f64 — iterate the whole table at 3-4 input points per case and assert value + gradient ULP distances stay inside the budget. A divergence names the case, point index, and computed ULPs in the panic message so regressions point straight at the bad op. The test was written to catch future CPU-GPU drift: adding a new opcode path in CPU that isn't mirrored into shaders, or a shader refactor that silently changes behaviour. Points deliberately avoid catastrophic-cancellation regimes (near-π for sin/cos, near-unity subtraction) where f64→f32 input rounding alone dominates the ULP metric and would give false positives. L12: src/tape.rs `Tape::reverse` doc comment now explains that the library doesn't flip the x86 MXCSR flush-to-zero bit. On chains with subnormal adjoints this costs 10-100× on x86 due to microcode emulation, but setting FTZ in the library would change numerical semantics for callers that depend on subnormal precision. Callers can opt in themselves via `core::arch::x86_64::_mm_setcsr(0x9FC0)` around the reverse-sweep call. ARM64 flushes by default so the note is x86-specific. Verified on M4 Max (wgpu Metal) and A100 via vast.ai (CUDA f32 + f64).
Phase 9 R2 — start factoring out per-opcode math formulas that today live in both `opcode.rs` (bytecode-tape dispatch) and the various AD type impls (`dual.rs`, `dual_vec.rs`, `reverse.rs`, `breverse.rs`, `laurent.rs`, `traits/num_traits_impls.rs`). Phase 7 found three cases of silent drift between copies: atan at large |a|, div at small |b|, hypot Inf handling on the GPU side. R1 (tests/gpu_cpu_parity.rs, committed earlier in this branch) catches drift between CPU and GPU; this commit starts catching drift within the CPU tree. The `src/kernels/mod.rs` module currently exports: - `hypot_partials(a, b, r)` — `(a/r, b/r)` with `r == 0` → `(0, 0)`. - `atan2_partials(a, b)` — `(b/h/h, -a/h/h)` with `h = hypot(a, b)`. - `atan_deriv(a)` — `1/(1+a²)` with the `|a| > 1e8` inv-based fallback. - `asinh_deriv(a)`, `acosh_deriv(a)` — matching large-|a| fallbacks. `opcode.rs` delegates Atan2, Hypot, Atan, Asinh, Acosh to these helpers. The duplicate copies in `dual.rs`, `dual_vec.rs`, and friends remain for now — a future commit can migrate them. Every helper is generic over `num_traits::Float` so any future migration needs zero signature changes. Verified on M4 Max (wgpu Metal) and A100 via vast.ai (CUDA f32 + f64, gpu_cpu_parity table green across all three backends).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 9 (final phase of Cycle 6). Three deliverables:
tests/gpu_cpu_parity.rs): table-driven property test running 26 tapes across 3 backends (wgpu f32, CUDA f32, CUDA f64) at 3-4 input points each, asserting value + gradient ULP distances stay within per-case budgets. Catches future CPU-GPU drift introduced by CPU formula changes not mirrored into shaders, or shader refactors that silently change behaviour.src/kernels/mod.rs): start factoring per-opcode math formulas into a shared module. Currently exportshypot_partials,atan2_partials,atan_deriv,asinh_deriv,acosh_deriv. The bytecode-tapeopcode.rsdispatcher delegates to these helpers so CPU-internal drift across AD types becomes harder. Copies indual.rs,dual_vec.rs, etc. remain for now — future commits can migrate them incrementally.Tape::reversedoc comment explains why the library does not set x86 MXCSR flush-to-zero, with guidance for callers who want to opt in.Test plan
cargo test --features \"bytecode,gpu-wgpu,taylor,laurent\" --test gpu_cpu_parity(wgpu Metal on M4 Max)-D warnings