Cycle 6 bug-hunt: 19 commits of numerical / GPU / solver hardening#56
Merged
gvonness-apolitical merged 21 commits intomainfrom Apr 18, 2026
Merged
Cycle 6 bug-hunt: 19 commits of numerical / GPU / solver hardening#56gvonness-apolitical merged 21 commits intomainfrom
gvonness-apolitical merged 21 commits intomainfrom
Conversation
`BtapeGuard` and `TapeGuard` previously stored a raw `*mut Tape<F>` /
`*mut BytecodeTape<F>` without a lifetime parameter, so the borrow
checker accepted programs in which the guard outlived its tape. The
next `with_active_{b,}tape` call would then dereference freed memory
(Miri confirmed the UAF).
Add a `'a` lifetime + `PhantomData<&'a mut ...>` so the guard carries
the tape borrow statically. The `compile_fail` doctests on each guard
regression-check the dangling pattern. Add `#[must_use]` on both
constructors to match the crate convention and warn against
`let _ = Guard::new(&mut t)`, which would deactivate the tape
immediately.
Call sites that previously accessed the tape directly after guard
construction (`record`, `record_multi`, `composed_hvp`, `vjp_step`, and
six integration tests) are restructured to scope the guard around the
closure invocation that actually needs the thread-local; subsequent
`tape.reverse(...)` / `tape.set_output(...)` calls then run on the
free borrow as intended.
The existing SAFETY comments in `with_active_btape` /
`with_active_tape` are tightened to reflect that the static lifetime
now enforces what the old comment only asserted by convention.
The closed-form partials for `atan2`, `asinh`, and `acosh` each re-square a quantity that `hypot` was expressly designed to avoid: `h*h`, `a*a+1`, `a*a-1`. In f64 these overflow for `|h| > ~1.3e154` and for `|a| > ~1.3e154`, silently collapsing the derivative to 0 (or, for `atan2`, sometimes to `+Inf` with the wrong sign). The `atan2` case also underflows for `|h| < ~1.5e-154` — the naive `h*h == 0` guard fired, and callers received a zero tangent where the true partial was representable as a large finite value. Replace: - `atan2` partials with the algebraically equivalent `(y/h)/h` and `(-x/h)/h` factoring. `y/h` and `x/h` are bounded by 1 (since `h = hypot(x, y)`), so no intermediate step overflows for any `h > 0`. The zero-guard now checks `h == 0` instead of `h*h == 0`, which is the only remaining singular point. - `asinh`/`acosh` partials with a `|x| > 1e8` switch to the form `|1/x| / sqrt(1 ± (1/x)²)`, mirroring the existing guard already in place on `atan`. For small `x`, the direct formula is kept. Applied at four layers: the `Dual` and `DualVec` forward-mode types, the `Reverse` adept-style reverse-mode type, and the `OpCode::*` reverse partials used by `BReverse` / tape-replay. The GPU backends (WGSL + CUDA + Taylor codegen) have the same class of bug on f32 (overflow threshold near `|x| = 1.8e19`) but are left for a follow-up commit that touches the shared codegen paths. Tests cover `(1e200, 1e200)`, `(1e-200, 1e-200)`, mixed magnitudes, origin convention, NaN propagation, acosh(1) boundary, and small-input regression (old behaviour unchanged) across Dual/DualVec/Reverse/BReverse.
`x^n` for integer `n` is a real-valued polynomial with real derivatives
at any real `x`, including `x < 0`. The `exp(b · ln(a))` implementation
behind `powf` produced a NaN tangent in that regime: `ln(a)` is NaN for
`a < 0`, and IEEE's `NaN * 0 = NaN` prevented the `dy * n.eps` factor
from collapsing to zero even when the exponent had no live tangent at
all. `(-2.0_f64).powf(3.0)` thus returned `Dual { re: -8, eps: NaN }`
instead of `Dual { re: -8, eps: 12 }`.
Fix by detecting a losslessly-representable integer exponent at the
forward-mode types (`Dual`, `DualVec`, `Taylor`, `Laurent`) and the
Adept-style `Reverse`, and dispatching to `powi`, which takes the
squaring path and handles negative bases natively. The integer check
uses `to_i32()` + round-trip `F::from(ni) == n.value` to guarantee no
precision is lost in the conversion.
For the bytecode-tape opcode `OpCode::Powf`, tape-replay semantics
preclude changing the recorded opcode, so the reverse partial gains a
small safety net: when `a <= 0` and the reverse pass needs `d/db`, return
0 instead of `r * ln(a)`. The convention matches the forward-mode
fast path — finite `r` with `a < 0` implies `b` was integer, and the
classical derivative w.r.t. `b` at an integer `b` is not defined; 0 is
the standard choice.
Non-integer exponents of a negative base (e.g. `(-2).powf(2.5)`) remain
NaN — the dispatch only fires for losslessly-representable integers.
Live-tangent exponents continue through the existing `exp(b·ln(a))`
path and are unaffected.
Three independent guards, each addressing an input pattern that the
current code handled by panicking, asserting, or silently walking off
into a non-finite region.
clarke_jacobian (src/bytecode_tape/jacobian.rs)
-----------------------------------------------
The combo enumeration over active kinks computes `1usize << k` and
panics (debug) or wraps to 1 (release) whenever `k >= usize::BITS`. A
caller passing `max_active_kinks = Some(64)` on a 64-bit target with
an unusually-kinky tape would hit this. Cap the effective limit at
`usize::BITS - 1` and return `TooManyKinks { limit: cap }` instead.
Laurent Add/Sub (src/traits/laurent_std_ops.rs)
-----------------------------------------------
Both impls aligned the operands to `min(p1, p2)` and asserted the
resulting shift fit in `K`. `x + Laurent::zero()` where `|x.pole_order|
>= K` would panic, breaking the Zero trait contract and any generic
`.sum()` built on top of it (e.g. iterator-based totals). Short-circuit
both impls on either operand being all-zero.
backtracking_armijo (echidna-optim/src/line_search.rs)
------------------------------------------------------
The Armijo sufficient-decrease check `f_new <= f_x + c·α·g^T·d` is
trivially true when `f_new = -Inf`, so the solver accepted a step off
the domain boundary and walked toward -Inf. Reject trial points with
non-finite `f_new` or `g_new` — treat them as infeasible and
backtrack past. Common trigger: log-barrier or log-likelihood
objectives returning `-Inf` outside the feasible region.
Each fix ships with a focused regression test (2 clarke + 5 laurent +
2 line_search), all passing. Full test suite remains green.
The WGSL shaders and CUDA kernel had kept the pre-`atan2`-fix, pre-`asinh` overflow-guard, pre-`powf`-safety-net formulations, plus the `fract` / ABS / max-min conventions that diverged from CPU. Landing the mirror fixes so that any expression recorded on the CPU tape evaluates equivalently on the GPU at every magnitude the CPU can represent. WGSL (`reverse.wgsl`, `tangent_forward.wgsl`, `tangent_reverse.wgsl`): - `atan2` partials: replace `(b*at - a*bt) / (a²+b²)` with a form normalized by `max(|a|,|b|)` so neither `a*a` nor `b*b` can overflow. The original `(y/h)/h` pattern used on the CPU isn't available in WGSL (no `hypot`), so the normalized form is the equivalent. - `asinh` / `acosh`: switch to `|1/a|/sqrt(1 ± (1/a)²)` for `|a| > 1e8`. - `fract`: `x - trunc(x)` instead of the built-in `fract(x) = x - floor(x)` so negative inputs match Rust's `f32::fract`. - `abs` derivative: detect sign via bit-pattern instead of `a >= 0.0`, which returns `true` for both `+0.0` and `-0.0` — matches Rust's `f32::signum` semantics. - `max` / `min`: route adjoint via `bitcast<u32>(r) == bitcast<u32>(a)` (reverse.wgsl) or explicit NaN bit-pattern check (tangent paths); `b != b` got folded away by naga. - Taylor codegen `1.0 / 0.0` and `0.0 / 0.0` literals rejected by naga as abstract-typed `inf` / `nan`; emit the canonical bit patterns (`bitcast<f32>(0x7f800000u)` / `0x7fc00000u`) instead. CUDA (`tape_eval.cu`): - Same `atan2` / `asinh` / `acosh` / `powf` / `fract` / max-min fixes. CUDA has `hypot` and `isnan` available, so the atan2 uses `(b/h)/h` directly (no rescale needed) and NaN routing is straightforward. ABS was already correct via `copysign`. Tests: `tests/gpu_kernel_parity.rs` adds 7 parity tests covering `atan2` at 1e20, `asinh`/`acosh` at 1e20, `powf(-2, 3)`, `max(x, NaN)`, `|-0.0|`, and `fract(-1.3)`. Existing 30-test wgpu suite stays green. CUDA changes code-reviewed; runtime verification deferred to vast.ai.
Mirror of `tests/gpu_kernel_parity.rs` that exercises the same six numerical edge cases through the CUDA reverse/tangent-forward/tangent- reverse paths. Validated on A100 + CUDA 12.8: atan2 at 1e20, asinh/acosh at 1e20, powf(-2, 3), max(x, NaN), fract(-1.3). Existing CUDA suite (15 tests) and STDE-on-GPU suite (11 tests) stay green.
The runtime-codegenned Taylor shaders (WGSL + CUDA) had three
emissions that diverged from the CPU reference after Phase 2B:
POWF db at a ≤ 0 (H6):
Emitted `val * log(|a|)` (CUDA) / `val * log(abs(a.v[0]))` (WGSL),
which is finite for a < 0 and diverges from CPU's `OpCode::Powf`
safety net (`db = 0` for a ≤ 0). Set db = 0 in both emitters so the
first-order Taylor coefficient matches `da * a.v[1] + 0 * b.v[1]`,
the same value the CPU computes.
POWI at a = 0 with |n| ≥ 5 (M18):
Only ni ∈ {2, 3, 4} were special-cased via chained `jet_mul`. For
|ni| ≥ 5 the fall-through `a.v[0] <= 0` branch computed
`log|0| = -∞` and polluted every higher-order coefficient with NaN,
while `x^5` at `x = 0` has a clean Taylor expansion (first nonzero
coefficient at index 5). Extend the squaring chain to cover
|ni| ∈ {2,…,8}, matching CPU `taylor_powi_squaring`.
ABS Taylor sign scan (M20):
Emitted `sg = select(sign(a.v[0]), 1.0, a.v[0] == 0.0)`, always
returning +1 at zero primal. CPU `Taylor::abs` scans coefficients
for the first non-zero and uses its sign, so that |f(t)| for f(t)
starting `[0, -v, …]` has sign -1. Emit an else-if chain across
all k coefficients, defaulting to +1 for the identically-zero case.
Generated shader still compiles cleanly on naga (Metal) and NVRTC
(PTX). 30 wgpu tests + 21 CUDA tests stay green on both M4 Max Metal
and A100 CUDA 12.8. Dead-store warnings for `sg`/`db` at k=1 are
pre-existing cosmetic artifacts of the fixed-width jet emission.
Every `(X * Y) as usize` in the CUDA dispatch macros used u32 × u32
multiplication first, then widened the result. For batch × nv × K in
a K=5 Taylor run at ~100k variables and batch >= 8k, the u32 product
overflows before the widening; `alloc_zeros` then reserves a truncated
buffer and the kernel (which indexes with `unsigned long long`) writes
past the end → device memory corruption that isn't caught on the host.
Rewrite each site as `(X as usize) * (Y as usize)` so the multiplication
happens in 64-bit on 64-bit targets. Fifteen sites in `cuda_backend.rs`;
the wgpu analogues already cast explicitly via `u64` were correct.
On the wgpu side, the shaders still use u32 for `bid * num_variables (* K)`
indexing. Add a `(batch * nv [* K]) <= u32::MAX` guard to two dispatch
paths that previously relied on an implicit argument:
- `taylor_forward_kth_batch` (K ∈ {1..5}, the K multiplier previously
wasn't guarded).
- `sparse_jacobian` (effective batch = `num_colors`, not the caller's
`batch_size`; the upstream guard didn't apply).
Also rewrite five residual `(batch_size * _) as usize` sites in
`wgpu_backend.rs` to match the CUDA form for defense-in-depth — each
was already covered by an adjacent u64-arithmetic guard plus the
`ni ≤ nv` tape invariant, so no observable behavior change, but a
future relocation of the guard would otherwise re-expose the wrap.
CUDA suite (15 + 6) green on A100 + CUDA 12.8. wgpu suite (30 + 7)
green on M4 Max / Metal.
Eight related fixes across the forward-mode AD types and supporting
ops. Each addresses a distinct class of NaN / Inf / catastrophic-
cancellation regime that was producing wrong derivatives.
Hypot tangent factoring (Dual, DualVec):
Replaced `(x·dx + y·dy) / h` with `(x/h)·dx + (y/h)·dy`. `x/h` and
`y/h` are each bounded by 1, so no intermediate step overflows —
matches the overflow-safe pattern `hypot` uses for the primal.
Prior form produced Inf tangents for inputs whose true tangent was
representable.
Laurent::hypot rescaling:
`(self*self + other*other).sqrt()` squared both operands before
summing; the squaring overflowed for any leading coefficient above
sqrt(f64::MAX). Rescale both by the max absolute coefficient before
squaring, then multiply the result back — keeps intermediate ranges
bounded and preserves higher-order Taylor coefficients.
Acosh near the domain boundary (Dual, DualVec):
Replaced the small-|x| `x*x - 1` term with the factored
`(x-1)(x+1)`. Near x = 1 the naive form suffers catastrophic
cancellation (x² rounds to 1 and loses the ε contribution); the
factored form stays distinct down to the ULP.
Recip at the singularity (Dual, DualVec):
Skip the `0·Inf = NaN` chain when the input eps is 0. At x = 0 with
eps = 0, the derivative is the "constant zero" convention rather
than a NaN that would poison downstream. Non-zero eps still produces
the classical Inf derivative.
Powf partial at Inf (opcode reverse_partials):
Extended the `a == 0 || r == 0` fallback to `!a.is_finite() ||
!r.is_finite()`. Prior code computed `b * r / a = 2·Inf / Inf = NaN`
for `Inf^2`; fallback `b * a.powf(b - 1) = 2·Inf = Inf` is finite
and matches the true derivative.
Laurent::powi edges:
- `powi(0) = 1` for any input (including all-zero), matching stdlib
`f64::powi(0, 0) = 1`. Was previously `nan_laurent()` for the
all-zero operand.
- Pole-order overflow returns `nan_laurent()` instead of panicking
via `checked_mul().expect()`. A `Float`-trait method panicking on
extreme-but-representable i32 exponents violated the trait
contract; returning NaN lets generic consumers recover.
taylor_hypot smooth-at-zero:
When both inputs vanish at t=0 with non-zero higher-order terms
(e.g., a(t) = 3t, b(t) = 4t), the function `hypot(a,b) = |t|·…`
is smooth away from t=0, and the Taylor expansion around t=0 from
the right is well-defined. Detect the common t-factor, recurse on
the shifted series, and shift the result back. Prior code went
through `taylor_sqrt` with a zero leading coefficient and produced
+Inf for every higher-order coefficient.
13 new regression tests in `tests/numerical_robustness.rs` cover each
fix. One pre-existing regression test (`regression_22_laurent_powi_
pole_order_overflow_panics`) was flipped from `should_panic` to a
returns-NaN assertion to match the new trait-conformant behaviour.
Prior behaviour on these input shapes was to silently produce wrong
derivatives or return a jet that looked structurally valid but was
biased toward zero in higher-order terms. Each fix here converts the
problem into an actionable panic (for internal invariants) or
Result::Err (for the trait-dispatched GPU backends), matching the
pattern already in place on the sibling `hessian_vec`,
`sparse_hessian_vec`, and `sparse_jacobian_vec` methods.
Custom-op guards (M14 + M15):
- `jacobian_forward` now asserts `custom_ops.is_empty()`. Previously
it dispatched through `forward_tangent`, which linearizes custom
ops around recording-time primals — correct at `x = x_record`,
O(‖x − x_record‖) biased everywhere else. Users who need a
Jacobian through custom ops should use the reverse-mode
`jacobian`, which calls `CustomOp::partials` exactly.
- `taylor_grad` and `ode_taylor_step` assert the same. The bias
here is worse than first-order: the linearization handler drops
every second- and higher-order custom-op contribution, so the
ODE integrator's local truncation estimate is systematically
wrong through custom ops. `ode_taylor_step` additionally gains a
`const { assert!(K >= 1) }` compile-time check to reject the
previously-UB `K = 0` instantiation.
Single-output guards (M16):
- `hessian`, `hvp`, and `hessian_vec` assert `num_outputs() == 1`.
Multi-output tapes previously returned the Hessian/HVP of
output_indices[0] alone — the caller got a numerically correct
result for the wrong function.
- `laplacian_gpu` and `laplacian_with_control_gpu` return
`Err(GpuError)` on multi-output tapes. Their per-batch `c2s[i]`
indexing assumes a single scalar output; on multi-output tapes
the backend lays out coefficients as `[batch, output]` row-major
and the estimator silently aggregated the wrong stride. Required
adding a `num_outputs(tape)` method to the `GpuBackend` trait so
the check is backend-agnostic.
Input-contiguity guard (M17):
- `forward_into` now carries the same `debug_assert!(opcodes[..n]
are Input)` that `forward` already had. Without it, a tape built
by mixing `new_input` and `push_op` out of order would silently
overwrite non-input slots on parallel evaluation paths.
Cross-country extract_jacobian (M11):
- Elevated the `non-input predecessor remains after elimination`
debug_assert to a hard assert with an actionable message.
Triggers when one declared output is an ancestor of another
(cross-country's elimination protocol doesn't handle this
topology); the release-mode skip would silently produce zero
rows for the dependent output.
Documentation (M13 + M44):
- Added a "Determinism requirement" section to `checkpoint.rs`
explaining that `step` must be bit-reproducible on re-execution
for the adjoint to match the forward trajectory.
- Added a "Custom-ops caveat" section to `stde/mod.rs` warning
that all estimators take `&tape` and never call `forward(x)`, so
custom-op tapes are locked to the recording point `x_record`.
Seven new tests in `tests/structural_asserts.rs` verify each panic
fires on the bad input shape, plus a "scalar-output still works"
sanity check. One existing test (`custom_op_jvp`) was updated to
use `tape.jacobian` (reverse mode) instead of `jacobian_forward`
since the latter now rightly rejects custom ops.
…ecks Phase 6 addresses a batch of mid-severity findings from the bug hunt: - LU singularity tolerance now anchors on the matrix infinity norm rather than a running max-pivot-seen, so an early small pivot no longer lowers the bar for every subsequent column. - L-BFGS curvature filter uses the dimensionally-consistent Cauchy-Schwarz form `sy > eps·sqrt(ss·yy)` and clamps the initial gamma to [1e-3, 1e3] so pairs that just pass the filter don't collapse the search direction. - Newton falls back to steepest descent when the LU solve fails or yields a non-descent direction, keeping the solver usable on non-convex problems instead of reporting NumericalError at the first saddle. - Trust-region and piggyback iterations detect non-finite predicted/ρ and non-finite tangent/adjoint components, covering ratio-converging divergences the primal norm check would miss. - Kahan/Neumaier compensated summation replaces naive accumulation in `norm` and `dot` once the vector crosses 64 elements, so ULP noise doesn't leak into the convergence test on long vectors. - Sparse implicit probe switches to a mixed-sign RHS plus residual check so near-singular F_z can't slide through via a null-space-orthogonal all-ones test. - `record_multi` rejects zero-output closures, `hessian`/`hessian_vec` and the sparse siblings recover the constant value from a primal pass on zero-input tapes, `mixed_partial` asserts that `orders.len()` matches the tape inputs, and `extraction_prefactor` falls back to log-domain accumulation so an overflowing integer product returns a clean `+inf` rather than NaN. Each change has a regression test and the new assertions mirror the existing structural-assert style from Phase 5.
Five low-severity latent issues in the series arithmetic surface as silent NaN/Inf or two's-complement wraparound today. Each now routes through an explicit degenerate value so downstream code sees a uniformly bad result instead of a mixed-state tape. - Laurent::recip uses checked_neg on pole_order so i32::MIN inputs fold to a NaN Laurent rather than silently wrapping back to i32::MIN. - Laurent Mul/Div switch to checked_add/checked_sub on the combined pole_order; overflow returns NaN via the existing nan_pub path. - taylor_sqrt on a negative leading coefficient fills the whole series with NaN; previously the primal went NaN while higher coefficients kept computing from `1/(2·NaN)` and looked finite to callers. - Taylor::rem with a zero divisor returns an all-NaN Taylor instead of a mix of Inf and NaN slots. - taylor_ln_1p gets a debug_assert on the buffer length so misuse surfaces with an actionable message instead of a raw slice panic. Regression tests live in tests/phase8_taylor_laurent.rs.
… at 0 Three reverse-mode partials misbehaved at boundary or out-of-domain inputs in ways that produced numerically plausible but semantically meaningless gradients. - Ln / Log2 / Log10 at a < 0 returned a finite partial like `1/-1 = -1`, letting an out-of-domain primal's gradient flow downstream as if it were real. They now return `(NaN, 0)` when strictly below the domain. The boundary a = 0 still evaluates via IEEE `1/0 = +Inf` to preserve the one-sided derivative limit callers rely on. - Ln1p guards `a < -1` the same way; Atanh guards `|a| > 1`. - Abs at ±0 used `a.signum()`, which returns ±0 depending on the sign bit of its argument — so algebraically equivalent tape positions produced different subgradients. The kink now resolves to 0 (symmetric midpoint of the Clarke subdifferential [-1, 1]) regardless of sign bit. The forced_reverse_partials path, which the tape uses to pin a specific branch at a kink, keeps its ±1 behaviour via a separate code path. L16 from the Phase 8 plan was dropped: `reverse_partials(Powi, ...)` is a public entry point exercised by opcode_edge_cases::reverse_powi, so the inline dispatch in the tape runtime coexists with the standalone function rather than deprecating it. Regression tests live in tests/phase8_opcode.rs.
Six convergence/robustness cleanups in the optim crate. - Trust-region radius collapse: when `rho < 1/4` (or an uphill predicted step) would shrink the radius below `config.min_radius`, the solver now returns `NumericalError` instead of clamping to `min_radius` and spinning to `MaxIterations`. Matches the doc contract on `TrustRegionConfig::min_radius`. - `rho` tiny-predicted threshold scales with `(1 + |f|)`; the previous `F::epsilon()` floor tripped the special-case branch immediately on large-magnitude objectives. - `boundary_tau` thresholds scale with problem magnitude: `dd` gated by `eps * (||s||² + radius² + 1)`, round-off-negative discriminants tolerated up to `eps * b²`, and the Vieta-vs-direct cutover uses `eps * radius` instead of a fixed `eps` so tiny-scale subproblems don't stall on a spurious `tau = 0`. - Steihaug CG drops a dead-code pre-loop tolerance check that was tautologically `||r|| < sqrt(eps)·||r||`. - `func_tol` in trust_region, L-BFGS, and Newton compares `|f_prev - f_val| < tol * (1 + |f|)` so a user-set 1e-6 means "one part per million" regardless of whether |f| is around 1 or 1e10. Regression tests live in echidna-optim/tests/phase8_solvers.rs.
Seven low-severity / latent items cleared with guards, sibling fixes, doc alignments, and one breaking-but-correct API signature change. - set_outputs bounds-checks every index against `self.values.len()` with an actionable panic message instead of silently propagating out-of-range output slots into downstream accessors. Duplicate indices remain valid (a tape may expose the same variable under two outputs). - hessian_diagonal_with_buf gets the n==0 early-return treatment that hessian / hessian_vec / sparse_hessian already received — empty-input tapes now recover the constant output value via a primal pass. - ndarray wrappers copy inputs element-wise via a small `to_vec` helper, matching the faer/nalgebra adapter pattern. The old `.as_slice().unwrap()` panicked on any non-contiguous layout (stepped slices, transposed views). - sparse_jacobian_ndarray now returns `(Array1<F>, pattern, Vec<F>)` instead of `(pattern, Vec<F>)`, surfacing the outputs the old signature discarded. Breaking change, acceptable in 0.x. - eval_dyn's `Σx[i]` placeholder becomes `F::zero()` — the placeholder was defensive dead code, but `F::zero()` is a safer fallback if the JetPlan invariant ever slips. - Checkpoint docs: Revolve truncation's O(n²) worst-case is now called out, and `grad_checkpointed_online` correctly documents the pre-loop `stop(x0, 0)` call. - mixed_partial docstring no longer claims a nonexistent panic on all-zero orders; documents the identity-operator behaviour instead. L13 (active_kinks excluding non-finite switching values) was dropped: an earlier bug-hunt regression test pins the opposite contract (conservative inclusion of NaN/Inf kinks), so reversing it would break documented behaviour. Flagged the tension in the active_kinks doc comment for future review. Regression tests live in tests/phase8_misc.rs.
Six numerical / safety fixes to the GPU correctness baseline, applied symmetrically to the WGSL hand-written shaders (forward, reverse, tangent_forward, tangent_reverse), the CUDA NVRTC kernel, and the Taylor codegen that emits both families. - ATAN: the `1/(1+a²)` derivative overflows in f32 for |a|>1.84e19, collapsing to 0 via `1/inf`. Switch to the inv-based form `inv²/(1+inv²)` for |a|>1e8 (and its second-order analogue in the tangent_reverse kernel). Mirrors the CPU fix applied in Phase 2A. - DIV: `db = -a*inv²` overflows for small |b| (inv² alone can exceed f32::MAX before the multiply), and the tangent_reverse second-order uses `inv³`, worse. Factor through the primal `r = a/b` so each term drops one `inv` — `db = -r*inv` for first order and the `inv³` term becomes `r*inv²` for second order. - POWI n=1 at a=0: the tangent_reverse second derivative is the general `n*(n-1)*pow(a, n-2)*at` which for n=1 evaluates `0 * pow(0, -1) * at = 0 * inf = NaN`. Short-circuit n==1 to `da_eps=0`, matching the mathematically exact zero second derivative of a linear function. - EXPM1 / LN1P (WGSL tangent kernels only): both primals used the cancellation-prone `exp(a)-1` / `log(1+a)` forms, losing ~7 digits for |a|<1e-4. The forward-shader helpers `expm1_f32` / `ln1p_f32` (Taylor shortcut below threshold) are now duplicated into the two tangent shaders. Tangent kernels read the primal when forming `rt`, so a drifted primal propagates. - HYPOT in Taylor codegen: the emitted `sqrt(a*a + b*b)` primal patch overflows at large leading-order magnitude. WGSL now computes `h * sqrt((a/h)² + (b/h)²)` via max-rescale, CUDA uses the `hypot(a, b)` builtin. Higher-order jet coefficients still pass through `jet_mul`/`jet_add` and remain overflow-risky; a full jet- wide rescale is a follow-up. - CUDA upload_tape empty output_indices: the CUDA path was crashing on `clone_htod(&[])` when a downstream caller cleared `GpuTapeData.output_indices` (valid since the struct fields are public). Defensive fallback now synthesises `vec![data.output_index]` with `num_outputs = 1`, matching wgpu. Applied to both `upload_tape` (f32) and `upload_tape_f64`. Regression tests live in tests/phase7_gpu_fixes.rs. WGSL runs on Metal (M4 Max) locally, CUDA runs on A100 via vast.ai. Both backends pass the Phase 7 tests plus the full pre-existing GPU suite.
Surfaces GPU driver failures and locks down CUDA-CPU numerical parity.
- wgpu device.poll at all five submission-wait sites now returns the
`Result<PollStatus, PollError>` through `?` into `GpuError::Other`
instead of `let _ = device.poll(...)`. On device loss (driver reset,
another submission OOM), callers used to see the subsequent
`rx.recv()` block forever, looking like a deadlock. The actual
driver error now propagates up to the caller.
- NVRTC compilation for every kernel family (f32, f64, Taylor codegen)
now passes `CompileOptions { fmad: Some(false), ..Default::default() }`.
The CUDA default contracts `a*b + c` into an FMA that rounds once
instead of twice, producing sub-ULP drift vs the CPU's unfused
arithmetic. Disabling FMA gives bit-exact CPU-GPU parity on f64
(exercised by `l25_cuda_fmad_disabled_bit_exact_with_cpu_f64`). The
perf cost is 1-2% in our kernel mix; callers chasing raw throughput
can revert the option.
Additionally pins behaviour confirmed correct during Phase 7 exploration:
- `m26_wgpu_abs_at_positive_zero` and `_at_negative_zero`: the WGSL
shaders already use `bitcast<u32>` to inspect the sign bit, so Abs
at -0.0 gives -1 and at +0.0 gives +1. No code change; the tests
lock this in against future refactors.
- `m28_cuda_rem_quotient_boundary_consistent`: the CUDA REM primal
(`fmod`) and reverse partials (`db = -trunc(a/b)`) are internally
consistent. No code change; regression test pins the convention.
M23's runtime path (device loss) can't be simulated in a unit test —
the error-propagation code is exercised by the compile-gate alone.
Tests: tests/phase7_gpu_parity.rs. WGSL verified on Metal/M4 Max,
CUDA verified on A100 via vast.ai.
Three real bugs surfaced by a retroactive multi-agent review of the Phase 7 + Phase 8 commits (d0bbab9..4dbee62). - WGSL `tangent_reverse.wgsl` POWI was missing the `n == 1` short- circuit added to the CUDA kernel in Phase 7. HVP on `x.powi(1)` at x=0 returned NaN on the wgpu backend while CUDA returned 0, breaking CPU-GPU and CUDA-WGSL parity on tapes built with the public `OpCode::Powi` API. - `TaylorDyn::rem` missed the Phase 8 Taylor::rem zero-divisor guard. The dynamic counterpart silently produced a mix of finite and NaN/Inf coefficients when `rhs.coeffs[0] == 0`. - Taylor codegen WGSL HYPOT primal produced NaN for `hypot(±Inf, x)` (the rescale formula gave `Inf/Inf * 0 = NaN`). IEEE requires `hypot(±Inf, x) = +Inf`, and the CUDA builtin and CPU both honour it. Added an explicit `aa == inf || bb == inf` early-return that short-circuits to +Inf before the rescale. New regression tests: - `l23_wgpu_powi_n1_at_zero_second_derivative_finite` — WGSL sibling of the CUDA L23 test. - `m29_wgpu_atan_moderate_large_a_positive_normal_f32` — strengthens the earlier M29 wgpu test by using `a=3e9` where the f32 result is a normal (not denormal) float, so Metal's denormal flush doesn't mask a regression to `0.0`. - `l9_taylor_dyn_rem_zero_divisor_returns_nan` — pins the TaylorDyn sibling of the existing Taylor::rem test. Out of scope, noted for a follow-up: the scalar `forward.wgsl hypot_f32` and `tangent_forward.wgsl` HYPOT primal arms still use the unguarded `sqrt(a*a + b*b)` form. They pre-date Phase 7 and a separate pass should bring them in line with the Taylor codegen rescale.
Two pre-existing WGSL HYPOT issues flagged out-of-scope during the Phase 7/8 retroactive review. - `forward.wgsl hypot_f32`: the rescaled formula computes `mn/mx = Inf/Inf = NaN` when both operands are Inf, then returns `Inf * NaN = NaN`. IEEE 754-2008 mandates `hypot(±Inf, x) = +Inf` for any x (including NaN). Added an `ax == inf || ay == inf` short-circuit before the rescale. `forward.wgsl` is the source of the `r` value that reverse.wgsl and tangent_reverse.wgsl read back, so fixing this helper lifts the IEEE contract across the whole reverse-primal chain. - `tangent_forward.wgsl` HYPOT arm: primal was `sqrt(a*a + b*b)` unguarded, overflowing for `|a|` or `|b|` > sqrt(f32::MAX) ≈ 1.84e19. Replaced with a call to `hypot_f32`, and added the helper (plus the same IEEE Inf guard) to this shader. Brings tangent_forward's HYPOT primal bit-for-bit in line with the forward kernel. Regression tests (all wgpu) in tests/phase7_gpu_fixes.rs: - `wgpu_hypot_inf_finite_returns_inf` - `wgpu_hypot_inf_inf_returns_inf` — the triple-backend divergence case that motivated the fix (CPU ✓, CUDA ✓, WGSL was NaN). - `wgpu_tangent_forward_hypot_large_magnitude_primal_finite` — pins the rescale so `hypot(1e20, 1e20)` stays at ≈1.41e20 instead of overflowing to +Inf.
Two CI failures on PR #56: - `cargo fmt --check` flagged ~22 files where the added `checked_add` / Kahan / relative-tol expressions exceeded rustfmt's single-line threshold differently than expected. - MSRV (1.93) build failed because `tests/phase8_opcode.rs` imports `echidna::opcode::{reverse_partials, OpCode}`, which is only re- exported when the `bytecode` feature is enabled. The test now gates the whole module on `feature = "bytecode"`, matching the other Phase 8 test files.
…hecked_ops
Clippy 1.95 added stricter lints that the CI runner enforces via
`-D warnings`. None of the flagged sites are bugs — they're style
preferences newer Clippy opted into — but the PR gate fails on them.
- `src/bytecode_tape/mod.rs` OpCode::Sub / OpCode::Div optimization
arms: collapsed the inner `if` guard into the match pattern
(`OpCode::Sub if …=> {…}`).
- `src/taylor_ops.rs` hypot branch: replaced two manual 1..n loops
with `copy_from_slice`.
- `src/gpu/mod.rs` chunk-size cap: replaced the `if nv_k > 0` guard
with `(u32::MAX as u64).checked_div(nv_k)`.
No behaviour changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Cycle 6 of the bug-hunt sweep. Nine phases of fixes across the AD core, optimizer crate, GPU backends, and public API. Every commit is review-fix-gated (multi-agent review with score ≥80 threshold), and every behaviour change has a discriminating regression test.
Phase roll-up:
Stats:
Test plan
cargo build -p echidna --features \"bytecode,faer,ndarray,nalgebra,serde,parallel,stde,diffop,taylor,laurent,gpu-wgpu\"cargo build -p echidna-optim --features \"parallel,sparse-implicit\"cargo test -p echidna --features \"bytecode,faer,ndarray,nalgebra,serde,parallel,stde,diffop,taylor,laurent,gpu-wgpu\"(954 passed)cargo test -p echidna-optim --features \"parallel,sparse-implicit\"(116 passed)