Skip to content

Cycle 6 bug-hunt: 19 commits of numerical / GPU / solver hardening#56

Merged
gvonness-apolitical merged 21 commits intomainfrom
fix/harden-cycle-6
Apr 18, 2026
Merged

Cycle 6 bug-hunt: 19 commits of numerical / GPU / solver hardening#56
gvonness-apolitical merged 21 commits intomainfrom
fix/harden-cycle-6

Conversation

@gvonness-apolitical
Copy link
Copy Markdown
Contributor

Summary

Cycle 6 of the bug-hunt sweep. Nine phases of fixes across the AD core, optimizer crate, GPU backends, and public API. Every commit is review-fix-gated (multi-agent review with score ≥80 threshold), and every behaviour change has a discriminating regression test.

Phase roll-up:

  • Phase 1 — critical UB: BtapeGuard/TapeGuard lifetimes tied to their tape.
  • Phase 2 — high-impact CPU correctness: atan2/asinh/acosh extreme-magnitude partials, powf negative-base safety, Laurent add/sub zero short-circuit, Clarke shift overflow.
  • Phase 3 — high-impact GPU/CPU parity: reverse-partial fixes mirrored into hand-written GPU kernels, Taylor codegen aligned with CPU fixes for POWF/POWI/ABS, wgpu u32 shader indexing + CUDA runtime-verified buffer sizes.
  • Phase 4 — medium numerical robustness: hypot/recip/acosh/powf/Laurent::powi.
  • Phase 5 — structural: silent-wrong shapes turned into loud asserts across the tape API.
  • Phase 6 — solver/optim + tape API: piggyback finiteness, L-BFGS curvature filter + gamma clamp, Newton descent-direction fallback, Kahan summation, trust-region NaN detection, hessian/hessian_vec/sparse_hessian empty-input sibling fixes.
  • Phase 8 — low-severity latent: Taylor/Laurent domain guards with checked pole_order arithmetic, opcode domain NaN guards + symmetric Abs subgradient, trust-region radius-collapse → NumericalError, solver func_tol relativised, ndarray wrappers copy element-wise, set_outputs bounds-checked, hessian_diagonal empty-input sibling.
  • Phase 7 — medium GPU remaining: GPU ATAN large-|a| inv-based form, DIV small-|b| via r*inv factoring, POWI n==1 short-circuit, WGSL EXPM1/LN1P precision helpers in tangent kernels, CUDA upload_tape defensive fallback, wgpu device.poll error propagation, CUDA NVRTC fmad=false for CPU-GPU bit-parity.
  • Retroactive review-fix on Phases 7 + 8: WGSL tangent_reverse POWI n==1 sibling gap, TaylorDyn::rem zero-divisor sibling gap, WGSL HYPOT Inf handling (both the Taylor codegen path and the scalar forward/tangent_forward path).

Stats:

  • 19 commits
  • ~90 files touched
  • ~100 regression tests added (65 on echidna, 20 on echidna-optim, 11 GPU parity + correctness)
  • 1070 total tests passing (954 echidna + 116 echidna-optim)
  • CUDA verified on A100 via vast.ai at each GPU-touching commit
  • wgpu verified on Metal / M4 Max throughout

Test plan

  • cargo build -p echidna --features \"bytecode,faer,ndarray,nalgebra,serde,parallel,stde,diffop,taylor,laurent,gpu-wgpu\"
  • cargo build -p echidna-optim --features \"parallel,sparse-implicit\"
  • cargo test -p echidna --features \"bytecode,faer,ndarray,nalgebra,serde,parallel,stde,diffop,taylor,laurent,gpu-wgpu\" (954 passed)
  • cargo test -p echidna-optim --features \"parallel,sparse-implicit\" (116 passed)
  • CUDA tests on A100 (phase7_gpu_fixes, phase7_gpu_parity, gpu_cuda_tests, gpu_cuda_kernel_parity all green)
  • Review CHANGELOG for any user-visible API break notes (sparse_jacobian_ndarray return type changed — 0.x acceptable but flag for release notes)

`BtapeGuard` and `TapeGuard` previously stored a raw `*mut Tape<F>` /
`*mut BytecodeTape<F>` without a lifetime parameter, so the borrow
checker accepted programs in which the guard outlived its tape. The
next `with_active_{b,}tape` call would then dereference freed memory
(Miri confirmed the UAF).

Add a `'a` lifetime + `PhantomData<&'a mut ...>` so the guard carries
the tape borrow statically. The `compile_fail` doctests on each guard
regression-check the dangling pattern. Add `#[must_use]` on both
constructors to match the crate convention and warn against
`let _ = Guard::new(&mut t)`, which would deactivate the tape
immediately.

Call sites that previously accessed the tape directly after guard
construction (`record`, `record_multi`, `composed_hvp`, `vjp_step`, and
six integration tests) are restructured to scope the guard around the
closure invocation that actually needs the thread-local; subsequent
`tape.reverse(...)` / `tape.set_output(...)` calls then run on the
free borrow as intended.

The existing SAFETY comments in `with_active_btape` /
`with_active_tape` are tightened to reflect that the static lifetime
now enforces what the old comment only asserted by convention.
The closed-form partials for `atan2`, `asinh`, and `acosh` each re-square
a quantity that `hypot` was expressly designed to avoid: `h*h`, `a*a+1`,
`a*a-1`. In f64 these overflow for `|h| > ~1.3e154` and for `|a| >
~1.3e154`, silently collapsing the derivative to 0 (or, for `atan2`,
sometimes to `+Inf` with the wrong sign). The `atan2` case also
underflows for `|h| < ~1.5e-154` — the naive `h*h == 0` guard fired, and
callers received a zero tangent where the true partial was representable
as a large finite value.

Replace:
- `atan2` partials with the algebraically equivalent `(y/h)/h` and
  `(-x/h)/h` factoring. `y/h` and `x/h` are bounded by 1 (since
  `h = hypot(x, y)`), so no intermediate step overflows for any `h > 0`.
  The zero-guard now checks `h == 0` instead of `h*h == 0`, which is the
  only remaining singular point.
- `asinh`/`acosh` partials with a `|x| > 1e8` switch to the form
  `|1/x| / sqrt(1 ± (1/x)²)`, mirroring the existing guard already in
  place on `atan`. For small `x`, the direct formula is kept.

Applied at four layers: the `Dual` and `DualVec` forward-mode types, the
`Reverse` adept-style reverse-mode type, and the `OpCode::*` reverse
partials used by `BReverse` / tape-replay. The GPU backends (WGSL + CUDA
+ Taylor codegen) have the same class of bug on f32 (overflow threshold
near `|x| = 1.8e19`) but are left for a follow-up commit that touches
the shared codegen paths.

Tests cover `(1e200, 1e200)`, `(1e-200, 1e-200)`, mixed magnitudes,
origin convention, NaN propagation, acosh(1) boundary, and small-input
regression (old behaviour unchanged) across Dual/DualVec/Reverse/BReverse.
`x^n` for integer `n` is a real-valued polynomial with real derivatives
at any real `x`, including `x < 0`. The `exp(b · ln(a))` implementation
behind `powf` produced a NaN tangent in that regime: `ln(a)` is NaN for
`a < 0`, and IEEE's `NaN * 0 = NaN` prevented the `dy * n.eps` factor
from collapsing to zero even when the exponent had no live tangent at
all. `(-2.0_f64).powf(3.0)` thus returned `Dual { re: -8, eps: NaN }`
instead of `Dual { re: -8, eps: 12 }`.

Fix by detecting a losslessly-representable integer exponent at the
forward-mode types (`Dual`, `DualVec`, `Taylor`, `Laurent`) and the
Adept-style `Reverse`, and dispatching to `powi`, which takes the
squaring path and handles negative bases natively. The integer check
uses `to_i32()` + round-trip `F::from(ni) == n.value` to guarantee no
precision is lost in the conversion.

For the bytecode-tape opcode `OpCode::Powf`, tape-replay semantics
preclude changing the recorded opcode, so the reverse partial gains a
small safety net: when `a <= 0` and the reverse pass needs `d/db`, return
0 instead of `r * ln(a)`. The convention matches the forward-mode
fast path — finite `r` with `a < 0` implies `b` was integer, and the
classical derivative w.r.t. `b` at an integer `b` is not defined; 0 is
the standard choice.

Non-integer exponents of a negative base (e.g. `(-2).powf(2.5)`) remain
NaN — the dispatch only fires for losslessly-representable integers.
Live-tangent exponents continue through the existing `exp(b·ln(a))`
path and are unaffected.
Three independent guards, each addressing an input pattern that the
current code handled by panicking, asserting, or silently walking off
into a non-finite region.

clarke_jacobian (src/bytecode_tape/jacobian.rs)
-----------------------------------------------
The combo enumeration over active kinks computes `1usize << k` and
panics (debug) or wraps to 1 (release) whenever `k >= usize::BITS`. A
caller passing `max_active_kinks = Some(64)` on a 64-bit target with
an unusually-kinky tape would hit this. Cap the effective limit at
`usize::BITS - 1` and return `TooManyKinks { limit: cap }` instead.

Laurent Add/Sub (src/traits/laurent_std_ops.rs)
-----------------------------------------------
Both impls aligned the operands to `min(p1, p2)` and asserted the
resulting shift fit in `K`. `x + Laurent::zero()` where `|x.pole_order|
>= K` would panic, breaking the Zero trait contract and any generic
`.sum()` built on top of it (e.g. iterator-based totals). Short-circuit
both impls on either operand being all-zero.

backtracking_armijo (echidna-optim/src/line_search.rs)
------------------------------------------------------
The Armijo sufficient-decrease check `f_new <= f_x + c·α·g^T·d` is
trivially true when `f_new = -Inf`, so the solver accepted a step off
the domain boundary and walked toward -Inf. Reject trial points with
non-finite `f_new` or `g_new` — treat them as infeasible and
backtrack past. Common trigger: log-barrier or log-likelihood
objectives returning `-Inf` outside the feasible region.

Each fix ships with a focused regression test (2 clarke + 5 laurent +
2 line_search), all passing. Full test suite remains green.
The WGSL shaders and CUDA kernel had kept the pre-`atan2`-fix, pre-`asinh`
overflow-guard, pre-`powf`-safety-net formulations, plus the `fract` /
ABS / max-min conventions that diverged from CPU. Landing the mirror
fixes so that any expression recorded on the CPU tape evaluates
equivalently on the GPU at every magnitude the CPU can represent.

WGSL (`reverse.wgsl`, `tangent_forward.wgsl`, `tangent_reverse.wgsl`):
- `atan2` partials: replace `(b*at - a*bt) / (a²+b²)` with a form
  normalized by `max(|a|,|b|)` so neither `a*a` nor `b*b` can overflow.
  The original `(y/h)/h` pattern used on the CPU isn't available in
  WGSL (no `hypot`), so the normalized form is the equivalent.
- `asinh` / `acosh`: switch to `|1/a|/sqrt(1 ± (1/a)²)` for `|a| > 1e8`.
- `fract`: `x - trunc(x)` instead of the built-in `fract(x) = x - floor(x)`
  so negative inputs match Rust's `f32::fract`.
- `abs` derivative: detect sign via bit-pattern instead of `a >= 0.0`,
  which returns `true` for both `+0.0` and `-0.0` — matches Rust's
  `f32::signum` semantics.
- `max` / `min`: route adjoint via `bitcast<u32>(r) == bitcast<u32>(a)`
  (reverse.wgsl) or explicit NaN bit-pattern check (tangent paths);
  `b != b` got folded away by naga.
- Taylor codegen `1.0 / 0.0` and `0.0 / 0.0` literals rejected by naga
  as abstract-typed `inf` / `nan`; emit the canonical bit patterns
  (`bitcast<f32>(0x7f800000u)` / `0x7fc00000u`) instead.

CUDA (`tape_eval.cu`):
- Same `atan2` / `asinh` / `acosh` / `powf` / `fract` / max-min fixes.
  CUDA has `hypot` and `isnan` available, so the atan2 uses
  `(b/h)/h` directly (no rescale needed) and NaN routing is
  straightforward. ABS was already correct via `copysign`.

Tests: `tests/gpu_kernel_parity.rs` adds 7 parity tests covering
`atan2` at 1e20, `asinh`/`acosh` at 1e20, `powf(-2, 3)`, `max(x, NaN)`,
`|-0.0|`, and `fract(-1.3)`. Existing 30-test wgpu suite stays green.
CUDA changes code-reviewed; runtime verification deferred to vast.ai.
Mirror of `tests/gpu_kernel_parity.rs` that exercises the same six
numerical edge cases through the CUDA reverse/tangent-forward/tangent-
reverse paths. Validated on A100 + CUDA 12.8: atan2 at 1e20, asinh/acosh
at 1e20, powf(-2, 3), max(x, NaN), fract(-1.3). Existing CUDA suite (15
tests) and STDE-on-GPU suite (11 tests) stay green.
The runtime-codegenned Taylor shaders (WGSL + CUDA) had three
emissions that diverged from the CPU reference after Phase 2B:

POWF db at a ≤ 0 (H6):
  Emitted `val * log(|a|)` (CUDA) / `val * log(abs(a.v[0]))` (WGSL),
  which is finite for a < 0 and diverges from CPU's `OpCode::Powf`
  safety net (`db = 0` for a ≤ 0). Set db = 0 in both emitters so the
  first-order Taylor coefficient matches `da * a.v[1] + 0 * b.v[1]`,
  the same value the CPU computes.

POWI at a = 0 with |n| ≥ 5 (M18):
  Only ni ∈ {2, 3, 4} were special-cased via chained `jet_mul`. For
  |ni| ≥ 5 the fall-through `a.v[0] <= 0` branch computed
  `log|0| = -∞` and polluted every higher-order coefficient with NaN,
  while `x^5` at `x = 0` has a clean Taylor expansion (first nonzero
  coefficient at index 5). Extend the squaring chain to cover
  |ni| ∈ {2,…,8}, matching CPU `taylor_powi_squaring`.

ABS Taylor sign scan (M20):
  Emitted `sg = select(sign(a.v[0]), 1.0, a.v[0] == 0.0)`, always
  returning +1 at zero primal. CPU `Taylor::abs` scans coefficients
  for the first non-zero and uses its sign, so that |f(t)| for f(t)
  starting `[0, -v, …]` has sign -1. Emit an else-if chain across
  all k coefficients, defaulting to +1 for the identically-zero case.

Generated shader still compiles cleanly on naga (Metal) and NVRTC
(PTX). 30 wgpu tests + 21 CUDA tests stay green on both M4 Max Metal
and A100 CUDA 12.8. Dead-store warnings for `sg`/`db` at k=1 are
pre-existing cosmetic artifacts of the fixed-width jet emission.
Every `(X * Y) as usize` in the CUDA dispatch macros used u32 × u32
multiplication first, then widened the result. For batch × nv × K in
a K=5 Taylor run at ~100k variables and batch >= 8k, the u32 product
overflows before the widening; `alloc_zeros` then reserves a truncated
buffer and the kernel (which indexes with `unsigned long long`) writes
past the end → device memory corruption that isn't caught on the host.

Rewrite each site as `(X as usize) * (Y as usize)` so the multiplication
happens in 64-bit on 64-bit targets. Fifteen sites in `cuda_backend.rs`;
the wgpu analogues already cast explicitly via `u64` were correct.

On the wgpu side, the shaders still use u32 for `bid * num_variables (* K)`
indexing. Add a `(batch * nv [* K]) <= u32::MAX` guard to two dispatch
paths that previously relied on an implicit argument:
- `taylor_forward_kth_batch` (K ∈ {1..5}, the K multiplier previously
  wasn't guarded).
- `sparse_jacobian` (effective batch = `num_colors`, not the caller's
  `batch_size`; the upstream guard didn't apply).

Also rewrite five residual `(batch_size * _) as usize` sites in
`wgpu_backend.rs` to match the CUDA form for defense-in-depth — each
was already covered by an adjacent u64-arithmetic guard plus the
`ni ≤ nv` tape invariant, so no observable behavior change, but a
future relocation of the guard would otherwise re-expose the wrap.

CUDA suite (15 + 6) green on A100 + CUDA 12.8. wgpu suite (30 + 7)
green on M4 Max / Metal.
Eight related fixes across the forward-mode AD types and supporting
ops. Each addresses a distinct class of NaN / Inf / catastrophic-
cancellation regime that was producing wrong derivatives.

Hypot tangent factoring (Dual, DualVec):
  Replaced `(x·dx + y·dy) / h` with `(x/h)·dx + (y/h)·dy`. `x/h` and
  `y/h` are each bounded by 1, so no intermediate step overflows —
  matches the overflow-safe pattern `hypot` uses for the primal.
  Prior form produced Inf tangents for inputs whose true tangent was
  representable.

Laurent::hypot rescaling:
  `(self*self + other*other).sqrt()` squared both operands before
  summing; the squaring overflowed for any leading coefficient above
  sqrt(f64::MAX). Rescale both by the max absolute coefficient before
  squaring, then multiply the result back — keeps intermediate ranges
  bounded and preserves higher-order Taylor coefficients.

Acosh near the domain boundary (Dual, DualVec):
  Replaced the small-|x| `x*x - 1` term with the factored
  `(x-1)(x+1)`. Near x = 1 the naive form suffers catastrophic
  cancellation (x² rounds to 1 and loses the ε contribution); the
  factored form stays distinct down to the ULP.

Recip at the singularity (Dual, DualVec):
  Skip the `0·Inf = NaN` chain when the input eps is 0. At x = 0 with
  eps = 0, the derivative is the "constant zero" convention rather
  than a NaN that would poison downstream. Non-zero eps still produces
  the classical Inf derivative.

Powf partial at Inf (opcode reverse_partials):
  Extended the `a == 0 || r == 0` fallback to `!a.is_finite() ||
  !r.is_finite()`. Prior code computed `b * r / a = 2·Inf / Inf = NaN`
  for `Inf^2`; fallback `b * a.powf(b - 1) = 2·Inf = Inf` is finite
  and matches the true derivative.

Laurent::powi edges:
  - `powi(0) = 1` for any input (including all-zero), matching stdlib
    `f64::powi(0, 0) = 1`. Was previously `nan_laurent()` for the
    all-zero operand.
  - Pole-order overflow returns `nan_laurent()` instead of panicking
    via `checked_mul().expect()`. A `Float`-trait method panicking on
    extreme-but-representable i32 exponents violated the trait
    contract; returning NaN lets generic consumers recover.

taylor_hypot smooth-at-zero:
  When both inputs vanish at t=0 with non-zero higher-order terms
  (e.g., a(t) = 3t, b(t) = 4t), the function `hypot(a,b) = |t|·…`
  is smooth away from t=0, and the Taylor expansion around t=0 from
  the right is well-defined. Detect the common t-factor, recurse on
  the shifted series, and shift the result back. Prior code went
  through `taylor_sqrt` with a zero leading coefficient and produced
  +Inf for every higher-order coefficient.

13 new regression tests in `tests/numerical_robustness.rs` cover each
fix. One pre-existing regression test (`regression_22_laurent_powi_
pole_order_overflow_panics`) was flipped from `should_panic` to a
returns-NaN assertion to match the new trait-conformant behaviour.
Prior behaviour on these input shapes was to silently produce wrong
derivatives or return a jet that looked structurally valid but was
biased toward zero in higher-order terms. Each fix here converts the
problem into an actionable panic (for internal invariants) or
Result::Err (for the trait-dispatched GPU backends), matching the
pattern already in place on the sibling `hessian_vec`,
`sparse_hessian_vec`, and `sparse_jacobian_vec` methods.

Custom-op guards (M14 + M15):
- `jacobian_forward` now asserts `custom_ops.is_empty()`. Previously
  it dispatched through `forward_tangent`, which linearizes custom
  ops around recording-time primals — correct at `x = x_record`,
  O(‖x − x_record‖) biased everywhere else. Users who need a
  Jacobian through custom ops should use the reverse-mode
  `jacobian`, which calls `CustomOp::partials` exactly.
- `taylor_grad` and `ode_taylor_step` assert the same. The bias
  here is worse than first-order: the linearization handler drops
  every second- and higher-order custom-op contribution, so the
  ODE integrator's local truncation estimate is systematically
  wrong through custom ops. `ode_taylor_step` additionally gains a
  `const { assert!(K >= 1) }` compile-time check to reject the
  previously-UB `K = 0` instantiation.

Single-output guards (M16):
- `hessian`, `hvp`, and `hessian_vec` assert `num_outputs() == 1`.
  Multi-output tapes previously returned the Hessian/HVP of
  output_indices[0] alone — the caller got a numerically correct
  result for the wrong function.
- `laplacian_gpu` and `laplacian_with_control_gpu` return
  `Err(GpuError)` on multi-output tapes. Their per-batch `c2s[i]`
  indexing assumes a single scalar output; on multi-output tapes
  the backend lays out coefficients as `[batch, output]` row-major
  and the estimator silently aggregated the wrong stride. Required
  adding a `num_outputs(tape)` method to the `GpuBackend` trait so
  the check is backend-agnostic.

Input-contiguity guard (M17):
- `forward_into` now carries the same `debug_assert!(opcodes[..n]
  are Input)` that `forward` already had. Without it, a tape built
  by mixing `new_input` and `push_op` out of order would silently
  overwrite non-input slots on parallel evaluation paths.

Cross-country extract_jacobian (M11):
- Elevated the `non-input predecessor remains after elimination`
  debug_assert to a hard assert with an actionable message.
  Triggers when one declared output is an ancestor of another
  (cross-country's elimination protocol doesn't handle this
  topology); the release-mode skip would silently produce zero
  rows for the dependent output.

Documentation (M13 + M44):
- Added a "Determinism requirement" section to `checkpoint.rs`
  explaining that `step` must be bit-reproducible on re-execution
  for the adjoint to match the forward trajectory.
- Added a "Custom-ops caveat" section to `stde/mod.rs` warning
  that all estimators take `&tape` and never call `forward(x)`, so
  custom-op tapes are locked to the recording point `x_record`.

Seven new tests in `tests/structural_asserts.rs` verify each panic
fires on the bad input shape, plus a "scalar-output still works"
sanity check. One existing test (`custom_op_jvp`) was updated to
use `tape.jacobian` (reverse mode) instead of `jacobian_forward`
since the latter now rightly rejects custom ops.
…ecks

Phase 6 addresses a batch of mid-severity findings from the bug hunt:

- LU singularity tolerance now anchors on the matrix infinity norm rather
  than a running max-pivot-seen, so an early small pivot no longer lowers
  the bar for every subsequent column.
- L-BFGS curvature filter uses the dimensionally-consistent Cauchy-Schwarz
  form `sy > eps·sqrt(ss·yy)` and clamps the initial gamma to [1e-3, 1e3]
  so pairs that just pass the filter don't collapse the search direction.
- Newton falls back to steepest descent when the LU solve fails or yields
  a non-descent direction, keeping the solver usable on non-convex
  problems instead of reporting NumericalError at the first saddle.
- Trust-region and piggyback iterations detect non-finite predicted/ρ
  and non-finite tangent/adjoint components, covering ratio-converging
  divergences the primal norm check would miss.
- Kahan/Neumaier compensated summation replaces naive accumulation in
  `norm` and `dot` once the vector crosses 64 elements, so ULP noise
  doesn't leak into the convergence test on long vectors.
- Sparse implicit probe switches to a mixed-sign RHS plus residual check
  so near-singular F_z can't slide through via a null-space-orthogonal
  all-ones test.
- `record_multi` rejects zero-output closures, `hessian`/`hessian_vec`
  and the sparse siblings recover the constant value from a primal pass
  on zero-input tapes, `mixed_partial` asserts that `orders.len()`
  matches the tape inputs, and `extraction_prefactor` falls back to
  log-domain accumulation so an overflowing integer product returns a
  clean `+inf` rather than NaN.

Each change has a regression test and the new assertions mirror the
existing structural-assert style from Phase 5.
Five low-severity latent issues in the series arithmetic surface as
silent NaN/Inf or two's-complement wraparound today. Each now routes
through an explicit degenerate value so downstream code sees a uniformly
bad result instead of a mixed-state tape.

- Laurent::recip uses checked_neg on pole_order so i32::MIN inputs fold
  to a NaN Laurent rather than silently wrapping back to i32::MIN.
- Laurent Mul/Div switch to checked_add/checked_sub on the combined
  pole_order; overflow returns NaN via the existing nan_pub path.
- taylor_sqrt on a negative leading coefficient fills the whole series
  with NaN; previously the primal went NaN while higher coefficients
  kept computing from `1/(2·NaN)` and looked finite to callers.
- Taylor::rem with a zero divisor returns an all-NaN Taylor instead of
  a mix of Inf and NaN slots.
- taylor_ln_1p gets a debug_assert on the buffer length so misuse
  surfaces with an actionable message instead of a raw slice panic.

Regression tests live in tests/phase8_taylor_laurent.rs.
… at 0

Three reverse-mode partials misbehaved at boundary or out-of-domain inputs
in ways that produced numerically plausible but semantically meaningless
gradients.

- Ln / Log2 / Log10 at a < 0 returned a finite partial like `1/-1 = -1`,
  letting an out-of-domain primal's gradient flow downstream as if it were
  real. They now return `(NaN, 0)` when strictly below the domain. The
  boundary a = 0 still evaluates via IEEE `1/0 = +Inf` to preserve the
  one-sided derivative limit callers rely on.
- Ln1p guards `a < -1` the same way; Atanh guards `|a| > 1`.
- Abs at ±0 used `a.signum()`, which returns ±0 depending on the sign bit
  of its argument — so algebraically equivalent tape positions produced
  different subgradients. The kink now resolves to 0 (symmetric midpoint
  of the Clarke subdifferential [-1, 1]) regardless of sign bit. The
  forced_reverse_partials path, which the tape uses to pin a specific
  branch at a kink, keeps its ±1 behaviour via a separate code path.

L16 from the Phase 8 plan was dropped: `reverse_partials(Powi, ...)` is a
public entry point exercised by opcode_edge_cases::reverse_powi, so the
inline dispatch in the tape runtime coexists with the standalone function
rather than deprecating it.

Regression tests live in tests/phase8_opcode.rs.
Six convergence/robustness cleanups in the optim crate.

- Trust-region radius collapse: when `rho < 1/4` (or an uphill predicted
  step) would shrink the radius below `config.min_radius`, the solver
  now returns `NumericalError` instead of clamping to `min_radius` and
  spinning to `MaxIterations`. Matches the doc contract on
  `TrustRegionConfig::min_radius`.
- `rho` tiny-predicted threshold scales with `(1 + |f|)`; the previous
  `F::epsilon()` floor tripped the special-case branch immediately on
  large-magnitude objectives.
- `boundary_tau` thresholds scale with problem magnitude: `dd` gated by
  `eps * (||s||² + radius² + 1)`, round-off-negative discriminants
  tolerated up to `eps * b²`, and the Vieta-vs-direct cutover uses
  `eps * radius` instead of a fixed `eps` so tiny-scale subproblems
  don't stall on a spurious `tau = 0`.
- Steihaug CG drops a dead-code pre-loop tolerance check that was
  tautologically `||r|| < sqrt(eps)·||r||`.
- `func_tol` in trust_region, L-BFGS, and Newton compares
  `|f_prev - f_val| < tol * (1 + |f|)` so a user-set 1e-6 means
  "one part per million" regardless of whether |f| is around 1 or 1e10.

Regression tests live in echidna-optim/tests/phase8_solvers.rs.
Seven low-severity / latent items cleared with guards, sibling fixes,
doc alignments, and one breaking-but-correct API signature change.

- set_outputs bounds-checks every index against `self.values.len()`
  with an actionable panic message instead of silently propagating
  out-of-range output slots into downstream accessors. Duplicate
  indices remain valid (a tape may expose the same variable under two
  outputs).
- hessian_diagonal_with_buf gets the n==0 early-return treatment that
  hessian / hessian_vec / sparse_hessian already received — empty-input
  tapes now recover the constant output value via a primal pass.
- ndarray wrappers copy inputs element-wise via a small `to_vec`
  helper, matching the faer/nalgebra adapter pattern. The old
  `.as_slice().unwrap()` panicked on any non-contiguous layout
  (stepped slices, transposed views).
- sparse_jacobian_ndarray now returns `(Array1<F>, pattern, Vec<F>)`
  instead of `(pattern, Vec<F>)`, surfacing the outputs the old
  signature discarded. Breaking change, acceptable in 0.x.
- eval_dyn's `Σx[i]` placeholder becomes `F::zero()` — the placeholder
  was defensive dead code, but `F::zero()` is a safer fallback if the
  JetPlan invariant ever slips.
- Checkpoint docs: Revolve truncation's O(n²) worst-case is now
  called out, and `grad_checkpointed_online` correctly documents the
  pre-loop `stop(x0, 0)` call.
- mixed_partial docstring no longer claims a nonexistent panic on
  all-zero orders; documents the identity-operator behaviour instead.

L13 (active_kinks excluding non-finite switching values) was dropped:
an earlier bug-hunt regression test pins the opposite contract
(conservative inclusion of NaN/Inf kinks), so reversing it would
break documented behaviour. Flagged the tension in the active_kinks
doc comment for future review.

Regression tests live in tests/phase8_misc.rs.
Six numerical / safety fixes to the GPU correctness baseline, applied
symmetrically to the WGSL hand-written shaders (forward, reverse,
tangent_forward, tangent_reverse), the CUDA NVRTC kernel, and the
Taylor codegen that emits both families.

- ATAN: the `1/(1+a²)` derivative overflows in f32 for |a|>1.84e19,
  collapsing to 0 via `1/inf`. Switch to the inv-based form
  `inv²/(1+inv²)` for |a|>1e8 (and its second-order analogue in the
  tangent_reverse kernel). Mirrors the CPU fix applied in Phase 2A.
- DIV: `db = -a*inv²` overflows for small |b| (inv² alone can exceed
  f32::MAX before the multiply), and the tangent_reverse second-order
  uses `inv³`, worse. Factor through the primal `r = a/b` so each
  term drops one `inv` — `db = -r*inv` for first order and the
  `inv³` term becomes `r*inv²` for second order.
- POWI n=1 at a=0: the tangent_reverse second derivative is the
  general `n*(n-1)*pow(a, n-2)*at` which for n=1 evaluates
  `0 * pow(0, -1) * at = 0 * inf = NaN`. Short-circuit n==1 to
  `da_eps=0`, matching the mathematically exact zero second derivative
  of a linear function.
- EXPM1 / LN1P (WGSL tangent kernels only): both primals used the
  cancellation-prone `exp(a)-1` / `log(1+a)` forms, losing ~7 digits
  for |a|<1e-4. The forward-shader helpers `expm1_f32` / `ln1p_f32`
  (Taylor shortcut below threshold) are now duplicated into the two
  tangent shaders. Tangent kernels read the primal when forming
  `rt`, so a drifted primal propagates.
- HYPOT in Taylor codegen: the emitted `sqrt(a*a + b*b)` primal patch
  overflows at large leading-order magnitude. WGSL now computes
  `h * sqrt((a/h)² + (b/h)²)` via max-rescale, CUDA uses the
  `hypot(a, b)` builtin. Higher-order jet coefficients still pass
  through `jet_mul`/`jet_add` and remain overflow-risky; a full jet-
  wide rescale is a follow-up.
- CUDA upload_tape empty output_indices: the CUDA path was crashing
  on `clone_htod(&[])` when a downstream caller cleared
  `GpuTapeData.output_indices` (valid since the struct fields are
  public). Defensive fallback now synthesises
  `vec![data.output_index]` with `num_outputs = 1`, matching wgpu.
  Applied to both `upload_tape` (f32) and `upload_tape_f64`.

Regression tests live in tests/phase7_gpu_fixes.rs. WGSL runs on Metal
(M4 Max) locally, CUDA runs on A100 via vast.ai. Both backends pass
the Phase 7 tests plus the full pre-existing GPU suite.
Surfaces GPU driver failures and locks down CUDA-CPU numerical parity.

- wgpu device.poll at all five submission-wait sites now returns the
  `Result<PollStatus, PollError>` through `?` into `GpuError::Other`
  instead of `let _ = device.poll(...)`. On device loss (driver reset,
  another submission OOM), callers used to see the subsequent
  `rx.recv()` block forever, looking like a deadlock. The actual
  driver error now propagates up to the caller.
- NVRTC compilation for every kernel family (f32, f64, Taylor codegen)
  now passes `CompileOptions { fmad: Some(false), ..Default::default() }`.
  The CUDA default contracts `a*b + c` into an FMA that rounds once
  instead of twice, producing sub-ULP drift vs the CPU's unfused
  arithmetic. Disabling FMA gives bit-exact CPU-GPU parity on f64
  (exercised by `l25_cuda_fmad_disabled_bit_exact_with_cpu_f64`). The
  perf cost is 1-2% in our kernel mix; callers chasing raw throughput
  can revert the option.

Additionally pins behaviour confirmed correct during Phase 7 exploration:

- `m26_wgpu_abs_at_positive_zero` and `_at_negative_zero`: the WGSL
  shaders already use `bitcast<u32>` to inspect the sign bit, so Abs
  at -0.0 gives -1 and at +0.0 gives +1. No code change; the tests
  lock this in against future refactors.
- `m28_cuda_rem_quotient_boundary_consistent`: the CUDA REM primal
  (`fmod`) and reverse partials (`db = -trunc(a/b)`) are internally
  consistent. No code change; regression test pins the convention.

M23's runtime path (device loss) can't be simulated in a unit test —
the error-propagation code is exercised by the compile-gate alone.

Tests: tests/phase7_gpu_parity.rs. WGSL verified on Metal/M4 Max,
CUDA verified on A100 via vast.ai.
Three real bugs surfaced by a retroactive multi-agent review of the
Phase 7 + Phase 8 commits (d0bbab9..4dbee62).

- WGSL `tangent_reverse.wgsl` POWI was missing the `n == 1` short-
  circuit added to the CUDA kernel in Phase 7. HVP on `x.powi(1)` at
  x=0 returned NaN on the wgpu backend while CUDA returned 0, breaking
  CPU-GPU and CUDA-WGSL parity on tapes built with the public
  `OpCode::Powi` API.
- `TaylorDyn::rem` missed the Phase 8 Taylor::rem zero-divisor guard.
  The dynamic counterpart silently produced a mix of finite and
  NaN/Inf coefficients when `rhs.coeffs[0] == 0`.
- Taylor codegen WGSL HYPOT primal produced NaN for `hypot(±Inf, x)`
  (the rescale formula gave `Inf/Inf * 0 = NaN`). IEEE requires
  `hypot(±Inf, x) = +Inf`, and the CUDA builtin and CPU both honour
  it. Added an explicit `aa == inf || bb == inf` early-return that
  short-circuits to +Inf before the rescale.

New regression tests:
- `l23_wgpu_powi_n1_at_zero_second_derivative_finite` — WGSL sibling
  of the CUDA L23 test.
- `m29_wgpu_atan_moderate_large_a_positive_normal_f32` — strengthens
  the earlier M29 wgpu test by using `a=3e9` where the f32 result is
  a normal (not denormal) float, so Metal's denormal flush doesn't
  mask a regression to `0.0`.
- `l9_taylor_dyn_rem_zero_divisor_returns_nan` — pins the TaylorDyn
  sibling of the existing Taylor::rem test.

Out of scope, noted for a follow-up: the scalar `forward.wgsl
hypot_f32` and `tangent_forward.wgsl` HYPOT primal arms still use the
unguarded `sqrt(a*a + b*b)` form. They pre-date Phase 7 and a separate
pass should bring them in line with the Taylor codegen rescale.
Two pre-existing WGSL HYPOT issues flagged out-of-scope during the
Phase 7/8 retroactive review.

- `forward.wgsl hypot_f32`: the rescaled formula computes
  `mn/mx = Inf/Inf = NaN` when both operands are Inf, then returns
  `Inf * NaN = NaN`. IEEE 754-2008 mandates `hypot(±Inf, x) = +Inf`
  for any x (including NaN). Added an `ax == inf || ay == inf`
  short-circuit before the rescale. `forward.wgsl` is the source of
  the `r` value that reverse.wgsl and tangent_reverse.wgsl read back,
  so fixing this helper lifts the IEEE contract across the whole
  reverse-primal chain.
- `tangent_forward.wgsl` HYPOT arm: primal was `sqrt(a*a + b*b)`
  unguarded, overflowing for `|a|` or `|b|` > sqrt(f32::MAX) ≈ 1.84e19.
  Replaced with a call to `hypot_f32`, and added the helper (plus the
  same IEEE Inf guard) to this shader. Brings tangent_forward's
  HYPOT primal bit-for-bit in line with the forward kernel.

Regression tests (all wgpu) in tests/phase7_gpu_fixes.rs:
- `wgpu_hypot_inf_finite_returns_inf`
- `wgpu_hypot_inf_inf_returns_inf` — the triple-backend divergence
  case that motivated the fix (CPU ✓, CUDA ✓, WGSL was NaN).
- `wgpu_tangent_forward_hypot_large_magnitude_primal_finite` —
  pins the rescale so `hypot(1e20, 1e20)` stays at ≈1.41e20 instead
  of overflowing to +Inf.
Two CI failures on PR #56:
- `cargo fmt --check` flagged ~22 files where the added `checked_add`
  / Kahan / relative-tol expressions exceeded rustfmt's single-line
  threshold differently than expected.
- MSRV (1.93) build failed because `tests/phase8_opcode.rs` imports
  `echidna::opcode::{reverse_partials, OpCode}`, which is only re-
  exported when the `bytecode` feature is enabled. The test now
  gates the whole module on `feature = "bytecode"`, matching the
  other Phase 8 test files.
…hecked_ops

Clippy 1.95 added stricter lints that the CI runner enforces via
`-D warnings`. None of the flagged sites are bugs — they're style
preferences newer Clippy opted into — but the PR gate fails on them.

- `src/bytecode_tape/mod.rs` OpCode::Sub / OpCode::Div optimization
  arms: collapsed the inner `if` guard into the match pattern
  (`OpCode::Sub if …=> {…}`).
- `src/taylor_ops.rs` hypot branch: replaced two manual 1..n loops
  with `copy_from_slice`.
- `src/gpu/mod.rs` chunk-size cap: replaced the `if nv_k > 0` guard
  with `(u32::MAX as u64).checked_div(nv_k)`.

No behaviour changes.
@gvonness-apolitical gvonness-apolitical merged commit e22c080 into main Apr 18, 2026
7 checks passed
@gvonness-apolitical gvonness-apolitical deleted the fix/harden-cycle-6 branch April 18, 2026 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant