Cycle 6 bug-hunt: 19 commits of numerical / GPU / solver hardening by gvonness-apolitical · Pull Request #56 · Entrolution/echidna

gvonness-apolitical · 2026-04-18T19:51:50Z

Summary

Cycle 6 of the bug-hunt sweep. Nine phases of fixes across the AD core, optimizer crate, GPU backends, and public API. Every commit is review-fix-gated (multi-agent review with score ≥80 threshold), and every behaviour change has a discriminating regression test.

Phase roll-up:

Phase 1 — critical UB: BtapeGuard/TapeGuard lifetimes tied to their tape.
Phase 2 — high-impact CPU correctness: atan2/asinh/acosh extreme-magnitude partials, powf negative-base safety, Laurent add/sub zero short-circuit, Clarke shift overflow.
Phase 3 — high-impact GPU/CPU parity: reverse-partial fixes mirrored into hand-written GPU kernels, Taylor codegen aligned with CPU fixes for POWF/POWI/ABS, wgpu u32 shader indexing + CUDA runtime-verified buffer sizes.
Phase 4 — medium numerical robustness: hypot/recip/acosh/powf/Laurent::powi.
Phase 5 — structural: silent-wrong shapes turned into loud asserts across the tape API.
Phase 6 — solver/optim + tape API: piggyback finiteness, L-BFGS curvature filter + gamma clamp, Newton descent-direction fallback, Kahan summation, trust-region NaN detection, hessian/hessian_vec/sparse_hessian empty-input sibling fixes.
Phase 8 — low-severity latent: Taylor/Laurent domain guards with checked pole_order arithmetic, opcode domain NaN guards + symmetric Abs subgradient, trust-region radius-collapse → NumericalError, solver func_tol relativised, ndarray wrappers copy element-wise, set_outputs bounds-checked, hessian_diagonal empty-input sibling.
Phase 7 — medium GPU remaining: GPU ATAN large-|a| inv-based form, DIV small-|b| via r*inv factoring, POWI n==1 short-circuit, WGSL EXPM1/LN1P precision helpers in tangent kernels, CUDA upload_tape defensive fallback, wgpu device.poll error propagation, CUDA NVRTC fmad=false for CPU-GPU bit-parity.
Retroactive review-fix on Phases 7 + 8: WGSL tangent_reverse POWI n==1 sibling gap, TaylorDyn::rem zero-divisor sibling gap, WGSL HYPOT Inf handling (both the Taylor codegen path and the scalar forward/tangent_forward path).

Stats:

19 commits
~90 files touched
~100 regression tests added (65 on echidna, 20 on echidna-optim, 11 GPU parity + correctness)
1070 total tests passing (954 echidna + 116 echidna-optim)
CUDA verified on A100 via vast.ai at each GPU-touching commit
wgpu verified on Metal / M4 Max throughout

Test plan

cargo build -p echidna --features \"bytecode,faer,ndarray,nalgebra,serde,parallel,stde,diffop,taylor,laurent,gpu-wgpu\"
cargo build -p echidna-optim --features \"parallel,sparse-implicit\"
cargo test -p echidna --features \"bytecode,faer,ndarray,nalgebra,serde,parallel,stde,diffop,taylor,laurent,gpu-wgpu\" (954 passed)
cargo test -p echidna-optim --features \"parallel,sparse-implicit\" (116 passed)
CUDA tests on A100 (phase7_gpu_fixes, phase7_gpu_parity, gpu_cuda_tests, gpu_cuda_kernel_parity all green)
Review CHANGELOG for any user-visible API break notes (sparse_jacobian_ndarray return type changed — 0.x acceptable but flag for release notes)

`BtapeGuard` and `TapeGuard` previously stored a raw `*mut Tape<F>` / `*mut BytecodeTape<F>` without a lifetime parameter, so the borrow checker accepted programs in which the guard outlived its tape. The next `with_active_{b,}tape` call would then dereference freed memory (Miri confirmed the UAF). Add a `'a` lifetime + `PhantomData<&'a mut ...>` so the guard carries the tape borrow statically. The `compile_fail` doctests on each guard regression-check the dangling pattern. Add `#[must_use]` on both constructors to match the crate convention and warn against `let _ = Guard::new(&mut t)`, which would deactivate the tape immediately. Call sites that previously accessed the tape directly after guard construction (`record`, `record_multi`, `composed_hvp`, `vjp_step`, and six integration tests) are restructured to scope the guard around the closure invocation that actually needs the thread-local; subsequent `tape.reverse(...)` / `tape.set_output(...)` calls then run on the free borrow as intended. The existing SAFETY comments in `with_active_btape` / `with_active_tape` are tightened to reflect that the static lifetime now enforces what the old comment only asserted by convention.

The closed-form partials for `atan2`, `asinh`, and `acosh` each re-square a quantity that `hypot` was expressly designed to avoid: `h*h`, `a*a+1`, `a*a-1`. In f64 these overflow for `|h| > ~1.3e154` and for `|a| > ~1.3e154`, silently collapsing the derivative to 0 (or, for `atan2`, sometimes to `+Inf` with the wrong sign). The `atan2` case also underflows for `|h| < ~1.5e-154` — the naive `h*h == 0` guard fired, and callers received a zero tangent where the true partial was representable as a large finite value. Replace: - `atan2` partials with the algebraically equivalent `(y/h)/h` and `(-x/h)/h` factoring. `y/h` and `x/h` are bounded by 1 (since `h = hypot(x, y)`), so no intermediate step overflows for any `h > 0`. The zero-guard now checks `h == 0` instead of `h*h == 0`, which is the only remaining singular point. - `asinh`/`acosh` partials with a `|x| > 1e8` switch to the form `|1/x| / sqrt(1 ± (1/x)²)`, mirroring the existing guard already in place on `atan`. For small `x`, the direct formula is kept. Applied at four layers: the `Dual` and `DualVec` forward-mode types, the `Reverse` adept-style reverse-mode type, and the `OpCode::*` reverse partials used by `BReverse` / tape-replay. The GPU backends (WGSL + CUDA + Taylor codegen) have the same class of bug on f32 (overflow threshold near `|x| = 1.8e19`) but are left for a follow-up commit that touches the shared codegen paths. Tests cover `(1e200, 1e200)`, `(1e-200, 1e-200)`, mixed magnitudes, origin convention, NaN propagation, acosh(1) boundary, and small-input regression (old behaviour unchanged) across Dual/DualVec/Reverse/BReverse.

`x^n` for integer `n` is a real-valued polynomial with real derivatives at any real `x`, including `x < 0`. The `exp(b · ln(a))` implementation behind `powf` produced a NaN tangent in that regime: `ln(a)` is NaN for `a < 0`, and IEEE's `NaN * 0 = NaN` prevented the `dy * n.eps` factor from collapsing to zero even when the exponent had no live tangent at all. `(-2.0_f64).powf(3.0)` thus returned `Dual { re: -8, eps: NaN }` instead of `Dual { re: -8, eps: 12 }`. Fix by detecting a losslessly-representable integer exponent at the forward-mode types (`Dual`, `DualVec`, `Taylor`, `Laurent`) and the Adept-style `Reverse`, and dispatching to `powi`, which takes the squaring path and handles negative bases natively. The integer check uses `to_i32()` + round-trip `F::from(ni) == n.value` to guarantee no precision is lost in the conversion. For the bytecode-tape opcode `OpCode::Powf`, tape-replay semantics preclude changing the recorded opcode, so the reverse partial gains a small safety net: when `a <= 0` and the reverse pass needs `d/db`, return 0 instead of `r * ln(a)`. The convention matches the forward-mode fast path — finite `r` with `a < 0` implies `b` was integer, and the classical derivative w.r.t. `b` at an integer `b` is not defined; 0 is the standard choice. Non-integer exponents of a negative base (e.g. `(-2).powf(2.5)`) remain NaN — the dispatch only fires for losslessly-representable integers. Live-tangent exponents continue through the existing `exp(b·ln(a))` path and are unaffected.

Three independent guards, each addressing an input pattern that the current code handled by panicking, asserting, or silently walking off into a non-finite region. clarke_jacobian (src/bytecode_tape/jacobian.rs) ----------------------------------------------- The combo enumeration over active kinks computes `1usize << k` and panics (debug) or wraps to 1 (release) whenever `k >= usize::BITS`. A caller passing `max_active_kinks = Some(64)` on a 64-bit target with an unusually-kinky tape would hit this. Cap the effective limit at `usize::BITS - 1` and return `TooManyKinks { limit: cap }` instead. Laurent Add/Sub (src/traits/laurent_std_ops.rs) ----------------------------------------------- Both impls aligned the operands to `min(p1, p2)` and asserted the resulting shift fit in `K`. `x + Laurent::zero()` where `|x.pole_order| >= K` would panic, breaking the Zero trait contract and any generic `.sum()` built on top of it (e.g. iterator-based totals). Short-circuit both impls on either operand being all-zero. backtracking_armijo (echidna-optim/src/line_search.rs) ------------------------------------------------------ The Armijo sufficient-decrease check `f_new <= f_x + c·α·g^T·d` is trivially true when `f_new = -Inf`, so the solver accepted a step off the domain boundary and walked toward -Inf. Reject trial points with non-finite `f_new` or `g_new` — treat them as infeasible and backtrack past. Common trigger: log-barrier or log-likelihood objectives returning `-Inf` outside the feasible region. Each fix ships with a focused regression test (2 clarke + 5 laurent + 2 line_search), all passing. Full test suite remains green.

The WGSL shaders and CUDA kernel had kept the pre-`atan2`-fix, pre-`asinh` overflow-guard, pre-`powf`-safety-net formulations, plus the `fract` / ABS / max-min conventions that diverged from CPU. Landing the mirror fixes so that any expression recorded on the CPU tape evaluates equivalently on the GPU at every magnitude the CPU can represent. WGSL (`reverse.wgsl`, `tangent_forward.wgsl`, `tangent_reverse.wgsl`): - `atan2` partials: replace `(b*at - a*bt) / (a²+b²)` with a form normalized by `max(|a|,|b|)` so neither `a*a` nor `b*b` can overflow. The original `(y/h)/h` pattern used on the CPU isn't available in WGSL (no `hypot`), so the normalized form is the equivalent. - `asinh` / `acosh`: switch to `|1/a|/sqrt(1 ± (1/a)²)` for `|a| > 1e8`. - `fract`: `x - trunc(x)` instead of the built-in `fract(x) = x - floor(x)` so negative inputs match Rust's `f32::fract`. - `abs` derivative: detect sign via bit-pattern instead of `a >= 0.0`, which returns `true` for both `+0.0` and `-0.0` — matches Rust's `f32::signum` semantics. - `max` / `min`: route adjoint via `bitcast<u32>(r) == bitcast<u32>(a)` (reverse.wgsl) or explicit NaN bit-pattern check (tangent paths); `b != b` got folded away by naga. - Taylor codegen `1.0 / 0.0` and `0.0 / 0.0` literals rejected by naga as abstract-typed `inf` / `nan`; emit the canonical bit patterns (`bitcast<f32>(0x7f800000u)` / `0x7fc00000u`) instead. CUDA (`tape_eval.cu`): - Same `atan2` / `asinh` / `acosh` / `powf` / `fract` / max-min fixes. CUDA has `hypot` and `isnan` available, so the atan2 uses `(b/h)/h` directly (no rescale needed) and NaN routing is straightforward. ABS was already correct via `copysign`. Tests: `tests/gpu_kernel_parity.rs` adds 7 parity tests covering `atan2` at 1e20, `asinh`/`acosh` at 1e20, `powf(-2, 3)`, `max(x, NaN)`, `|-0.0|`, and `fract(-1.3)`. Existing 30-test wgpu suite stays green. CUDA changes code-reviewed; runtime verification deferred to vast.ai.

Mirror of `tests/gpu_kernel_parity.rs` that exercises the same six numerical edge cases through the CUDA reverse/tangent-forward/tangent- reverse paths. Validated on A100 + CUDA 12.8: atan2 at 1e20, asinh/acosh at 1e20, powf(-2, 3), max(x, NaN), fract(-1.3). Existing CUDA suite (15 tests) and STDE-on-GPU suite (11 tests) stay green.

The runtime-codegenned Taylor shaders (WGSL + CUDA) had three emissions that diverged from the CPU reference after Phase 2B: POWF db at a ≤ 0 (H6): Emitted `val * log(|a|)` (CUDA) / `val * log(abs(a.v[0]))` (WGSL), which is finite for a < 0 and diverges from CPU's `OpCode::Powf` safety net (`db = 0` for a ≤ 0). Set db = 0 in both emitters so the first-order Taylor coefficient matches `da * a.v[1] + 0 * b.v[1]`, the same value the CPU computes. POWI at a = 0 with |n| ≥ 5 (M18): Only ni ∈ {2, 3, 4} were special-cased via chained `jet_mul`. For |ni| ≥ 5 the fall-through `a.v[0] <= 0` branch computed `log|0| = -∞` and polluted every higher-order coefficient with NaN, while `x^5` at `x = 0` has a clean Taylor expansion (first nonzero coefficient at index 5). Extend the squaring chain to cover |ni| ∈ {2,…,8}, matching CPU `taylor_powi_squaring`. ABS Taylor sign scan (M20): Emitted `sg = select(sign(a.v[0]), 1.0, a.v[0] == 0.0)`, always returning +1 at zero primal. CPU `Taylor::abs` scans coefficients for the first non-zero and uses its sign, so that |f(t)| for f(t) starting `[0, -v, …]` has sign -1. Emit an else-if chain across all k coefficients, defaulting to +1 for the identically-zero case. Generated shader still compiles cleanly on naga (Metal) and NVRTC (PTX). 30 wgpu tests + 21 CUDA tests stay green on both M4 Max Metal and A100 CUDA 12.8. Dead-store warnings for `sg`/`db` at k=1 are pre-existing cosmetic artifacts of the fixed-width jet emission.

Every `(X * Y) as usize` in the CUDA dispatch macros used u32 × u32 multiplication first, then widened the result. For batch × nv × K in a K=5 Taylor run at ~100k variables and batch >= 8k, the u32 product overflows before the widening; `alloc_zeros` then reserves a truncated buffer and the kernel (which indexes with `unsigned long long`) writes past the end → device memory corruption that isn't caught on the host. Rewrite each site as `(X as usize) * (Y as usize)` so the multiplication happens in 64-bit on 64-bit targets. Fifteen sites in `cuda_backend.rs`; the wgpu analogues already cast explicitly via `u64` were correct. On the wgpu side, the shaders still use u32 for `bid * num_variables (* K)` indexing. Add a `(batch * nv [* K]) <= u32::MAX` guard to two dispatch paths that previously relied on an implicit argument: - `taylor_forward_kth_batch` (K ∈ {1..5}, the K multiplier previously wasn't guarded). - `sparse_jacobian` (effective batch = `num_colors`, not the caller's `batch_size`; the upstream guard didn't apply). Also rewrite five residual `(batch_size * _) as usize` sites in `wgpu_backend.rs` to match the CUDA form for defense-in-depth — each was already covered by an adjacent u64-arithmetic guard plus the `ni ≤ nv` tape invariant, so no observable behavior change, but a future relocation of the guard would otherwise re-expose the wrap. CUDA suite (15 + 6) green on A100 + CUDA 12.8. wgpu suite (30 + 7) green on M4 Max / Metal.

Eight related fixes across the forward-mode AD types and supporting ops. Each addresses a distinct class of NaN / Inf / catastrophic- cancellation regime that was producing wrong derivatives. Hypot tangent factoring (Dual, DualVec): Replaced `(x·dx + y·dy) / h` with `(x/h)·dx + (y/h)·dy`. `x/h` and `y/h` are each bounded by 1, so no intermediate step overflows — matches the overflow-safe pattern `hypot` uses for the primal. Prior form produced Inf tangents for inputs whose true tangent was representable. Laurent::hypot rescaling: `(self*self + other*other).sqrt()` squared both operands before summing; the squaring overflowed for any leading coefficient above sqrt(f64::MAX). Rescale both by the max absolute coefficient before squaring, then multiply the result back — keeps intermediate ranges bounded and preserves higher-order Taylor coefficients. Acosh near the domain boundary (Dual, DualVec): Replaced the small-|x| `x*x - 1` term with the factored `(x-1)(x+1)`. Near x = 1 the naive form suffers catastrophic cancellation (x² rounds to 1 and loses the ε contribution); the factored form stays distinct down to the ULP. Recip at the singularity (Dual, DualVec): Skip the `0·Inf = NaN` chain when the input eps is 0. At x = 0 with eps = 0, the derivative is the "constant zero" convention rather than a NaN that would poison downstream. Non-zero eps still produces the classical Inf derivative. Powf partial at Inf (opcode reverse_partials): Extended the `a == 0 || r == 0` fallback to `!a.is_finite() || !r.is_finite()`. Prior code computed `b * r / a = 2·Inf / Inf = NaN` for `Inf^2`; fallback `b * a.powf(b - 1) = 2·Inf = Inf` is finite and matches the true derivative. Laurent::powi edges: - `powi(0) = 1` for any input (including all-zero), matching stdlib `f64::powi(0, 0) = 1`. Was previously `nan_laurent()` for the all-zero operand. - Pole-order overflow returns `nan_laurent()` instead of panicking via `checked_mul().expect()`. A `Float`-trait method panicking on extreme-but-representable i32 exponents violated the trait contract; returning NaN lets generic consumers recover. taylor_hypot smooth-at-zero: When both inputs vanish at t=0 with non-zero higher-order terms (e.g., a(t) = 3t, b(t) = 4t), the function `hypot(a,b) = |t|·…` is smooth away from t=0, and the Taylor expansion around t=0 from the right is well-defined. Detect the common t-factor, recurse on the shifted series, and shift the result back. Prior code went through `taylor_sqrt` with a zero leading coefficient and produced +Inf for every higher-order coefficient. 13 new regression tests in `tests/numerical_robustness.rs` cover each fix. One pre-existing regression test (`regression_22_laurent_powi_ pole_order_overflow_panics`) was flipped from `should_panic` to a returns-NaN assertion to match the new trait-conformant behaviour.

Prior behaviour on these input shapes was to silently produce wrong derivatives or return a jet that looked structurally valid but was biased toward zero in higher-order terms. Each fix here converts the problem into an actionable panic (for internal invariants) or Result::Err (for the trait-dispatched GPU backends), matching the pattern already in place on the sibling `hessian_vec`, `sparse_hessian_vec`, and `sparse_jacobian_vec` methods. Custom-op guards (M14 + M15): - `jacobian_forward` now asserts `custom_ops.is_empty()`. Previously it dispatched through `forward_tangent`, which linearizes custom ops around recording-time primals — correct at `x = x_record`, O(‖x − x_record‖) biased everywhere else. Users who need a Jacobian through custom ops should use the reverse-mode `jacobian`, which calls `CustomOp::partials` exactly. - `taylor_grad` and `ode_taylor_step` assert the same. The bias here is worse than first-order: the linearization handler drops every second- and higher-order custom-op contribution, so the ODE integrator's local truncation estimate is systematically wrong through custom ops. `ode_taylor_step` additionally gains a `const { assert!(K >= 1) }` compile-time check to reject the previously-UB `K = 0` instantiation. Single-output guards (M16): - `hessian`, `hvp`, and `hessian_vec` assert `num_outputs() == 1`. Multi-output tapes previously returned the Hessian/HVP of output_indices[0] alone — the caller got a numerically correct result for the wrong function. - `laplacian_gpu` and `laplacian_with_control_gpu` return `Err(GpuError)` on multi-output tapes. Their per-batch `c2s[i]` indexing assumes a single scalar output; on multi-output tapes the backend lays out coefficients as `[batch, output]` row-major and the estimator silently aggregated the wrong stride. Required adding a `num_outputs(tape)` method to the `GpuBackend` trait so the check is backend-agnostic. Input-contiguity guard (M17): - `forward_into` now carries the same `debug_assert!(opcodes[..n] are Input)` that `forward` already had. Without it, a tape built by mixing `new_input` and `push_op` out of order would silently overwrite non-input slots on parallel evaluation paths. Cross-country extract_jacobian (M11): - Elevated the `non-input predecessor remains after elimination` debug_assert to a hard assert with an actionable message. Triggers when one declared output is an ancestor of another (cross-country's elimination protocol doesn't handle this topology); the release-mode skip would silently produce zero rows for the dependent output. Documentation (M13 + M44): - Added a "Determinism requirement" section to `checkpoint.rs` explaining that `step` must be bit-reproducible on re-execution for the adjoint to match the forward trajectory. - Added a "Custom-ops caveat" section to `stde/mod.rs` warning that all estimators take `&tape` and never call `forward(x)`, so custom-op tapes are locked to the recording point `x_record`. Seven new tests in `tests/structural_asserts.rs` verify each panic fires on the bad input shape, plus a "scalar-output still works" sanity check. One existing test (`custom_op_jvp`) was updated to use `tape.jacobian` (reverse mode) instead of `jacobian_forward` since the latter now rightly rejects custom ops.

…ecks Phase 6 addresses a batch of mid-severity findings from the bug hunt: - LU singularity tolerance now anchors on the matrix infinity norm rather than a running max-pivot-seen, so an early small pivot no longer lowers the bar for every subsequent column. - L-BFGS curvature filter uses the dimensionally-consistent Cauchy-Schwarz form `sy > eps·sqrt(ss·yy)` and clamps the initial gamma to [1e-3, 1e3] so pairs that just pass the filter don't collapse the search direction. - Newton falls back to steepest descent when the LU solve fails or yields a non-descent direction, keeping the solver usable on non-convex problems instead of reporting NumericalError at the first saddle. - Trust-region and piggyback iterations detect non-finite predicted/ρ and non-finite tangent/adjoint components, covering ratio-converging divergences the primal norm check would miss. - Kahan/Neumaier compensated summation replaces naive accumulation in `norm` and `dot` once the vector crosses 64 elements, so ULP noise doesn't leak into the convergence test on long vectors. - Sparse implicit probe switches to a mixed-sign RHS plus residual check so near-singular F_z can't slide through via a null-space-orthogonal all-ones test. - `record_multi` rejects zero-output closures, `hessian`/`hessian_vec` and the sparse siblings recover the constant value from a primal pass on zero-input tapes, `mixed_partial` asserts that `orders.len()` matches the tape inputs, and `extraction_prefactor` falls back to log-domain accumulation so an overflowing integer product returns a clean `+inf` rather than NaN. Each change has a regression test and the new assertions mirror the existing structural-assert style from Phase 5.

Five low-severity latent issues in the series arithmetic surface as silent NaN/Inf or two's-complement wraparound today. Each now routes through an explicit degenerate value so downstream code sees a uniformly bad result instead of a mixed-state tape. - Laurent::recip uses checked_neg on pole_order so i32::MIN inputs fold to a NaN Laurent rather than silently wrapping back to i32::MIN. - Laurent Mul/Div switch to checked_add/checked_sub on the combined pole_order; overflow returns NaN via the existing nan_pub path. - taylor_sqrt on a negative leading coefficient fills the whole series with NaN; previously the primal went NaN while higher coefficients kept computing from `1/(2·NaN)` and looked finite to callers. - Taylor::rem with a zero divisor returns an all-NaN Taylor instead of a mix of Inf and NaN slots. - taylor_ln_1p gets a debug_assert on the buffer length so misuse surfaces with an actionable message instead of a raw slice panic. Regression tests live in tests/phase8_taylor_laurent.rs.

… at 0 Three reverse-mode partials misbehaved at boundary or out-of-domain inputs in ways that produced numerically plausible but semantically meaningless gradients. - Ln / Log2 / Log10 at a < 0 returned a finite partial like `1/-1 = -1`, letting an out-of-domain primal's gradient flow downstream as if it were real. They now return `(NaN, 0)` when strictly below the domain. The boundary a = 0 still evaluates via IEEE `1/0 = +Inf` to preserve the one-sided derivative limit callers rely on. - Ln1p guards `a < -1` the same way; Atanh guards `|a| > 1`. - Abs at ±0 used `a.signum()`, which returns ±0 depending on the sign bit of its argument — so algebraically equivalent tape positions produced different subgradients. The kink now resolves to 0 (symmetric midpoint of the Clarke subdifferential [-1, 1]) regardless of sign bit. The forced_reverse_partials path, which the tape uses to pin a specific branch at a kink, keeps its ±1 behaviour via a separate code path. L16 from the Phase 8 plan was dropped: `reverse_partials(Powi, ...)` is a public entry point exercised by opcode_edge_cases::reverse_powi, so the inline dispatch in the tape runtime coexists with the standalone function rather than deprecating it. Regression tests live in tests/phase8_opcode.rs.

Six convergence/robustness cleanups in the optim crate. - Trust-region radius collapse: when `rho < 1/4` (or an uphill predicted step) would shrink the radius below `config.min_radius`, the solver now returns `NumericalError` instead of clamping to `min_radius` and spinning to `MaxIterations`. Matches the doc contract on `TrustRegionConfig::min_radius`. - `rho` tiny-predicted threshold scales with `(1 + |f|)`; the previous `F::epsilon()` floor tripped the special-case branch immediately on large-magnitude objectives. - `boundary_tau` thresholds scale with problem magnitude: `dd` gated by `eps * (||s||² + radius² + 1)`, round-off-negative discriminants tolerated up to `eps * b²`, and the Vieta-vs-direct cutover uses `eps * radius` instead of a fixed `eps` so tiny-scale subproblems don't stall on a spurious `tau = 0`. - Steihaug CG drops a dead-code pre-loop tolerance check that was tautologically `||r|| < sqrt(eps)·||r||`. - `func_tol` in trust_region, L-BFGS, and Newton compares `|f_prev - f_val| < tol * (1 + |f|)` so a user-set 1e-6 means "one part per million" regardless of whether |f| is around 1 or 1e10. Regression tests live in echidna-optim/tests/phase8_solvers.rs.

Seven low-severity / latent items cleared with guards, sibling fixes, doc alignments, and one breaking-but-correct API signature change. - set_outputs bounds-checks every index against `self.values.len()` with an actionable panic message instead of silently propagating out-of-range output slots into downstream accessors. Duplicate indices remain valid (a tape may expose the same variable under two outputs). - hessian_diagonal_with_buf gets the n==0 early-return treatment that hessian / hessian_vec / sparse_hessian already received — empty-input tapes now recover the constant output value via a primal pass. - ndarray wrappers copy inputs element-wise via a small `to_vec` helper, matching the faer/nalgebra adapter pattern. The old `.as_slice().unwrap()` panicked on any non-contiguous layout (stepped slices, transposed views). - sparse_jacobian_ndarray now returns `(Array1<F>, pattern, Vec<F>)` instead of `(pattern, Vec<F>)`, surfacing the outputs the old signature discarded. Breaking change, acceptable in 0.x. - eval_dyn's `Σx[i]` placeholder becomes `F::zero()` — the placeholder was defensive dead code, but `F::zero()` is a safer fallback if the JetPlan invariant ever slips. - Checkpoint docs: Revolve truncation's O(n²) worst-case is now called out, and `grad_checkpointed_online` correctly documents the pre-loop `stop(x0, 0)` call. - mixed_partial docstring no longer claims a nonexistent panic on all-zero orders; documents the identity-operator behaviour instead. L13 (active_kinks excluding non-finite switching values) was dropped: an earlier bug-hunt regression test pins the opposite contract (conservative inclusion of NaN/Inf kinks), so reversing it would break documented behaviour. Flagged the tension in the active_kinks doc comment for future review. Regression tests live in tests/phase8_misc.rs.

Six numerical / safety fixes to the GPU correctness baseline, applied symmetrically to the WGSL hand-written shaders (forward, reverse, tangent_forward, tangent_reverse), the CUDA NVRTC kernel, and the Taylor codegen that emits both families. - ATAN: the `1/(1+a²)` derivative overflows in f32 for |a|>1.84e19, collapsing to 0 via `1/inf`. Switch to the inv-based form `inv²/(1+inv²)` for |a|>1e8 (and its second-order analogue in the tangent_reverse kernel). Mirrors the CPU fix applied in Phase 2A. - DIV: `db = -a*inv²` overflows for small |b| (inv² alone can exceed f32::MAX before the multiply), and the tangent_reverse second-order uses `inv³`, worse. Factor through the primal `r = a/b` so each term drops one `inv` — `db = -r*inv` for first order and the `inv³` term becomes `r*inv²` for second order. - POWI n=1 at a=0: the tangent_reverse second derivative is the general `n*(n-1)*pow(a, n-2)*at` which for n=1 evaluates `0 * pow(0, -1) * at = 0 * inf = NaN`. Short-circuit n==1 to `da_eps=0`, matching the mathematically exact zero second derivative of a linear function. - EXPM1 / LN1P (WGSL tangent kernels only): both primals used the cancellation-prone `exp(a)-1` / `log(1+a)` forms, losing ~7 digits for |a|<1e-4. The forward-shader helpers `expm1_f32` / `ln1p_f32` (Taylor shortcut below threshold) are now duplicated into the two tangent shaders. Tangent kernels read the primal when forming `rt`, so a drifted primal propagates. - HYPOT in Taylor codegen: the emitted `sqrt(a*a + b*b)` primal patch overflows at large leading-order magnitude. WGSL now computes `h * sqrt((a/h)² + (b/h)²)` via max-rescale, CUDA uses the `hypot(a, b)` builtin. Higher-order jet coefficients still pass through `jet_mul`/`jet_add` and remain overflow-risky; a full jet- wide rescale is a follow-up. - CUDA upload_tape empty output_indices: the CUDA path was crashing on `clone_htod(&[])` when a downstream caller cleared `GpuTapeData.output_indices` (valid since the struct fields are public). Defensive fallback now synthesises `vec![data.output_index]` with `num_outputs = 1`, matching wgpu. Applied to both `upload_tape` (f32) and `upload_tape_f64`. Regression tests live in tests/phase7_gpu_fixes.rs. WGSL runs on Metal (M4 Max) locally, CUDA runs on A100 via vast.ai. Both backends pass the Phase 7 tests plus the full pre-existing GPU suite.

Surfaces GPU driver failures and locks down CUDA-CPU numerical parity. - wgpu device.poll at all five submission-wait sites now returns the `Result<PollStatus, PollError>` through `?` into `GpuError::Other` instead of `let _ = device.poll(...)`. On device loss (driver reset, another submission OOM), callers used to see the subsequent `rx.recv()` block forever, looking like a deadlock. The actual driver error now propagates up to the caller. - NVRTC compilation for every kernel family (f32, f64, Taylor codegen) now passes `CompileOptions { fmad: Some(false), ..Default::default() }`. The CUDA default contracts `a*b + c` into an FMA that rounds once instead of twice, producing sub-ULP drift vs the CPU's unfused arithmetic. Disabling FMA gives bit-exact CPU-GPU parity on f64 (exercised by `l25_cuda_fmad_disabled_bit_exact_with_cpu_f64`). The perf cost is 1-2% in our kernel mix; callers chasing raw throughput can revert the option. Additionally pins behaviour confirmed correct during Phase 7 exploration: - `m26_wgpu_abs_at_positive_zero` and `_at_negative_zero`: the WGSL shaders already use `bitcast<u32>` to inspect the sign bit, so Abs at -0.0 gives -1 and at +0.0 gives +1. No code change; the tests lock this in against future refactors. - `m28_cuda_rem_quotient_boundary_consistent`: the CUDA REM primal (`fmod`) and reverse partials (`db = -trunc(a/b)`) are internally consistent. No code change; regression test pins the convention. M23's runtime path (device loss) can't be simulated in a unit test — the error-propagation code is exercised by the compile-gate alone. Tests: tests/phase7_gpu_parity.rs. WGSL verified on Metal/M4 Max, CUDA verified on A100 via vast.ai.

Three real bugs surfaced by a retroactive multi-agent review of the Phase 7 + Phase 8 commits (d0bbab9..4dbee62). - WGSL `tangent_reverse.wgsl` POWI was missing the `n == 1` short- circuit added to the CUDA kernel in Phase 7. HVP on `x.powi(1)` at x=0 returned NaN on the wgpu backend while CUDA returned 0, breaking CPU-GPU and CUDA-WGSL parity on tapes built with the public `OpCode::Powi` API. - `TaylorDyn::rem` missed the Phase 8 Taylor::rem zero-divisor guard. The dynamic counterpart silently produced a mix of finite and NaN/Inf coefficients when `rhs.coeffs[0] == 0`. - Taylor codegen WGSL HYPOT primal produced NaN for `hypot(±Inf, x)` (the rescale formula gave `Inf/Inf * 0 = NaN`). IEEE requires `hypot(±Inf, x) = +Inf`, and the CUDA builtin and CPU both honour it. Added an explicit `aa == inf || bb == inf` early-return that short-circuits to +Inf before the rescale. New regression tests: - `l23_wgpu_powi_n1_at_zero_second_derivative_finite` — WGSL sibling of the CUDA L23 test. - `m29_wgpu_atan_moderate_large_a_positive_normal_f32` — strengthens the earlier M29 wgpu test by using `a=3e9` where the f32 result is a normal (not denormal) float, so Metal's denormal flush doesn't mask a regression to `0.0`. - `l9_taylor_dyn_rem_zero_divisor_returns_nan` — pins the TaylorDyn sibling of the existing Taylor::rem test. Out of scope, noted for a follow-up: the scalar `forward.wgsl hypot_f32` and `tangent_forward.wgsl` HYPOT primal arms still use the unguarded `sqrt(a*a + b*b)` form. They pre-date Phase 7 and a separate pass should bring them in line with the Taylor codegen rescale.

Two pre-existing WGSL HYPOT issues flagged out-of-scope during the Phase 7/8 retroactive review. - `forward.wgsl hypot_f32`: the rescaled formula computes `mn/mx = Inf/Inf = NaN` when both operands are Inf, then returns `Inf * NaN = NaN`. IEEE 754-2008 mandates `hypot(±Inf, x) = +Inf` for any x (including NaN). Added an `ax == inf || ay == inf` short-circuit before the rescale. `forward.wgsl` is the source of the `r` value that reverse.wgsl and tangent_reverse.wgsl read back, so fixing this helper lifts the IEEE contract across the whole reverse-primal chain. - `tangent_forward.wgsl` HYPOT arm: primal was `sqrt(a*a + b*b)` unguarded, overflowing for `|a|` or `|b|` > sqrt(f32::MAX) ≈ 1.84e19. Replaced with a call to `hypot_f32`, and added the helper (plus the same IEEE Inf guard) to this shader. Brings tangent_forward's HYPOT primal bit-for-bit in line with the forward kernel. Regression tests (all wgpu) in tests/phase7_gpu_fixes.rs: - `wgpu_hypot_inf_finite_returns_inf` - `wgpu_hypot_inf_inf_returns_inf` — the triple-backend divergence case that motivated the fix (CPU ✓, CUDA ✓, WGSL was NaN). - `wgpu_tangent_forward_hypot_large_magnitude_primal_finite` — pins the rescale so `hypot(1e20, 1e20)` stays at ≈1.41e20 instead of overflowing to +Inf.

Two CI failures on PR #56: - `cargo fmt --check` flagged ~22 files where the added `checked_add` / Kahan / relative-tol expressions exceeded rustfmt's single-line threshold differently than expected. - MSRV (1.93) build failed because `tests/phase8_opcode.rs` imports `echidna::opcode::{reverse_partials, OpCode}`, which is only re- exported when the `bytecode` feature is enabled. The test now gates the whole module on `feature = "bytecode"`, matching the other Phase 8 test files.

…hecked_ops Clippy 1.95 added stricter lints that the CI runner enforces via `-D warnings`. None of the flagged sites are bugs — they're style preferences newer Clippy opted into — but the PR gate fails on them. - `src/bytecode_tape/mod.rs` OpCode::Sub / OpCode::Div optimization arms: collapsed the inner `if` guard into the match pattern (`OpCode::Sub if …=> {…}`). - `src/taylor_ops.rs` hypot branch: replaced two manual 1..n loops with `copy_from_slice`. - `src/gpu/mod.rs` chunk-size cap: replaced the `if nv_k > 0` guard with `(u32::MAX as u64).checked_div(nv_k)`. No behaviour changes.

gvonness-apolitical added 21 commits April 18, 2026 14:29

gvonness-apolitical merged commit e22c080 into main Apr 18, 2026
7 checks passed

gvonness-apolitical deleted the fix/harden-cycle-6 branch April 18, 2026 20:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cycle 6 bug-hunt: 19 commits of numerical / GPU / solver hardening#56

Cycle 6 bug-hunt: 19 commits of numerical / GPU / solver hardening#56
gvonness-apolitical merged 21 commits intomainfrom
fix/harden-cycle-6

gvonness-apolitical commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gvonness-apolitical commented Apr 18, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant