You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Two correctness bugs and one no-silent-fallback hardening:
1) residual_buf + gate_buf dynamo dead-elimination
Both qwen3_5.py call sites for the BF16 residual / gate mirror
`.copy_()` lived inside `try/except` blocks whose protected line
`get_forward_context().attn_metadata[layer_name]` raises at
torch.compile trace time (forward_context is None). Dynamo
concluded the try body was always-caught dead code and the
captured PIECEWISE graph dropped the .copy_. At runtime the
buffers stayed at the CUDA-graph-allocator-zeroed value →
β-coop / paged read zeros → gibberish. Verified 2026-04-26
via /tmp/nvllm-dumps: residual_in absmax=0.0 across all 16
full-attn layers pre-fix.
Fix: new `cute_residual_mirror` opaque op in _mlp_op.py with
`mutates_args=["residual_buf"]`. The first-pass attempt with
`mutates_args=[]` was still dead-eliminated — the mutates_args
declaration is what tells torch.compile the op has a real
side effect on a tracked tensor. Both qwen3_5.py call sites
(Qwen3_5DecoderLayer.forward residual_buf @L427, Qwen3_5Attention.forward
gate_buf @l253) now route through the op.
This was an actual bug present before β-coop ever fired:
paged kernel was silently reading zero residual_buf in any
PIECEWISE deployment using fusion. Standalone correctness win.
2) β-coop predicate hard-gate (no-silent-fallback)
`_will_fire_beta_coop_pre` and `_use_beta_coop` previously
bypassed the `(64 * num_seqs) <= _resident_cap` cooperative-launch
fitness check when forced_path == "coop", under the assumption
"user asked for coop, they know what they're doing." But on
multi-seq decode (e.g. nat=3 batches) the fixed grid exceeds
the resident cap → CUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE →
except-handler fallthrough to β-lite. β-lite is MLP-only with
no attention → silent gibberish.
Fix: cooperative-launch fitness is now a HARD gate regardless
of forced_path. If the grid won't fit, paged_attention_forward
stays in the decode path. Predicate is duplicated at two sites
(`_will_fire_beta_coop_pre` for the paged-skip decision and
`_use_beta_coop` for the dispatch) — kept in sync via comment
cross-refs. Per memory:feedback_no_silent_fallbacks.
3) C2 attn-output-gate wired through β-coop kernel
phase_e_kernel.py: gate_ptr + gate_fused flag added to
PhaseE_Beta_Kernel.run_beta_coop_full and to the JIT signature.
gate_fused == 0 disables the multiply (back-compat for callers
that don't supply gate_buf). _backend.py β-coop dispatch passes
self.gate_buf[:nat]. Mirrors paged kernel.py:1555-1569.
This is the consumer side of fix #1 — without #1 the gate buffer
was always zero so the flag couldn't have been observed.
4) Env-gated tensor dump harness (kept per feedback_keep_debug_harnesses)
_backend.py β-coop branch: CUTE_DUMP_TENSORS=1 dumps
{residual_in, query, gate, residual_out, rmsnorm_out} per
(layer × decode step), bounded to 3 steps × 16 layers.
Files land in /tmp/nvllm-dumps/. serve-cute.sh adds the
bind mount and env passthrough. Used to bisect this bug;
keeping for the next graph-capture investigation.
Also: BETA_DIFF harness clones paged's wo_output / rmsnorm_output /
residual_output before β-coop overwrites them, then logs the
delta. Gated on CUTE_DEBUG_FUSION=1, only fires in dual-fire mode
(skipped when paged is gated off). Verified BETA_DIFF=0 with
FIXED inputs — β-coop math byte-identical to paged.
Validation matrix (2026-04-26 EOD, ig1/Qwen3.5-27B-NVFP4):
- PIECEWISE + paged-only: COHERENT ✓
- PIECEWISE + dual-fire (paged + β-coop): COHERENT ✓ BETA_DIFF=0
- PIECEWISE + solo β-coop: GIBBERISH ✗ (remaining)
- EAGER + solo β-coop: COHERENT ✓
The remaining solo-β-coop gibberish under PIECEWISE is upstream of
β-coop entirely — layer 3 inputs (the first full-attn layer, after
3 untouched linear-attn layers) differ between dual-fire and solo
modes for the same prompt + seed. Captured CUDA graph layout / compile
artifact differs depending on whether paged is also in the captured
segment. Investigation paths in memory:project_beta_coop_residual_solo_bug.
Side-by-side dumps preserved at /tmp/nvllm-dumps-{dualfire,solo}
(80 files each) for next session.
Refs: memory:project_beta_coop_residual_solo_bug
memory:project_uber_kernel_migration
memory:feedback_no_silent_fallbacks
memory:feedback_keep_debug_harnesses
memory:feedback_layer_output_contract
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0 commit comments