fix(cute): C2 plumbing — residual/gate mirror op + β-coop predicate hard-gate

Natfii · claude · Natfii · commit 5a0311ca324e · 2026-04-26T10:25:48.000-04:00
Two correctness bugs and one no-silent-fallback hardening: 1) residual_buf + gate_buf dynamo dead-elimination Both qwen3_5.py call sites for the BF16 residual / gate mirror `.copy_()` lived inside `try/except` blocks whose protected line `get_forward_context().attn_metadata[layer_name]` raises at torch.compile trace time (forward_context is None). Dynamo concluded the try body was always-caught dead code and the captured PIECEWISE graph dropped the .copy_. At runtime the buffers stayed at the CUDA-graph-allocator-zeroed value → β-coop / paged read zeros → gibberish. Verified 2026-04-26 via /tmp/nvllm-dumps: residual_in absmax=0.0 across all 16 full-attn layers pre-fix. Fix: new `cute_residual_mirror` opaque op in _mlp_op.py with `mutates_args=["residual_buf"]`. The first-pass attempt with `mutates_args=[]` was still dead-eliminated — the mutates_args declaration is what tells torch.compile the op has a real side effect on a tracked tensor. Both qwen3_5.py call sites (Qwen3_5DecoderLayer.forward residual_buf @L427, Qwen3_5Attention.forward gate_buf @l253) now route through the op. This was an actual bug present before β-coop ever fired: paged kernel was silently reading zero residual_buf in any PIECEWISE deployment using fusion. Standalone correctness win. 2) β-coop predicate hard-gate (no-silent-fallback) `_will_fire_beta_coop_pre` and `_use_beta_coop` previously bypassed the `(64 * num_seqs) <= _resident_cap` cooperative-launch fitness check when forced_path == "coop", under the assumption "user asked for coop, they know what they're doing." But on multi-seq decode (e.g. nat=3 batches) the fixed grid exceeds the resident cap → CUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE → except-handler fallthrough to β-lite. β-lite is MLP-only with no attention → silent gibberish. Fix: cooperative-launch fitness is now a HARD gate regardless of forced_path. If the grid won't fit, paged_attention_forward stays in the decode path. Predicate is duplicated at two sites (`_will_fire_beta_coop_pre` for the paged-skip decision and `_use_beta_coop` for the dispatch) — kept in sync via comment cross-refs. Per memory:feedback_no_silent_fallbacks. 3) C2 attn-output-gate wired through β-coop kernel phase_e_kernel.py: gate_ptr + gate_fused flag added to PhaseE_Beta_Kernel.run_beta_coop_full and to the JIT signature. gate_fused == 0 disables the multiply (back-compat for callers that don't supply gate_buf). _backend.py β-coop dispatch passes self.gate_buf[:nat]. Mirrors paged kernel.py:1555-1569. This is the consumer side of fix #1 — without #1 the gate buffer was always zero so the flag couldn't have been observed. 4) Env-gated tensor dump harness (kept per feedback_keep_debug_harnesses) _backend.py β-coop branch: CUTE_DUMP_TENSORS=1 dumps {residual_in, query, gate, residual_out, rmsnorm_out} per (layer × decode step), bounded to 3 steps × 16 layers. Files land in /tmp/nvllm-dumps/. serve-cute.sh adds the bind mount and env passthrough. Used to bisect this bug; keeping for the next graph-capture investigation. Also: BETA_DIFF harness clones paged's wo_output / rmsnorm_output / residual_output before β-coop overwrites them, then logs the delta. Gated on CUTE_DEBUG_FUSION=1, only fires in dual-fire mode (skipped when paged is gated off). Verified BETA_DIFF=0 with FIXED inputs — β-coop math byte-identical to paged. Validation matrix (2026-04-26 EOD, ig1/Qwen3.5-27B-NVFP4): - PIECEWISE + paged-only: COHERENT ✓ - PIECEWISE + dual-fire (paged + β-coop): COHERENT ✓ BETA_DIFF=0 - PIECEWISE + solo β-coop: GIBBERISH ✗ (remaining) - EAGER + solo β-coop: COHERENT ✓ The remaining solo-β-coop gibberish under PIECEWISE is upstream of β-coop entirely — layer 3 inputs (the first full-attn layer, after 3 untouched linear-attn layers) differ between dual-fire and solo modes for the same prompt + seed. Captured CUDA graph layout / compile artifact differs depending on whether paged is also in the captured segment. Investigation paths in memory:project_beta_coop_residual_solo_bug. Side-by-side dumps preserved at /tmp/nvllm-dumps-{dualfire,solo} (80 files each) for next session. Refs: memory:project_beta_coop_residual_solo_bug memory:project_uber_kernel_migration memory:feedback_no_silent_fallbacks memory:feedback_keep_debug_harnesses memory:feedback_layer_output_contract Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/scripts/serve-cute.sh b/scripts/serve-cute.sh
@@ -70,10 +70,12 @@ docker run -d \
   --network host \
   -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
   -v "$HOME/.cache/flashinfer:/root/.cache/flashinfer" \
+  -v "/tmp/nvllm-dumps:/tmp/nvllm-dumps" \
   -e VLLM_NVFP4_GEMM_BACKEND=cutlass \
   -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
   -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
   -e CUTE_DEBUG_FUSION="${CUTE_DEBUG_FUSION:-0}" \
+  -e CUTE_DUMP_TENSORS="${CUTE_DUMP_TENSORS:-0}" \
   -e CUTE_MLP_FUSION="${CUTE_MLP_FUSION:-1}" \
   -e CUTE_ATTN_FUSION="${CUTE_ATTN_FUSION:-1}" \
   -e CUTE_DEBUG_MLP_FUSION="${CUTE_DEBUG_MLP_FUSION:-0}" \
diff --git a/vllm/nvllm/models/qwen3_5.py b/vllm/nvllm/models/qwen3_5.py
@@ -253,20 +253,15 @@ def forward(
         # is a cheap one-off BF16 memcpy; it avoids the old model->impl flag
         # side-channel that was flagged as fragile.
         if gate is not None:
-            from vllm.forward_context import get_forward_context
-
+            # 2026-04-26: gate_buf mirror via the same opaque op as
+            # residual_buf — the prior plain-Python .copy_() inside
+            # try/except (which protected the trace-time-failing
+            # `attn_metadata[...]` lookup) was being dead-eliminated by
+            # @support_torch_compile dynamo. Same root cause, same fix.
             impl = self.attn.impl
             gate_buf = getattr(impl, "gate_buf", None)
             if gate_buf is not None:
-                try:
-                    nat = (
-                        get_forward_context()
-                        .attn_metadata[self.attn.layer_name]
-                        .num_actual_tokens
-                    )
-                    gate_buf[:nat].copy_(gate[:nat])
-                except (RuntimeError, KeyError, AttributeError, TypeError):
-                    pass
+                torch.ops.vllm.cute_residual_mirror(gate_buf, gate)
 
         attn_output = self.attn(q, k, v)
 
@@ -432,18 +427,23 @@ def _ct_mark(label: str) -> None:
         impl = None
         if self.layer_type == "full_attention":
             impl = self.self_attn.attn.impl
-            fusion_could_run = getattr(impl, "_fusion_bound", False)
-            if fusion_could_run:
-                try:
-                    from vllm.forward_context import get_forward_context
-
-                    attn_md = get_forward_context().attn_metadata[
-                        self.self_attn.attn.layer_name
-                    ]
-                    nat = attn_md.num_actual_tokens
-                    impl.residual_buf[:nat].copy_(residual[:nat])
-                except (RuntimeError, KeyError, AttributeError, TypeError):
-                    pass
+            # 2026-04-26: residual mirror via opaque custom op. The prior
+            # plain-Python .copy_(residual) was inside a try/except whose
+            # protected lookup `get_forward_context().attn_metadata[...]`
+            # threw at torch.compile trace time. dynamo concluded the
+            # try body was always-caught dead code and the captured
+            # graph dropped the .copy_. At runtime residual_buf stayed at
+            # the CUDA-graph-allocator-zeroed value → β-coop read zeros
+            # → gibberish.
+            #
+            # The opaque op preserves the side effect across graph capture.
+            # `residual_buf` is a declared mutates_args so torch.compile
+            # tracks the mutation as a real side effect (the prior op
+            # version with `mutates_args=[]` was still dead-eliminated).
+            if getattr(impl, "_fusion_bound", False):
+                torch.ops.vllm.cute_residual_mirror(
+                    impl.residual_buf, residual
+                )
         _ct_mark("residual_mirror")
 
         self_attention_output = torch.empty_like(hidden_states)
diff --git a/vllm/v1/attention/backends/cute_paged/_backend.py b/vllm/v1/attention/backends/cute_paged/_backend.py
@@ -1072,35 +1072,75 @@ def forward(
         num_seqs = len(attn_metadata.seq_lens)
         padded_num_seqs = num_seqs  # graph capture overrides via metadata
 
-        result = paged_attention_forward(
-            query=query[:num_actual_tokens],
-            kv_cache=kv_cache,
-            page_table=attn_metadata.block_table,
-            seq_lens=attn_metadata.seq_lens,
-            scale=self.scale,
-            k_scale=k_scale,
-            v_scale=v_scale,
-            page_size=64,
-            query_start_loc=attn_metadata.query_start_loc,
-            wo_weight=wo_weight,
-            wo_scales=wo_scales,
-            wo_global_scale=wo_global_scale,
-            wo_output=wo_output,
-            rmsnorm_gamma=rmsnorm_gamma,
-            rmsnorm_residual=rmsnorm_residual,
-            rmsnorm_output=rmsnorm_output,
-            residual_output=residual_output,
-            arrival_count=arrival_count,
-            rmsnorm_eps=rmsnorm_eps,
-            gate_buf=gate_buf,
-            padded_num_seqs=padded_num_seqs,
+        # C2: gate paged_attention_forward off on decode when β-coop is
+        # going to fire — β-coop is the sole Phase A+B+C+3+4 uber-kernel
+        # in that path, paged becomes redundant double-fire. We replicate
+        # the _use_beta_coop predicate computed below so the gate matches.
+        # NOTE: kept commented-out OFF gate sites (none here) per
+        # feedback_comment_not_delete; the only structural change is the
+        # `if _will_fire_beta_coop_pre:` wrapper around the paged call.
+        _phase_e_env_pre = _phase_e_env_config()
+        # 2026-04-26: cooperative-launch fitness (64*num_seqs <= _resident_cap)
+        # is a HARD gate even in forced-coop mode. Previously bypassed when
+        # forced_path=="coop", which caused CUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE
+        # on multi-seq decode (e.g., nat=3 batches). Because the paged-skip
+        # below assumes β-coop will run, a coop launch failure left
+        # self.rmsnorm_output stale → β-lite (MLP-only) ran on garbage →
+        # silent gibberish. Keep this in sync with `_use_beta_coop` below.
+        _will_fire_beta_coop_pre = (
+            _phase_e_env_pre.enabled
+            and is_decode_only
+            and use_fusion
+            and getattr(self, "_phase_e_coop_kernel", None) is not None
+            and getattr(self, "_mlp_fusion_bound", False)
+            and num_actual_tokens <= getattr(self, "_fusion_max_num_seqs", 0)
+            and (64 * num_seqs) <= getattr(self, "_resident_cap", 0)
+            and _phase_e_env_pre.forced_path in ("coop", "auto")
         )
+        if _will_fire_beta_coop_pre:
+            result = None
+            # Mark snapshots stale so the BETA_DIFF harness skips below.
+            self._debug_paged_res = None
+        else:
+            result = paged_attention_forward(
+                query=query[:num_actual_tokens],
+                kv_cache=kv_cache,
+                page_table=attn_metadata.block_table,
+                seq_lens=attn_metadata.seq_lens,
+                scale=self.scale,
+                k_scale=k_scale,
+                v_scale=v_scale,
+                page_size=64,
+                query_start_loc=attn_metadata.query_start_loc,
+                wo_weight=wo_weight,
+                wo_scales=wo_scales,
+                wo_global_scale=wo_global_scale,
+                wo_output=wo_output,
+                rmsnorm_gamma=rmsnorm_gamma,
+                rmsnorm_residual=rmsnorm_residual,
+                rmsnorm_output=rmsnorm_output,
+                residual_output=residual_output,
+                arrival_count=arrival_count,
+                rmsnorm_eps=rmsnorm_eps,
+                gate_buf=gate_buf,
+                padded_num_seqs=padded_num_seqs,
+            )
+
+            # --- BETA_DIFF harness: snapshot paged's outputs so we can diff
+            # against β-coop's overwrite later. Gated on CUTE_DEBUG_FUSION=1.
+            # Only fires when paged actually ran (else clause).
+            # See memory:project_beta_coop_residual_solo_bug for protocol.
+            if _DEBUG_FUSION and use_fusion and is_decode_only:
+                self._debug_paged_wo = self.wo_output.detach().clone()
+                self._debug_paged_rms = self.rmsnorm_output.detach().clone()
+                self._debug_paged_res = self.residual_output.detach().clone()
 
         # --- DEBUG: fusion diagnostic (CUTE_DEBUG_FUSION=1) ---
         # Compares kernel's impl.wo_output (Phase B GEMV) against a Python
         # reference computed from the kernel's own Phase A output (`result`)
         # and a one-time-dequantized W_O. Proves whether Phase B is faithful.
-        if _DEBUG_FUSION and use_fusion:
+        # Skip when paged was gated off (result is None).
+        if _DEBUG_FUSION and use_fusion and result is not None:
             self._debug_fusion_diff(
                 result=result,
                 num_actual_tokens=num_actual_tokens,
@@ -1173,16 +1213,18 @@ def forward(
         _resident_cap = getattr(self, "_resident_cap", 0)
         # Task 16: β-coop dispatch. β-coop requires the unified kernel
         # attached in attach_mlp_fusion (CUTE_PHASE_E_FUSION=1 at attach
-        # time). forced_path="coop" always routes here; "auto" routes here
-        # when the full grid fits the resident cap for a single cooperative
-        # launch (otherwise β-lite's two-kernel path handles it).
+        # time). 2026-04-26: cooperative-launch fitness is a HARD gate
+        # for both forced_path values — was previously bypassed when
+        # forced_path=="coop", causing CUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE
+        # on multi-seq decode and silent gibberish (β-lite is MLP-only,
+        # provides no attention fallback). Must stay in sync with
+        # `_will_fire_beta_coop_pre` above.
         _coop_attached = getattr(self, "_phase_e_coop_kernel", None) is not None
-        _use_beta_coop = _phase_e_active and _coop_attached and (
-            _phase_e_env.forced_path == "coop"
-            or (
-                _phase_e_env.forced_path == "auto"
-                and _total_ctas <= _resident_cap
-            )
+        _use_beta_coop = (
+            _phase_e_active
+            and _coop_attached
+            and _total_ctas <= _resident_cap
+            and _phase_e_env.forced_path in ("coop", "auto")
         )
         _use_beta_lite = (
             _phase_e_active
@@ -1261,9 +1303,81 @@ def forward(
                         # Caller-supplied residual_output so self.residual_output
                         # reflects residual_post_attn (Phase-1 Phase-C output).
                         residual_output=self.residual_output[:nat],
+                        # C2: Qwen3.5 attn output gate — buffer was filled by
+                        # qwen3_5.py:267 from the q_proj's gate slice. Mirrors
+                        # the paged kernel's `gate_buf=` plumbing.
+                        gate_buf=self.gate_buf[:nat],
                     )
                 self._phase_e_consumed = True
                 self._phase_e_use_beta_coop = True
+                # 2026-04-26: ENV-GATED dump for off-line math verification.
+                # CUTE_DUMP_TENSORS=1 enables; bounded to first 3 decode
+                # steps × 16 full-attn layers so disk doesn't bloat. Files
+                # land in /tmp/nvllm-dumps/layer{N}_step{S}_{name}.pt.
+                # See ~/jupyterlab/beta_coop_kernel_dump_compare.ipynb.
+                if os.environ.get("CUTE_DUMP_TENSORS", "0") == "1":
+                    _dump_dir = "/tmp/nvllm-dumps"
+                    os.makedirs(_dump_dir, exist_ok=True)
+                    _step_counter = getattr(self, "_dump_step_counter", 0)
+                    if _step_counter < 3 * 16:
+                        _layer_segs = getattr(
+                            layer, "layer_name", "<layer>").split(".")
+                        _layer_digits = [
+                            p for p in _layer_segs if p.isdigit()]
+                        _layer_idx = int(_layer_digits[0]) \
+                            if _layer_digits else -1
+                        _base = (f"{_dump_dir}/layer{_layer_idx}_"
+                                 f"step{_step_counter // 16}")
+                        torch.save(
+                            self.residual_buf[:nat].detach().clone(),
+                            f"{_base}_residual_in.pt")
+                        torch.save(
+                            query[:nat].detach().clone(),
+                            f"{_base}_query.pt")
+                        torch.save(
+                            self.gate_buf[:nat].detach().clone(),
+                            f"{_base}_gate.pt")
+                        torch.save(
+                            self.residual_output[:nat].detach().clone(),
+                            f"{_base}_residual_out.pt")
+                        torch.save(
+                            self.rmsnorm_output[:nat].detach().clone(),
+                            f"{_base}_rmsnorm_out.pt")
+                        self._dump_step_counter = _step_counter + 1
+                # --- BETA_DIFF harness: diff β-coop's overwrite vs paged.
+                # See memory:project_beta_coop_residual_solo_bug for protocol.
+                if (_DEBUG_FUSION and is_decode_only
+                    and getattr(self, "_debug_paged_res", None) is not None):
+                    nat_dbg = num_actual_tokens
+                    wo_diff = (
+                        self.wo_output[:nat_dbg]
+                        - self._debug_paged_wo[:nat_dbg]
+                    ).abs()
+                    rms_diff = (
+                        self.rmsnorm_output[:nat_dbg].float()
+                        - self._debug_paged_rms[:nat_dbg].float()
+                    ).abs()
+                    res_diff = (
+                        self.residual_output[:nat_dbg].float()
+                        - self._debug_paged_res[:nat_dbg].float()
+                    ).abs()
+                    # Also dump the raw β-coop residual_output[0, :8] and
+                    # the corresponding paged value, so we can eyeball
+                    # whether sentinel landed.
+                    res_b = self.residual_output[0, :8].float().tolist()
+                    res_p = self._debug_paged_res[0, :8].float().tolist()
+                    logger.info(
+                        "[BETA_DIFF] layer=%s nat=%d "
+                        "wo:max=%.4e mean=%.4e | "
+                        "rms:max=%.4e mean=%.4e | "
+                        "res:max=%.4e mean=%.4e | "
+                        "res_beta[0,:8]=%s | res_paged[0,:8]=%s",
+                        getattr(layer, "layer_name", "<layer>"), nat_dbg,
+                        wo_diff.max().item(), wo_diff.mean().item(),
+                        rms_diff.max().item(), rms_diff.mean().item(),
+                        res_diff.max().item(), res_diff.mean().item(),
+                        res_b, res_p,
+                    )
             except Exception as e:  # noqa: BLE001 — fail-closed, fall through to β-lite
                 logger.warning(
                     "CuTe Phase E β-coop launch failed (falling back to "
@@ -1394,7 +1508,14 @@ def forward(
         #         self._mlp_fusion_active = False
         # --- END PHASE D2 DISABLED ---
 
-        output[:num_actual_tokens].copy_(result)
+        # C2: when paged_attention_forward was gated off (β-coop fired
+        # alone), `result` is None. β-coop wrote its outputs into
+        # self.rmsnorm_output / self.mlp_output / self.residual_output
+        # which the consume branch reads directly — `output` is the
+        # framework's unified attn-output buffer, not consumed in the
+        # fusion path. Skip the copy_ in that case.
+        if result is not None:
+            output[:num_actual_tokens].copy_(result)
         return output
 
     def _debug_fusion_diff(
diff --git a/vllm/v1/attention/backends/cute_paged/_mlp_op.py b/vllm/v1/attention/backends/cute_paged/_mlp_op.py
@@ -242,3 +242,70 @@ def _cute_phase_e_dispatch_fake(
 # ε epilogue (Phase 4) was deleted in the same commit.
 #
 # See docs/research/uber_kernel_migration/q4_brainstorm_layer_LN_2026-04-25.md.
+
+
+# --- 2026-04-26: cute_residual_mirror -----------------------------------------
+# Opaque op for the residual mirror copy that qwen3_5.py's decoder forward
+# does at layer entry: `impl.residual_buf[:nat].copy_(residual[:nat])`.
+#
+# Why an op: the prior plain-Python `.copy_()` was inside an `if fusion_could_run:
+# try: ... attn_md = get_forward_context().attn_metadata[layer_name] ...`
+# block. Under @support_torch_compile (model.forward), dynamo traced the
+# get_forward_context lookup (None at trace time) → TypeError → except
+# pass. The captured graph then dropped the `.copy_` because (a) the
+# inferred trace path always took the except branch and (b) `impl.residual_buf`
+# is mutated state torch.compile doesn't track as a graph output. Result at
+# runtime: residual_buf stayed at the CUDA-graph-allocator-zeroed value;
+# β-coop read zeros; gibberish. (Verified 2026-04-26 via /tmp/nvllm-dumps —
+# residual_in absmax=0.0000 across all 16 full-attn layers.)
+#
+# Wrapping the copy in an opaque custom op makes it a black-box side-effect
+# from torch.compile's perspective — it's preserved across graph capture and
+# always runs at runtime.
+
+_RES_MIRROR_DIAG_SEEN: set[int] = set()
+
+
+def _cute_residual_mirror_impl(
+    residual_buf: torch.Tensor,
+    residual: torch.Tensor,
+) -> None:
+    """Copy `residual` into `residual_buf` (in-place mutation).
+
+    Direct buffer-passing replaces the prior registry-lookup design:
+    `mutates_args=["residual_buf"]` tells torch.compile the op has a
+    real side effect on a tracked tensor, so it isn't dead-eliminated.
+    """
+    nat = residual.shape[0]
+    if nat == 0:
+        return
+    nat = min(nat, residual_buf.shape[0])
+    # 2026-04-26 DIAG: one-shot per residual_buf identity. Logs whether
+    # the op fires at runtime + the input magnitude. Remove after ship.
+    _key = id(residual_buf)
+    if _key not in _RES_MIRROR_DIAG_SEEN:
+        _RES_MIRROR_DIAG_SEEN.add(_key)
+        logger.info(
+            "[RES_MIRROR_OP] nat=%d residual_absmax=%.4e "
+            "buf_shape=%s buf_pre_absmax=%.4e",
+            nat,
+            residual.float().abs().max().item(),
+            tuple(residual_buf.shape),
+            residual_buf.float().abs().max().item(),
+        )
+    residual_buf[:nat].copy_(residual[:nat])
+
+
+def _cute_residual_mirror_fake(
+    residual_buf: torch.Tensor,
+    residual: torch.Tensor,
+) -> None:
+    return
+
+
+direct_register_custom_op(
+    op_name="cute_residual_mirror",
+    op_func=_cute_residual_mirror_impl,
+    mutates_args=["residual_buf"],
+    fake_impl=_cute_residual_mirror_fake,
+)
diff --git a/vllm/v1/attention/backends/cute_paged/phase_e_kernel.py b/vllm/v1/attention/backends/cute_paged/phase_e_kernel.py