fix(cute): C1 — β-coop and β-lite read residual_buf, not residual_output

Natfii · claude · Natfii · commit a65bcef3119c · 2026-04-25T14:14:24.000-04:00
β-coop's Phase 1C residual_in pointed at self.residual_output, which
paged_attention_forward had already filled with (h+r) + wo_out =
residual_post_attn. β-coop then re-added wo_out inside its own Phase 1C,
producing 2·wo_out + h + r — gibberish output cascading through 16
fused full-attn layers, observed as " 2                              ".

Same alias existed in β-lite's residual_post_ln source (audit Finding 6;
β-lite never re-ran Phase C so the corruption only manifested when β-coop
fired, but β-lite was structurally on the same buggy path).

Fixed both call sites:
- vllm/v1/attention/backends/cute_paged/_backend.py:1175 (β-coop)
- vllm/v1/attention/backends/cute_paged/_backend.py:1268 (β-lite)

Both now read self.residual_buf — the post-input-LN residual mirrored
from qwen3_5.py:460 — matching the math the kernels expect.

L2 buffer-contracts test added at tests/v1/cute_paged/test_uber_kernel_buffer_contracts.py.
Pure source-text inspection via inspect.getsource on CutePagedAttentionImpl.forward;
catches the class structurally without requiring a GPU run.

Validation:
- Pre-fix pytest: 2 FAILED (test caught the bug)
- Post-fix pytest: 2 PASSED
- Live serve probe with CUTE_PHASE_E_FUSION=1 produced coherent reasoning
  output (not pre-fix " 2 ..." gibberish).

gsm8k_eval_50 ≥90% gate DEFERRED to C2. At this commit's state β-coop and
paged_attention_forward both fire Phase A+B+C, costing ~+15 ms per
fused-full-attn layer × 16 layers ≈ 0.7 tok/s observed (predicted by
memory:project_phase_e_phantom_speedup). The 180s per-question timeout
in scripts/gsm8k_eval_50.py can't accommodate. C2 retires paged_attention_forward
from the decode path and recovers throughput; the gsm8k gate runs there.

Refs: docs/superpowers/specs/2026-04-25-uber-kernel-migration-design.md
      docs/research/uber_kernel_migration/spec_audit_2026-04-25.md (Finding 6)
      memory:project_phase_e_beta_math_bug
      memory:project_phase_e_phantom_speedup

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/tests/v1/cute_paged/test_uber_kernel_buffer_contracts.py b/tests/v1/cute_paged/test_uber_kernel_buffer_contracts.py
@@ -0,0 +1,48 @@
+"""L2 structural test: verifies β-coop / β-lite read inputs from the right buffer.
+
+Pre-fix: β-coop reads `self.residual_output` (post-Phase-C output of the legacy
+paged_attention_forward), causing residual_post_attn = 2*attn_out + h + r.
+Post-fix: β-coop reads `self.residual_buf` (post-input-LN residual mirrored
+from qwen3_5.py:460), giving residual_post_attn = attn_out + h + r.
+
+Strategy: pure source-text inspection via `inspect.getsource` on
+`CutePagedAttentionImpl.forward`. We assert the post-fix wiring is present
+(`self.residual_buf`) and the buggy alias (`self.residual_output` as residual
+input to β kernels) is absent. No CUDA, no kernel launch — runs anywhere.
+"""
+import inspect
+import pytest
+
+
+def test_beta_coop_residual_in_sources_from_residual_buf():
+    """β-coop's residual_in must source from self.residual_buf, not residual_output."""
+    from vllm.v1.attention.backends.cute_paged._backend import (
+        CutePagedAttentionImpl,
+    )
+
+    src = inspect.getsource(CutePagedAttentionImpl.forward)
+    assert "residual_in=self.residual_buf" in src, (
+        "Expected β-coop launch to read from self.residual_buf; found a different source. "
+        "Check _backend.py:1175 — buffer-aliasing bug may have regressed."
+    )
+    # Strengthened guard (audit Finding 6 / option b): C1 fixes both occurrences
+    # of the alias bug, so `residual_in=self.residual_output` must not appear
+    # ANYWHERE in CutePagedAttentionImpl.forward source. The original anchor
+    # ("# β-coop") doesn't exist in source, so the guarded form silently passed
+    # either way. This bare check fails loudly if the bug regresses.
+    assert "residual_in=self.residual_output" not in src, (
+        "β-coop call site still reads self.residual_output — the alias bug is back. "
+        "See _backend.py:1175 (commit 76b88ba21) and audit Finding 6."
+    )
+
+
+def test_beta_lite_residual_post_ln_sources_from_residual_buf():
+    """β-lite has the same alias bug pre-migration (audit Finding 6). Verify fix."""
+    from vllm.v1.attention.backends.cute_paged._backend import (
+        CutePagedAttentionImpl,
+    )
+
+    src = inspect.getsource(CutePagedAttentionImpl.forward)
+    assert "residual_post_ln=self.residual_buf" in src, (
+        "β-lite still aliases legacy buffer. See audit Finding 6 / _backend.py:1268."
+    )
diff --git a/vllm/v1/attention/backends/cute_paged/_backend.py b/vllm/v1/attention/backends/cute_paged/_backend.py
@@ -1172,7 +1172,7 @@ def forward(
                         # Phase 0 inputs (dummy — output side-channel for future
                         # QKV-fusion; not consumed by this layer's attn path).
                         hidden_in=self.rmsnorm_output[:nat],
-                        residual_in=self.residual_output[:nat],
+                        residual_in=self.residual_buf[:nat],
                         input_gamma=self._phase_e_coop_input_gamma,
                         post_attn_gamma=self.rmsnorm_gamma,
                         attn_input_bf16=self._phase_e_coop_attn_input_scratch[:nat],
@@ -1265,7 +1265,7 @@ def forward(
                         gate_up_global_scale=self._mlp_gate_up_gs,
                         down_global_scale=self._mlp_down_gs,
                         # ε epilogue inputs (Task 8 kwargs):
-                        residual_post_ln=self.residual_output[:nat],
+                        residual_post_ln=self.residual_buf[:nat],
                         next_input_layernorm_gamma=_next_gamma,
                         next_hidden_output=self.next_hidden_scratch[:nat],
                         emit_epilogue=True,