Skip to content

Commit 90b06d5

Browse files
Natfiiclaude
andcommitted
docs(uber-kernel): consume-gate DCE + graph-capture findings (2026-04-26)
Diagnostic baseline for the C2 follow-up architectural pass. Documents: 1. The two coupled DCE / specialisation bugs in qwen3_5.py that made β-coop's outputs structurally unobservable to torch.compile under PIECEWISE compile (cute_residual_mirror DCE'd despite mutates_args; `if _fusion_active` consume gate specialised to else-branch at trace time). 2. The captured FX graph evidence at /root/.cache/vllm/torch_compile_cache/<hash>/.../computation_graph.py showing the legacy Python o_proj + post_attn_LN was always running. 3. Why dual-fire happened to produce coherent output anyway (paged populated `output` with Phase A; Python pipeline reconstructed correctness) — and why solo broke (no Phase A populator). 4. The B-fix attempt in commit 514b88c (reverted in 3ffcf87): cute_attn_consume + cute_post_attn_ln_dispatch opaque ops, registry lookup pattern (no .item() syncs). PROVEN correct under cudagraph_mode=NONE; STILL gibberish under cudagraph_mode=PIECEWISE for reasons not root-caused this session (likely warmup-vs-runtime state divergence + something deeper in the cooperative-launch β-coop kernel under CUDA graph replay). 5. Three architectural answers for the C2 redesign to pick from: - β-coop writes directly to framework `output` (eliminate the Python pipeline + consume entirely) - In-graph torch.cond / torch.where on tensor signals (avoid .item() + Python-attr fragility) - Capture multiple graphs per (shape, fusion-active) variant and dispatch externally Reverted on the debug branch because shipping a partial fix that fails under the production graph mode would be a regression. The architectural work belongs on feat/uber-kernel-migration, not a debug branch bandaid (memory:feedback_pace_pressure). Refs: commit 514b88c (B-fix WIP, reverted) commit 5a0311c (C2 plumbing, shipped) memory:project_beta_coop_residual_solo_bug memory:project_uber_kernel_migration Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 3ffcf87 commit 90b06d5

1 file changed

Lines changed: 251 additions & 0 deletions

File tree

Lines changed: 251 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,251 @@
1+
# Consume-gate DCE + graph-capture findings (2026-04-26)
2+
3+
Diagnostic baseline for the C2 follow-up architectural pass. References
4+
the WIP commit `514b88c6f` (B-fix attempt, reverted in `3ffcf8740` on
5+
`debug/beta-coop-residual-solo`) and its parent `5a0311ca3` (the
6+
shippable C2 plumbing).
7+
8+
## TL;DR
9+
10+
The C2 migration's premise — β-coop replaces Python o_proj +
11+
post_attention_layernorm — was structurally **unobservable to
12+
torch.compile** under PIECEWISE compile. Two coupled DCE / specialisation
13+
bugs in `vllm/nvllm/models/qwen3_5.py` caused the captured FX graph to
14+
silently run the legacy Python pipeline and discard β-coop's outputs:
15+
16+
1. `cute_residual_mirror` was DCE-dropped despite
17+
`mutates_args=["residual_buf"]`. Dynamo's DCE removes ops whose
18+
mutations have no observable downstream reader **in the captured
19+
graph**; `impl.residual_buf` is read inside opaque op bodies via
20+
Python-attribute access, invisible to dynamo's reachability analysis.
21+
`mutates_args` alone is **not sufficient**.
22+
23+
2. The `if getattr(impl, "_fusion_active", False)` consume gate at
24+
`qwen3_5.py:466-476` was specialised to the else-branch by dynamo at
25+
trace time (`_fusion_active = False` is the impl `__init__` default;
26+
the per-step mutation happens inside the `unified_attention` opaque
27+
op where dynamo can't see). Captured graph: legacy Python o_proj +
28+
`post_attention_layernorm` ALWAYS ran; β-coop's `rmsnorm_output` /
29+
`residual_output` were never read.
30+
31+
Dual-fire (paged + β-coop) happened to produce coherent output by
32+
accident: paged populated `output` with Phase A attn (via the
33+
framework op's `mutates_args`), Python o_proj computed `wo_out`,
34+
Python `post_attention_layernorm` reconstructed `residual_post_attn`.
35+
β-coop's outputs were entirely wasted compute.
36+
37+
Solo (paged-skip) broke because nothing populated `output` with
38+
Phase A in solo mode → Python o_proj operated on uninitialised
39+
memory → gibberish.
40+
41+
The B-fix in `514b88c6f` proves both bugs are real and fixable under
42+
`cudagraph_mode=NONE`. It does NOT yet survive `cudagraph_mode=PIECEWISE`
43+
(production), suggesting at least one additional graph-capture issue
44+
needs root-causing as part of the architectural pass.
45+
46+
## How the bug was confirmed
47+
48+
### Step 1 — captured FX graph inspection
49+
50+
```bash
51+
docker exec nvllm find /root/.cache/vllm/torch_compile_cache \
52+
-name 'computation_graph.py' | head -1 \
53+
| xargs -I{} grep -oE 'torch\.ops\.vllm\.[a-z_]+' {} | sort -u
54+
```
55+
56+
Output before B-fix:
57+
58+
```
59+
torch.ops.vllm.cute_phase_e_dispatch
60+
torch.ops.vllm.gdn_attention_core
61+
torch.ops.vllm.unified_attention_with_output
62+
torch.ops.vllm.unified_kv_cache_update
63+
```
64+
65+
`cute_residual_mirror` is **absent**. Despite being called at
66+
`qwen3_5.py:444` with `mutates_args=["residual_buf"]`. Same for
67+
`gate_buf` mirror at `qwen3_5.py:264`.
68+
69+
### Step 2 — captured FX layer 3 segment shows Python pipeline
70+
71+
`/root/.cache/vllm/torch_compile_cache/<hash>/rank_0_0/backbone/computation_graph.py`,
72+
layer 3 attention segment (submod_8) ran the legacy o_proj path:
73+
74+
```python
75+
# qwen3_5.py:285 — applied even with _fusion_active=True at runtime
76+
sigmoid: "bf16[s18, 6144]" = torch.sigmoid(gate_1)
77+
mul: "bf16[s18, 6144]" = view * sigmoid
78+
# scaled_fp4_quant.out + cutlass_scaled_fp4_mm = the o_proj
79+
scaled_fp4_quant_out = torch.ops._C.scaled_fp4_quant.out(reshape, ...)
80+
cutlass_scaled_fp4_mm = torch.ops._C.cutlass_scaled_fp4_mm(empty_1, empty, ...)
81+
self_attention_output_3[slice(None, None, None)] = view_2
82+
83+
# qwen3_5.py:491-493 — fused-residual RMSNorm
84+
add: "bf16[s18, 5120]" = self_attention_output_3 + x_32
85+
# ... rsqrt, mul by gamma, .to(bf16)
86+
to: "bf16[s18, 5120]" = mul_9.to(torch.bfloat16)
87+
88+
# Then cute_phase_e_dispatch consumes the post-LN output
89+
cute_phase_e_dispatch = torch.ops.vllm.cute_phase_e_dispatch(
90+
to, empty_like, empty_like_1, add, 'language_model.model.layers.3.mlp')
91+
```
92+
93+
Both the consume branch (`qwen3_5.py:466-476`) and the post_attn_LN gate
94+
(`qwen3_5.py:490-496`) were dead-eliminated to favour the Python pipeline.
95+
96+
### Step 3 — solo result-matrix verification
97+
98+
| Mode | Behaviour |
99+
|--------------------------------------------|------------------------------------|
100+
| EAGER + solo β-coop (no compile) | COHERENT (no DCE, gates work) |
101+
| PIECEWISE + dual-fire (paged + β-coop) | COHERENT (Python pipeline reconstructs from paged Phase A) |
102+
| PIECEWISE + solo β-coop (paged gated off) | GIBBERISH (nothing populates `output`) |
103+
104+
## What B-fix attempted (`514b88c6f`)
105+
106+
Three opaque ops to make the consume + post_attn_LN dispatch survive
107+
torch.compile dead-elim:
108+
109+
- **`cute_residual_mirror`** (existing) — preserved across DCE by
110+
passing `residual_buf` and `gate_buf` as **phantom inputs** to
111+
`cute_attn_consume`, giving the mutations observable downstream
112+
readers.
113+
114+
- **`cute_attn_consume`** (new) — replaces the dead-eliminated consume
115+
branch. Always runs in the captured graph; dispatches at runtime via
116+
`_CUTE_ATTN_REGISTRY[layer_name]` lookup of `impl._fusion_bound`.
117+
118+
- **`cute_post_attn_ln_dispatch`** (new) — replaces the dead-eliminated
119+
post_attn_LN gate. Skips when fusion-bound (β-coop did Phase C);
120+
applies fused-residual RMSNorm in-place when not.
121+
122+
Captured FX after B-fix had all 4 ops:
123+
124+
```
125+
torch.ops.vllm.cute_attn_consume
126+
torch.ops.vllm.cute_phase_e_dispatch
127+
torch.ops.vllm.cute_post_attn_ln_dispatch
128+
torch.ops.vllm.cute_residual_mirror
129+
torch.ops.vllm.gdn_attention_core
130+
torch.ops.vllm.unified_attention_with_output
131+
torch.ops.vllm.unified_kv_cache_update
132+
```
133+
134+
`cute_residual_mirror` survived DCE thanks to the phantom dep.
135+
136+
## What broke under `cudagraph_mode=PIECEWISE`
137+
138+
PIECEWISE+NONE (B-fix v3): probe `"The capital of France is" → ` produced
139+
`' Paris. Paris is a city in France, and it is also the capital of
140+
France...'` — coherent.
141+
142+
PIECEWISE+graphs (B-fix v3): same probe produced `' Paris这种现象这种现象
143+
这种现象...'` — first token correct (prefill), then a single-token
144+
degenerate loop.
145+
146+
### Failed pivots in this session
147+
148+
- **v1**: tensor signal `_fusion_active_signal` + `int(signal.item())`
149+
inside the op body. Crashed at warmup with
150+
`cudaErrorStreamCaptureInvalidated`. **`.item()` causes a host-device
151+
sync that is incompatible with CUDA graph capture**.
152+
153+
- **v2**: registry-lookup of `impl._phase_e_use_beta_coop` (Python attr
154+
reset per-step at top of impl forward). Survived capture, gibberish
155+
at decode.
156+
157+
- **v3**: registry-lookup of `impl._fusion_bound` (set once at
158+
`attach_fusion`, stable across warmup + runtime). Same gibberish.
159+
160+
The graph-capture failure under `cudagraph_mode=PIECEWISE` was not
161+
root-caused before the session ended.
162+
163+
## Suspected causes (for the architectural pass to investigate)
164+
165+
1. **vLLM V1 captures decode segments at warmup with shapes/state that
166+
diverge from runtime.** Python-attr reads inside opaque op bodies
167+
don't reliably reflect runtime state — what's True at warmup capture
168+
isn't necessarily what runs at replay. Even gating on `_fusion_bound`
169+
(intended to be capture-stable) didn't help, suggesting the issue
170+
is deeper than just the gate value.
171+
172+
2. **β-coop's cooperative-launch + atomic-counter spin-wait may have
173+
CUDA-graph replay quirks** independent of the consume gate. Captured
174+
cooperative kernels with stream-sync-aware barriers might not replay
175+
correctly across decode steps.
176+
177+
3. **PIECEWISE segment boundaries** — torch.compile may split the
178+
forward at op boundaries differently than expected. Each captured
179+
subgraph could have its own warmup-vs-runtime divergence.
180+
181+
4. **The Python o_proj path is still present in the captured graph
182+
alongside `cute_attn_consume`**. Even when consume copies β-coop's
183+
`rmsnorm_output` into `self_attention_output`, the Python o_proj
184+
already wrote a different value there earlier in the same forward.
185+
In solo, that earlier value is junk (no Phase A); but the order
186+
should be o_proj first, consume second, so consume should overwrite.
187+
Verify ordering at the kernel level under graph replay.
188+
189+
## Architectural answers (the C2 redesign should pick one)
190+
191+
- **Have β-coop write directly to the framework `output` parameter.**
192+
Removes the need for the Python pipeline entirely; consume becomes a
193+
no-op or is folded into the kernel. Requires β-coop to expose a
194+
bf16 attn-output buffer slot that the model framework can consume.
195+
196+
- **Use in-graph control flow** (`torch.cond` or `torch.where` on
197+
tensor signals) for the consume / post_attn_LN dispatch. Avoids
198+
`.item()` and Python-attr fragility entirely. Requires structuring
199+
the dispatch as data-dependent tensor ops rather than Python branches.
200+
201+
- **Capture multiple graphs per shape and dispatch externally.** vLLM
202+
V1 already captures separate graphs for prefill vs decode shapes;
203+
extend to capture separate fusion-active vs fusion-inactive variants.
204+
Heaviest engineering but cleanest semantics — each captured graph
205+
has stable behaviour at replay.
206+
207+
## What remains shippable on `feat/uber-kernel-migration`
208+
209+
Commit `5a0311ca3` (parent of the WIP) is correctness-positive:
210+
211+
- `cute_residual_mirror` opaque op (still DCE-dropped, but the call
212+
site is in place for the architectural fix to make observable).
213+
- β-coop predicate hard-gate (no-silent-fallback).
214+
- C2 attn-output-gate wiring through `phase_e_kernel.py`.
215+
- Env-gated tensor dump harness for kernel-level diagnostics.
216+
217+
These don't change behaviour vs the prior dual-fire path (β-coop's
218+
outputs are still discarded by dual-fire's reliance on Python o_proj +
219+
post_attn_LN), but they're prerequisites for the architectural fix.
220+
221+
## How to reproduce the diagnostic
222+
223+
```bash
224+
# 1. Launch with fusion enabled, PIECEWISE+graphs (default)
225+
CUTE_PHASE_E_FUSION=1 ./scripts/serve-cute.sh
226+
227+
# 2. Wait for API_READY, send a probe
228+
curl -s http://localhost:8000/v1/completions \
229+
-H "Content-Type: application/json" \
230+
-d '{"model":"default","prompt":"The capital of France is","max_tokens":40,"temperature":0,"seed":42}'
231+
232+
# 3. Inspect the captured FX graph for vllm ops
233+
docker exec nvllm find /root/.cache/vllm/torch_compile_cache \
234+
-name 'computation_graph.py' | head -1 \
235+
| xargs -I{} grep -oE 'torch\.ops\.vllm\.[a-z_]+' {} | sort -u
236+
237+
# 4. To reproduce the B-fix's PIECEWISE+NONE coherence:
238+
# Edit scripts/serve-cute.sh: cudagraph_mode "PIECEWISE" → "NONE"
239+
# Apply `git show 514b88c6f` to the working tree
240+
# Restart container, send probe — should be coherent
241+
```
242+
243+
## References
244+
245+
- Commit `514b88c6f` — B-fix WIP (reverted in `3ffcf8740`).
246+
- Commit `5a0311ca3` — shipping C2 plumbing (parent of this work).
247+
- `memory:project_beta_coop_residual_solo_bug` — solo β-coop pickup notes.
248+
- `memory:project_uber_kernel_migration` — C1, C1.5 status.
249+
- `memory:feedback_pace_pressure` — don't let pace drive design;
250+
the architectural fix belongs on `feat/uber-kernel-migration`,
251+
not patched on a debug branch.

0 commit comments

Comments
 (0)