|
| 1 | +# Consume-gate DCE + graph-capture findings (2026-04-26) |
| 2 | + |
| 3 | +Diagnostic baseline for the C2 follow-up architectural pass. References |
| 4 | +the WIP commit `514b88c6f` (B-fix attempt, reverted in `3ffcf8740` on |
| 5 | +`debug/beta-coop-residual-solo`) and its parent `5a0311ca3` (the |
| 6 | +shippable C2 plumbing). |
| 7 | + |
| 8 | +## TL;DR |
| 9 | + |
| 10 | +The C2 migration's premise — β-coop replaces Python o_proj + |
| 11 | +post_attention_layernorm — was structurally **unobservable to |
| 12 | +torch.compile** under PIECEWISE compile. Two coupled DCE / specialisation |
| 13 | +bugs in `vllm/nvllm/models/qwen3_5.py` caused the captured FX graph to |
| 14 | +silently run the legacy Python pipeline and discard β-coop's outputs: |
| 15 | + |
| 16 | +1. `cute_residual_mirror` was DCE-dropped despite |
| 17 | + `mutates_args=["residual_buf"]`. Dynamo's DCE removes ops whose |
| 18 | + mutations have no observable downstream reader **in the captured |
| 19 | + graph**; `impl.residual_buf` is read inside opaque op bodies via |
| 20 | + Python-attribute access, invisible to dynamo's reachability analysis. |
| 21 | + `mutates_args` alone is **not sufficient**. |
| 22 | + |
| 23 | +2. The `if getattr(impl, "_fusion_active", False)` consume gate at |
| 24 | + `qwen3_5.py:466-476` was specialised to the else-branch by dynamo at |
| 25 | + trace time (`_fusion_active = False` is the impl `__init__` default; |
| 26 | + the per-step mutation happens inside the `unified_attention` opaque |
| 27 | + op where dynamo can't see). Captured graph: legacy Python o_proj + |
| 28 | + `post_attention_layernorm` ALWAYS ran; β-coop's `rmsnorm_output` / |
| 29 | + `residual_output` were never read. |
| 30 | + |
| 31 | + Dual-fire (paged + β-coop) happened to produce coherent output by |
| 32 | + accident: paged populated `output` with Phase A attn (via the |
| 33 | + framework op's `mutates_args`), Python o_proj computed `wo_out`, |
| 34 | + Python `post_attention_layernorm` reconstructed `residual_post_attn`. |
| 35 | + β-coop's outputs were entirely wasted compute. |
| 36 | + |
| 37 | + Solo (paged-skip) broke because nothing populated `output` with |
| 38 | + Phase A in solo mode → Python o_proj operated on uninitialised |
| 39 | + memory → gibberish. |
| 40 | + |
| 41 | +The B-fix in `514b88c6f` proves both bugs are real and fixable under |
| 42 | +`cudagraph_mode=NONE`. It does NOT yet survive `cudagraph_mode=PIECEWISE` |
| 43 | +(production), suggesting at least one additional graph-capture issue |
| 44 | +needs root-causing as part of the architectural pass. |
| 45 | + |
| 46 | +## How the bug was confirmed |
| 47 | + |
| 48 | +### Step 1 — captured FX graph inspection |
| 49 | + |
| 50 | +```bash |
| 51 | +docker exec nvllm find /root/.cache/vllm/torch_compile_cache \ |
| 52 | + -name 'computation_graph.py' | head -1 \ |
| 53 | + | xargs -I{} grep -oE 'torch\.ops\.vllm\.[a-z_]+' {} | sort -u |
| 54 | +``` |
| 55 | + |
| 56 | +Output before B-fix: |
| 57 | + |
| 58 | +``` |
| 59 | +torch.ops.vllm.cute_phase_e_dispatch |
| 60 | +torch.ops.vllm.gdn_attention_core |
| 61 | +torch.ops.vllm.unified_attention_with_output |
| 62 | +torch.ops.vllm.unified_kv_cache_update |
| 63 | +``` |
| 64 | + |
| 65 | +`cute_residual_mirror` is **absent**. Despite being called at |
| 66 | +`qwen3_5.py:444` with `mutates_args=["residual_buf"]`. Same for |
| 67 | +`gate_buf` mirror at `qwen3_5.py:264`. |
| 68 | + |
| 69 | +### Step 2 — captured FX layer 3 segment shows Python pipeline |
| 70 | + |
| 71 | +`/root/.cache/vllm/torch_compile_cache/<hash>/rank_0_0/backbone/computation_graph.py`, |
| 72 | +layer 3 attention segment (submod_8) ran the legacy o_proj path: |
| 73 | + |
| 74 | +```python |
| 75 | +# qwen3_5.py:285 — applied even with _fusion_active=True at runtime |
| 76 | +sigmoid: "bf16[s18, 6144]" = torch.sigmoid(gate_1) |
| 77 | +mul: "bf16[s18, 6144]" = view * sigmoid |
| 78 | +# scaled_fp4_quant.out + cutlass_scaled_fp4_mm = the o_proj |
| 79 | +scaled_fp4_quant_out = torch.ops._C.scaled_fp4_quant.out(reshape, ...) |
| 80 | +cutlass_scaled_fp4_mm = torch.ops._C.cutlass_scaled_fp4_mm(empty_1, empty, ...) |
| 81 | +self_attention_output_3[slice(None, None, None)] = view_2 |
| 82 | + |
| 83 | +# qwen3_5.py:491-493 — fused-residual RMSNorm |
| 84 | +add: "bf16[s18, 5120]" = self_attention_output_3 + x_32 |
| 85 | +# ... rsqrt, mul by gamma, .to(bf16) |
| 86 | +to: "bf16[s18, 5120]" = mul_9.to(torch.bfloat16) |
| 87 | + |
| 88 | +# Then cute_phase_e_dispatch consumes the post-LN output |
| 89 | +cute_phase_e_dispatch = torch.ops.vllm.cute_phase_e_dispatch( |
| 90 | + to, empty_like, empty_like_1, add, 'language_model.model.layers.3.mlp') |
| 91 | +``` |
| 92 | + |
| 93 | +Both the consume branch (`qwen3_5.py:466-476`) and the post_attn_LN gate |
| 94 | +(`qwen3_5.py:490-496`) were dead-eliminated to favour the Python pipeline. |
| 95 | + |
| 96 | +### Step 3 — solo result-matrix verification |
| 97 | + |
| 98 | +| Mode | Behaviour | |
| 99 | +|--------------------------------------------|------------------------------------| |
| 100 | +| EAGER + solo β-coop (no compile) | COHERENT (no DCE, gates work) | |
| 101 | +| PIECEWISE + dual-fire (paged + β-coop) | COHERENT (Python pipeline reconstructs from paged Phase A) | |
| 102 | +| PIECEWISE + solo β-coop (paged gated off) | GIBBERISH (nothing populates `output`) | |
| 103 | + |
| 104 | +## What B-fix attempted (`514b88c6f`) |
| 105 | + |
| 106 | +Three opaque ops to make the consume + post_attn_LN dispatch survive |
| 107 | +torch.compile dead-elim: |
| 108 | + |
| 109 | +- **`cute_residual_mirror`** (existing) — preserved across DCE by |
| 110 | + passing `residual_buf` and `gate_buf` as **phantom inputs** to |
| 111 | + `cute_attn_consume`, giving the mutations observable downstream |
| 112 | + readers. |
| 113 | + |
| 114 | +- **`cute_attn_consume`** (new) — replaces the dead-eliminated consume |
| 115 | + branch. Always runs in the captured graph; dispatches at runtime via |
| 116 | + `_CUTE_ATTN_REGISTRY[layer_name]` lookup of `impl._fusion_bound`. |
| 117 | + |
| 118 | +- **`cute_post_attn_ln_dispatch`** (new) — replaces the dead-eliminated |
| 119 | + post_attn_LN gate. Skips when fusion-bound (β-coop did Phase C); |
| 120 | + applies fused-residual RMSNorm in-place when not. |
| 121 | + |
| 122 | +Captured FX after B-fix had all 4 ops: |
| 123 | + |
| 124 | +``` |
| 125 | +torch.ops.vllm.cute_attn_consume |
| 126 | +torch.ops.vllm.cute_phase_e_dispatch |
| 127 | +torch.ops.vllm.cute_post_attn_ln_dispatch |
| 128 | +torch.ops.vllm.cute_residual_mirror |
| 129 | +torch.ops.vllm.gdn_attention_core |
| 130 | +torch.ops.vllm.unified_attention_with_output |
| 131 | +torch.ops.vllm.unified_kv_cache_update |
| 132 | +``` |
| 133 | + |
| 134 | +`cute_residual_mirror` survived DCE thanks to the phantom dep. |
| 135 | + |
| 136 | +## What broke under `cudagraph_mode=PIECEWISE` |
| 137 | + |
| 138 | +PIECEWISE+NONE (B-fix v3): probe `"The capital of France is" → ` produced |
| 139 | +`' Paris. Paris is a city in France, and it is also the capital of |
| 140 | +France...'` — coherent. |
| 141 | + |
| 142 | +PIECEWISE+graphs (B-fix v3): same probe produced `' Paris这种现象这种现象 |
| 143 | +这种现象...'` — first token correct (prefill), then a single-token |
| 144 | +degenerate loop. |
| 145 | + |
| 146 | +### Failed pivots in this session |
| 147 | + |
| 148 | +- **v1**: tensor signal `_fusion_active_signal` + `int(signal.item())` |
| 149 | + inside the op body. Crashed at warmup with |
| 150 | + `cudaErrorStreamCaptureInvalidated`. **`.item()` causes a host-device |
| 151 | + sync that is incompatible with CUDA graph capture**. |
| 152 | + |
| 153 | +- **v2**: registry-lookup of `impl._phase_e_use_beta_coop` (Python attr |
| 154 | + reset per-step at top of impl forward). Survived capture, gibberish |
| 155 | + at decode. |
| 156 | + |
| 157 | +- **v3**: registry-lookup of `impl._fusion_bound` (set once at |
| 158 | + `attach_fusion`, stable across warmup + runtime). Same gibberish. |
| 159 | + |
| 160 | +The graph-capture failure under `cudagraph_mode=PIECEWISE` was not |
| 161 | +root-caused before the session ended. |
| 162 | + |
| 163 | +## Suspected causes (for the architectural pass to investigate) |
| 164 | + |
| 165 | +1. **vLLM V1 captures decode segments at warmup with shapes/state that |
| 166 | + diverge from runtime.** Python-attr reads inside opaque op bodies |
| 167 | + don't reliably reflect runtime state — what's True at warmup capture |
| 168 | + isn't necessarily what runs at replay. Even gating on `_fusion_bound` |
| 169 | + (intended to be capture-stable) didn't help, suggesting the issue |
| 170 | + is deeper than just the gate value. |
| 171 | + |
| 172 | +2. **β-coop's cooperative-launch + atomic-counter spin-wait may have |
| 173 | + CUDA-graph replay quirks** independent of the consume gate. Captured |
| 174 | + cooperative kernels with stream-sync-aware barriers might not replay |
| 175 | + correctly across decode steps. |
| 176 | + |
| 177 | +3. **PIECEWISE segment boundaries** — torch.compile may split the |
| 178 | + forward at op boundaries differently than expected. Each captured |
| 179 | + subgraph could have its own warmup-vs-runtime divergence. |
| 180 | + |
| 181 | +4. **The Python o_proj path is still present in the captured graph |
| 182 | + alongside `cute_attn_consume`**. Even when consume copies β-coop's |
| 183 | + `rmsnorm_output` into `self_attention_output`, the Python o_proj |
| 184 | + already wrote a different value there earlier in the same forward. |
| 185 | + In solo, that earlier value is junk (no Phase A); but the order |
| 186 | + should be o_proj first, consume second, so consume should overwrite. |
| 187 | + Verify ordering at the kernel level under graph replay. |
| 188 | + |
| 189 | +## Architectural answers (the C2 redesign should pick one) |
| 190 | + |
| 191 | +- **Have β-coop write directly to the framework `output` parameter.** |
| 192 | + Removes the need for the Python pipeline entirely; consume becomes a |
| 193 | + no-op or is folded into the kernel. Requires β-coop to expose a |
| 194 | + bf16 attn-output buffer slot that the model framework can consume. |
| 195 | + |
| 196 | +- **Use in-graph control flow** (`torch.cond` or `torch.where` on |
| 197 | + tensor signals) for the consume / post_attn_LN dispatch. Avoids |
| 198 | + `.item()` and Python-attr fragility entirely. Requires structuring |
| 199 | + the dispatch as data-dependent tensor ops rather than Python branches. |
| 200 | + |
| 201 | +- **Capture multiple graphs per shape and dispatch externally.** vLLM |
| 202 | + V1 already captures separate graphs for prefill vs decode shapes; |
| 203 | + extend to capture separate fusion-active vs fusion-inactive variants. |
| 204 | + Heaviest engineering but cleanest semantics — each captured graph |
| 205 | + has stable behaviour at replay. |
| 206 | + |
| 207 | +## What remains shippable on `feat/uber-kernel-migration` |
| 208 | + |
| 209 | +Commit `5a0311ca3` (parent of the WIP) is correctness-positive: |
| 210 | + |
| 211 | +- `cute_residual_mirror` opaque op (still DCE-dropped, but the call |
| 212 | + site is in place for the architectural fix to make observable). |
| 213 | +- β-coop predicate hard-gate (no-silent-fallback). |
| 214 | +- C2 attn-output-gate wiring through `phase_e_kernel.py`. |
| 215 | +- Env-gated tensor dump harness for kernel-level diagnostics. |
| 216 | + |
| 217 | +These don't change behaviour vs the prior dual-fire path (β-coop's |
| 218 | +outputs are still discarded by dual-fire's reliance on Python o_proj + |
| 219 | +post_attn_LN), but they're prerequisites for the architectural fix. |
| 220 | + |
| 221 | +## How to reproduce the diagnostic |
| 222 | + |
| 223 | +```bash |
| 224 | +# 1. Launch with fusion enabled, PIECEWISE+graphs (default) |
| 225 | +CUTE_PHASE_E_FUSION=1 ./scripts/serve-cute.sh |
| 226 | + |
| 227 | +# 2. Wait for API_READY, send a probe |
| 228 | +curl -s http://localhost:8000/v1/completions \ |
| 229 | + -H "Content-Type: application/json" \ |
| 230 | + -d '{"model":"default","prompt":"The capital of France is","max_tokens":40,"temperature":0,"seed":42}' |
| 231 | + |
| 232 | +# 3. Inspect the captured FX graph for vllm ops |
| 233 | +docker exec nvllm find /root/.cache/vllm/torch_compile_cache \ |
| 234 | + -name 'computation_graph.py' | head -1 \ |
| 235 | + | xargs -I{} grep -oE 'torch\.ops\.vllm\.[a-z_]+' {} | sort -u |
| 236 | + |
| 237 | +# 4. To reproduce the B-fix's PIECEWISE+NONE coherence: |
| 238 | +# Edit scripts/serve-cute.sh: cudagraph_mode "PIECEWISE" → "NONE" |
| 239 | +# Apply `git show 514b88c6f` to the working tree |
| 240 | +# Restart container, send probe — should be coherent |
| 241 | +``` |
| 242 | + |
| 243 | +## References |
| 244 | + |
| 245 | +- Commit `514b88c6f` — B-fix WIP (reverted in `3ffcf8740`). |
| 246 | +- Commit `5a0311ca3` — shipping C2 plumbing (parent of this work). |
| 247 | +- `memory:project_beta_coop_residual_solo_bug` — solo β-coop pickup notes. |
| 248 | +- `memory:project_uber_kernel_migration` — C1, C1.5 status. |
| 249 | +- `memory:feedback_pace_pressure` — don't let pace drive design; |
| 250 | + the architectural fix belongs on `feat/uber-kernel-migration`, |
| 251 | + not patched on a debug branch. |
0 commit comments