Commit 514b88c
wip(cute): B-fix attempt — consume-gate DCE + post-attn-LN dispatch ops
WIP: partial fix for the C2 migration's consume-gate plumbing problem.
This commit will be reverted in the next commit; preserved here in git
history for the follow-up architectural pass on feat/uber-kernel-migration.
See docs/research/uber_kernel_migration/2026-04-26-consume-gate-dce-and-graph-capture.md
(landing in the next commit) for the full diagnostic baseline.
What was diagnosed in this session
==================================
The C2 migration's premise — β-coop replaces Python o_proj +
post_attention_layernorm — was structurally unobservable to torch.compile
under PIECEWISE compile. Inspecting the captured FX graph at
/root/.cache/vllm/torch_compile_cache/<hash>/rank_0_0/backbone/computation_graph.py
revealed:
1. `cute_residual_mirror` was DCE-dropped despite `mutates_args=["residual_buf"]`.
Dynamo's DCE removes ops whose mutations have no observable downstream
reader IN THE GRAPH; impl.residual_buf is read inside opaque op bodies
via Python-attribute access, invisible to dynamo's reachability analysis.
`mutates_args` alone is NOT sufficient — needs an explicit graph-input
downstream reader.
2. The `if getattr(impl, "_fusion_active", False)` consume gate at
qwen3_5.py:466-476 was specialised to "always-take else branch" by
dynamo at trace time (`_fusion_active = False` at __init__, mutated
inside the unified_attention opaque op where dynamo can't see).
Captured graph: legacy Python o_proj + post_attn_LN ALWAYS ran;
β-coop's rmsnorm_output / residual_output were never read.
3. Dual-fire happened to produce coherent output entirely by accident:
paged populated `output` with Phase A attn (via the framework op's
declared mutates_args), Python o_proj computed wo_out from it, Python
post_attn_LN reconstructed residual_post_attn. β-coop's outputs were
wasted. Solo (paged-skip) broke because nothing populated `output`
with Phase A in solo mode.
What this commit attempted
==========================
Three opaque ops to replace the dead-eliminated Python branches:
- `cute_residual_mirror` (existing) — preserved across DCE by
passing residual_buf as a phantom input to `cute_attn_consume`,
giving the mutation a downstream reader.
- `cute_attn_consume` (new) — replaces the dead-eliminated consume
branch. Always runs in the captured graph; dispatches at runtime
via registry lookup of impl._fusion_bound. When β-coop fired,
copies impl.rmsnorm_output → self_attention_output and
impl.residual_output → residual.
- `cute_post_attn_ln_dispatch` (new) — replaces the dead-eliminated
post_attn_LN gate. Skips when fusion-bound (β-coop did Phase C);
applies fused-residual RMSNorm in-place when not.
Result matrix
=============
| Mode | Result |
|-----------------------------------------------|-----------------|
| PIECEWISE + cudagraph_mode=NONE + solo | COHERENT ✓ |
| PIECEWISE + cudagraph_mode=PIECEWISE + solo | GIBBERISH ✗ |
Under PIECEWISE+NONE, the B-fix is correct: solo β-coop produces
" Paris. Paris is a city in France..." for the standard probe.
Under PIECEWISE+graphs (production target), gibberish: first token
" Paris" correct (prefill works), then decode collapses into a
single-token loop ("这种现象" repeated). The captured graph contains
all 4 ops (cute_residual_mirror, cute_attn_consume,
cute_post_attn_ln_dispatch, cute_phase_e_dispatch) but the runtime
output is wrong.
Failed pivots in this session
=============================
- v1: tensor signal `_fusion_active_signal` + `int(signal.item())`
inside the op body. Crashed at warmup with
`cudaErrorStreamCaptureInvalidated` — `.item()` causes a host-device
sync that's incompatible with CUDA graph capture.
- v2: registry-lookup of `impl._phase_e_use_beta_coop` (Python attr,
per-step reset). Survived capture but produced gibberish.
- v3: registry-lookup of `impl._fusion_bound` (set once at
attach_fusion, stable across warmup + runtime). Same gibberish.
The graph-capture failure under cudagraph_mode=PIECEWISE remains
unexplained at the end of this session. Suspected root causes for the
follow-up architectural pass:
- vLLM V1 captures decode segments at warmup with shapes/state that
diverge from runtime; Python-attr reads inside opaque op bodies
don't reliably reflect runtime state.
- β-coop's cooperative-launch + atomic-counter spin-wait may have
CUDA-graph replay quirks independent of the consume gate.
- Some interaction between PIECEWISE's segment boundaries and the
new opaque ops.
Why this is being reverted
==========================
The B-fix proves the consume-gate DCE is real and bounded — it works
under PIECEWISE+NONE. But shipping a partial fix that fails under the
production graph mode would be a regression. The architectural answer
(have β-coop write to the framework `output` directly so Python pipeline
becomes unnecessary, OR use in-graph torch.cond/torch.where on tensor
signals, OR capture multiple graphs and dispatch externally) belongs in
the C2 redesign on feat/uber-kernel-migration, not patched on a debug
branch.
The next commit reverts this. The findings doc lands separately so it
remains in HEAD for the follow-up session.
Refs: memory:project_beta_coop_residual_solo_bug
memory:project_uber_kernel_migration
memory:feedback_pace_pressure (don't let pace drive design)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 5a0311c commit 514b88c
3 files changed
Lines changed: 294 additions & 14 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
463 | 463 | | |
464 | 464 | | |
465 | 465 | | |
466 | | - | |
467 | | - | |
468 | | - | |
469 | | - | |
470 | | - | |
471 | | - | |
472 | | - | |
473 | | - | |
474 | | - | |
475 | | - | |
476 | | - | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
| 506 | + | |
| 507 | + | |
| 508 | + | |
| 509 | + | |
477 | 510 | | |
478 | 511 | | |
479 | 512 | | |
| |||
487 | 520 | | |
488 | 521 | | |
489 | 522 | | |
490 | | - | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
| 529 | + | |
| 530 | + | |
| 531 | + | |
| 532 | + | |
| 533 | + | |
| 534 | + | |
| 535 | + | |
| 536 | + | |
| 537 | + | |
| 538 | + | |
| 539 | + | |
| 540 | + | |
| 541 | + | |
| 542 | + | |
| 543 | + | |
| 544 | + | |
| 545 | + | |
| 546 | + | |
| 547 | + | |
491 | 548 | | |
492 | 549 | | |
493 | 550 | | |
494 | 551 | | |
495 | | - | |
496 | | - | |
497 | 552 | | |
498 | 553 | | |
499 | 554 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
336 | 336 | | |
337 | 337 | | |
338 | 338 | | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
339 | 353 | | |
340 | 354 | | |
341 | 355 | | |
| |||
655 | 669 | | |
656 | 670 | | |
657 | 671 | | |
| 672 | + | |
| 673 | + | |
| 674 | + | |
| 675 | + | |
| 676 | + | |
| 677 | + | |
| 678 | + | |
| 679 | + | |
| 680 | + | |
658 | 681 | | |
659 | 682 | | |
660 | 683 | | |
| |||
1020 | 1043 | | |
1021 | 1044 | | |
1022 | 1045 | | |
| 1046 | + | |
| 1047 | + | |
| 1048 | + | |
| 1049 | + | |
| 1050 | + | |
| 1051 | + | |
| 1052 | + | |
| 1053 | + | |
| 1054 | + | |
| 1055 | + | |
| 1056 | + | |
| 1057 | + | |
| 1058 | + | |
| 1059 | + | |
| 1060 | + | |
| 1061 | + | |
| 1062 | + | |
1023 | 1063 | | |
1024 | 1064 | | |
1025 | 1065 | | |
| |||
1310 | 1350 | | |
1311 | 1351 | | |
1312 | 1352 | | |
| 1353 | + | |
| 1354 | + | |
| 1355 | + | |
| 1356 | + | |
| 1357 | + | |
| 1358 | + | |
| 1359 | + | |
| 1360 | + | |
1313 | 1361 | | |
1314 | 1362 | | |
1315 | 1363 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
46 | 46 | | |
47 | 47 | | |
48 | 48 | | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
49 | 60 | | |
50 | 61 | | |
51 | 62 | | |
| |||
309 | 320 | | |
310 | 321 | | |
311 | 322 | | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
0 commit comments