Commit 7d429f1
diag(c2): β-coop-vs-legacy diagnostic harness (env-gated, halt-on-divergence)
Adds CUTE_C2_DIAG=1 probe that compares β-coop's outputs against the
legacy post-attn-LN outputs in dual-fire under PIECEWISE+graphs. New
module vllm/v1/attention/backends/cute_paged/_c2_diag.py with 17 unit
tests; call site env-gated in vllm/nvllm/models/qwen3_5.py; serve-cute.sh
plumbs env vars + /tmp/c2_diag mount across the EngineCore subprocess
boundary.
Architectural limit found and documented: under PIECEWISE+graphs the
op's Python body executes only at capture (where it skips to avoid
cudaErrorStreamCaptureInvalidated), never during decode replay — the
diag cannot observe steady-state β-coop. See
docs/research/uber_kernel_migration/2026-04-27-c2-diagnostic-results.md
for full verdict + decision to proceed with CUTE_DUMP_TENSORS-based
forensics instead.
Plumbing wins kept (reusable for future fused-kernel diagnostics):
- vLLM EngineCore env stripping workaround (/tmp/c2_diag/ENV file)
- direct_register_custom_op pattern for fullgraph compatibility
- prefill-skip + capture-skip runtime guards in op impl
- os.getenv(name) or default — set-but-empty trap
Production behavior unchanged when CUTE_C2_DIAG is unset.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 788697b commit 7d429f1
6 files changed
Lines changed: 2233 additions & 6 deletions
File tree
- docs/research/uber_kernel_migration
- scripts
- tests/v1/cute_paged
- vllm
- nvllm/models
- v1/attention/backends/cute_paged
0 commit comments