Skip to content

Commit 631ddcc

Browse files
Natfiiclaude
andcommitted
docs(release): nvllm-v0.3.0 — uber-kernel migration release notes
Pinned-permalink release notes for feat/uber-kernel-migration (merge base 76b88ba, branch tip 1f91013). Documents the 13 commits that bring β-coop into production as the actual decode path under PIECEWISE CUDA graphs, plus three rounds of perf polish: Phase 4 (β-coop fires) → Phase 5 (paged-skip + except-replay) → Phase 6a (Python diet) → Phase 6b (small-M NVFP4 GEMM dispatcher). Includes file:line refs pinned to the branch tip, evidence tables sourced from committed trace summaries, and the AGENTS.md §4 AI assistance disclosure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 1f91013 commit 631ddcc

1 file changed

Lines changed: 184 additions & 0 deletions

File tree

Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
# Release — Uber-Kernel Migration
2+
3+
**Date:** 2026-04-29
4+
**Branch:** `feat/uber-kernel-migration`
5+
**Merge base (main):** [`76b88ba21`](https://github.com/Navi-AI-Lab/nvllm/commit/76b88ba2165d74d1665b60eaeeab933958f0fd18)
6+
**Branch tip:** [`1f91013b8`](https://github.com/Navi-AI-Lab/nvllm/commit/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea)
7+
**Diffstat:** 50 files changed, +10,714 / −589
8+
**Hardware target:** NVIDIA DGX Spark (GB10, SM120 / 121), 128 GB unified
9+
**Model under test:** `ig1/Qwen3.5-27B-NVFP4`
10+
11+
---
12+
13+
## What this release contains
14+
15+
The β-coop "uber" kernel ([`PhaseE_Beta_Kernel`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/phase_e_kernel.py)) — a single cooperative-launch CuTe DSL kernel that subsumes per-full-attention-layer decode work (Phase A attention + Phase B W_O + Phase C post-attention RMSNorm + Phase E MLP) — was already present on `main` at [`bc9037955`](https://github.com/Navi-AI-Lab/nvllm/commit/bc9037955) (Phase E ship, 2026-04-23). It compiled, captured under PIECEWISE CUDA graphs, and produced coherent output in the smoke harness, but the captured FX graph still ran the legacy split path because β-coop's outputs were structurally unobservable to graph capture (consume-gate DCE, [findings 2026-04-26](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/docs/research/uber_kernel_migration/2026-04-26-consume-gate-dce-and-graph-capture.md)).
16+
17+
This branch brings β-coop into production as the actual decode path for full-attention layers under PIECEWISE CUDA graphs, then layers on three rounds of perf polish on top.
18+
19+
---
20+
21+
## Commits (oldest → newest)
22+
23+
| # | Hash | Subject |
24+
|---|---|---|
25+
| 1 | [`2b21f3450`](https://github.com/Navi-AI-Lab/nvllm/commit/2b21f3450) | chore(serve): bake flashinfer-autotune-off flag into serve-cute.sh |
26+
| 2 | [`a65bcef31`](https://github.com/Navi-AI-Lab/nvllm/commit/a65bcef31) | fix(cute): C1 — β-coop and β-lite read residual_buf, not residual_output |
27+
| 3 | [`54da780f3`](https://github.com/Navi-AI-Lab/nvllm/commit/54da780f3) | refactor(cute): C1.5 — delete Phase 4 + F.1 layer-LN bake plumbing |
28+
| 4 | [`5a0311ca3`](https://github.com/Navi-AI-Lab/nvllm/commit/5a0311ca3) | fix(cute): C2 plumbing — residual/gate mirror op + β-coop predicate hard-gate |
29+
| 5 | [`514b88c6f`](https://github.com/Navi-AI-Lab/nvllm/commit/514b88c6f) | wip(cute): B-fix attempt — consume-gate DCE + post-attn-LN dispatch ops *(reverted in #6, kept in history for the architectural-pass reference)* |
30+
| 6 | [`3ffcf8740`](https://github.com/Navi-AI-Lab/nvllm/commit/3ffcf8740) | Revert "wip(cute): B-fix attempt" |
31+
| 7 | [`90b06d5df`](https://github.com/Navi-AI-Lab/nvllm/commit/90b06d5df) | docs(uber-kernel): consume-gate DCE + graph-capture findings (2026-04-26) |
32+
| 8 | [`788697bff`](https://github.com/Navi-AI-Lab/nvllm/commit/788697bff) | docs(uber-kernel): C2 diagnostic spec — β-coop vs legacy under PIECEWISE+graphs |
33+
| 9 | [`7d429f1b7`](https://github.com/Navi-AI-Lab/nvllm/commit/7d429f1b7) | diag(c2): β-coop-vs-legacy diagnostic harness (env-gated, halt-on-divergence) |
34+
| 10 | [`0185f84a0`](https://github.com/Navi-AI-Lab/nvllm/commit/0185f84a0) | feat(cute): β-coop under PIECEWISE+graphs — Phase 4 + KV-update DCE fix |
35+
| 11 | [`e7c9c38e9`](https://github.com/Navi-AI-Lab/nvllm/commit/e7c9c38e9) | perf(cute): Phase 5 — restore paged-skip optimization with except-replay |
36+
| 12 | [`722efc60b`](https://github.com/Navi-AI-Lab/nvllm/commit/722efc60b) | perf(cute): Phase 6a — β-coop hot-path Python diet (-4.0% per kernel call) |
37+
| 13 | [`1f91013b8`](https://github.com/Navi-AI-Lab/nvllm/commit/1f91013b8) | perf(cutlass): Phase 6b — small-M NVFP4 GEMM dispatcher (-1.09% NVFP4 mass) |
38+
39+
---
40+
41+
## Code surfaces (line refs pinned to branch tip `1f91013b8`)
42+
43+
### Phase 4 — β-coop fires under PIECEWISE+graphs
44+
45+
| Surface | File | Lines |
46+
|---|---|---|
47+
| β-coop torch op + fake registration | [`vllm/v1/attention/backends/cute_paged/_beta_coop_op.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_beta_coop_op.py) | [L40](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_beta_coop_op.py#L40), [L112](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_beta_coop_op.py#L112), [L129](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_beta_coop_op.py#L129) |
48+
| Model-side dispatch (`Qwen3_5Attention`) | [`vllm/nvllm/models/qwen3_5.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/nvllm/models/qwen3_5.py) | [L295-L348](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/nvllm/models/qwen3_5.py#L295-L348), [L582](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/nvllm/models/qwen3_5.py#L582) |
49+
| `_use_beta_coop` predicate + framework-output bind | [`vllm/v1/attention/backends/cute_paged/_backend.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_backend.py) | [L1246](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_backend.py#L1246), [L1268](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_backend.py#L1268) |
50+
51+
### Phase 5 — paged-skip narrowed to `_use_beta_coop` with except-handler replay
52+
53+
| Surface | File | Lines |
54+
|---|---|---|
55+
| `_skip_paged = _use_beta_coop` | [`vllm/v1/attention/backends/cute_paged/_backend.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_backend.py#L1267-L1268) | [L1267-L1268](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_backend.py#L1267-L1268) |
56+
| Skip-paged guard | [`vllm/v1/attention/backends/cute_paged/_backend.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_backend.py#L1326) | [L1326](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_backend.py#L1326) |
57+
| Except-replay branch | [`vllm/v1/attention/backends/cute_paged/_backend.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_backend.py#L1605-L1622) | [L1605-L1622](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_backend.py#L1605-L1622) |
58+
59+
### Phase 6a — hot-path Python diet (module-level env caches)
60+
61+
| Surface | File | Lines |
62+
|---|---|---|
63+
| `_CUTE_DUMP_TENSORS`, `_VERIFY_FRAMEWORK_OUTPUTS`, `_PHASE_E_ENV` | [`vllm/v1/attention/backends/cute_paged/_backend.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_backend.py) | [L46](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_backend.py#L46), [L52](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_backend.py#L52), [L130](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_backend.py#L130) |
64+
| `_BETA_COOP_COUNT_FIRES` flag | [`vllm/v1/attention/backends/cute_paged/_beta_coop_op.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_beta_coop_op.py#L36) | [L36](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_beta_coop_op.py#L36) |
65+
66+
### Phase 6b — small-M NVFP4 GEMM dispatcher
67+
68+
| Surface | File | Lines |
69+
|---|---|---|
70+
| Winners table + `lookup_m_small_winner` | [`csrc/libtorch_stable/quantization/fp4/nvfp4_winners_table.hpp`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/csrc/libtorch_stable/quantization/fp4/nvfp4_winners_table.hpp) | [L28](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/csrc/libtorch_stable/quantization/fp4/nvfp4_winners_table.hpp#L28), [L33](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/csrc/libtorch_stable/quantization/fp4/nvfp4_winners_table.hpp#L33) |
71+
| BF16 dispatch (small-M reorder) | [`csrc/libtorch_stable/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cu`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/csrc/libtorch_stable/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cu#L340-L380) | [L340-L380](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/csrc/libtorch_stable/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cu#L340-L380) |
72+
| FP16 dispatch (small-M reorder) | [`csrc/libtorch_stable/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cu`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/csrc/libtorch_stable/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cu#L419-L460) | [L419-L460](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/csrc/libtorch_stable/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cu#L419-L460) |
73+
| Codegen (incl. `SMALL_ONLY_SHAPES` for the GDN row) | [`docs/research/gemm_sweep/gen_winners_header.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/docs/research/gemm_sweep/gen_winners_header.py) | full file |
74+
| Replay harness (`--m-band` + new label modes) | [`docs/research/gemm_sweep/replay_winners_table.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/docs/research/gemm_sweep/replay_winners_table.py) | full file |
75+
76+
### C2 diagnostic harness
77+
78+
| Surface | File | Lines |
79+
|---|---|---|
80+
| Halt-on-divergence comparator | [`vllm/v1/attention/backends/cute_paged/_c2_diag.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_c2_diag.py) | full file (308 lines) |
81+
| Test coverage | [`tests/v1/cute_paged/test_c2_diag.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/tests/v1/cute_paged/test_c2_diag.py) | full file (235 lines) |
82+
83+
---
84+
85+
## Evidence
86+
87+
All measurements taken under identical workloads (5 timed × 64 max_tokens × concurrency=1, 15 warmup curls, PIECEWISE CUDA graphs, FP8 KV cache, FUSION=1). Per-kernel μs values from torch profiler via `--profiler-config` + `/start_profile` / `/stop_profile`; nsys CUPTI cannot trace vLLM V1's spawned EngineCore.
88+
89+
### `PhaseE_Beta_Kernel` per-call (μs)
90+
91+
| Run | Commit | Calls | Mean μs | Δ vs Phase E baseline |
92+
|---|---|---:|---:|---:|
93+
| Phase E β-coop baseline (main) | [`bc9037955`](https://github.com/Navi-AI-Lab/nvllm/commit/bc9037955) | 5,040 | 42,933.771 ||
94+
| Phase 6a (this branch) | [`722efc60b`](https://github.com/Navi-AI-Lab/nvllm/commit/722efc60b) | 5,040 | 41,217.510 | −1,716.261 (−4.00%) |
95+
| Phase 6b (this branch tip) | [`1f91013b8`](https://github.com/Navi-AI-Lab/nvllm/commit/1f91013b8) | 5,040 | 40,893.101 | **−2,040.670 (−4.75%)** |
96+
97+
Sources: [phase_e summary](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/benchmarks/nvllm/traces/phase_e/2026-04-23-initial/summary.md), [phase_6a summary](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/benchmarks/nvllm/traces/phase_6a/2026-04-29-initial/summary.md), [phase_6b summary](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/benchmarks/nvllm/traces/gemm_winners_table_smallM/2026-04-29-qwen35-27b/summary.md).
98+
99+
### NVFP4 GEMM mass (Phase 6b small-M dispatcher)
100+
101+
| Run | Commit | Calls | Total ms | Mean μs/call | Δ vs Phase 6a |
102+
|---|---|---:|---:|---:|---:|
103+
| Phase 6a | [`722efc60b`](https://github.com/Navi-AI-Lab/nvllm/commit/722efc60b) | 36,080 | 11,724.2 | 324.97 ||
104+
| Phase 6b build #1 (no GDN row) | (intermediate) | 36,080 | 11,624.1 | 322.18 | −100.1 ms (−0.85%) |
105+
| Phase 6b build #2 (GDN row added) | [`1f91013b8`](https://github.com/Navi-AI-Lab/nvllm/commit/1f91013b8) | 36,080 | 11,596.8 | 321.43 | **−127.4 ms (−1.09%)** |
106+
107+
### Phase 6b dispatcher replay (per-shape × M, small-M band)
108+
109+
20-cell replay total against forced-Stream-K baseline (`NVLLM_FP4_GEMM_CONFIG_M256=4` vs no env var):
110+
111+
| Shape | Σ baseline μs | Σ table μs | Δ |
112+
|---|---:|---:|---:|
113+
| `qkv_proj` (8192, 5120) | 448.70 | 343.74 | **−23.4%** |
114+
| `o_proj` (5120, 6144) | 332.28 | 288.77 | **−13.1%** |
115+
| `gate_up_proj` (34816, 5120) | 2,120.30 | 2,135.84 | +0.7% |
116+
| `down_proj` (5120, 17408) | 1,150.49 | 1,143.69 | −0.6% |
117+
| **Total (20 cells)** | **4,051.77** | **3,912.04** | **−3.45%** |
118+
119+
Source: [phase_6b summary § Primary evidence](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/benchmarks/nvllm/traces/gemm_winners_table_smallM/2026-04-29-qwen35-27b/summary.md). Wins concentrate on shapes where the optimal tile differs (`128x256x128` vs Stream-K's `128x128x256`); near-zero where the tile shapes match (only the schedule differs).
120+
121+
### Phase 5 — paged-skip + except-replay
122+
123+
GSM8K sanity per-question latency dropped 16 s/Q → 12 s/Q (~25%) vs Phase 4 (`0185f84a0`). 8/8 PASS. The legacy paged-attention forward calls do not appear in the kernel-time table for `_use_beta_coop` paths. Source: [phase_5_paged_skip summary](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/benchmarks/nvllm/traces/phase_5_paged_skip/2026-04-28-restored/summary.md).
124+
125+
### End-to-end wall (Phase 6a)
126+
127+
GSM8K-50, seed=42, max_tokens=512, thinking off, identical workload:
128+
129+
| Run | Commit | Correct | Wall (s) | Δ |
130+
|---|---|---:|---:|---:|
131+
| Phase 5 | [`e7c9c38e9`](https://github.com/Navi-AI-Lab/nvllm/commit/e7c9c38e9) | 30/50 | 7,030 ||
132+
| Phase 6a | [`722efc60b`](https://github.com/Navi-AI-Lab/nvllm/commit/722efc60b) | 31/50 | 6,838 | **−192 s (−2.7%)** |
133+
134+
---
135+
136+
## Correctness gates
137+
138+
| Gate | Result | Reference |
139+
|---|---|---|
140+
| GSM8K 8/8 sanity at Phase 4 ship | 8/8 PASS | [`0185f84a0`](https://github.com/Navi-AI-Lab/nvllm/commit/0185f84a0) |
141+
| GSM8K 8/8 sanity at Phase 5 ship | 8/8 PASS, 12 s/Q | [phase_5 summary](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/benchmarks/nvllm/traces/phase_5_paged_skip/2026-04-28-restored/summary.md) |
142+
| GSM8K-50 (seed=42) at Phase 6a ship | 31/50 (62.0%); Phase 5 baseline 30/50 | [phase_6a summary](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/benchmarks/nvllm/traces/phase_6a/2026-04-29-initial/summary.md) |
143+
| GSM8K 8/8 sanity at Phase 6b ship | 8/8 PASS (dispatcher refactor, no math change) | [phase_6b summary § Correctness gate](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/benchmarks/nvllm/traces/gemm_winners_table_smallM/2026-04-29-qwen35-27b/summary.md) |
144+
145+
Test files added on this branch:
146+
147+
- [`tests/v1/cute_paged/test_beta_coop_skeleton.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/tests/v1/cute_paged/test_beta_coop_skeleton.py) — 123 lines
148+
- [`tests/v1/cute_paged/test_c2_diag.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/tests/v1/cute_paged/test_c2_diag.py) — 235 lines
149+
- [`tests/v1/cute_paged/test_uber_kernel_buffer_contracts.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/tests/v1/cute_paged/test_uber_kernel_buffer_contracts.py) — 48 lines
150+
- [`tests/v1/cute_paged/test_uber_kernel_multi_layer.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/tests/v1/cute_paged/test_uber_kernel_multi_layer.py) — 114 lines
151+
152+
---
153+
154+
## Artifact index (committed evidence)
155+
156+
| Path | Contents |
157+
|---|---|
158+
| `benchmarks/nvllm/traces/phase_e/2026-04-23-initial/` | Phase E baseline (kernel CSV, serve log, summary) |
159+
| `benchmarks/nvllm/traces/phase_5_paged_skip/2026-04-28-restored/` | Phase 5 paged-skip restoration evidence |
160+
| `benchmarks/nvllm/traces/phase_6a/2026-04-29-initial/` | Phase 6a Python-diet evidence + GSM8K-50 wall |
161+
| `benchmarks/nvllm/traces/gemm_winners_table_smallM/2026-04-29-qwen35-27b/` | Phase 6b dispatcher replay + E2E + summary |
162+
| `benchmarks/nvllm/traces/gemm_sweep_sm120_phase6b_gdn/2026-04-29/` | GDN `(14336, 5120)` supplemental microbench |
163+
164+
Per [AGENTS.md §4](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/AGENTS.md): raw `*.pt.trace.json.gz` files are gitignored (size > 30 MB each, reproducible from the capture scripts in `docs/research/phase_*_traces/`); per-kernel CSVs, serve logs, memory watchdog logs, and `summary.md` are committed.
165+
166+
---
167+
168+
## Configuration verified at branch tip
169+
170+
| Field | Value |
171+
|---|---|
172+
| Backend | `CUTE_PAGED` |
173+
| KV cache dtype | `fp8_e4m3` |
174+
| Compilation | PIECEWISE CUDA graphs |
175+
| Attention path | β-coop (`_use_beta_coop=True`) for full-attention layers when `64 * num_seqs ≤ _resident_cap` (96 on GB10) |
176+
| Attention path (fallback) | β-lite for `num_seqs > 1` where the cooperative-launch resident cap blocks β-coop |
177+
| Linear-attention layers | unaffected (FLA GDN, 48 of 64 layers) |
178+
| Image | `nvllm:gb10` SHA `7ea16c763044` (Phase 6b build #2) |
179+
180+
---
181+
182+
## AI assistance disclosure
183+
184+
Branch authored with AI assistance (Claude Opus 4.7, 1M context). Each commit lists `Co-Authored-By` in the trailer. The submitting human reviewed every changed line and ran the listed correctness gates. No upstream `vllm-project/vllm` PRs are produced by this branch.

0 commit comments

Comments
 (0)