|
| 1 | +# Release — Uber-Kernel Migration |
| 2 | + |
| 3 | +**Date:** 2026-04-29 |
| 4 | +**Branch:** `feat/uber-kernel-migration` |
| 5 | +**Merge base (main):** [`76b88ba21`](https://github.com/Navi-AI-Lab/nvllm/commit/76b88ba2165d74d1665b60eaeeab933958f0fd18) |
| 6 | +**Branch tip:** [`1f91013b8`](https://github.com/Navi-AI-Lab/nvllm/commit/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea) |
| 7 | +**Diffstat:** 50 files changed, +10,714 / −589 |
| 8 | +**Hardware target:** NVIDIA DGX Spark (GB10, SM120 / 121), 128 GB unified |
| 9 | +**Model under test:** `ig1/Qwen3.5-27B-NVFP4` |
| 10 | + |
| 11 | +--- |
| 12 | + |
| 13 | +## What this release contains |
| 14 | + |
| 15 | +The β-coop "uber" kernel ([`PhaseE_Beta_Kernel`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/phase_e_kernel.py)) — a single cooperative-launch CuTe DSL kernel that subsumes per-full-attention-layer decode work (Phase A attention + Phase B W_O + Phase C post-attention RMSNorm + Phase E MLP) — was already present on `main` at [`bc9037955`](https://github.com/Navi-AI-Lab/nvllm/commit/bc9037955) (Phase E ship, 2026-04-23). It compiled, captured under PIECEWISE CUDA graphs, and produced coherent output in the smoke harness, but the captured FX graph still ran the legacy split path because β-coop's outputs were structurally unobservable to graph capture (consume-gate DCE, [findings 2026-04-26](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/docs/research/uber_kernel_migration/2026-04-26-consume-gate-dce-and-graph-capture.md)). |
| 16 | + |
| 17 | +This branch brings β-coop into production as the actual decode path for full-attention layers under PIECEWISE CUDA graphs, then layers on three rounds of perf polish on top. |
| 18 | + |
| 19 | +--- |
| 20 | + |
| 21 | +## Commits (oldest → newest) |
| 22 | + |
| 23 | +| # | Hash | Subject | |
| 24 | +|---|---|---| |
| 25 | +| 1 | [`2b21f3450`](https://github.com/Navi-AI-Lab/nvllm/commit/2b21f3450) | chore(serve): bake flashinfer-autotune-off flag into serve-cute.sh | |
| 26 | +| 2 | [`a65bcef31`](https://github.com/Navi-AI-Lab/nvllm/commit/a65bcef31) | fix(cute): C1 — β-coop and β-lite read residual_buf, not residual_output | |
| 27 | +| 3 | [`54da780f3`](https://github.com/Navi-AI-Lab/nvllm/commit/54da780f3) | refactor(cute): C1.5 — delete Phase 4 + F.1 layer-LN bake plumbing | |
| 28 | +| 4 | [`5a0311ca3`](https://github.com/Navi-AI-Lab/nvllm/commit/5a0311ca3) | fix(cute): C2 plumbing — residual/gate mirror op + β-coop predicate hard-gate | |
| 29 | +| 5 | [`514b88c6f`](https://github.com/Navi-AI-Lab/nvllm/commit/514b88c6f) | wip(cute): B-fix attempt — consume-gate DCE + post-attn-LN dispatch ops *(reverted in #6, kept in history for the architectural-pass reference)* | |
| 30 | +| 6 | [`3ffcf8740`](https://github.com/Navi-AI-Lab/nvllm/commit/3ffcf8740) | Revert "wip(cute): B-fix attempt" | |
| 31 | +| 7 | [`90b06d5df`](https://github.com/Navi-AI-Lab/nvllm/commit/90b06d5df) | docs(uber-kernel): consume-gate DCE + graph-capture findings (2026-04-26) | |
| 32 | +| 8 | [`788697bff`](https://github.com/Navi-AI-Lab/nvllm/commit/788697bff) | docs(uber-kernel): C2 diagnostic spec — β-coop vs legacy under PIECEWISE+graphs | |
| 33 | +| 9 | [`7d429f1b7`](https://github.com/Navi-AI-Lab/nvllm/commit/7d429f1b7) | diag(c2): β-coop-vs-legacy diagnostic harness (env-gated, halt-on-divergence) | |
| 34 | +| 10 | [`0185f84a0`](https://github.com/Navi-AI-Lab/nvllm/commit/0185f84a0) | feat(cute): β-coop under PIECEWISE+graphs — Phase 4 + KV-update DCE fix | |
| 35 | +| 11 | [`e7c9c38e9`](https://github.com/Navi-AI-Lab/nvllm/commit/e7c9c38e9) | perf(cute): Phase 5 — restore paged-skip optimization with except-replay | |
| 36 | +| 12 | [`722efc60b`](https://github.com/Navi-AI-Lab/nvllm/commit/722efc60b) | perf(cute): Phase 6a — β-coop hot-path Python diet (-4.0% per kernel call) | |
| 37 | +| 13 | [`1f91013b8`](https://github.com/Navi-AI-Lab/nvllm/commit/1f91013b8) | perf(cutlass): Phase 6b — small-M NVFP4 GEMM dispatcher (-1.09% NVFP4 mass) | |
| 38 | + |
| 39 | +--- |
| 40 | + |
| 41 | +## Code surfaces (line refs pinned to branch tip `1f91013b8`) |
| 42 | + |
| 43 | +### Phase 4 — β-coop fires under PIECEWISE+graphs |
| 44 | + |
| 45 | +| Surface | File | Lines | |
| 46 | +|---|---|---| |
| 47 | +| β-coop torch op + fake registration | [`vllm/v1/attention/backends/cute_paged/_beta_coop_op.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_beta_coop_op.py) | [L40](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_beta_coop_op.py#L40), [L112](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_beta_coop_op.py#L112), [L129](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_beta_coop_op.py#L129) | |
| 48 | +| Model-side dispatch (`Qwen3_5Attention`) | [`vllm/nvllm/models/qwen3_5.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/nvllm/models/qwen3_5.py) | [L295-L348](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/nvllm/models/qwen3_5.py#L295-L348), [L582](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/nvllm/models/qwen3_5.py#L582) | |
| 49 | +| `_use_beta_coop` predicate + framework-output bind | [`vllm/v1/attention/backends/cute_paged/_backend.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_backend.py) | [L1246](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_backend.py#L1246), [L1268](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_backend.py#L1268) | |
| 50 | + |
| 51 | +### Phase 5 — paged-skip narrowed to `_use_beta_coop` with except-handler replay |
| 52 | + |
| 53 | +| Surface | File | Lines | |
| 54 | +|---|---|---| |
| 55 | +| `_skip_paged = _use_beta_coop` | [`vllm/v1/attention/backends/cute_paged/_backend.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_backend.py#L1267-L1268) | [L1267-L1268](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_backend.py#L1267-L1268) | |
| 56 | +| Skip-paged guard | [`vllm/v1/attention/backends/cute_paged/_backend.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_backend.py#L1326) | [L1326](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_backend.py#L1326) | |
| 57 | +| Except-replay branch | [`vllm/v1/attention/backends/cute_paged/_backend.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_backend.py#L1605-L1622) | [L1605-L1622](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_backend.py#L1605-L1622) | |
| 58 | + |
| 59 | +### Phase 6a — hot-path Python diet (module-level env caches) |
| 60 | + |
| 61 | +| Surface | File | Lines | |
| 62 | +|---|---|---| |
| 63 | +| `_CUTE_DUMP_TENSORS`, `_VERIFY_FRAMEWORK_OUTPUTS`, `_PHASE_E_ENV` | [`vllm/v1/attention/backends/cute_paged/_backend.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_backend.py) | [L46](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_backend.py#L46), [L52](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_backend.py#L52), [L130](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_backend.py#L130) | |
| 64 | +| `_BETA_COOP_COUNT_FIRES` flag | [`vllm/v1/attention/backends/cute_paged/_beta_coop_op.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_beta_coop_op.py#L36) | [L36](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_beta_coop_op.py#L36) | |
| 65 | + |
| 66 | +### Phase 6b — small-M NVFP4 GEMM dispatcher |
| 67 | + |
| 68 | +| Surface | File | Lines | |
| 69 | +|---|---|---| |
| 70 | +| Winners table + `lookup_m_small_winner` | [`csrc/libtorch_stable/quantization/fp4/nvfp4_winners_table.hpp`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/csrc/libtorch_stable/quantization/fp4/nvfp4_winners_table.hpp) | [L28](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/csrc/libtorch_stable/quantization/fp4/nvfp4_winners_table.hpp#L28), [L33](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/csrc/libtorch_stable/quantization/fp4/nvfp4_winners_table.hpp#L33) | |
| 71 | +| BF16 dispatch (small-M reorder) | [`csrc/libtorch_stable/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cu`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/csrc/libtorch_stable/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cu#L340-L380) | [L340-L380](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/csrc/libtorch_stable/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cu#L340-L380) | |
| 72 | +| FP16 dispatch (small-M reorder) | [`csrc/libtorch_stable/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cu`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/csrc/libtorch_stable/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cu#L419-L460) | [L419-L460](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/csrc/libtorch_stable/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cu#L419-L460) | |
| 73 | +| Codegen (incl. `SMALL_ONLY_SHAPES` for the GDN row) | [`docs/research/gemm_sweep/gen_winners_header.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/docs/research/gemm_sweep/gen_winners_header.py) | full file | |
| 74 | +| Replay harness (`--m-band` + new label modes) | [`docs/research/gemm_sweep/replay_winners_table.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/docs/research/gemm_sweep/replay_winners_table.py) | full file | |
| 75 | + |
| 76 | +### C2 diagnostic harness |
| 77 | + |
| 78 | +| Surface | File | Lines | |
| 79 | +|---|---|---| |
| 80 | +| Halt-on-divergence comparator | [`vllm/v1/attention/backends/cute_paged/_c2_diag.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/vllm/v1/attention/backends/cute_paged/_c2_diag.py) | full file (308 lines) | |
| 81 | +| Test coverage | [`tests/v1/cute_paged/test_c2_diag.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/tests/v1/cute_paged/test_c2_diag.py) | full file (235 lines) | |
| 82 | + |
| 83 | +--- |
| 84 | + |
| 85 | +## Evidence |
| 86 | + |
| 87 | +All measurements taken under identical workloads (5 timed × 64 max_tokens × concurrency=1, 15 warmup curls, PIECEWISE CUDA graphs, FP8 KV cache, FUSION=1). Per-kernel μs values from torch profiler via `--profiler-config` + `/start_profile` / `/stop_profile`; nsys CUPTI cannot trace vLLM V1's spawned EngineCore. |
| 88 | + |
| 89 | +### `PhaseE_Beta_Kernel` per-call (μs) |
| 90 | + |
| 91 | +| Run | Commit | Calls | Mean μs | Δ vs Phase E baseline | |
| 92 | +|---|---|---:|---:|---:| |
| 93 | +| Phase E β-coop baseline (main) | [`bc9037955`](https://github.com/Navi-AI-Lab/nvllm/commit/bc9037955) | 5,040 | 42,933.771 | — | |
| 94 | +| Phase 6a (this branch) | [`722efc60b`](https://github.com/Navi-AI-Lab/nvllm/commit/722efc60b) | 5,040 | 41,217.510 | −1,716.261 (−4.00%) | |
| 95 | +| Phase 6b (this branch tip) | [`1f91013b8`](https://github.com/Navi-AI-Lab/nvllm/commit/1f91013b8) | 5,040 | 40,893.101 | **−2,040.670 (−4.75%)** | |
| 96 | + |
| 97 | +Sources: [phase_e summary](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/benchmarks/nvllm/traces/phase_e/2026-04-23-initial/summary.md), [phase_6a summary](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/benchmarks/nvllm/traces/phase_6a/2026-04-29-initial/summary.md), [phase_6b summary](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/benchmarks/nvllm/traces/gemm_winners_table_smallM/2026-04-29-qwen35-27b/summary.md). |
| 98 | + |
| 99 | +### NVFP4 GEMM mass (Phase 6b small-M dispatcher) |
| 100 | + |
| 101 | +| Run | Commit | Calls | Total ms | Mean μs/call | Δ vs Phase 6a | |
| 102 | +|---|---|---:|---:|---:|---:| |
| 103 | +| Phase 6a | [`722efc60b`](https://github.com/Navi-AI-Lab/nvllm/commit/722efc60b) | 36,080 | 11,724.2 | 324.97 | — | |
| 104 | +| Phase 6b build #1 (no GDN row) | (intermediate) | 36,080 | 11,624.1 | 322.18 | −100.1 ms (−0.85%) | |
| 105 | +| Phase 6b build #2 (GDN row added) | [`1f91013b8`](https://github.com/Navi-AI-Lab/nvllm/commit/1f91013b8) | 36,080 | 11,596.8 | 321.43 | **−127.4 ms (−1.09%)** | |
| 106 | + |
| 107 | +### Phase 6b dispatcher replay (per-shape × M, small-M band) |
| 108 | + |
| 109 | +20-cell replay total against forced-Stream-K baseline (`NVLLM_FP4_GEMM_CONFIG_M256=4` vs no env var): |
| 110 | + |
| 111 | +| Shape | Σ baseline μs | Σ table μs | Δ | |
| 112 | +|---|---:|---:|---:| |
| 113 | +| `qkv_proj` (8192, 5120) | 448.70 | 343.74 | **−23.4%** | |
| 114 | +| `o_proj` (5120, 6144) | 332.28 | 288.77 | **−13.1%** | |
| 115 | +| `gate_up_proj` (34816, 5120) | 2,120.30 | 2,135.84 | +0.7% | |
| 116 | +| `down_proj` (5120, 17408) | 1,150.49 | 1,143.69 | −0.6% | |
| 117 | +| **Total (20 cells)** | **4,051.77** | **3,912.04** | **−3.45%** | |
| 118 | + |
| 119 | +Source: [phase_6b summary § Primary evidence](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/benchmarks/nvllm/traces/gemm_winners_table_smallM/2026-04-29-qwen35-27b/summary.md). Wins concentrate on shapes where the optimal tile differs (`128x256x128` vs Stream-K's `128x128x256`); near-zero where the tile shapes match (only the schedule differs). |
| 120 | + |
| 121 | +### Phase 5 — paged-skip + except-replay |
| 122 | + |
| 123 | +GSM8K sanity per-question latency dropped 16 s/Q → 12 s/Q (~25%) vs Phase 4 (`0185f84a0`). 8/8 PASS. The legacy paged-attention forward calls do not appear in the kernel-time table for `_use_beta_coop` paths. Source: [phase_5_paged_skip summary](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/benchmarks/nvllm/traces/phase_5_paged_skip/2026-04-28-restored/summary.md). |
| 124 | + |
| 125 | +### End-to-end wall (Phase 6a) |
| 126 | + |
| 127 | +GSM8K-50, seed=42, max_tokens=512, thinking off, identical workload: |
| 128 | + |
| 129 | +| Run | Commit | Correct | Wall (s) | Δ | |
| 130 | +|---|---|---:|---:|---:| |
| 131 | +| Phase 5 | [`e7c9c38e9`](https://github.com/Navi-AI-Lab/nvllm/commit/e7c9c38e9) | 30/50 | 7,030 | — | |
| 132 | +| Phase 6a | [`722efc60b`](https://github.com/Navi-AI-Lab/nvllm/commit/722efc60b) | 31/50 | 6,838 | **−192 s (−2.7%)** | |
| 133 | + |
| 134 | +--- |
| 135 | + |
| 136 | +## Correctness gates |
| 137 | + |
| 138 | +| Gate | Result | Reference | |
| 139 | +|---|---|---| |
| 140 | +| GSM8K 8/8 sanity at Phase 4 ship | 8/8 PASS | [`0185f84a0`](https://github.com/Navi-AI-Lab/nvllm/commit/0185f84a0) | |
| 141 | +| GSM8K 8/8 sanity at Phase 5 ship | 8/8 PASS, 12 s/Q | [phase_5 summary](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/benchmarks/nvllm/traces/phase_5_paged_skip/2026-04-28-restored/summary.md) | |
| 142 | +| GSM8K-50 (seed=42) at Phase 6a ship | 31/50 (62.0%); Phase 5 baseline 30/50 | [phase_6a summary](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/benchmarks/nvllm/traces/phase_6a/2026-04-29-initial/summary.md) | |
| 143 | +| GSM8K 8/8 sanity at Phase 6b ship | 8/8 PASS (dispatcher refactor, no math change) | [phase_6b summary § Correctness gate](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/benchmarks/nvllm/traces/gemm_winners_table_smallM/2026-04-29-qwen35-27b/summary.md) | |
| 144 | + |
| 145 | +Test files added on this branch: |
| 146 | + |
| 147 | +- [`tests/v1/cute_paged/test_beta_coop_skeleton.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/tests/v1/cute_paged/test_beta_coop_skeleton.py) — 123 lines |
| 148 | +- [`tests/v1/cute_paged/test_c2_diag.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/tests/v1/cute_paged/test_c2_diag.py) — 235 lines |
| 149 | +- [`tests/v1/cute_paged/test_uber_kernel_buffer_contracts.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/tests/v1/cute_paged/test_uber_kernel_buffer_contracts.py) — 48 lines |
| 150 | +- [`tests/v1/cute_paged/test_uber_kernel_multi_layer.py`](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/tests/v1/cute_paged/test_uber_kernel_multi_layer.py) — 114 lines |
| 151 | + |
| 152 | +--- |
| 153 | + |
| 154 | +## Artifact index (committed evidence) |
| 155 | + |
| 156 | +| Path | Contents | |
| 157 | +|---|---| |
| 158 | +| `benchmarks/nvllm/traces/phase_e/2026-04-23-initial/` | Phase E baseline (kernel CSV, serve log, summary) | |
| 159 | +| `benchmarks/nvllm/traces/phase_5_paged_skip/2026-04-28-restored/` | Phase 5 paged-skip restoration evidence | |
| 160 | +| `benchmarks/nvllm/traces/phase_6a/2026-04-29-initial/` | Phase 6a Python-diet evidence + GSM8K-50 wall | |
| 161 | +| `benchmarks/nvllm/traces/gemm_winners_table_smallM/2026-04-29-qwen35-27b/` | Phase 6b dispatcher replay + E2E + summary | |
| 162 | +| `benchmarks/nvllm/traces/gemm_sweep_sm120_phase6b_gdn/2026-04-29/` | GDN `(14336, 5120)` supplemental microbench | |
| 163 | + |
| 164 | +Per [AGENTS.md §4](https://github.com/Navi-AI-Lab/nvllm/blob/1f91013b8432f01d5bc3cddfbd401a2d4d1cf0ea/AGENTS.md): raw `*.pt.trace.json.gz` files are gitignored (size > 30 MB each, reproducible from the capture scripts in `docs/research/phase_*_traces/`); per-kernel CSVs, serve logs, memory watchdog logs, and `summary.md` are committed. |
| 165 | + |
| 166 | +--- |
| 167 | + |
| 168 | +## Configuration verified at branch tip |
| 169 | + |
| 170 | +| Field | Value | |
| 171 | +|---|---| |
| 172 | +| Backend | `CUTE_PAGED` | |
| 173 | +| KV cache dtype | `fp8_e4m3` | |
| 174 | +| Compilation | PIECEWISE CUDA graphs | |
| 175 | +| Attention path | β-coop (`_use_beta_coop=True`) for full-attention layers when `64 * num_seqs ≤ _resident_cap` (96 on GB10) | |
| 176 | +| Attention path (fallback) | β-lite for `num_seqs > 1` where the cooperative-launch resident cap blocks β-coop | |
| 177 | +| Linear-attention layers | unaffected (FLA GDN, 48 of 64 layers) | |
| 178 | +| Image | `nvllm:gb10` SHA `7ea16c763044` (Phase 6b build #2) | |
| 179 | + |
| 180 | +--- |
| 181 | + |
| 182 | +## AI assistance disclosure |
| 183 | + |
| 184 | +Branch authored with AI assistance (Claude Opus 4.7, 1M context). Each commit lists `Co-Authored-By` in the trailer. The submitting human reviewed every changed line and ran the listed correctness gates. No upstream `vllm-project/vllm` PRs are produced by this branch. |
0 commit comments