Phase B + C: Qwen3.5 own-the-stack — fork-owned model class + layers#3
Merged
Phase B + C: Qwen3.5 own-the-stack — fork-owned model class + layers#3
Conversation
- vllm/nvllm/layers/layernorm.py: Qwen3_5RMSNorm (copy of GemmaRMSNorm body, registered as "qwen3_5_rms_norm" to avoid CustomOp collision) - vllm/nvllm/layers/mlp.py: Qwen3_5MLP (copy of dense Qwen2MoeMLP body, drops expert_gate + reduce_results kwargs; not used by 27B dense) - vllm/nvllm/models/qwen3_5.py: 8-line import diff, 3 call-site renames, 0 logic restructured - tools/pre_commit/mypy.py: add vllm/nvllm/layers to EXCLUDE - notebooks/nvllm/layers_smoke_tests.py: 5 host-side Tier-1 tests (5/5 pass) Ship gate: GSM8K 8/8 on natfii/Qwen3.5-27B-NVFP4-Opus-GB10 with PIECEWISE CUDA graphs, matching Phase B baseline (commit 4110dc7). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- benchmarks/nvllm/traces/cute_fusion/2026-04-17-phase-c/: - summary.md: Tier-3 ship-gate evidence (GSM8K 8/8, commit cbfadb6, image nvllm:gb10-ots-phaseC) - decode_log.txt: 32-line trimmed CUTE_DEBUG_FUSION math (first phaseB+phaseC pair per fusion-active layer 3..63, matches Phase B shape at 2026-04-17-own-the-stack/) - .gitignore: add benchmarks/nvllm/traces/**/*.full.txt rule so the unfiltered 1.4MB per-decode dumps stay local-only - notebooks/nvllm/layers_smoke_tests.py: add missing nvllm fork SPDX line (code-review audit suggestion) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Natfii
added a commit
that referenced
this pull request
Apr 22, 2026
analyze.py carries the shortlisting logic (importable by both notebooks and the B.2.2 code generator). gemm_microbench_analysis.ipynb renders per-shape heatmaps + top-3 per (shape x M-bucket) + exports shortlist.json. Notebook committed with outputs pre-rendered so a future heuristic session can open it without re-running the sweep. Shortlist: 12 unique configs across 4 shapes x 4 M-buckets (16/16 cells populated). Tile 128x128x256_*_Pers dominates gate_up_proj / down_proj and o_proj mid-M; 128x256x128 wins qkv_proj / o_proj at small M; 256x128x128_TmaWSCoop_Pers wins o_proj at M=192-256. Pers tile scheduler wins nearly every bucket — SK only appears as #3 fallback. smoke_M256 (baseline) places in qkv_proj at large M, showing room for further tuning there is tight. Next (B.2.2): register shortlisted configs in the C++ dispatcher behind NVLLM_FP4_GEMM_CONFIG_M256 env var, rebuild, per-config E2E traces. Refs: docs/superpowers/plans/2026-04-21-gemm-sweep.md Task B.2.1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Natfii
added a commit
that referenced
this pull request
Apr 25, 2026
Each of the 16 full_attention layers in Qwen3.5-27B attaches its own PhaseE_Beta_Kernel instance with its own `self._compiled_phase_coop_full = None`, so `cute.compile()` fires once per layer on first request — 16 × ~23 s ≈ ~6 min cold-start stall. Fix: module-level `_PHASE_E_COOP_FULL_COMPILE_CACHE` keyed by the tuple of all 22 `self.` constexprs read inside `_jit_launch_phase_0_to_4` (audited via grep; key covers them all + 12 safe-redundant derived fields). Instances with matching config share one compiled kernel. Evidence (`benchmarks/nvllm/traces/phase_e_1/2026-04-24-coop-compile-cache/`): - 16 β-coop attachments → 1 compile event (was 16). - Cold Q1 = 79.4 s (compile + decode); warm Q2-Q8 = 22.7-23.2 s each. - Projected savings ≈ 310 s (~5 min) shaved off first-request latency. - GSM8K sanity PASS 7/8 (Q2 is a regex-extractor artifact on '120/12', not a kernel regression — reproduces on baseline without this fix). Unit tests (`tests/kernels/cute/test_phase_e_compile_cache.py`): - 6 new tests covering dict existence, key equivalence for matching configs, key distinctness for different configs, 16-instance → 1-compile behavior, distinct-config → N-compiles, and back-compat instance attr population. - 33/33 existing Phase E tests still pass. Next in Phase E.1: #3 record_function spans (this PR), #2 β-coop SMEM shrink + #4 matched-concurrency baseline bench (follow-up session), #5 cudaProfilerApi hook (infra). Base: 7bc5773 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Natfii
added a commit
that referenced
this pull request
Apr 25, 2026
Torch profiler traces of the CuTe backend lumped all 16 full_attention
layers together because the β-coop and β-lite call sites in
`_backend.forward()` emitted no span markers. Per-layer attribution was
only inferrable from kernel names, not from the profiler row labels.
Wrap each call site in `torch.profiler.record_function`:
- β-coop (line ~1144): `PhaseE_Beta.coop.{_layer_name}`
- β-lite (line ~1219): `PhaseE_Beta.lite.{_layer_name}`
`record_function` is a no-op when no profiler is active, so there is
zero steady-state cost. In profile captures the spans give one row per
layer per path in chrome://tracing.
Unit tests (`tests/kernels/cute/test_phase_e_record_function_spans.py`):
- record_function is imported.
- β-coop branch wraps the run_beta_coop_full call with a span labelled
PhaseE_Beta.coop.{_layer_name}.
- β-lite branch wraps the _mlp_kernel call with a span labelled
PhaseE_Beta.lite.{_layer_name}.
- Span labels distinct between paths.
4/4 new tests pass, 43/43 total Phase E tests pass. Integration verify
of the spans' trace output deferred to the next live profile capture
(the wrap syntax is covered by the unit tests; runtime behaviour of
record_function is owned by torch).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lands the Phase B + C own-the-stack refactor for Qwen3.5 on the fork:
Phase B (
2b2d67c3b, already on main):Qwen3_5Attentionmodel classmoved from upstream
vllm/model_executor/models/qwen3_next.pyto fork-ownedvllm/nvllm/models/qwen3_5.py, with a 15-linefrom vllm.nvllm.models.qwen3_5 import *shim in the upstream-tracking file. Adds
attach_fusion(parent_layer)API onthe CuTe paged backend.
Phase C (this PR, 2 commits):
cbfadb6a9refactor(nvllm): Phase C own-the-stack — layers into vllm/nvllm/layers/6434802d6chore(nvllm): Phase C — trace evidence + audit fixesMoves
Qwen3_5RMSNormandQwen3_5MLPout of upstream-tracking files intofork-owned
vllm/nvllm/layers/layernorm.pyandvllm/nvllm/layers/mlp.py.Qwen3_5RMSNormregisters as CustomOp"qwen3_5_rms_norm";Qwen3_5MLPis a dense-only copy of
Qwen2MoeMLPthat dropsexpert_gate/reduce_results.Why
The fused-MLP kernel work (Phase D, separate PR off
feat/unreal-kernel-phase-d)needs a place to live that's fork-owned — otherwise every upstream sync drops
our fusion hooks. Own-the-stack gives:
upstream
Qwen2MoeMLP/GemmaRMSNorm.from vllm.model_executor.models.qwen3_5 import *imports still work.Test plan
at landing time; recorded in
memory:project_own_the_stack)Qwen3NextAttentionupstream-clean after Phase B off-ramp (matches upstreamcommit
494636b29)CutePagedAttentionImpl.attach_fusion(parent_layer)works for Qwen3.5 (Phase B shipped + smoke-tested)
No model-forward behavior changes — pure code ownership movement with shim files
keeping all existing imports resolving.
AI assistance
Assembled with AI assistance (Claude Opus 4.7). Every changed line was reviewed
by the submitter. AI-assisted commits carry
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>trailers.Notes
mainonNavi-AI-Lab/nvllm(this fork), NOT upstreamvllm-project/vllm.feat/unreal-kernel-phase-dafter this lands.