Skip to content

Phase B + C: Qwen3.5 own-the-stack — fork-owned model class + layers#3

Merged
Natfii merged 2 commits intomainfrom
feat/own-the-stack-phase-c
Apr 20, 2026
Merged

Phase B + C: Qwen3.5 own-the-stack — fork-owned model class + layers#3
Natfii merged 2 commits intomainfrom
feat/own-the-stack-phase-c

Conversation

@Natfii
Copy link
Copy Markdown

@Natfii Natfii commented Apr 20, 2026

Summary

Lands the Phase B + C own-the-stack refactor for Qwen3.5 on the fork:

  • Phase B (2b2d67c3b, already on main): Qwen3_5Attention model class
    moved from upstream vllm/model_executor/models/qwen3_next.py to fork-owned
    vllm/nvllm/models/qwen3_5.py, with a 15-line from vllm.nvllm.models.qwen3_5 import *
    shim in the upstream-tracking file. Adds attach_fusion(parent_layer) API on
    the CuTe paged backend.

  • Phase C (this PR, 2 commits):

    • cbfadb6a9 refactor(nvllm): Phase C own-the-stack — layers into vllm/nvllm/layers/
    • 6434802d6 chore(nvllm): Phase C — trace evidence + audit fixes

    Moves Qwen3_5RMSNorm and Qwen3_5MLP out of upstream-tracking files into
    fork-owned vllm/nvllm/layers/layernorm.py and vllm/nvllm/layers/mlp.py.
    Qwen3_5RMSNorm registers as CustomOp "qwen3_5_rms_norm"; Qwen3_5MLP
    is a dense-only copy of Qwen2MoeMLP that drops expert_gate/reduce_results.

Why

The fused-MLP kernel work (Phase D, separate PR off feat/unreal-kernel-phase-d)
needs a place to live that's fork-owned — otherwise every upstream sync drops
our fusion hooks. Own-the-stack gives:

  1. Fork-owned model class — fusion binding evolves without editing upstream files.
  2. Fork-owned layer classes — uber-kernel can fuse in-class ops without touching
    upstream Qwen2MoeMLP / GemmaRMSNorm.
  3. Shim files in upstream paths — existing
    from vllm.model_executor.models.qwen3_5 import * imports still work.

Test plan

  • GSM8K 8/8 on Qwen3.5-27B-NVFP4 via CuTe paged backend (validated 2026-04-17
    at landing time; recorded in memory:project_own_the_stack)
  • Qwen3NextAttention upstream-clean after Phase B off-ramp (matches upstream
    commit 494636b29)
  • Fusion binding via CutePagedAttentionImpl.attach_fusion(parent_layer)
    works for Qwen3.5 (Phase B shipped + smoke-tested)

No model-forward behavior changes — pure code ownership movement with shim files
keeping all existing imports resolving.

AI assistance

Assembled with AI assistance (Claude Opus 4.7). Every changed line was reviewed
by the submitter. AI-assisted commits carry
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> trailers.

Notes

  • Base: main on Navi-AI-Lab/nvllm (this fork), NOT upstream vllm-project/vllm.
  • Phase D follows in a separate PR off feat/unreal-kernel-phase-d after this lands.

Natfii and others added 2 commits April 17, 2026 14:34
- vllm/nvllm/layers/layernorm.py: Qwen3_5RMSNorm (copy of GemmaRMSNorm body,
  registered as "qwen3_5_rms_norm" to avoid CustomOp collision)
- vllm/nvllm/layers/mlp.py: Qwen3_5MLP (copy of dense Qwen2MoeMLP body,
  drops expert_gate + reduce_results kwargs; not used by 27B dense)
- vllm/nvllm/models/qwen3_5.py: 8-line import diff, 3 call-site renames,
  0 logic restructured
- tools/pre_commit/mypy.py: add vllm/nvllm/layers to EXCLUDE
- notebooks/nvllm/layers_smoke_tests.py: 5 host-side Tier-1 tests (5/5 pass)

Ship gate: GSM8K 8/8 on natfii/Qwen3.5-27B-NVFP4-Opus-GB10 with
PIECEWISE CUDA graphs, matching Phase B baseline (commit 4110dc7).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- benchmarks/nvllm/traces/cute_fusion/2026-04-17-phase-c/:
  - summary.md: Tier-3 ship-gate evidence (GSM8K 8/8, commit cbfadb6,
    image nvllm:gb10-ots-phaseC)
  - decode_log.txt: 32-line trimmed CUTE_DEBUG_FUSION math (first
    phaseB+phaseC pair per fusion-active layer 3..63, matches Phase B
    shape at 2026-04-17-own-the-stack/)
- .gitignore: add benchmarks/nvllm/traces/**/*.full.txt rule so the
  unfiltered 1.4MB per-decode dumps stay local-only
- notebooks/nvllm/layers_smoke_tests.py: add missing nvllm fork SPDX
  line (code-review audit suggestion)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Natfii Natfii merged commit b7b1e04 into main Apr 20, 2026
Natfii added a commit that referenced this pull request Apr 22, 2026
analyze.py carries the shortlisting logic (importable by both notebooks
and the B.2.2 code generator). gemm_microbench_analysis.ipynb renders
per-shape heatmaps + top-3 per (shape x M-bucket) + exports shortlist.json.
Notebook committed with outputs pre-rendered so a future heuristic session
can open it without re-running the sweep.

Shortlist: 12 unique configs across 4 shapes x 4 M-buckets (16/16 cells
populated). Tile 128x128x256_*_Pers dominates gate_up_proj / down_proj
and o_proj mid-M; 128x256x128 wins qkv_proj / o_proj at small M;
256x128x128_TmaWSCoop_Pers wins o_proj at M=192-256. Pers tile scheduler
wins nearly every bucket — SK only appears as #3 fallback. smoke_M256
(baseline) places in qkv_proj at large M, showing room for further
tuning there is tight.

Next (B.2.2): register shortlisted configs in the C++ dispatcher behind
NVLLM_FP4_GEMM_CONFIG_M256 env var, rebuild, per-config E2E traces.

Refs: docs/superpowers/plans/2026-04-21-gemm-sweep.md Task B.2.1

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Natfii Natfii deleted the feat/own-the-stack-phase-c branch April 22, 2026 08:38
Natfii added a commit that referenced this pull request Apr 25, 2026
Each of the 16 full_attention layers in Qwen3.5-27B attaches its own
PhaseE_Beta_Kernel instance with its own `self._compiled_phase_coop_full
= None`, so `cute.compile()` fires once per layer on first request —
16 × ~23 s ≈ ~6 min cold-start stall.

Fix: module-level `_PHASE_E_COOP_FULL_COMPILE_CACHE` keyed by the tuple
of all 22 `self.` constexprs read inside `_jit_launch_phase_0_to_4`
(audited via grep; key covers them all + 12 safe-redundant derived
fields). Instances with matching config share one compiled kernel.

Evidence (`benchmarks/nvllm/traces/phase_e_1/2026-04-24-coop-compile-cache/`):
- 16 β-coop attachments → 1 compile event (was 16).
- Cold Q1 = 79.4 s (compile + decode); warm Q2-Q8 = 22.7-23.2 s each.
- Projected savings ≈ 310 s (~5 min) shaved off first-request latency.
- GSM8K sanity PASS 7/8 (Q2 is a regex-extractor artifact on '120/12',
  not a kernel regression — reproduces on baseline without this fix).

Unit tests (`tests/kernels/cute/test_phase_e_compile_cache.py`):
- 6 new tests covering dict existence, key equivalence for matching
  configs, key distinctness for different configs, 16-instance → 1-compile
  behavior, distinct-config → N-compiles, and back-compat instance attr
  population.
- 33/33 existing Phase E tests still pass.

Next in Phase E.1: #3 record_function spans (this PR), #2 β-coop SMEM
shrink + #4 matched-concurrency baseline bench (follow-up session),
#5 cudaProfilerApi hook (infra).

Base: 7bc5773

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Natfii added a commit that referenced this pull request Apr 25, 2026
Torch profiler traces of the CuTe backend lumped all 16 full_attention
layers together because the β-coop and β-lite call sites in
`_backend.forward()` emitted no span markers. Per-layer attribution was
only inferrable from kernel names, not from the profiler row labels.

Wrap each call site in `torch.profiler.record_function`:

- β-coop (line ~1144): `PhaseE_Beta.coop.{_layer_name}`
- β-lite (line ~1219): `PhaseE_Beta.lite.{_layer_name}`

`record_function` is a no-op when no profiler is active, so there is
zero steady-state cost. In profile captures the spans give one row per
layer per path in chrome://tracing.

Unit tests (`tests/kernels/cute/test_phase_e_record_function_spans.py`):
- record_function is imported.
- β-coop branch wraps the run_beta_coop_full call with a span labelled
  PhaseE_Beta.coop.{_layer_name}.
- β-lite branch wraps the _mlp_kernel call with a span labelled
  PhaseE_Beta.lite.{_layer_name}.
- Span labels distinct between paths.

4/4 new tests pass, 43/43 total Phase E tests pass. Integration verify
of the spans' trace output deferred to the next live profile capture
(the wrap syntax is covered by the unit tests; runtime behaviour of
record_function is owned by torch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant