Skip to content

evidence(wo_split prod soak): full {1,2,4,8} sweep + writeup#9

Merged
Natfii merged 18 commits into
mainfrom
evidence/wo-split-prod-soak
May 7, 2026
Merged

evidence(wo_split prod soak): full {1,2,4,8} sweep + writeup#9
Natfii merged 18 commits into
mainfrom
evidence/wo-split-prod-soak

Conversation

@Natfii
Copy link
Copy Markdown

@Natfii Natfii commented May 7, 2026

Summary

  • Production soak across CUTE_WO_SPLIT ∈ {1, 2, 4, 8} on Qwen3.5-27B-NVFP4. wo8 wall-time -3.3% vs wo1 baseline, p95 TPOT -25.5 ms, GSM8K parity within 1/50 across all arms.
  • All optimized arms verdicted "keep opt-in", not new default. Wall gains plateau after wo4; wo8's main advantage at production batch sizes is reduced wall-time variance (stddev 1.98 s vs wo1's 6.74 s).
  • Production decoder cross-check captured — wo8 supplementary region_timings.npy confirms phase1_wo_gemv running at 32 active CTAs in real serving (= 8× the 4-CTA baseline), median 2359.8 μs / p99 2468.9 μs. Optimized W_O path is engaged in real serving, not just in the harness.
  • Companion to PR evidence(W_O K-parallel): validation harness + 8.39x sweep + NCU memory-bound #7 (controlled harness, 8.39× kernel-level speedup with NCU memory-bound classification) and PR feat(wo_split=8): opt-in K-parallel W_O GEMV prototype #8 (feat/wo-split-8-prototype, the implementation under measurement).

Per-arm headline

arm gsm8k wall mean (s) wall stddev tpot p50 (ms) tpot p95 (ms) longdecode p95
wo1 48/50 8104.75 6.74 467.98 510.73 518.54
wo2 47/50 7910.47 3.39 450.43 493.07 500.75
wo4 48/50 7829.37 4.22 443.63 486.66 494.26
wo8 47/50 7833.98 1.98 441.94 485.21 491.69

Scope

  • Run scripts under docs/research/2026-05-04-wo-split-prod-soak/:
    • runner.sh — primary (5 ShareGPT replays + 5 longdecode replays + 2-concurrent + GSM8K-50) and supplementary (bounded ShareGPT under profiler + region timing) per arm
    • _replay.py, coherence_check.py, longdecode_prompt.txt, sharegpt_slice.jsonl — replay tooling and inputs
    • wo1_region_pass.sh — failed-attempt repro for the wo1 baseline region npy capture (kept as the receipt for the limitation noted below)
  • Evidence under benchmarks/nvllm/traces/wo_split_prod_soak/2026-05-04-soak/:
    • summary.md — auto-generated per-arm aggregate
    • writeup.md — human-readable companion (header pin, headline bullets, harness-cited kernel claim, wo8 production npy region table, known-limitations section, repro commands)
    • wo{1,2,4,8}/primary/ — full primary evidence per arm with DONE markers
    • wo8/supplementary/sharegpt_region_timings.npy — the only region npy that survived (auto-dump fired only on the wo_split=8 code path)
    • wo{1,4}/supplementary*/ — partial artifacts from runs whose POST /stop_profile hung the SoC

Total new evidence: ~17 MB across 270 files.

Production decoder region breakdown (wo8)

Per-region medians (μs) computed directly from wo8/supplementary/sharegpt_region_timings.npy:

region active CTAs median μs p99 μs
phase1_wo_gemv 32 2359.8 2468.9
grid_barrier_wait 64 1809.6 2868.5
phase3_3a_fc1_silu 64 566.8 615.9
phase4_residual 28 230.3 230.4
phase1_attn_pre_wo 4 229.9 230.0
phaseE_post 1 162.9 162.9
phase3_3c_fc2_atomic 64 116.5 176.2

grid_barrier_wait at 1809.6 μs median is the largest single component after the W_O reduction; the cooperative-launch grid barrier is the natural next target if a future iteration aims to recover more time inside the β-coop kernel.

Known limitations & failed attempts

Documented in writeup.md § "Known limitations & failed attempts":

  1. No same-run wo_split=1 serving region npy. Auto-dump hook does not fire on the wo_split=1 code path under serving in this config. Patching the backend would change the code under measurement; we did not.
  2. Torch profiler stop_profile can hard-reboot the host. Reproduced on wo4 (2026-05-05) and wo1 (2026-05-07). Bounds (limit_requests=4, max_prompt_chars=5500) prevent crashes during replay but not during post-replay kineto flush.
  3. Region-timing extract tool error. extract_regions.py fails on empty trace dirs. Worked around inline via direct npy decode.

Test plan

  • Primary GSM8K-50 + 5×ShareGPT + 5×longdecode + 2-concurrent on all 4 arms (wo1/wo2/wo4/wo8) — DONE markers committed
  • Coherence-check on longdecode replays — committed under wo*/primary/run*/longdecode_coherence.json
  • Wo8 supplementary region npy captured + breakdown computed
  • Failed-attempt repro script committed (wo1_region_pass.sh)
  • writeup.md link audit (13 local md links resolve)
  • Reviewer reproduce primary: WO_SPLITS="1,2,4,8" bash docs/research/2026-05-04-wo-split-prod-soak/runner.sh
  • Reviewer reproduce region breakdown: see writeup.md § "How to reproduce"

Notes

🤖 Generated with Claude Code

Natfii and others added 18 commits May 4, 2026 06:59
Adds Phase_E_Beta_Kernel.wo_split (env CUTE_WO_SPLIT, default 1,
bounded by slice_ctas) and threads it through _coop_full_compile_key.
At wo_split=1 the kernel behavior is unchanged. Disk cache will
distinguish wo_split variants once subsequent tasks add a kernel
body change.

Verified: cache MISS observed on first launch with new key ("first
call for this config" log at phase_e_kernel.py:3170); smoke probe
returned coherent /v1/completions output post-warmup.

Task 1 of 12 (wo_split=8 production prototype, plan at
/home/natfii/.claude/plans/sorted-crafting-rainbow.md).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Unbundles total_ctas_per_seq_attn into two concepts:
  - total_ctas_per_seq_attn (= num_kv_heads = 4): R1 attn-producer mask
  - total_wo_slots (= num_kv_heads * wo_split = 4 at wo_split=1):
      drives wo_output stride, gather loop, election target, counter reset

Slot-index formula bx*num_kv_heads+by stays legacy in this task; Task 8
lifts to by*wo_split+bx alongside the K-parallel kernel body.

At wo_split=1, total_wo_slots == 4 == legacy total_ctas_per_seq_attn,
so the address math is bit-exact. Cache key picks up new function
fingerprint via the Int32 arg addition.

Task 2 of 12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the literal `4` in the β-coop _phase_e_coop_wo_output
allocation with self.num_kv_heads * self._phase_e_coop_kernel.wo_split.
At wo_split=1 the dim 1 is still 4 (no-op refactor); at wo_split=8
later, this expands to 32 slots.

β-lite wo_output allocation at _backend.py:399-403 unchanged
(β-lite uses a different code path, out of scope for this plan).

Reset op _wo_output_reset_op.py shape preconditions are already
generic (dim()==3); only the error-message string is updated to
reflect the new shape semantic.

Task 3 of 12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds R11 (phase1_pre_wo_wait) and R12 (phase1_gather_reduce) to the
region-timing taxonomy. Host- and kernel-side constants bump together
to avoid the buffer-stride mismatch that an unbundled change would
introduce (host expects 13×16=208 bytes per CTA stride, kernel writes
11×16=176 — out-of-bounds or under-read between tasks).

Region classification is split: PHASE1_REGIONS stays {1,2,3};
WAIT_NOT_WORK_REGIONS gains R11 (consumer wait); new
DYNAMIC_SINGLE_CTA_REGIONS = {12} for the elected single-CTA gather.
This avoids the reducer's first-match-wins if/elif misclassifying
R11 as parallel phase1 work.

Files:
  region_timing.py  - REGION_NAMES (+2), region-class sets,
                      _phase1_wo_split_cta_ids helper, reducer branch
  _backend.py       - _REGION_TIMING_NUM_REGIONS = 13
  phase_e_kernel.py - _region_timing_num_regions = 13;
                      _REGION_TIMING_PER_CTA_STRIDE module constant;
                      22 Int64(11*2*8) sites → Int64(_REGION_TIMING_PER_CTA_STRIDE)
  extract_regions.py - --wo-split arg; dispatch to wo_split helper
                       for R2/R3/R11/R12 when wo_split>1

At wo_split=1: R11 mask never fires (bx>0 && bx<1 empty), R12 records
gather time, R0-R10 numerically identical to prior baseline. No
behavioral change at default.

Tasks 4+5 of 12 (combined for atomicity).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lays the pre-W_O barrier infrastructure for the K-parallel W_O GEMV
that lands in Task 8. At wo_split=1 the infrastructure is dormant:
the consumer mask `bx > 0 && bx < 1` is empty, so no CTA spins and
R11 buffer rows stay zero (host nonzero filter drops them).

Counter: _phase_e_coop_pre_wo_arrival_count, allocated in attach_mlp_fusion,
zeroed per-launch via host .zero_() inside run_beta_coop_full (mirrors
phase1_arrival_count reset pattern at line ~3120).

Producer (bx==0 && by<num_kv_heads): _threadfence + sync_threads
+ tid0 atomic_add 1 to pre_wo_arrival_count after attn output written.
Placed inside the existing bx==0 parent block, between R1 exit and
R2 entry timing.

Consumer (bx>0 && bx<wo_split && by<num_kv_heads): R11 entry sample,
spin-wait via _ld_volatile_u32 until counter == num_kv_heads,
_acquire_fence + sync_threads, R11 exit sample. Placed at kernel-
level (outside bx==0 parent) immediately before the existing R4
grid-barrier entry. Dead at wo_split=1.

Cooperative=True invariant preserved on β-coop launch (CLAUDE.md
rule 8: atomic-counter spin-wait barriers must run cooperative).

Cache MISS confirmed on first launch ("first call for this config"
log line) — function fingerprint shifted by the new pre_wo_arrival_ptr
arg and new R11 timing sites.

Task 6 of 12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps the existing is_last_cta gather block in _kernel_phase_0_to_4
with R12 entry/exit timing samples. Only the elected CTA writes a
tick; all other CTAs leave R12 slots at zero. Host reducer
(region_timing.py:208) drops zeros and reports the elected tick as
median/mean for R12.

R12 is in DYNAMIC_SINGLE_CTA_REGIONS (region_timing.py); reducer
classifies as "dynamic_single" with NaN frac_of_kernel (not parallel
work).

At wo_split=1 the gather sums num_kv_heads=4 partials (legacy);
post-Task-8 at wo_split=8 the gather sums 32 partials. R12 captures
the duration in either case.

Task 7 of 12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lifts the W_O block out of the legacy `bx == 0 && by < num_kv_heads`
attn-producer parent. New gate: `bx < wo_split && by < num_kv_heads`.
At wo_split=1 the new gate is equivalent to the old (bx<1 ≡ bx==0)
and behavior is bit-exact preserved against the pre-Task-8 baseline.

K-range slicing follows torch_reference.py:443-446 exactly:
    K_per_head     = K // num_kv_heads
    k_start_in_head = (K_per_head * bx) // wo_split
    k_end_in_head   = (K_per_head * (bx + 1)) // wo_split
    k_start = by * K_per_head + k_start_in_head
    k_end   = by * K_per_head + k_end_in_head

Slot index: slot_idx = by * wo_split + bx (matches torch_reference.py:438-439
slot_id // wo_split == by, slot_id % wo_split == bx).

`wo_split_const: cutlass.Constexpr[int]` is threaded through both the
@cute.jit host wrapper and the @cute.kernel body, sourced from
self.wo_split at trace time. Cache key already includes self.wo_split
(Task 1) so flipping the env spawns a fresh compile.

The legacy W_O+gather block is moved out of the attn-producer parent
and placed at kernel-level after R11 (pre_wo_wait) so all W_O CTAs
(bx ∈ [0, wo_split), by < num_kv_heads) execute it. The election
counter target stays at total_wo_slots-1 (= num_kv_heads*wo_split - 1)
which scales naturally with the new W_O CTA count.

Bit-exact algorithm gate against reference_split_order:
- Harness microkernel @ docs/research/2026-05-03-w-o-k-parallel-harness/
  reproduces this exact K-range/slot formula and reports
  max_abs == 0.000e+00 vs reference_split_order(wo_split=N) at both
  wo_split=1 and wo_split=8.

Production kernel verification (synthetic repro at /tmp/wo_split_repro.py):
- wo_split=1: run_beta_coop_full completes; wo_output[:,0,:] FINITE;
  identical stats min=-7.84e+06 max=8.22e+06 mean=2.60e+04.
- wo_split=8: run_beta_coop_full completes; wo_output[:,0,:] FINITE;
  identical stats; max_abs vs wo_split=1 = 7.0 (≈1 ULP at FP32 8M
  magnitude — expected K-parallel reorder noise on K=6144 mixed-sign
  random data; identical to harness wo_split=1-vs-wo_split=8 drift).

Serve smoke at both wo_split=1 (default) and CUTE_WO_SPLIT=8 produces
identical coherent /v1/completions output across the three test prompts:
- "What is 2+2?" → "2+2 equals 4..."
- "Capital of France?" → "<think>...What is the capital of France?..."
- "Write a haiku about coding." → "<think>...Topic: Coding. Format: Haiku..."

Cache MISS confirmed for the wo_split=8 + Task-8 config:
"Compiling PhaseE_Beta_Kernel β-coop full (first call for this config)…"

Task 8 of 12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The kernel's wo_split bounds are robust for arbitrary 1..slice_ctas,
but only the powers-of-2 subset {1, 2, 4, 8} has the bench/correctness
story this PR ships. reference_split_order at torch_reference.py only
validates these four values, and the harness sweep evidence likewise
covers only this set.

Don't expose unevidenced settings (3/5/6/7) accidentally — the assert
fails fast on init if a user sets CUTE_WO_SPLIT to a non-evidenced
value. Comment block at the field documents the intent so a future
contributor knows the kernel logic itself isn't the gating constraint.

Pre-Task-10 cleanup. Subsequent baseline (Task 10) and graduation
(Task 11) traces will be captured against this restricted set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Original Task 9 plan ("parameterize host Phase 1 mask helpers on
wo_split") was already subsumed by the Task 4+5 combined dispatch.
Repurposed to address the kernel-side cleanups flagged by the Task 8
spec+quality review:

#2 (Important): R11 timing/spin/exit gates now use wo_split_const
    instead of self.wo_split, matching the W_O block (Task 8). Both
    are bound from int(self.wo_split) in the same JIT compile call,
    but mixing the two in the same kernel body forced readers to
    verify equivalence. Now uniform across the kernel body.

#3 (Minor): Hoisted single pre_wo_consumer_active = (bx>0 &&
    bx<wo_split_const && by<num_kv_heads) above the R11 entry; reused
    at entry timing, spin gate, and exit timing. Removes the duplicate
    pre_wo_consumer_active2 copy-paste artifact.

#4 (Minor): Dropped "# NEW:" prefix from the wo_split cache-key inline
    comment — the marker would go stale at PR.

#5 (Real, fixed in same diff via the L253 comment block): bound-
    restriction comment now points to docs/research/2026-05-03-w-o-k-
    parallel-harness/torch_reference.py (the committed path) instead
    of /tmp/wo_split_repro_workdir/torch_reference.py (machine-local
    transient).

#6 (Minor): Added 3-line comment block before the new pre_wo_consumer_active
    declaration explaining bx==0 producers skip R11 because their
    attn_output reads are intra-CTA — the cross-CTA safety derivation
    that the spec reviewer pointed out was undocumented.

Deferred to merge-prep (per user direction):
- #1: total_ctas_per_seq_attn dead-arg cleanup (Task 12 PR-prep)
- #7: cutlass.const_expr gate on wo_split=1 producer fence/atomic
       (revisit if Task 10/11 evidence shows wo_split=1 overhead matters)

Pure refactor — bit-exact gate at wo_split=1 AND wo_split=8 still
passes with max_abs == 0.0 against reference_split_order. Cache MISS
on first launch (wo_split_const reference and mask hoist change the
PTX even though numerics are identical at runtime).

Task 9 of 12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vLLM's EngineCore subprocess (typically pid 146) strips most docker -e
env vars from its parent (per feedback_vllm_enginecore_env_strip in
project memory). Without this workaround, CUTE_WO_SPLIT=8 set on docker
run never reaches Phase_E_Beta_Kernel.__init__ and the kernel falls back
to the default wo_split=1.

Workaround mirrors the existing CUTE_C2_DIAG_* sentinel pattern:

1. scripts/serve-cute.sh writes CUTE_WO_SPLIT=${CUTE_WO_SPLIT:-1} to
   the /tmp/c2_diag/ENV file (already bind-mounted into the container).

2. vllm/nvllm/models/qwen3_5.py reads /tmp/c2_diag/ENV at module import
   and calls os.environ.setdefault for any line matching CUTE_C2_* OR
   CUTE_WO_SPLIT=. The setdefault skips when the var is already set, so
   real env wins.

Verified end-to-end on the live container today: with CUTE_WO_SPLIT=8
set on the host shell, serve-cute.sh writes the sentinel, EngineCore
reads it, PhaseE_Beta_Kernel constructs with self.wo_split=8, region
timing buffer shows R2 active CTAs = 32 (was 4), R11 active CTAs = 28
(consumer mask fires for bx>0).

Pre-Task-12 cleanup. Required for the wo_split=8 graduation evidence
in Tasks 10/11 to be reproducible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Task 8 quality review flagged total_ctas_per_seq_attn as dead plumbing:
defined in run_beta_coop_full host wrapper, plumbed through 4 kernel
signature levels, but never consumed in the kernel body (the R1
attn-pre-W_O mask uses the literal `by < Int32(4)` directly, not the
arg).

This was acceptable as Task 8 ships because removing it is a separate
refactor and the bit-exact gate already verified the kernel produces
correct output regardless of the dead arg. Removed now as merge-prep
cleanup before Task 12 (PR open).

Removed 5 sites:
  - Host wrapper definition
  - all_args tuple pack
  - _jit_launch_phase_0_to_4 sig
  - _jit_launch_phase_0_to_4 forward to inner kernel call
  - _kernel_phase_0_to_4 sig

Comment block at the host wrapper now documents the literal `4` in the
R1 mask: it stays a literal because wo_split scales the W_O CTA count,
NOT the attn-producer count (which is always num_kv_heads = 4 for
Qwen3.5-27B).

Bit-exact gate against reference_split_order still passes at both
wo_split=1 (max_abs=0) and wo_split=8 (max_abs=0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plan for evidence/wo-split-prod-soak: 4-arm sweep (wo_split=1/2/4/8),
N=5 sequential workload replays per arm, four phases per arm — GSM8K-50
anchor under /no_think, ShareGPT mixed-length serve (deterministic ~30
multi-turn convs from anon8231489123/ShareGPT_Vicuna_unfiltered),
2048-token long decode probe, lightweight 2-concurrent probe.

Two measurement modes per arm: primary CUTE_BETA_REGION_TIMING=0 for
clean wall/TPOT, supplementary CUTE_BETA_REGION_TIMING=1 for region
breakdown across R2/R11/R12. nsys budget: 8 representative traces (4
arms x phases B+C), not on perf-measurement runs.

Quality gate: GSM8K floor >=30/50 AND pairwise within 2 questions of
wo_split=1 baseline. Default candidate requires >=5% wall improvement
vs baseline, no GSM8K regression, TPOT p95 not worse. Successor branch
feat/wo-split-N-prototype where N is the verdict-selected value (4 or
8); ties broken by TPOT p95.

Branch force-reset from feat/wo-split-8-prototype@69c530082; the 11
kernel-feature + sentinel + cleanup commits are kept; two superseded
single-arm evidence/docs commits dropped. PR ships the production W_O
K-parallel feature (opt-in via CUTE_WO_SPLIT) plus the soak evidence
proving whether it is safe to default; default flip remains a
conditional follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add ShareGPT slice generation and deterministic committed input, the long-decode prompt, warning-only coherence checks, and the soak results parser.

Co-authored-by: OpenAI Codex <codex@openai.com>
Update the soak plan so supplementary serving traces use VLLM_TORCH_PROFILER_DIR instead of nsys, noting that V1 EngineCore subprocess capture is unreliable for CUPTI injection.

Co-authored-by: OpenAI Codex <codex@openai.com>
Add the multi-arm soak orchestrator and streaming replay helper, plus an opt-in serve-cute profiler hook so supplementary passes can use vLLM torch profiler endpoints.

Co-authored-by: OpenAI Codex <codex@openai.com>
The first run wedged overnight: profiler+region timing turned a 5109-char
ShareGPT turn into a 138s response, the next 6787-char turn blew past the
per-request HTTP timeout, _replay.py raised, set -e killed the runner, and
the orphan container's profiler buffer eventually triggered a cgroup-OOM
cascade (visible as udev mass-kill at 23:30 + snapd restart loop).

Three fixes:

  * _replay.py: --http-timeout (primary name; --timeout kept as alias),
    --limit-requests, --max-prompt-chars. Length filter runs BEFORE the
    request count, so a pathological prompt cannot consume a budgeted slot.

  * runner.sh: supplementary sharegpt runs with --http-timeout 1800
    --limit-requests 4 --max-prompt-chars 5500. Replay call wrapped in
    set +e + PIPESTATUS; profiler_stop, region dump, and extraction run
    best-effort on failure without masking replay_rc.

  * runner.sh main: two loops instead of one — all primaries (decision-
    critical) across every arm first, then all supplementaries (diagnostic).
    Outer primary_DONE / supplementary_DONE gates skip completed arms on
    resume so wo1's already-finished primary is not redone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New domain-specific guide for kernel, graph-captured op, and numerics-aware
work. Linked from AGENTS.md "Domain-Specific Guides" section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Full primary + supplementary evidence for the W_O K-parallel production
soak across CUTE_WO_SPLIT ∈ {1, 2, 4, 8}. Run commit 5b8fc39, image
sha256:9c0f1d31c92c. Bundles:

- writeup.md: human-readable companion to the auto-generated summary.md
  (per-arm table, harness-cited kernel claim, wo8 production region npy
  cross-check, known-limitations section for failed profiler attempts)
- wo{1,2,4,8}/primary/: GSM8K-50 + 5×ShareGPT + 5×longdecode + 2-concurrent
  per arm with DONE markers; verdicts "keep opt-in" for wo2/wo4/wo8
- wo8/supplementary/sharegpt_region_timings.npy: only region npy that
  survived (auto-dump fired only on wo_split=8 path); 32-active-CTA
  phase1_wo_gemv = optimized W_O path engaged in real serving
- wo{1,4}/supplementary*/: partial artifacts from runs whose
  POST /stop_profile hung the SoC (wo4 host crash 2026-05-05; wo1
  host crash 2026-05-07). Documented as profiling-tooling limitation
- docs/research/2026-05-04-wo-split-prod-soak/wo1_region_pass.sh: the
  failed-attempt repro script, kept as the receipt for the limitation

Force-add per evidence-dir convention; .gitignore filters .json/.log/.csv
across the repo but trace evidence overrides.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Natfii Natfii merged commit a131443 into main May 7, 2026
@Natfii Natfii deleted the evidence/wo-split-prod-soak branch May 7, 2026 12:16
Natfii added a commit that referenced this pull request May 16, 2026
- Swap serve.sh model row to ig1/Qwen3.5-27B-NVFP4 (default since 04/06)
- CuTe paged: "Prototype" -> "production decode path since v0.3.0"
- Document CUTE_WO_SPLIT={2,4,8} opt-in K-parallel W_O GEMV (PR #9)
- Drop --debug from serve-cute.sh launch line; PIECEWISE is default
- Veitner acknowledgment: "next direction" -> "applied in WO_SPLIT"
- Markdown lint fixups (blank lines after headings, table separators)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Natfii added a commit that referenced this pull request May 16, 2026
#16)

* chore(cute-paged): retire warmup.py + closed-arc trace-capture scripts

Pure deletion of dead/retired code. No behavior change. No new feature.
No perf claim. Not part of either active thread (SSM correctness or beta
kernel cost work) — this is housekeeping that reduces noise floor for
both.

Deletions:
- vllm/v1/attention/backends/cute_paged/warmup.py (165 lines) — module
  was retired with the FULL+blessed AOT cache work; Dockerfile.gb10
  explicitly removed the build-time invocation; no live importers in
  vllm/ or scripts/ (only stale references in docs/superpowers/specs,
  which are historical design records).
- scripts/phase_d_trace_capture.sh
- scripts/phase_d2_trace_capture.sh
- scripts/phase_d2e_trace_capture.sh
- scripts/phase_d3a_sweep.sh
  - All four scripts belong to the closed D2.x cherry-pick bisection
    arc; only referenced from their own trace-summary receipts under
    benchmarks/nvllm/traces/cute_paged_mlp_fusion/ and from historical
    docs/superpowers/specs entries.
- scripts/fusion_phaseb_diff.py — Phase B harness; the equivalent
  diagnostic is now the CUTE_DEBUG_FUSION env-gated path in _backend.py.

Comment-only edit:
- _backend.py:71-73 — drop the stale file-path reference to warmup.py
  (which no longer exists) from a comment about the disk-cache hook.
  No code change.

NOT in this PR (per memory:feedback_comment_not_delete + post-tool-use
kernel-fusion-file hook): the 4 already-commented PHASE 3 ORIGINAL /
Phase 4 epsilon-locals blocks inside _backend.py. The audit identified
them as candidates but their comments self-justify retention for "Phase
B/C debug recovery" and re-enable scenarios. Leaving them in place.

Net: -936 lines / +3 lines across 7 files. Force-pushable rollback is
`git revert`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(readme): sync model default + CuTe paged status + WO_SPLIT opt-in

- Swap serve.sh model row to ig1/Qwen3.5-27B-NVFP4 (default since 04/06)
- CuTe paged: "Prototype" -> "production decode path since v0.3.0"
- Document CUTE_WO_SPLIT={2,4,8} opt-in K-parallel W_O GEMV (PR #9)
- Drop --debug from serve-cute.sh launch line; PIECEWISE is default
- Veitner acknowledgment: "next direction" -> "applied in WO_SPLIT"
- Markdown lint fixups (blank lines after headings, table separators)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Natfii <27841768+Natfii@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Natfii added a commit that referenced this pull request May 17, 2026
…ec-decode fix + recipe + tokenizer patch (#20)

Scripts (model-first naming, backward-compat symlinks):
  serve.sh              -> serve-qwen35-triton.sh  (upstream triton_attn fallback)
  serve-cute.sh         -> serve-qwen35.sh         (CuTe = production decode default)
  serve-cute-full.sh    -> serve-qwen35-full.sh    (FULL_AND_PIECEWISE via blessed cache)
  serve-cute-mtp.sh     -> serve-qwen36.sh         (CuTe + qwen3_5_mtp + qwen3_xml parser)

serve-qwen36.sh:
  - Default HF_MODEL=natfii/Qwen3.6-27B-VLM-NVFP4-MTP (host-portable; HF override works)
  - QUANTIZATION='' default (vLLM auto-detect; modelopt-format hf_quant_config.json)
  - Auto-detect tokenizer_config TokenizersBackend wrapper + bind-mount the
    Qwen2Tokenizer patch from models/qwen36-tokenizer-patch (generalized from
    Huihui-specific path; transformers 4.57.6 image compat)
  - --tool-call-parser qwen3_xml (per froggeric discussion #9 recommended stack
    for Qwen3.6 chat/tool workloads)
  - Drop Huihui-specific defaults/hooks

Tokenizer patch dir rename:
  models/qwen3.6-mtp-tokenizer-patch/  ->  models/qwen36-tokenizer-patch/
  README rewritten as model-generic (Unsloth + Huihui + our quant all observed
  to ship the same TokenizersBackend wrapper).

vllm/v1/attention/backends/cute_paged/_backend.py — fusion_max_tokens fix:
  attach_fusion() now derives decode_query_len = 1 + num_speculative_tokens
  and allocates buffers + sizes the runtime fits_buffer gate (L1258) at
  fusion_max_tokens = max_num_seqs * decode_query_len. Without this, with
  MTP=N>=1 every step had num_actual_tokens=2 > _fusion_max_num_seqs=1,
  so _fusion_active silently flipped False and the entire fusion path
  was a no-op — including CUTE_ATTN_FUSION=1 and the Phase E gate. The
  field name `_fusion_max_num_seqs` is kept (mis-)named for minimal
  blast radius; semantically it is now "max fusion tokens".

  Empirics on Qwen3.6-27B-VLM-NVFP4-MTP, MTP=1, max_num_seqs=1, GSM8K-50:
    wo1               : 47/50  2487.7s  (baseline)
    wo8               : 47/50  2164.8s  (-12.98% vs wo1)
    wo8 + mlpfuse     : 47/50  2166.3s  (+0.07% vs wo8, noise)
    wo8 + attnfuse    : 47/50  2647.5s  (+22.30% vs wo8, regression)
  ATTN_FUSION stays off-default; with the patch the regression is now
  measurable (was previously masked as no-op). Kernel needs re-tuning for
  the num_q=2 MTP decode shape before re-enabling.

README:
  - Serve scripts table reflects new names + serve-qwen36.sh active
  - Roadmap adds "Now — Qwen3.6-27B bring-up" section
  - Backward-compat symlinks documented

Recipe + checkpoint published at https://huggingface.co/natfii/Qwen3.6-27B-VLM-NVFP4-MTP
  (internal testing artifact; lna-lab/GGUF-to-NVFP4-SM120 modelopt + MTP graft).

Quant scripts (training/quantize_qwen36_27b_vlm_mtp.{sh,py}) intentionally
left uncommitted — training/ is gitignored, matches the existing
quantize_qwen35_27b.{sh,py} pattern.

Co-authored-by: Natfii <27841768+Natfii@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant