evidence(wo_split prod soak): full {1,2,4,8} sweep + writeup by Natfii · Pull Request #9 · Navi-AI-Lab/nvllm

Natfii · 2026-05-07T12:15:42Z

Summary

Production soak across CUTE_WO_SPLIT ∈ {1, 2, 4, 8} on Qwen3.5-27B-NVFP4. wo8 wall-time -3.3% vs wo1 baseline, p95 TPOT -25.5 ms, GSM8K parity within 1/50 across all arms.
All optimized arms verdicted "keep opt-in", not new default. Wall gains plateau after wo4; wo8's main advantage at production batch sizes is reduced wall-time variance (stddev 1.98 s vs wo1's 6.74 s).
Production decoder cross-check captured — wo8 supplementary region_timings.npy confirms phase1_wo_gemv running at 32 active CTAs in real serving (= 8× the 4-CTA baseline), median 2359.8 μs / p99 2468.9 μs. Optimized W_O path is engaged in real serving, not just in the harness.
Companion to PR evidence(W_O K-parallel): validation harness + 8.39x sweep + NCU memory-bound #7 (controlled harness, 8.39× kernel-level speedup with NCU memory-bound classification) and PR feat(wo_split=8): opt-in K-parallel W_O GEMV prototype #8 (feat/wo-split-8-prototype, the implementation under measurement).

Per-arm headline

arm	gsm8k	wall mean (s)	wall stddev	tpot p50 (ms)	tpot p95 (ms)	longdecode p95
wo1	48/50	8104.75	6.74	467.98	510.73	518.54
wo2	47/50	7910.47	3.39	450.43	493.07	500.75
wo4	48/50	7829.37	4.22	443.63	486.66	494.26
wo8	47/50	7833.98	1.98	441.94	485.21	491.69

Scope

Run scripts under docs/research/2026-05-04-wo-split-prod-soak/:
- runner.sh — primary (5 ShareGPT replays + 5 longdecode replays + 2-concurrent + GSM8K-50) and supplementary (bounded ShareGPT under profiler + region timing) per arm
- _replay.py, coherence_check.py, longdecode_prompt.txt, sharegpt_slice.jsonl — replay tooling and inputs
- wo1_region_pass.sh — failed-attempt repro for the wo1 baseline region npy capture (kept as the receipt for the limitation noted below)
Evidence under benchmarks/nvllm/traces/wo_split_prod_soak/2026-05-04-soak/:
- summary.md — auto-generated per-arm aggregate
- writeup.md — human-readable companion (header pin, headline bullets, harness-cited kernel claim, wo8 production npy region table, known-limitations section, repro commands)
- wo{1,2,4,8}/primary/ — full primary evidence per arm with DONE markers
- wo8/supplementary/sharegpt_region_timings.npy — the only region npy that survived (auto-dump fired only on the wo_split=8 code path)
- wo{1,4}/supplementary*/ — partial artifacts from runs whose POST /stop_profile hung the SoC

Total new evidence: ~17 MB across 270 files.

Production decoder region breakdown (wo8)

Per-region medians (μs) computed directly from wo8/supplementary/sharegpt_region_timings.npy:

region	active CTAs	median μs	p99 μs
`phase1_wo_gemv`	32	2359.8	2468.9
`grid_barrier_wait`	64	1809.6	2868.5
`phase3_3a_fc1_silu`	64	566.8	615.9
`phase4_residual`	28	230.3	230.4
`phase1_attn_pre_wo`	4	229.9	230.0
`phaseE_post`	1	162.9	162.9
`phase3_3c_fc2_atomic`	64	116.5	176.2

grid_barrier_wait at 1809.6 μs median is the largest single component after the W_O reduction; the cooperative-launch grid barrier is the natural next target if a future iteration aims to recover more time inside the β-coop kernel.

Known limitations & failed attempts

Documented in writeup.md § "Known limitations & failed attempts":

No same-run wo_split=1 serving region npy. Auto-dump hook does not fire on the wo_split=1 code path under serving in this config. Patching the backend would change the code under measurement; we did not.
Torch profiler stop_profile can hard-reboot the host. Reproduced on wo4 (2026-05-05) and wo1 (2026-05-07). Bounds (limit_requests=4, max_prompt_chars=5500) prevent crashes during replay but not during post-replay kineto flush.
Region-timing extract tool error. extract_regions.py fails on empty trace dirs. Worked around inline via direct npy decode.

Test plan

Primary GSM8K-50 + 5×ShareGPT + 5×longdecode + 2-concurrent on all 4 arms (wo1/wo2/wo4/wo8) — DONE markers committed
Coherence-check on longdecode replays — committed under wo*/primary/run*/longdecode_coherence.json
Wo8 supplementary region npy captured + breakdown computed
Failed-attempt repro script committed (wo1_region_pass.sh)
writeup.md link audit (13 local md links resolve)
Reviewer reproduce primary: WO_SPLITS="1,2,4,8" bash docs/research/2026-05-04-wo-split-prod-soak/runner.sh
Reviewer reproduce region breakdown: see writeup.md § "How to reproduce"

Notes

AI-assisted: soak orchestration, region npy decode, writeup, failed-attempt diagnosis, and PR description done with Claude Opus 4.7. Run-time decisions (kill stuck curl, redirect after host crashes, framing of slope-vs-absolute claim) made under direct user direction.
Companion of PR evidence(W_O K-parallel): validation harness + 8.39x sweep + NCU memory-bound #7 (controlled NCU harness — kernel-level 8.39× speedup) and PR feat(wo_split=8): opt-in K-parallel W_O GEMV prototype #8 (feat/wo-split-8-prototype — implementation). Together: kernel claim is grounded in PR evidence(W_O K-parallel): validation harness + 8.39x sweep + NCU memory-bound #7, production impact is in this soak, the code under measurement is in PR feat(wo_split=8): opt-in K-parallel W_O GEMV prototype #8.
No production code changed in this PR. Evidence + writeup + run scripts only.

🤖 Generated with Claude Code

Adds Phase_E_Beta_Kernel.wo_split (env CUTE_WO_SPLIT, default 1, bounded by slice_ctas) and threads it through _coop_full_compile_key. At wo_split=1 the kernel behavior is unchanged. Disk cache will distinguish wo_split variants once subsequent tasks add a kernel body change. Verified: cache MISS observed on first launch with new key ("first call for this config" log at phase_e_kernel.py:3170); smoke probe returned coherent /v1/completions output post-warmup. Task 1 of 12 (wo_split=8 production prototype, plan at /home/natfii/.claude/plans/sorted-crafting-rainbow.md). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Unbundles total_ctas_per_seq_attn into two concepts: - total_ctas_per_seq_attn (= num_kv_heads = 4): R1 attn-producer mask - total_wo_slots (= num_kv_heads * wo_split = 4 at wo_split=1): drives wo_output stride, gather loop, election target, counter reset Slot-index formula bx*num_kv_heads+by stays legacy in this task; Task 8 lifts to by*wo_split+bx alongside the K-parallel kernel body. At wo_split=1, total_wo_slots == 4 == legacy total_ctas_per_seq_attn, so the address math is bit-exact. Cache key picks up new function fingerprint via the Int32 arg addition. Task 2 of 12. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces the literal `4` in the β-coop _phase_e_coop_wo_output allocation with self.num_kv_heads * self._phase_e_coop_kernel.wo_split. At wo_split=1 the dim 1 is still 4 (no-op refactor); at wo_split=8 later, this expands to 32 slots. β-lite wo_output allocation at _backend.py:399-403 unchanged (β-lite uses a different code path, out of scope for this plan). Reset op _wo_output_reset_op.py shape preconditions are already generic (dim()==3); only the error-message string is updated to reflect the new shape semantic. Task 3 of 12. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds R11 (phase1_pre_wo_wait) and R12 (phase1_gather_reduce) to the region-timing taxonomy. Host- and kernel-side constants bump together to avoid the buffer-stride mismatch that an unbundled change would introduce (host expects 13×16=208 bytes per CTA stride, kernel writes 11×16=176 — out-of-bounds or under-read between tasks). Region classification is split: PHASE1_REGIONS stays {1,2,3}; WAIT_NOT_WORK_REGIONS gains R11 (consumer wait); new DYNAMIC_SINGLE_CTA_REGIONS = {12} for the elected single-CTA gather. This avoids the reducer's first-match-wins if/elif misclassifying R11 as parallel phase1 work. Files: region_timing.py - REGION_NAMES (+2), region-class sets, _phase1_wo_split_cta_ids helper, reducer branch _backend.py - _REGION_TIMING_NUM_REGIONS = 13 phase_e_kernel.py - _region_timing_num_regions = 13; _REGION_TIMING_PER_CTA_STRIDE module constant; 22 Int64(11*2*8) sites → Int64(_REGION_TIMING_PER_CTA_STRIDE) extract_regions.py - --wo-split arg; dispatch to wo_split helper for R2/R3/R11/R12 when wo_split>1 At wo_split=1: R11 mask never fires (bx>0 && bx<1 empty), R12 records gather time, R0-R10 numerically identical to prior baseline. No behavioral change at default. Tasks 4+5 of 12 (combined for atomicity). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lays the pre-W_O barrier infrastructure for the K-parallel W_O GEMV that lands in Task 8. At wo_split=1 the infrastructure is dormant: the consumer mask `bx > 0 && bx < 1` is empty, so no CTA spins and R11 buffer rows stay zero (host nonzero filter drops them). Counter: _phase_e_coop_pre_wo_arrival_count, allocated in attach_mlp_fusion, zeroed per-launch via host .zero_() inside run_beta_coop_full (mirrors phase1_arrival_count reset pattern at line ~3120). Producer (bx==0 && by<num_kv_heads): _threadfence + sync_threads + tid0 atomic_add 1 to pre_wo_arrival_count after attn output written. Placed inside the existing bx==0 parent block, between R1 exit and R2 entry timing. Consumer (bx>0 && bx<wo_split && by<num_kv_heads): R11 entry sample, spin-wait via _ld_volatile_u32 until counter == num_kv_heads, _acquire_fence + sync_threads, R11 exit sample. Placed at kernel- level (outside bx==0 parent) immediately before the existing R4 grid-barrier entry. Dead at wo_split=1. Cooperative=True invariant preserved on β-coop launch (CLAUDE.md rule 8: atomic-counter spin-wait barriers must run cooperative). Cache MISS confirmed on first launch ("first call for this config" log line) — function fingerprint shifted by the new pre_wo_arrival_ptr arg and new R11 timing sites. Task 6 of 12. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wraps the existing is_last_cta gather block in _kernel_phase_0_to_4 with R12 entry/exit timing samples. Only the elected CTA writes a tick; all other CTAs leave R12 slots at zero. Host reducer (region_timing.py:208) drops zeros and reports the elected tick as median/mean for R12. R12 is in DYNAMIC_SINGLE_CTA_REGIONS (region_timing.py); reducer classifies as "dynamic_single" with NaN frac_of_kernel (not parallel work). At wo_split=1 the gather sums num_kv_heads=4 partials (legacy); post-Task-8 at wo_split=8 the gather sums 32 partials. R12 captures the duration in either case. Task 7 of 12. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lifts the W_O block out of the legacy `bx == 0 && by < num_kv_heads` attn-producer parent. New gate: `bx < wo_split && by < num_kv_heads`. At wo_split=1 the new gate is equivalent to the old (bx<1 ≡ bx==0) and behavior is bit-exact preserved against the pre-Task-8 baseline. K-range slicing follows torch_reference.py:443-446 exactly: K_per_head = K // num_kv_heads k_start_in_head = (K_per_head * bx) // wo_split k_end_in_head = (K_per_head * (bx + 1)) // wo_split k_start = by * K_per_head + k_start_in_head k_end = by * K_per_head + k_end_in_head Slot index: slot_idx = by * wo_split + bx (matches torch_reference.py:438-439 slot_id // wo_split == by, slot_id % wo_split == bx). `wo_split_const: cutlass.Constexpr[int]` is threaded through both the @cute.jit host wrapper and the @cute.kernel body, sourced from self.wo_split at trace time. Cache key already includes self.wo_split (Task 1) so flipping the env spawns a fresh compile. The legacy W_O+gather block is moved out of the attn-producer parent and placed at kernel-level after R11 (pre_wo_wait) so all W_O CTAs (bx ∈ [0, wo_split), by < num_kv_heads) execute it. The election counter target stays at total_wo_slots-1 (= num_kv_heads*wo_split - 1) which scales naturally with the new W_O CTA count. Bit-exact algorithm gate against reference_split_order: - Harness microkernel @ docs/research/2026-05-03-w-o-k-parallel-harness/ reproduces this exact K-range/slot formula and reports max_abs == 0.000e+00 vs reference_split_order(wo_split=N) at both wo_split=1 and wo_split=8. Production kernel verification (synthetic repro at /tmp/wo_split_repro.py): - wo_split=1: run_beta_coop_full completes; wo_output[:,0,:] FINITE; identical stats min=-7.84e+06 max=8.22e+06 mean=2.60e+04. - wo_split=8: run_beta_coop_full completes; wo_output[:,0,:] FINITE; identical stats; max_abs vs wo_split=1 = 7.0 (≈1 ULP at FP32 8M magnitude — expected K-parallel reorder noise on K=6144 mixed-sign random data; identical to harness wo_split=1-vs-wo_split=8 drift). Serve smoke at both wo_split=1 (default) and CUTE_WO_SPLIT=8 produces identical coherent /v1/completions output across the three test prompts: - "What is 2+2?" → "2+2 equals 4..." - "Capital of France?" → "<think>...What is the capital of France?..." - "Write a haiku about coding." → "<think>...Topic: Coding. Format: Haiku..." Cache MISS confirmed for the wo_split=8 + Task-8 config: "Compiling PhaseE_Beta_Kernel β-coop full (first call for this config)…" Task 8 of 12. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The kernel's wo_split bounds are robust for arbitrary 1..slice_ctas, but only the powers-of-2 subset {1, 2, 4, 8} has the bench/correctness story this PR ships. reference_split_order at torch_reference.py only validates these four values, and the harness sweep evidence likewise covers only this set. Don't expose unevidenced settings (3/5/6/7) accidentally — the assert fails fast on init if a user sets CUTE_WO_SPLIT to a non-evidenced value. Comment block at the field documents the intent so a future contributor knows the kernel logic itself isn't the gating constraint. Pre-Task-10 cleanup. Subsequent baseline (Task 10) and graduation (Task 11) traces will be captured against this restricted set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Original Task 9 plan ("parameterize host Phase 1 mask helpers on wo_split") was already subsumed by the Task 4+5 combined dispatch. Repurposed to address the kernel-side cleanups flagged by the Task 8 spec+quality review: #2 (Important): R11 timing/spin/exit gates now use wo_split_const instead of self.wo_split, matching the W_O block (Task 8). Both are bound from int(self.wo_split) in the same JIT compile call, but mixing the two in the same kernel body forced readers to verify equivalence. Now uniform across the kernel body. #3 (Minor): Hoisted single pre_wo_consumer_active = (bx>0 && bx<wo_split_const && by<num_kv_heads) above the R11 entry; reused at entry timing, spin gate, and exit timing. Removes the duplicate pre_wo_consumer_active2 copy-paste artifact. #4 (Minor): Dropped "# NEW:" prefix from the wo_split cache-key inline comment — the marker would go stale at PR. #5 (Real, fixed in same diff via the L253 comment block): bound- restriction comment now points to docs/research/2026-05-03-w-o-k- parallel-harness/torch_reference.py (the committed path) instead of /tmp/wo_split_repro_workdir/torch_reference.py (machine-local transient). #6 (Minor): Added 3-line comment block before the new pre_wo_consumer_active declaration explaining bx==0 producers skip R11 because their attn_output reads are intra-CTA — the cross-CTA safety derivation that the spec reviewer pointed out was undocumented. Deferred to merge-prep (per user direction): - #1: total_ctas_per_seq_attn dead-arg cleanup (Task 12 PR-prep) - #7: cutlass.const_expr gate on wo_split=1 producer fence/atomic (revisit if Task 10/11 evidence shows wo_split=1 overhead matters) Pure refactor — bit-exact gate at wo_split=1 AND wo_split=8 still passes with max_abs == 0.0 against reference_split_order. Cache MISS on first launch (wo_split_const reference and mask hoist change the PTX even though numerics are identical at runtime). Task 9 of 12. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vLLM's EngineCore subprocess (typically pid 146) strips most docker -e env vars from its parent (per feedback_vllm_enginecore_env_strip in project memory). Without this workaround, CUTE_WO_SPLIT=8 set on docker run never reaches Phase_E_Beta_Kernel.__init__ and the kernel falls back to the default wo_split=1. Workaround mirrors the existing CUTE_C2_DIAG_* sentinel pattern: 1. scripts/serve-cute.sh writes CUTE_WO_SPLIT=${CUTE_WO_SPLIT:-1} to the /tmp/c2_diag/ENV file (already bind-mounted into the container). 2. vllm/nvllm/models/qwen3_5.py reads /tmp/c2_diag/ENV at module import and calls os.environ.setdefault for any line matching CUTE_C2_* OR CUTE_WO_SPLIT=. The setdefault skips when the var is already set, so real env wins. Verified end-to-end on the live container today: with CUTE_WO_SPLIT=8 set on the host shell, serve-cute.sh writes the sentinel, EngineCore reads it, PhaseE_Beta_Kernel constructs with self.wo_split=8, region timing buffer shows R2 active CTAs = 32 (was 4), R11 active CTAs = 28 (consumer mask fires for bx>0). Pre-Task-12 cleanup. Required for the wo_split=8 graduation evidence in Tasks 10/11 to be reproducible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Task 8 quality review flagged total_ctas_per_seq_attn as dead plumbing: defined in run_beta_coop_full host wrapper, plumbed through 4 kernel signature levels, but never consumed in the kernel body (the R1 attn-pre-W_O mask uses the literal `by < Int32(4)` directly, not the arg). This was acceptable as Task 8 ships because removing it is a separate refactor and the bit-exact gate already verified the kernel produces correct output regardless of the dead arg. Removed now as merge-prep cleanup before Task 12 (PR open). Removed 5 sites: - Host wrapper definition - all_args tuple pack - _jit_launch_phase_0_to_4 sig - _jit_launch_phase_0_to_4 forward to inner kernel call - _kernel_phase_0_to_4 sig Comment block at the host wrapper now documents the literal `4` in the R1 mask: it stays a literal because wo_split scales the W_O CTA count, NOT the attn-producer count (which is always num_kv_heads = 4 for Qwen3.5-27B). Bit-exact gate against reference_split_order still passes at both wo_split=1 (max_abs=0) and wo_split=8 (max_abs=0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Plan for evidence/wo-split-prod-soak: 4-arm sweep (wo_split=1/2/4/8), N=5 sequential workload replays per arm, four phases per arm — GSM8K-50 anchor under /no_think, ShareGPT mixed-length serve (deterministic ~30 multi-turn convs from anon8231489123/ShareGPT_Vicuna_unfiltered), 2048-token long decode probe, lightweight 2-concurrent probe. Two measurement modes per arm: primary CUTE_BETA_REGION_TIMING=0 for clean wall/TPOT, supplementary CUTE_BETA_REGION_TIMING=1 for region breakdown across R2/R11/R12. nsys budget: 8 representative traces (4 arms x phases B+C), not on perf-measurement runs. Quality gate: GSM8K floor >=30/50 AND pairwise within 2 questions of wo_split=1 baseline. Default candidate requires >=5% wall improvement vs baseline, no GSM8K regression, TPOT p95 not worse. Successor branch feat/wo-split-N-prototype where N is the verdict-selected value (4 or 8); ties broken by TPOT p95. Branch force-reset from feat/wo-split-8-prototype@69c530082; the 11 kernel-feature + sentinel + cleanup commits are kept; two superseded single-arm evidence/docs commits dropped. PR ships the production W_O K-parallel feature (opt-in via CUTE_WO_SPLIT) plus the soak evidence proving whether it is safe to default; default flip remains a conditional follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add ShareGPT slice generation and deterministic committed input, the long-decode prompt, warning-only coherence checks, and the soak results parser. Co-authored-by: OpenAI Codex <codex@openai.com>

Update the soak plan so supplementary serving traces use VLLM_TORCH_PROFILER_DIR instead of nsys, noting that V1 EngineCore subprocess capture is unreliable for CUPTI injection. Co-authored-by: OpenAI Codex <codex@openai.com>

Add the multi-arm soak orchestrator and streaming replay helper, plus an opt-in serve-cute profiler hook so supplementary passes can use vLLM torch profiler endpoints. Co-authored-by: OpenAI Codex <codex@openai.com>

The first run wedged overnight: profiler+region timing turned a 5109-char ShareGPT turn into a 138s response, the next 6787-char turn blew past the per-request HTTP timeout, _replay.py raised, set -e killed the runner, and the orphan container's profiler buffer eventually triggered a cgroup-OOM cascade (visible as udev mass-kill at 23:30 + snapd restart loop). Three fixes: * _replay.py: --http-timeout (primary name; --timeout kept as alias), --limit-requests, --max-prompt-chars. Length filter runs BEFORE the request count, so a pathological prompt cannot consume a budgeted slot. * runner.sh: supplementary sharegpt runs with --http-timeout 1800 --limit-requests 4 --max-prompt-chars 5500. Replay call wrapped in set +e + PIPESTATUS; profiler_stop, region dump, and extraction run best-effort on failure without masking replay_rc. * runner.sh main: two loops instead of one — all primaries (decision- critical) across every arm first, then all supplementaries (diagnostic). Outer primary_DONE / supplementary_DONE gates skip completed arms on resume so wo1's already-finished primary is not redone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New domain-specific guide for kernel, graph-captured op, and numerics-aware work. Linked from AGENTS.md "Domain-Specific Guides" section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Full primary + supplementary evidence for the W_O K-parallel production soak across CUTE_WO_SPLIT ∈ {1, 2, 4, 8}. Run commit 5b8fc39, image sha256:9c0f1d31c92c. Bundles: - writeup.md: human-readable companion to the auto-generated summary.md (per-arm table, harness-cited kernel claim, wo8 production region npy cross-check, known-limitations section for failed profiler attempts) - wo{1,2,4,8}/primary/: GSM8K-50 + 5×ShareGPT + 5×longdecode + 2-concurrent per arm with DONE markers; verdicts "keep opt-in" for wo2/wo4/wo8 - wo8/supplementary/sharegpt_region_timings.npy: only region npy that survived (auto-dump fired only on wo_split=8 path); 32-active-CTA phase1_wo_gemv = optimized W_O path engaged in real serving - wo{1,4}/supplementary*/: partial artifacts from runs whose POST /stop_profile hung the SoC (wo4 host crash 2026-05-05; wo1 host crash 2026-05-07). Documented as profiling-tooling limitation - docs/research/2026-05-04-wo-split-prod-soak/wo1_region_pass.sh: the failed-attempt repro script, kept as the receipt for the limitation Force-add per evidence-dir convention; .gitignore filters .json/.log/.csv across the repo but trace evidence overrides. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Swap serve.sh model row to ig1/Qwen3.5-27B-NVFP4 (default since 04/06) - CuTe paged: "Prototype" -> "production decode path since v0.3.0" - Document CUTE_WO_SPLIT={2,4,8} opt-in K-parallel W_O GEMV (PR #9) - Drop --debug from serve-cute.sh launch line; PIECEWISE is default - Veitner acknowledgment: "next direction" -> "applied in WO_SPLIT" - Markdown lint fixups (blank lines after headings, table separators) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

#16) * chore(cute-paged): retire warmup.py + closed-arc trace-capture scripts Pure deletion of dead/retired code. No behavior change. No new feature. No perf claim. Not part of either active thread (SSM correctness or beta kernel cost work) — this is housekeeping that reduces noise floor for both. Deletions: - vllm/v1/attention/backends/cute_paged/warmup.py (165 lines) — module was retired with the FULL+blessed AOT cache work; Dockerfile.gb10 explicitly removed the build-time invocation; no live importers in vllm/ or scripts/ (only stale references in docs/superpowers/specs, which are historical design records). - scripts/phase_d_trace_capture.sh - scripts/phase_d2_trace_capture.sh - scripts/phase_d2e_trace_capture.sh - scripts/phase_d3a_sweep.sh - All four scripts belong to the closed D2.x cherry-pick bisection arc; only referenced from their own trace-summary receipts under benchmarks/nvllm/traces/cute_paged_mlp_fusion/ and from historical docs/superpowers/specs entries. - scripts/fusion_phaseb_diff.py — Phase B harness; the equivalent diagnostic is now the CUTE_DEBUG_FUSION env-gated path in _backend.py. Comment-only edit: - _backend.py:71-73 — drop the stale file-path reference to warmup.py (which no longer exists) from a comment about the disk-cache hook. No code change. NOT in this PR (per memory:feedback_comment_not_delete + post-tool-use kernel-fusion-file hook): the 4 already-commented PHASE 3 ORIGINAL / Phase 4 epsilon-locals blocks inside _backend.py. The audit identified them as candidates but their comments self-justify retention for "Phase B/C debug recovery" and re-enable scenarios. Leaving them in place. Net: -936 lines / +3 lines across 7 files. Force-pushable rollback is `git revert`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(readme): sync model default + CuTe paged status + WO_SPLIT opt-in - Swap serve.sh model row to ig1/Qwen3.5-27B-NVFP4 (default since 04/06) - CuTe paged: "Prototype" -> "production decode path since v0.3.0" - Document CUTE_WO_SPLIT={2,4,8} opt-in K-parallel W_O GEMV (PR #9) - Drop --debug from serve-cute.sh launch line; PIECEWISE is default - Veitner acknowledgment: "next direction" -> "applied in WO_SPLIT" - Markdown lint fixups (blank lines after headings, table separators) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Natfii <27841768+Natfii@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ec-decode fix + recipe + tokenizer patch (#20) Scripts (model-first naming, backward-compat symlinks): serve.sh -> serve-qwen35-triton.sh (upstream triton_attn fallback) serve-cute.sh -> serve-qwen35.sh (CuTe = production decode default) serve-cute-full.sh -> serve-qwen35-full.sh (FULL_AND_PIECEWISE via blessed cache) serve-cute-mtp.sh -> serve-qwen36.sh (CuTe + qwen3_5_mtp + qwen3_xml parser) serve-qwen36.sh: - Default HF_MODEL=natfii/Qwen3.6-27B-VLM-NVFP4-MTP (host-portable; HF override works) - QUANTIZATION='' default (vLLM auto-detect; modelopt-format hf_quant_config.json) - Auto-detect tokenizer_config TokenizersBackend wrapper + bind-mount the Qwen2Tokenizer patch from models/qwen36-tokenizer-patch (generalized from Huihui-specific path; transformers 4.57.6 image compat) - --tool-call-parser qwen3_xml (per froggeric discussion #9 recommended stack for Qwen3.6 chat/tool workloads) - Drop Huihui-specific defaults/hooks Tokenizer patch dir rename: models/qwen3.6-mtp-tokenizer-patch/ -> models/qwen36-tokenizer-patch/ README rewritten as model-generic (Unsloth + Huihui + our quant all observed to ship the same TokenizersBackend wrapper). vllm/v1/attention/backends/cute_paged/_backend.py — fusion_max_tokens fix: attach_fusion() now derives decode_query_len = 1 + num_speculative_tokens and allocates buffers + sizes the runtime fits_buffer gate (L1258) at fusion_max_tokens = max_num_seqs * decode_query_len. Without this, with MTP=N>=1 every step had num_actual_tokens=2 > _fusion_max_num_seqs=1, so _fusion_active silently flipped False and the entire fusion path was a no-op — including CUTE_ATTN_FUSION=1 and the Phase E gate. The field name `_fusion_max_num_seqs` is kept (mis-)named for minimal blast radius; semantically it is now "max fusion tokens". Empirics on Qwen3.6-27B-VLM-NVFP4-MTP, MTP=1, max_num_seqs=1, GSM8K-50: wo1 : 47/50 2487.7s (baseline) wo8 : 47/50 2164.8s (-12.98% vs wo1) wo8 + mlpfuse : 47/50 2166.3s (+0.07% vs wo8, noise) wo8 + attnfuse : 47/50 2647.5s (+22.30% vs wo8, regression) ATTN_FUSION stays off-default; with the patch the regression is now measurable (was previously masked as no-op). Kernel needs re-tuning for the num_q=2 MTP decode shape before re-enabling. README: - Serve scripts table reflects new names + serve-qwen36.sh active - Roadmap adds "Now — Qwen3.6-27B bring-up" section - Backward-compat symlinks documented Recipe + checkpoint published at https://huggingface.co/natfii/Qwen3.6-27B-VLM-NVFP4-MTP (internal testing artifact; lna-lab/GGUF-to-NVFP4-SM120 modelopt + MTP graft). Quant scripts (training/quantize_qwen36_27b_vlm_mtp.{sh,py}) intentionally left uncommitted — training/ is gitignored, matches the existing quantize_qwen35_27b.{sh,py} pattern. Co-authored-by: Natfii <27841768+Natfii@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Natfii and others added 18 commits May 4, 2026 06:59

wo_split soak: input prep + offline analysis

17ea763

Add ShareGPT slice generation and deterministic committed input, the long-decode prompt, warning-only coherence checks, and the soak results parser. Co-authored-by: OpenAI Codex <codex@openai.com>

wo_split soak: use torch profiler for serving trace

955f8c2

Update the soak plan so supplementary serving traces use VLLM_TORCH_PROFILER_DIR instead of nsys, noting that V1 EngineCore subprocess capture is unreliable for CUPTI injection. Co-authored-by: OpenAI Codex <codex@openai.com>

wo_split soak: add serving runner

d6804d0

Add the multi-arm soak orchestrator and streaming replay helper, plus an opt-in serve-cute profiler hook so supplementary passes can use vLLM torch profiler endpoints. Co-authored-by: OpenAI Codex <codex@openai.com>

docs: add design-tenets guide + AGENTS.md pointer

829b32b

New domain-specific guide for kernel, graph-captured op, and numerics-aware work. Linked from AGENTS.md "Domain-Specific Guides" section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Natfii merged commit a131443 into main May 7, 2026

Natfii deleted the evidence/wo-split-prod-soak branch May 7, 2026 12:16

Natfii mentioned this pull request May 7, 2026

feat(wo_split=8): opt-in K-parallel W_O GEMV prototype #8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evidence(wo_split prod soak): full {1,2,4,8} sweep + writeup#9

evidence(wo_split prod soak): full {1,2,4,8} sweep + writeup#9
Natfii merged 18 commits into
mainfrom
evidence/wo-split-prod-soak

Natfii commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Natfii commented May 7, 2026

Summary

Per-arm headline

Scope

Production decoder region breakdown (wo8)

Known limitations & failed attempts

Test plan

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant