evidence(phaseE-tax): post-SSM re-run reproduces 2026-05-02 verdict (Phase 4 stays dead)#15
Merged
Merged
Conversation
Fresh nvllm:gb10-ssm image off main HEAD (PR #13 squash, 6761983) with the SSM zero-on-realloc patch baked in. Re-ran the 2026-05-02 phaseE-tax 3-leg sweep (lower8 / phaseE-off / all-beta) to verify the Phase 4 verdict still holds post-SSM-fix. Result: Phase 4 stays dead. Verdict reproduced within run-to-run noise (~1-2% per-kernel mean_us, identical ordering, GSM8K within +-2 questions). Per-kernel A/B (2026-05-02 -> 2026-05-15, mean_us per call): - DecodeKernel: 17.04-17.11 -> 17.13-17.27 (+0.2 to +1.3%) - PhaseE_Beta_Kernel: 40.64-40.83 -> 41.31-41.46 (+1.6%) - Phase_D_MLP_Kernel: 23.93 -> 24.15 (+0.9%) Per-token aggregate (ms/tok, decode + mlp + beta): - lower8: 320 -> 322 (+0.8%) - phaseE-off: 656 -> 663 (+1.0%) - all-beta: 369 -> 372 (+0.7%) Ordering preserved: lower8 << all-beta << phaseE-off. SSM patch did NOT change the cost model that resolved against Phase 4; the patch fires at request-realloc boundaries outside the decode hot path. GSM8K (50 questions, seed=42): - lower8: 46/50 (was 47/50, 1 timeout on Q45 long-output boundary) - phaseE-off: 4/50 (was 2/50, mostly 180s timeouts as expected) - all-beta: 47/50 (was 47/50) memory:feedback_phase4_dead needs NO update. Path to re-opening Phase 4 remains the same: make beta cheaper first (NVFP4 GEMV K-parallel reduction), then revisit fusion atop a cheaper beta kernel. Also includes: - 1-line ergonomic fix to docs/research/2026-05-02-phaseE-tax-3leg/ run_3leg.sh: make OUT_ROOT env-overridable so re-runs land in a new evidence dir without disturbing the prior one. - .gitignore rules for the new evidence dir's raw .pt.trace.json.gz (~346 MB total) and *_serve.log + profiler_out_*.txt, parallel to the existing 2026-05-02 rules. Committed evidence (per AGENTS.md S4): summary.md, per-leg profile_kernels.csv, profile_metadata.json, gsm8k.json, gsm8k_metadata.json, perq.jsonl, mem_watchdog.log. Force-added per memory:feedback_evidence_force_add since benchmarks/**/*.json is gitignored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Re-ran the 2026-05-02 phaseE-tax 3-leg sweep against a fresh
nvllm:gb10-ssmimage (built today off main HEAD with the SSM zero-on-realloc patch from PR #13 baked in). Phase 4 stays dead — kernel costs reproduce within ~1–2% run-to-run noise; ordering preserved.The
memory:feedback_phase4_deadverdict needs no update. Cheaper β via NVFP4 GEMV K-parallel reduction remains the path to re-opening fusion.Headline A/B
Per-kernel mean μs (2026-05-02 → 2026-05-15):
Per-token aggregate ms/tok:
GSM8K (50q, seed=42, /v1/completions):
All within ±2-question single-run noise.
Why this re-run was needed
Task vllm-project#38 in the strategy stack: verify the Phase 4 cost-model verdict holds post-SSM-fix. The SSM patch fires
torch.index_fill_at request-realloc boundaries (outside the decode hot path), so we expected the kernel-cost A/B to reproduce within noise — which it does.Provenance
67619835b(main, PR #13 squash-merge)nvllm:gb10-ssm(sha256:b7ede5c…) — fresh build 2026-05-15T21:37:25BENCH_RC=0benchmarks/nvllm/traces/cute_paged_attn/2026-05-15-phaseE-tax-3leg-post-ssm/What's committed
summary.md(full A/B with explanations + reproduce commands)profile_kernels.csv,profile_metadata.json,gsm8k.json,gsm8k_metadata.json,perq.jsonl(force-added permemory:feedback_evidence_force_add)mem_watchdog.log(host-side memory snapshot every 30s, ~200 KB)run_3leg.sh: makeOUT_ROOTenv-overridable so re-runs don't disturb the prior evidence dir..gitignorerules for the new evidence dir's raw.pt.trace.json.gz(~346 MB total),*_serve.log, andprofiler_out_*.txt, parallel to the existing 2026-05-02 rules.What's NOT committed (intentionally)
profile.pt.trace.json.gztraces (~346 MB total) — gitignored per the convention from the 2026-05-02 evidence dir; reproducible via the committed runner.serve.logfiles (~100-250 KB each) — gitignored.profiler_out_0.txt(~20 KB) — gitignored.*_DONErunner-state marker files.Test plan
ok:truewithBENCH_RC=0.mean_usA/B vs 2026-05-02 within ±2% across all 6 kernel rows (verified insummary.md).n_callsper kernel matches 2026-05-02 exactly (35700, 5100, 40800, 12240, 4080), confirming A/B comparability.summary.mdcites the 2026-05-02 prior summary directly for the comparison.🤖 Generated with Claude Code