Skip to content

evidence(phaseE-tax): post-SSM re-run reproduces 2026-05-02 verdict (Phase 4 stays dead)#15

Merged
Natfii merged 1 commit into
mainfrom
bench/phaseE-tax-post-ssm
May 16, 2026
Merged

evidence(phaseE-tax): post-SSM re-run reproduces 2026-05-02 verdict (Phase 4 stays dead)#15
Natfii merged 1 commit into
mainfrom
bench/phaseE-tax-post-ssm

Conversation

@Natfii
Copy link
Copy Markdown

@Natfii Natfii commented May 16, 2026

Summary

Re-ran the 2026-05-02 phaseE-tax 3-leg sweep against a fresh nvllm:gb10-ssm image (built today off main HEAD with the SSM zero-on-realloc patch from PR #13 baked in). Phase 4 stays dead — kernel costs reproduce within ~1–2% run-to-run noise; ordering preserved.

The memory:feedback_phase4_dead verdict needs no update. Cheaper β via NVFP4 GEMV K-parallel reduction remains the path to re-opening fusion.

Headline A/B

Per-kernel mean μs (2026-05-02 → 2026-05-15):

Kernel leg 2026-05-02 2026-05-15 Δ%
DecodeKernel lower8 17088.5 17128.2 +0.2%
PhaseE_Beta_Kernel lower8 40635.6 41311.3 +1.7%
DecodeKernel phaseE-off 17040.1 17268.3 +1.3%
Phase_D_MLP_Kernel phaseE-off 23931.4 24150.1 +0.9%
DecodeKernel all-beta 17106.3 17136.0 +0.2%
PhaseE_Beta_Kernel all-beta 40829.3 41461.4 +1.5%

Per-token aggregate ms/tok:

Leg 2026-05-02 2026-05-15 Δ%
lower8 320 322 +0.8%
phaseE-off 656 663 +1.0%
all-beta 369 372 +0.7%

GSM8K (50q, seed=42, /v1/completions):

Leg 2026-05-15 2026-05-02
lower8 46/50 (1 timeout on Q45) 47/50
phaseE-off 4/50 (46 timeouts — legacy fallback ≫180s/q) 2/50
all-beta 47/50 47/50

All within ±2-question single-run noise.

Why this re-run was needed

Task vllm-project#38 in the strategy stack: verify the Phase 4 cost-model verdict holds post-SSM-fix. The SSM patch fires torch.index_fill_ at request-realloc boundaries (outside the decode hot path), so we expected the kernel-cost A/B to reproduce within noise — which it does.

Provenance

Field Value
Commit 67619835b (main, PR #13 squash-merge)
Image nvllm:gb10-ssm (sha256:b7ede5c…) — fresh build 2026-05-15T21:37:25
Hardware NVIDIA DGX Spark (GB10, SM120/SM121), 128 GB unified LPDDR5x
Host driver 590.48.01
Suite wall 2026-05-15 21:37 → 2026-05-16 04:53 EDT (~7h 16min)
Suite exit BENCH_RC=0
Out dir benchmarks/nvllm/traces/cute_paged_attn/2026-05-15-phaseE-tax-3leg-post-ssm/

What's committed

  • summary.md (full A/B with explanations + reproduce commands)
  • Per-leg: profile_kernels.csv, profile_metadata.json, gsm8k.json, gsm8k_metadata.json, perq.jsonl (force-added per memory:feedback_evidence_force_add)
  • mem_watchdog.log (host-side memory snapshot every 30s, ~200 KB)
  • 1-line ergonomic fix to run_3leg.sh: make OUT_ROOT env-overridable so re-runs don't disturb the prior evidence dir.
  • .gitignore rules for the new evidence dir's raw .pt.trace.json.gz (~346 MB total), *_serve.log, and profiler_out_*.txt, parallel to the existing 2026-05-02 rules.

What's NOT committed (intentionally)

  • Raw profile.pt.trace.json.gz traces (~346 MB total) — gitignored per the convention from the 2026-05-02 evidence dir; reproducible via the committed runner.
  • Per-leg serve.log files (~100-250 KB each) — gitignored.
  • profiler_out_0.txt (~20 KB) — gitignored.
  • Empty *_DONE runner-state marker files.

Test plan

  • All three legs reached ok:true with BENCH_RC=0.
  • Per-kernel mean_us A/B vs 2026-05-02 within ±2% across all 6 kernel rows (verified in summary.md).
  • GSM8K within ±2 questions per leg vs 2026-05-02 (verified per leg).
  • n_calls per kernel matches 2026-05-02 exactly (35700, 5100, 40800, 12240, 4080), confirming A/B comparability.
  • summary.md cites the 2026-05-02 prior summary directly for the comparison.

🤖 Generated with Claude Code

Fresh nvllm:gb10-ssm image off main HEAD (PR #13 squash, 6761983)
with the SSM zero-on-realloc patch baked in. Re-ran the 2026-05-02
phaseE-tax 3-leg sweep (lower8 / phaseE-off / all-beta) to verify the
Phase 4 verdict still holds post-SSM-fix.

Result: Phase 4 stays dead. Verdict reproduced within run-to-run noise
(~1-2% per-kernel mean_us, identical ordering, GSM8K within +-2 questions).

Per-kernel A/B (2026-05-02 -> 2026-05-15, mean_us per call):
- DecodeKernel:           17.04-17.11 -> 17.13-17.27  (+0.2 to +1.3%)
- PhaseE_Beta_Kernel:     40.64-40.83 -> 41.31-41.46  (+1.6%)
- Phase_D_MLP_Kernel:     23.93 -> 24.15              (+0.9%)

Per-token aggregate (ms/tok, decode + mlp + beta):
- lower8:     320 -> 322  (+0.8%)
- phaseE-off: 656 -> 663  (+1.0%)
- all-beta:   369 -> 372  (+0.7%)

Ordering preserved: lower8 << all-beta << phaseE-off. SSM patch did
NOT change the cost model that resolved against Phase 4; the patch
fires at request-realloc boundaries outside the decode hot path.

GSM8K (50 questions, seed=42):
- lower8:     46/50 (was 47/50, 1 timeout on Q45 long-output boundary)
- phaseE-off:  4/50 (was 2/50, mostly 180s timeouts as expected)
- all-beta:   47/50 (was 47/50)

memory:feedback_phase4_dead needs NO update. Path to re-opening Phase 4
remains the same: make beta cheaper first (NVFP4 GEMV K-parallel
reduction), then revisit fusion atop a cheaper beta kernel.

Also includes:
- 1-line ergonomic fix to docs/research/2026-05-02-phaseE-tax-3leg/
  run_3leg.sh: make OUT_ROOT env-overridable so re-runs land in a new
  evidence dir without disturbing the prior one.
- .gitignore rules for the new evidence dir's raw .pt.trace.json.gz
  (~346 MB total) and *_serve.log + profiler_out_*.txt, parallel to
  the existing 2026-05-02 rules.

Committed evidence (per AGENTS.md S4): summary.md, per-leg
profile_kernels.csv, profile_metadata.json, gsm8k.json, gsm8k_metadata.json,
perq.jsonl, mem_watchdog.log. Force-added per memory:feedback_evidence_force_add
since benchmarks/**/*.json is gitignored.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Natfii Natfii merged commit 168313f into main May 16, 2026
@Natfii Natfii deleted the bench/phaseE-tax-post-ssm branch May 16, 2026 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant