evidence(phaseE-tax): post-SSM re-run reproduces 2026-05-02 verdict (Phase 4 stays dead) by Natfii · Pull Request #15 · Navi-AI-Lab/nvllm

Natfii · 2026-05-16T08:55:08Z

Summary

Re-ran the 2026-05-02 phaseE-tax 3-leg sweep against a fresh nvllm:gb10-ssm image (built today off main HEAD with the SSM zero-on-realloc patch from PR #13 baked in). Phase 4 stays dead — kernel costs reproduce within ~1–2% run-to-run noise; ordering preserved.

The memory:feedback_phase4_dead verdict needs no update. Cheaper β via NVFP4 GEMV K-parallel reduction remains the path to re-opening fusion.

Headline A/B

Per-kernel mean μs (2026-05-02 → 2026-05-15):

Kernel	leg	2026-05-02	2026-05-15	Δ%
DecodeKernel	lower8	17088.5	17128.2	+0.2%
PhaseE_Beta_Kernel	lower8	40635.6	41311.3	+1.7%
DecodeKernel	phaseE-off	17040.1	17268.3	+1.3%
Phase_D_MLP_Kernel	phaseE-off	23931.4	24150.1	+0.9%
DecodeKernel	all-beta	17106.3	17136.0	+0.2%
PhaseE_Beta_Kernel	all-beta	40829.3	41461.4	+1.5%

Per-token aggregate ms/tok:

Leg	2026-05-02	2026-05-15	Δ%
lower8	320	322	+0.8%
phaseE-off	656	663	+1.0%
all-beta	369	372	+0.7%

GSM8K (50q, seed=42, /v1/completions):

Leg	2026-05-15	2026-05-02
lower8	46/50 (1 timeout on Q45)	47/50
phaseE-off	4/50 (46 timeouts — legacy fallback ≫180s/q)	2/50
all-beta	47/50	47/50

All within ±2-question single-run noise.

Why this re-run was needed

Task vllm-project#38 in the strategy stack: verify the Phase 4 cost-model verdict holds post-SSM-fix. The SSM patch fires torch.index_fill_ at request-realloc boundaries (outside the decode hot path), so we expected the kernel-cost A/B to reproduce within noise — which it does.

Provenance

Field	Value
Commit	`67619835b` (main, PR #13 squash-merge)
Image	`nvllm:gb10-ssm` (sha256:b7ede5c…) — fresh build 2026-05-15T21:37:25
Hardware	NVIDIA DGX Spark (GB10, SM120/SM121), 128 GB unified LPDDR5x
Host driver	590.48.01
Suite wall	2026-05-15 21:37 → 2026-05-16 04:53 EDT (~7h 16min)
Suite exit	`BENCH_RC=0`
Out dir	`benchmarks/nvllm/traces/cute_paged_attn/2026-05-15-phaseE-tax-3leg-post-ssm/`

What's committed

summary.md (full A/B with explanations + reproduce commands)
Per-leg: profile_kernels.csv, profile_metadata.json, gsm8k.json, gsm8k_metadata.json, perq.jsonl (force-added per memory:feedback_evidence_force_add)
mem_watchdog.log (host-side memory snapshot every 30s, ~200 KB)
1-line ergonomic fix to run_3leg.sh: make OUT_ROOT env-overridable so re-runs don't disturb the prior evidence dir.
.gitignore rules for the new evidence dir's raw .pt.trace.json.gz (~346 MB total), *_serve.log, and profiler_out_*.txt, parallel to the existing 2026-05-02 rules.

What's NOT committed (intentionally)

Raw profile.pt.trace.json.gz traces (~346 MB total) — gitignored per the convention from the 2026-05-02 evidence dir; reproducible via the committed runner.
Per-leg serve.log files (~100-250 KB each) — gitignored.
profiler_out_0.txt (~20 KB) — gitignored.
Empty *_DONE runner-state marker files.

Test plan

All three legs reached ok:true with BENCH_RC=0.
Per-kernel mean_us A/B vs 2026-05-02 within ±2% across all 6 kernel rows (verified in summary.md).
GSM8K within ±2 questions per leg vs 2026-05-02 (verified per leg).
n_calls per kernel matches 2026-05-02 exactly (35700, 5100, 40800, 12240, 4080), confirming A/B comparability.
summary.md cites the 2026-05-02 prior summary directly for the comparison.

🤖 Generated with Claude Code

Fresh nvllm:gb10-ssm image off main HEAD (PR #13 squash, 6761983) with the SSM zero-on-realloc patch baked in. Re-ran the 2026-05-02 phaseE-tax 3-leg sweep (lower8 / phaseE-off / all-beta) to verify the Phase 4 verdict still holds post-SSM-fix. Result: Phase 4 stays dead. Verdict reproduced within run-to-run noise (~1-2% per-kernel mean_us, identical ordering, GSM8K within +-2 questions). Per-kernel A/B (2026-05-02 -> 2026-05-15, mean_us per call): - DecodeKernel: 17.04-17.11 -> 17.13-17.27 (+0.2 to +1.3%) - PhaseE_Beta_Kernel: 40.64-40.83 -> 41.31-41.46 (+1.6%) - Phase_D_MLP_Kernel: 23.93 -> 24.15 (+0.9%) Per-token aggregate (ms/tok, decode + mlp + beta): - lower8: 320 -> 322 (+0.8%) - phaseE-off: 656 -> 663 (+1.0%) - all-beta: 369 -> 372 (+0.7%) Ordering preserved: lower8 << all-beta << phaseE-off. SSM patch did NOT change the cost model that resolved against Phase 4; the patch fires at request-realloc boundaries outside the decode hot path. GSM8K (50 questions, seed=42): - lower8: 46/50 (was 47/50, 1 timeout on Q45 long-output boundary) - phaseE-off: 4/50 (was 2/50, mostly 180s timeouts as expected) - all-beta: 47/50 (was 47/50) memory:feedback_phase4_dead needs NO update. Path to re-opening Phase 4 remains the same: make beta cheaper first (NVFP4 GEMV K-parallel reduction), then revisit fusion atop a cheaper beta kernel. Also includes: - 1-line ergonomic fix to docs/research/2026-05-02-phaseE-tax-3leg/ run_3leg.sh: make OUT_ROOT env-overridable so re-runs land in a new evidence dir without disturbing the prior one. - .gitignore rules for the new evidence dir's raw .pt.trace.json.gz (~346 MB total) and *_serve.log + profiler_out_*.txt, parallel to the existing 2026-05-02 rules. Committed evidence (per AGENTS.md S4): summary.md, per-leg profile_kernels.csv, profile_metadata.json, gsm8k.json, gsm8k_metadata.json, perq.jsonl, mem_watchdog.log. Force-added per memory:feedback_evidence_force_add since benchmarks/**/*.json is gitignored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Natfii merged commit 168313f into main May 16, 2026

Natfii deleted the bench/phaseE-tax-post-ssm branch May 16, 2026 12:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evidence(phaseE-tax): post-SSM re-run reproduces 2026-05-02 verdict (Phase 4 stays dead)#15

evidence(phaseE-tax): post-SSM re-run reproduces 2026-05-02 verdict (Phase 4 stays dead)#15
Natfii merged 1 commit into
mainfrom
bench/phaseE-tax-post-ssm

Natfii commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Natfii commented May 16, 2026

Summary

Headline A/B

Why this re-run was needed

Provenance

What's committed

What's NOT committed (intentionally)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant