feat: SSM zero-on-realloc guard + sentinel ablation harness by Natfii · Pull Request #13 · Navi-AI-Lab/nvllm

Natfii · 2026-05-16T00:32:59Z

Summary

Three-commit series adding an SSM zero-on-realloc guard alongside the existing full-attention KV zero-on-realloc path, plus a reproducible sentinel-gated 4-arm ablation harness and the 2026-05-15 harness-validation evidence.

Commit 1 — production patch (633a96669): MambaBlockZeroer sister to KVBlockZeroer. Walks the same block-ID list at request-free / block-realloc time, zeroing recycled conv_state / ssm_state rows via torch.index_fill_ before the next prefill writes them. Always-on; rollback is git revert (EngineCore strips env vars, so an env kill switch is unreliable).
Commit 2 — ablation harness (655fc8c32): scripts/ablation/ (runner + compare + overlay patch + prep helper), docs/research/2026-05-15-ssm-zero-on-realloc/README.md, plus backwards-compatible instrumentation in scripts/gsm8k_eval_50.py (--run-index, --metrics-url, per-Q JSONL, /metrics snapshots).
Commit 3 — evidence (8f16a9eb1): benchmarks/nvllm/traces/ssm_zero_on_realloc/2026-05-15-sentinel-ablation/ with summary.md per AGENTS.md §4, ANALYSIS.md, runner_manifest.json, comparison.json, four per-arm verdicts. Also fixes a residual IMAGE_DIGEST emit bug in the runner (cmd || echo X -> cmd || true + ${VAR:-X}).

Why this is shipped (and what is NOT claimed)

The existing KVBlockZeroer (upstream PR vllm-project#35219) clears full-attn KV blocks but skips Mamba layers because the conv / ssm page sizes differ from the full-attn page size and cannot share the Triton kernel's uniform PAGE_SIZE_EL. MambaBlockZeroer covers the remaining state. This addresses the mamba-state lifecycle gap identified during the β-coop sustained-load collapse diagnosis arc.

Honest framing — what this PR does and does not prove:

The β-coop sustained-load collapse was not reproducing on the 2026-05-15 host. The 4-arm sentinel ablation in this PR cannot prove the SSM patch fixes the collapse — it proves the sentinel-gating mechanism works and that the patch is correctness/perf neutral under non-collapse load. The original fix-validation evidence (patched soak: 48/48/48/48/48 vs D2.7 baseline 47/47/47/11/35; decode ~2.5 → ~9.4 tok/s) lives in the local /tmp/ssm_zero_on_free_soak/ directory and is referenced in memory:project_beta_coop_sustained_collapse. The committed harness is for the next collapse discrimination window.
No perf claim. No nsys trace was captured. Median decode is flat across all 4 arms (within 0.03 tok/s).
KV new_block_ids channel relax is NOT shipped to default. The 4-arm sweep showed a deterministic -1 question on the kv_only arm (47/50 × 5 vs 48/50 × 5 on the other three arms). The relax is kept in the harness overlay patch only.

4-arm sentinel ablation result (2026-05-15)

Arm	SSM sentinel	KV sentinel	runs (correct/50)	first_fire (ssm,kv)	gate_pass	harness_pass
`both`	1	1	48,48,48,48,48	(1, 1)	true	true
`neither`	0	0	48,48,48,48,48	(0, 0)	true	true
`ssm_only`	1	0	48,48,48,48,48	(1, 0)	true	true
`kv_only`	0	1	47,47,47,47,47	(0, 1)	true	true

harness_pass=true for all four arms means: when SSM_sentinel=1 the SSM gate fired (and not when 0); same for KV. The env-strip confound that made a prior env-gated attempt a null A/B is eliminated. See benchmarks/nvllm/traces/ssm_zero_on_realloc/2026-05-15-sentinel-ablation/summary.md for the full host/image manifest and the per-Q breakdown.

Test plan

Production patch diff inspected; no sentinel / env / debug residue (grep -c "sentinel\|SENTINEL\|/run/nvllm" vllm/v1/worker/utils.py vllm/v1/worker/gpu_model_runner.py → 0).
Sentinel overlay patch roundtrips: git apply scripts/ablation/ssm_sentinel_overlay.patch then git checkout -- restores HEAD cleanly; markers _SSM_ZERO_SENTINEL and _KV_ZERO_SENTINEL land where expected.
Runner script bash-syntax checked (bash -n); compare script Python-syntax checked (ast.parse).
scripts/gsm8k_eval_50.py backwards-compatible: --run-index defaults to 0, --metrics-url defaults to None; existing callers are unaffected.
All four per-arm verdict.json files parse and report gate_pass=true + harness_pass=true.
Future: re-run the harness against a collapsing host state to discriminate which half of the original /tmp/ssm_zero_on_free_soak patch was load-bearing (working hypothesis: SSM-only).
Future: phaseE-tax bench re-run post-SSM-fix to verify the Phase 4 verdict still holds (task Support various block sizes vllm-project/vllm#38, separate).

Constraints (from work-session policy)

AI assistance was used for the patch authorship, the harness, and the commit-message drafting. Human authored the design constraints (drop KV channel relax from production, no "fixes collapse" claim, no perf claim without nsys), reviewed every changed line, and ran the ablation suite end-to-end.
This PR targets Navi-AI-Lab/nvllm main, not upstream vllm-project/vllm, per memory:feedback_never_touch_upstream_vllm.

🤖 Generated with Claude Code

Adds an SSM zero-on-realloc guard alongside the existing KV zero-on-realloc path. KVBlockZeroer.zero_block_ids now also walks a sister MambaBlockZeroer on the same block-ID list, zeroing recycled conv_state / ssm_state rows before the next prefill writes into them. Rationale: the existing KVBlockZeroer (upstream PR vllm-project#35219) clears full-attn KV blocks at request-free / block-realloc time but skips Mamba layers because the conv / ssm page sizes differ from the full-attn page size and cannot share the Triton kernel's uniform PAGE_SIZE_EL. MambaBlockZeroer covers the remaining state via per-tensor torch.index_fill_, which is simple, idempotent, and runs only outside the hot decode path. Under the 2026-05-15 non-reproducing host state, the patch is correctness and perf neutral; the companion sentinel-gated ablation harness (added in follow-up commits) proves the toggle path executes (4-arm sweep, 20 runs, both/neither/ssm_only all 48/50, kv_only deterministic 47/50). No perf claim is made: no nsys trace was captured. Rollback is `git revert` since EngineCore strips env vars on subprocess spawn, making an env kill switch unreliable. Files: - vllm/v1/worker/utils.py: add MambaBlockZeroer class, add fire counter, wire it into KVBlockZeroer.{init_meta, zero_block_ids}. - vllm/v1/worker/gpu_model_runner.py: materialize the attn-groups iterator into a list so KVBlockZeroer can walk it twice. Evidence: benchmarks/nvllm/traces/ssm_zero_on_realloc/2026-05-15-sentinel-ablation/ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a reproducible 4-arm ablation harness for the SSM zero-on-realloc patch (preceding commit) plus the KV new_block_ids channel relax that was investigated alongside it but explicitly NOT shipped to default. Rationale: the patch's correctness rationale is the SSM lifecycle gap identified in the mamba-state audit; under the 2026-05-15 host state it is perf/correctness neutral. The β-coop sustained-load collapse was not reproducing during this work, so the harness exists for future discrimination windows, not to validate the current commit. Layout: - scripts/ablation/run_ssm_ablation_suite.sh: 4-arm runner (both / neither / ssm_only / kv_only). Per-arm bind-mounts a per-arm sentinel dir at /run/nvllm :ro. Captures per-arm verdict.json with harness_validation block (enabled arms must log first_fire>=1; disabled arms must not). Counter helper uses awk to avoid the `grep -c ... || echo 0` bug that emitted "0\n0" when grep found zero matches. - scripts/ablation/ssm_ablation_compare.py: post-suite analyzer; reads per-arm verdict.json + perq.jsonl, emits ANALYSIS.md with verdict table, Run-4 (collapse window) per-Q breakdown, steady-state stats, friend's interpretation thresholds, and KV-drained invariant. - scripts/ablation/ssm_sentinel_overlay.patch: unified diff that overlays the sentinel-gated debug version on top of the production code from the preceding commit. The runner expects the patched files at $PATCHED_REPO/vllm/v1/{worker,core}/... - scripts/ablation/prepare_sentinel_overlay.sh: clones HEAD into a scratch dir and applies the overlay; the runner just bind-mounts the result. - docs/research/2026-05-15-ssm-zero-on-realloc/README.md: design rationale, why sentinels not env vars (EngineCore strips env), what the commit series does and does not claim, reproduce commands. - scripts/gsm8k_eval_50.py: backwards-compatible instrumentation (--run-index, --metrics-url, per-Q JSONL, /metrics snapshots). Both new flags default to no-op; existing callers are unaffected. Sentinel paths: - /run/nvllm/zero_ssm_on_realloc.enabled -> SSM gate fires - /run/nvllm/kv_zero_for_mamba_ids.enabled -> KV new_block_ids relax fires The KV relax is in the overlay for completeness but NOT in the production patch: the 2026-05-15 4-arm sweep showed a deterministic -1 question on kv_only (47/50 x 5 vs 48/50 x 5 on the other three arms). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Commits the harness-validation evidence for the SSM zero-on-realloc patch + ablation harness (preceding two commits in this branch). Honest framing: This is harness validation only. The β-coop sustained-load collapse was not reproducing on the 2026-05-15 host, so the 4-arm sweep cannot prove the SSM patch fixes the collapse. What it does prove: - Sentinel-file gating works through vLLM EngineCore (env-stripped) where env-var gating did not. - Production SSM patch is correctness-neutral under non-collapse load (both / neither / ssm_only all 48/50, identical to baseline). - The KV new_block_ids channel relax (kv_only arm) is NOT correctness-neutral: a deterministic -1 question (47/50 x 5). This is the basis for not shipping the KV relax to default. No nsys trace was captured; no perf claim is made. Median decode is flat across arms (within 0.03 tok/s). Layout: benchmarks/nvllm/traces/ssm_zero_on_realloc/2026-05-15-sentinel-ablation/ summary.md - per AGENTS.md S4: host/image manifest, per-arm verdict table, what shows/doesn't, reproduce commands ANALYSIS.md - per-Q Run-4 breakdown, steady-state stats, drained-KV invariant table runner_manifest.json - host_driver, image_id, prompt_set_hash, git_sha (manifest's image_digest "\nno-digest" artifact cleaned in-place at copy time) comparison.json - aggregate per-arm verdict pointer verdict-{arm}.json - four per-arm verdicts (force-added per memory:feedback_evidence_force_add since benchmarks/**/*.json is gitignored) Also fixes the IMAGE_DIGEST emit bug in scripts/ablation/run_ssm_ablation_suite.sh (the same "cmd 2>/dev/null || echo X" pattern that bit the harness counter; uses "cmd 2>/dev/null || true" + "${VAR:-X}" instead). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…14) All three memtester runs returned rc=0 with 0 FAILURE lines: - sanity_4G: 436s, rc=0, FAILURE count=0 - band_32G: 3391s, rc=0, FAILURE count=0 - band_64G: 6764s, rc=0, FAILURE count=0 This gate ruled out RAM in the 4G / 32G / 64G bands as a hardware contributor to the β-coop sustained-load collapse arc, prior to running D2.7. D2.7 was subsequently skipped because the SSM zero-on-realloc fix (PR #13) closed the diagnosis arc. The 32-64 GB band is the suspect one per NVIDIA forum reports of reproducible memtester failures on some Spark units. This unit passed. Context for why this gate exists (Spark fleet caveats): - LPDDR5x on Spark has no ECC (NVIDIA-confirmed); silent bit-flip corruption is possible. - Memtester 32-64 GB failures reported by other Spark owners; unresolved upstream as of 2026-05-14. - Thermal-sensor errata on at least one unit means nvidia-smi throttle-reasons is not a reliable HW-health gate. See benchmarks/nvllm/traces/hw_gates/2026-05-14-memtester-32G-64G-clean/ summary.md for the full host/manifest, what this does and does not prove, and reproduction commands. Sources linked there. Co-authored-by: Natfii <27841768+Natfii@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

) Fresh nvllm:gb10-ssm image off main HEAD (PR #13 squash, 6761983) with the SSM zero-on-realloc patch baked in. Re-ran the 2026-05-02 phaseE-tax 3-leg sweep (lower8 / phaseE-off / all-beta) to verify the Phase 4 verdict still holds post-SSM-fix. Result: Phase 4 stays dead. Verdict reproduced within run-to-run noise (~1-2% per-kernel mean_us, identical ordering, GSM8K within +-2 questions). Per-kernel A/B (2026-05-02 -> 2026-05-15, mean_us per call): - DecodeKernel: 17.04-17.11 -> 17.13-17.27 (+0.2 to +1.3%) - PhaseE_Beta_Kernel: 40.64-40.83 -> 41.31-41.46 (+1.6%) - Phase_D_MLP_Kernel: 23.93 -> 24.15 (+0.9%) Per-token aggregate (ms/tok, decode + mlp + beta): - lower8: 320 -> 322 (+0.8%) - phaseE-off: 656 -> 663 (+1.0%) - all-beta: 369 -> 372 (+0.7%) Ordering preserved: lower8 << all-beta << phaseE-off. SSM patch did NOT change the cost model that resolved against Phase 4; the patch fires at request-realloc boundaries outside the decode hot path. GSM8K (50 questions, seed=42): - lower8: 46/50 (was 47/50, 1 timeout on Q45 long-output boundary) - phaseE-off: 4/50 (was 2/50, mostly 180s timeouts as expected) - all-beta: 47/50 (was 47/50) memory:feedback_phase4_dead needs NO update. Path to re-opening Phase 4 remains the same: make beta cheaper first (NVFP4 GEMV K-parallel reduction), then revisit fusion atop a cheaper beta kernel. Also includes: - 1-line ergonomic fix to docs/research/2026-05-02-phaseE-tax-3leg/ run_3leg.sh: make OUT_ROOT env-overridable so re-runs land in a new evidence dir without disturbing the prior one. - .gitignore rules for the new evidence dir's raw .pt.trace.json.gz (~346 MB total) and *_serve.log + profiler_out_*.txt, parallel to the existing 2026-05-02 rules. Committed evidence (per AGENTS.md S4): summary.md, per-leg profile_kernels.csv, profile_metadata.json, gsm8k.json, gsm8k_metadata.json, perq.jsonl, mem_watchdog.log. Force-added per memory:feedback_evidence_force_add since benchmarks/**/*.json is gitignored. Co-authored-by: Natfii <27841768+Natfii@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Natfii and others added 3 commits May 15, 2026 20:21

Natfii mentioned this pull request May 16, 2026

evidence + diagnosis: β-coop layer-sweep + 12L_3_47 survival-soak gate failure #12

Closed

8 tasks

Natfii merged commit 6761983 into main May 16, 2026

Natfii deleted the feat/ssm-zero-on-realloc branch May 16, 2026 00:37

This was referenced May 16, 2026

evidence(hw_gates): memtester 4G/32G/64G CLEAN on Spark (2026-05-14) #14

Merged

evidence(phaseE-tax): post-SSM re-run reproduces 2026-05-02 verdict (Phase 4 stays dead) #15

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: SSM zero-on-realloc guard + sentinel ablation harness#13

feat: SSM zero-on-realloc guard + sentinel ablation harness#13
Natfii merged 3 commits into
mainfrom
feat/ssm-zero-on-realloc

Natfii commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Natfii commented May 16, 2026

Summary

Why this is shipped (and what is NOT claimed)

4-arm sentinel ablation result (2026-05-15)

Test plan

Constraints (from work-session policy)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant