feat: SSM zero-on-realloc guard + sentinel ablation harness#13
Merged
Conversation
Adds an SSM zero-on-realloc guard alongside the existing KV zero-on-realloc path. KVBlockZeroer.zero_block_ids now also walks a sister MambaBlockZeroer on the same block-ID list, zeroing recycled conv_state / ssm_state rows before the next prefill writes into them. Rationale: the existing KVBlockZeroer (upstream PR vllm-project#35219) clears full-attn KV blocks at request-free / block-realloc time but skips Mamba layers because the conv / ssm page sizes differ from the full-attn page size and cannot share the Triton kernel's uniform PAGE_SIZE_EL. MambaBlockZeroer covers the remaining state via per-tensor torch.index_fill_, which is simple, idempotent, and runs only outside the hot decode path. Under the 2026-05-15 non-reproducing host state, the patch is correctness and perf neutral; the companion sentinel-gated ablation harness (added in follow-up commits) proves the toggle path executes (4-arm sweep, 20 runs, both/neither/ssm_only all 48/50, kv_only deterministic 47/50). No perf claim is made: no nsys trace was captured. Rollback is `git revert` since EngineCore strips env vars on subprocess spawn, making an env kill switch unreliable. Files: - vllm/v1/worker/utils.py: add MambaBlockZeroer class, add fire counter, wire it into KVBlockZeroer.{init_meta, zero_block_ids}. - vllm/v1/worker/gpu_model_runner.py: materialize the attn-groups iterator into a list so KVBlockZeroer can walk it twice. Evidence: benchmarks/nvllm/traces/ssm_zero_on_realloc/2026-05-15-sentinel-ablation/ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a reproducible 4-arm ablation harness for the SSM zero-on-realloc
patch (preceding commit) plus the KV new_block_ids channel relax that was
investigated alongside it but explicitly NOT shipped to default.
Rationale: the patch's correctness rationale is the SSM lifecycle gap
identified in the mamba-state audit; under the 2026-05-15 host state it is
perf/correctness neutral. The β-coop sustained-load collapse was not
reproducing during this work, so the harness exists for future discrimination
windows, not to validate the current commit.
Layout:
- scripts/ablation/run_ssm_ablation_suite.sh: 4-arm runner (both / neither /
ssm_only / kv_only). Per-arm bind-mounts a per-arm sentinel dir at
/run/nvllm :ro. Captures per-arm verdict.json with harness_validation
block (enabled arms must log first_fire>=1; disabled arms must not).
Counter helper uses awk to avoid the `grep -c ... || echo 0` bug that
emitted "0\n0" when grep found zero matches.
- scripts/ablation/ssm_ablation_compare.py: post-suite analyzer; reads
per-arm verdict.json + perq.jsonl, emits ANALYSIS.md with verdict table,
Run-4 (collapse window) per-Q breakdown, steady-state stats, friend's
interpretation thresholds, and KV-drained invariant.
- scripts/ablation/ssm_sentinel_overlay.patch: unified diff that overlays
the sentinel-gated debug version on top of the production code from the
preceding commit. The runner expects the patched files at
$PATCHED_REPO/vllm/v1/{worker,core}/...
- scripts/ablation/prepare_sentinel_overlay.sh: clones HEAD into a scratch
dir and applies the overlay; the runner just bind-mounts the result.
- docs/research/2026-05-15-ssm-zero-on-realloc/README.md: design rationale,
why sentinels not env vars (EngineCore strips env), what the commit
series does and does not claim, reproduce commands.
- scripts/gsm8k_eval_50.py: backwards-compatible instrumentation
(--run-index, --metrics-url, per-Q JSONL, /metrics snapshots). Both new
flags default to no-op; existing callers are unaffected.
Sentinel paths:
- /run/nvllm/zero_ssm_on_realloc.enabled -> SSM gate fires
- /run/nvllm/kv_zero_for_mamba_ids.enabled -> KV new_block_ids relax fires
The KV relax is in the overlay for completeness but NOT in the production
patch: the 2026-05-15 4-arm sweep showed a deterministic -1 question on
kv_only (47/50 x 5 vs 48/50 x 5 on the other three arms).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Commits the harness-validation evidence for the SSM zero-on-realloc
patch + ablation harness (preceding two commits in this branch).
Honest framing:
This is harness validation only. The β-coop sustained-load collapse
was not reproducing on the 2026-05-15 host, so the 4-arm sweep cannot
prove the SSM patch fixes the collapse. What it does prove:
- Sentinel-file gating works through vLLM EngineCore (env-stripped)
where env-var gating did not.
- Production SSM patch is correctness-neutral under non-collapse load
(both / neither / ssm_only all 48/50, identical to baseline).
- The KV new_block_ids channel relax (kv_only arm) is NOT
correctness-neutral: a deterministic -1 question (47/50 x 5). This
is the basis for not shipping the KV relax to default.
No nsys trace was captured; no perf claim is made. Median decode is
flat across arms (within 0.03 tok/s).
Layout:
benchmarks/nvllm/traces/ssm_zero_on_realloc/2026-05-15-sentinel-ablation/
summary.md - per AGENTS.md S4: host/image manifest,
per-arm verdict table, what shows/doesn't,
reproduce commands
ANALYSIS.md - per-Q Run-4 breakdown, steady-state stats,
drained-KV invariant table
runner_manifest.json - host_driver, image_id, prompt_set_hash,
git_sha (manifest's image_digest "\nno-digest"
artifact cleaned in-place at copy time)
comparison.json - aggregate per-arm verdict pointer
verdict-{arm}.json - four per-arm verdicts (force-added per
memory:feedback_evidence_force_add since
benchmarks/**/*.json is gitignored)
Also fixes the IMAGE_DIGEST emit bug in
scripts/ablation/run_ssm_ablation_suite.sh (the same "cmd 2>/dev/null
|| echo X" pattern that bit the harness counter; uses "cmd 2>/dev/null
|| true" + "${VAR:-X}" instead).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8 tasks
This was referenced May 16, 2026
Natfii
added a commit
that referenced
this pull request
May 16, 2026
…14) All three memtester runs returned rc=0 with 0 FAILURE lines: - sanity_4G: 436s, rc=0, FAILURE count=0 - band_32G: 3391s, rc=0, FAILURE count=0 - band_64G: 6764s, rc=0, FAILURE count=0 This gate ruled out RAM in the 4G / 32G / 64G bands as a hardware contributor to the β-coop sustained-load collapse arc, prior to running D2.7. D2.7 was subsequently skipped because the SSM zero-on-realloc fix (PR #13) closed the diagnosis arc. The 32-64 GB band is the suspect one per NVIDIA forum reports of reproducible memtester failures on some Spark units. This unit passed. Context for why this gate exists (Spark fleet caveats): - LPDDR5x on Spark has no ECC (NVIDIA-confirmed); silent bit-flip corruption is possible. - Memtester 32-64 GB failures reported by other Spark owners; unresolved upstream as of 2026-05-14. - Thermal-sensor errata on at least one unit means nvidia-smi throttle-reasons is not a reliable HW-health gate. See benchmarks/nvllm/traces/hw_gates/2026-05-14-memtester-32G-64G-clean/ summary.md for the full host/manifest, what this does and does not prove, and reproduction commands. Sources linked there. Co-authored-by: Natfii <27841768+Natfii@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Natfii
added a commit
that referenced
this pull request
May 16, 2026
) Fresh nvllm:gb10-ssm image off main HEAD (PR #13 squash, 6761983) with the SSM zero-on-realloc patch baked in. Re-ran the 2026-05-02 phaseE-tax 3-leg sweep (lower8 / phaseE-off / all-beta) to verify the Phase 4 verdict still holds post-SSM-fix. Result: Phase 4 stays dead. Verdict reproduced within run-to-run noise (~1-2% per-kernel mean_us, identical ordering, GSM8K within +-2 questions). Per-kernel A/B (2026-05-02 -> 2026-05-15, mean_us per call): - DecodeKernel: 17.04-17.11 -> 17.13-17.27 (+0.2 to +1.3%) - PhaseE_Beta_Kernel: 40.64-40.83 -> 41.31-41.46 (+1.6%) - Phase_D_MLP_Kernel: 23.93 -> 24.15 (+0.9%) Per-token aggregate (ms/tok, decode + mlp + beta): - lower8: 320 -> 322 (+0.8%) - phaseE-off: 656 -> 663 (+1.0%) - all-beta: 369 -> 372 (+0.7%) Ordering preserved: lower8 << all-beta << phaseE-off. SSM patch did NOT change the cost model that resolved against Phase 4; the patch fires at request-realloc boundaries outside the decode hot path. GSM8K (50 questions, seed=42): - lower8: 46/50 (was 47/50, 1 timeout on Q45 long-output boundary) - phaseE-off: 4/50 (was 2/50, mostly 180s timeouts as expected) - all-beta: 47/50 (was 47/50) memory:feedback_phase4_dead needs NO update. Path to re-opening Phase 4 remains the same: make beta cheaper first (NVFP4 GEMV K-parallel reduction), then revisit fusion atop a cheaper beta kernel. Also includes: - 1-line ergonomic fix to docs/research/2026-05-02-phaseE-tax-3leg/ run_3leg.sh: make OUT_ROOT env-overridable so re-runs land in a new evidence dir without disturbing the prior one. - .gitignore rules for the new evidence dir's raw .pt.trace.json.gz (~346 MB total) and *_serve.log + profiler_out_*.txt, parallel to the existing 2026-05-02 rules. Committed evidence (per AGENTS.md S4): summary.md, per-leg profile_kernels.csv, profile_metadata.json, gsm8k.json, gsm8k_metadata.json, perq.jsonl, mem_watchdog.log. Force-added per memory:feedback_evidence_force_add since benchmarks/**/*.json is gitignored. Co-authored-by: Natfii <27841768+Natfii@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three-commit series adding an SSM zero-on-realloc guard alongside the existing full-attention KV zero-on-realloc path, plus a reproducible sentinel-gated 4-arm ablation harness and the 2026-05-15 harness-validation evidence.
633a96669):MambaBlockZeroersister toKVBlockZeroer. Walks the same block-ID list at request-free / block-realloc time, zeroing recycledconv_state/ssm_staterows viatorch.index_fill_before the next prefill writes them. Always-on; rollback isgit revert(EngineCore strips env vars, so an env kill switch is unreliable).655fc8c32):scripts/ablation/(runner + compare + overlay patch + prep helper),docs/research/2026-05-15-ssm-zero-on-realloc/README.md, plus backwards-compatible instrumentation inscripts/gsm8k_eval_50.py(--run-index,--metrics-url, per-Q JSONL, /metrics snapshots).8f16a9eb1):benchmarks/nvllm/traces/ssm_zero_on_realloc/2026-05-15-sentinel-ablation/withsummary.mdper AGENTS.md §4, ANALYSIS.md, runner_manifest.json, comparison.json, four per-arm verdicts. Also fixes a residualIMAGE_DIGESTemit bug in the runner (cmd || echo X->cmd || true+${VAR:-X}).Why this is shipped (and what is NOT claimed)
The existing
KVBlockZeroer(upstream PR vllm-project#35219) clears full-attn KV blocks but skips Mamba layers because the conv / ssm page sizes differ from the full-attn page size and cannot share the Triton kernel's uniformPAGE_SIZE_EL.MambaBlockZeroercovers the remaining state. This addresses the mamba-state lifecycle gap identified during the β-coop sustained-load collapse diagnosis arc.Honest framing — what this PR does and does not prove:
48/48/48/48/48vs D2.7 baseline47/47/47/11/35; decode~2.5 → ~9.4 tok/s) lives in the local/tmp/ssm_zero_on_free_soak/directory and is referenced inmemory:project_beta_coop_sustained_collapse. The committed harness is for the next collapse discrimination window.new_block_idschannel relax is NOT shipped to default. The 4-arm sweep showed a deterministic -1 question on thekv_onlyarm (47/50 × 5 vs 48/50 × 5 on the other three arms). The relax is kept in the harness overlay patch only.4-arm sentinel ablation result (2026-05-15)
bothneitherssm_onlykv_onlyharness_pass=truefor all four arms means: whenSSM_sentinel=1the SSM gate fired (and not when 0); same for KV. The env-strip confound that made a prior env-gated attempt a null A/B is eliminated. Seebenchmarks/nvllm/traces/ssm_zero_on_realloc/2026-05-15-sentinel-ablation/summary.mdfor the full host/image manifest and the per-Q breakdown.Test plan
grep -c "sentinel\|SENTINEL\|/run/nvllm" vllm/v1/worker/utils.py vllm/v1/worker/gpu_model_runner.py→ 0).git apply scripts/ablation/ssm_sentinel_overlay.patchthengit checkout --restores HEAD cleanly; markers_SSM_ZERO_SENTINELand_KV_ZERO_SENTINELland where expected.bash -n); compare script Python-syntax checked (ast.parse).scripts/gsm8k_eval_50.pybackwards-compatible:--run-indexdefaults to 0,--metrics-urldefaults toNone; existing callers are unaffected.verdict.jsonfiles parse and reportgate_pass=true+harness_pass=true./tmp/ssm_zero_on_free_soakpatch was load-bearing (working hypothesis: SSM-only).Constraints (from work-session policy)
Navi-AI-Lab/nvllmmain, not upstreamvllm-project/vllm, permemory:feedback_never_touch_upstream_vllm.🤖 Generated with Claude Code