Skip to content

feat: SSM zero-on-realloc guard + sentinel ablation harness#13

Merged
Natfii merged 3 commits into
mainfrom
feat/ssm-zero-on-realloc
May 16, 2026
Merged

feat: SSM zero-on-realloc guard + sentinel ablation harness#13
Natfii merged 3 commits into
mainfrom
feat/ssm-zero-on-realloc

Conversation

@Natfii
Copy link
Copy Markdown

@Natfii Natfii commented May 16, 2026

Summary

Three-commit series adding an SSM zero-on-realloc guard alongside the existing full-attention KV zero-on-realloc path, plus a reproducible sentinel-gated 4-arm ablation harness and the 2026-05-15 harness-validation evidence.

  • Commit 1 — production patch (633a96669): MambaBlockZeroer sister to KVBlockZeroer. Walks the same block-ID list at request-free / block-realloc time, zeroing recycled conv_state / ssm_state rows via torch.index_fill_ before the next prefill writes them. Always-on; rollback is git revert (EngineCore strips env vars, so an env kill switch is unreliable).
  • Commit 2 — ablation harness (655fc8c32): scripts/ablation/ (runner + compare + overlay patch + prep helper), docs/research/2026-05-15-ssm-zero-on-realloc/README.md, plus backwards-compatible instrumentation in scripts/gsm8k_eval_50.py (--run-index, --metrics-url, per-Q JSONL, /metrics snapshots).
  • Commit 3 — evidence (8f16a9eb1): benchmarks/nvllm/traces/ssm_zero_on_realloc/2026-05-15-sentinel-ablation/ with summary.md per AGENTS.md §4, ANALYSIS.md, runner_manifest.json, comparison.json, four per-arm verdicts. Also fixes a residual IMAGE_DIGEST emit bug in the runner (cmd || echo X -> cmd || true + ${VAR:-X}).

Why this is shipped (and what is NOT claimed)

The existing KVBlockZeroer (upstream PR vllm-project#35219) clears full-attn KV blocks but skips Mamba layers because the conv / ssm page sizes differ from the full-attn page size and cannot share the Triton kernel's uniform PAGE_SIZE_EL. MambaBlockZeroer covers the remaining state. This addresses the mamba-state lifecycle gap identified during the β-coop sustained-load collapse diagnosis arc.

Honest framing — what this PR does and does not prove:

  • The β-coop sustained-load collapse was not reproducing on the 2026-05-15 host. The 4-arm sentinel ablation in this PR cannot prove the SSM patch fixes the collapse — it proves the sentinel-gating mechanism works and that the patch is correctness/perf neutral under non-collapse load. The original fix-validation evidence (patched soak: 48/48/48/48/48 vs D2.7 baseline 47/47/47/11/35; decode ~2.5 → ~9.4 tok/s) lives in the local /tmp/ssm_zero_on_free_soak/ directory and is referenced in memory:project_beta_coop_sustained_collapse. The committed harness is for the next collapse discrimination window.
  • No perf claim. No nsys trace was captured. Median decode is flat across all 4 arms (within 0.03 tok/s).
  • KV new_block_ids channel relax is NOT shipped to default. The 4-arm sweep showed a deterministic -1 question on the kv_only arm (47/50 × 5 vs 48/50 × 5 on the other three arms). The relax is kept in the harness overlay patch only.

4-arm sentinel ablation result (2026-05-15)

Arm SSM sentinel KV sentinel runs (correct/50) first_fire (ssm,kv) gate_pass harness_pass
both 1 1 48,48,48,48,48 (1, 1) true true
neither 0 0 48,48,48,48,48 (0, 0) true true
ssm_only 1 0 48,48,48,48,48 (1, 0) true true
kv_only 0 1 47,47,47,47,47 (0, 1) true true

harness_pass=true for all four arms means: when SSM_sentinel=1 the SSM gate fired (and not when 0); same for KV. The env-strip confound that made a prior env-gated attempt a null A/B is eliminated. See benchmarks/nvllm/traces/ssm_zero_on_realloc/2026-05-15-sentinel-ablation/summary.md for the full host/image manifest and the per-Q breakdown.

Test plan

  • Production patch diff inspected; no sentinel / env / debug residue (grep -c "sentinel\|SENTINEL\|/run/nvllm" vllm/v1/worker/utils.py vllm/v1/worker/gpu_model_runner.py → 0).
  • Sentinel overlay patch roundtrips: git apply scripts/ablation/ssm_sentinel_overlay.patch then git checkout -- restores HEAD cleanly; markers _SSM_ZERO_SENTINEL and _KV_ZERO_SENTINEL land where expected.
  • Runner script bash-syntax checked (bash -n); compare script Python-syntax checked (ast.parse).
  • scripts/gsm8k_eval_50.py backwards-compatible: --run-index defaults to 0, --metrics-url defaults to None; existing callers are unaffected.
  • All four per-arm verdict.json files parse and report gate_pass=true + harness_pass=true.
  • Future: re-run the harness against a collapsing host state to discriminate which half of the original /tmp/ssm_zero_on_free_soak patch was load-bearing (working hypothesis: SSM-only).
  • Future: phaseE-tax bench re-run post-SSM-fix to verify the Phase 4 verdict still holds (task Support various block sizes vllm-project/vllm#38, separate).

Constraints (from work-session policy)

  • AI assistance was used for the patch authorship, the harness, and the commit-message drafting. Human authored the design constraints (drop KV channel relax from production, no "fixes collapse" claim, no perf claim without nsys), reviewed every changed line, and ran the ablation suite end-to-end.
  • This PR targets Navi-AI-Lab/nvllm main, not upstream vllm-project/vllm, per memory:feedback_never_touch_upstream_vllm.

🤖 Generated with Claude Code

Natfii and others added 3 commits May 15, 2026 20:21
Adds an SSM zero-on-realloc guard alongside the existing KV zero-on-realloc
path. KVBlockZeroer.zero_block_ids now also walks a sister MambaBlockZeroer
on the same block-ID list, zeroing recycled conv_state / ssm_state rows
before the next prefill writes into them.

Rationale: the existing KVBlockZeroer (upstream PR vllm-project#35219) clears full-attn
KV blocks at request-free / block-realloc time but skips Mamba layers
because the conv / ssm page sizes differ from the full-attn page size and
cannot share the Triton kernel's uniform PAGE_SIZE_EL. MambaBlockZeroer
covers the remaining state via per-tensor torch.index_fill_, which is
simple, idempotent, and runs only outside the hot decode path.

Under the 2026-05-15 non-reproducing host state, the patch is correctness
and perf neutral; the companion sentinel-gated ablation harness (added in
follow-up commits) proves the toggle path executes (4-arm sweep, 20 runs,
both/neither/ssm_only all 48/50, kv_only deterministic 47/50).

No perf claim is made: no nsys trace was captured. Rollback is `git revert`
since EngineCore strips env vars on subprocess spawn, making an env kill
switch unreliable.

Files:
- vllm/v1/worker/utils.py: add MambaBlockZeroer class, add fire counter,
  wire it into KVBlockZeroer.{init_meta, zero_block_ids}.
- vllm/v1/worker/gpu_model_runner.py: materialize the attn-groups iterator
  into a list so KVBlockZeroer can walk it twice.

Evidence: benchmarks/nvllm/traces/ssm_zero_on_realloc/2026-05-15-sentinel-ablation/

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a reproducible 4-arm ablation harness for the SSM zero-on-realloc
patch (preceding commit) plus the KV new_block_ids channel relax that was
investigated alongside it but explicitly NOT shipped to default.

Rationale: the patch's correctness rationale is the SSM lifecycle gap
identified in the mamba-state audit; under the 2026-05-15 host state it is
perf/correctness neutral. The β-coop sustained-load collapse was not
reproducing during this work, so the harness exists for future discrimination
windows, not to validate the current commit.

Layout:
- scripts/ablation/run_ssm_ablation_suite.sh: 4-arm runner (both / neither /
  ssm_only / kv_only). Per-arm bind-mounts a per-arm sentinel dir at
  /run/nvllm :ro. Captures per-arm verdict.json with harness_validation
  block (enabled arms must log first_fire>=1; disabled arms must not).
  Counter helper uses awk to avoid the `grep -c ... || echo 0` bug that
  emitted "0\n0" when grep found zero matches.
- scripts/ablation/ssm_ablation_compare.py: post-suite analyzer; reads
  per-arm verdict.json + perq.jsonl, emits ANALYSIS.md with verdict table,
  Run-4 (collapse window) per-Q breakdown, steady-state stats, friend's
  interpretation thresholds, and KV-drained invariant.
- scripts/ablation/ssm_sentinel_overlay.patch: unified diff that overlays
  the sentinel-gated debug version on top of the production code from the
  preceding commit. The runner expects the patched files at
  $PATCHED_REPO/vllm/v1/{worker,core}/...
- scripts/ablation/prepare_sentinel_overlay.sh: clones HEAD into a scratch
  dir and applies the overlay; the runner just bind-mounts the result.
- docs/research/2026-05-15-ssm-zero-on-realloc/README.md: design rationale,
  why sentinels not env vars (EngineCore strips env), what the commit
  series does and does not claim, reproduce commands.
- scripts/gsm8k_eval_50.py: backwards-compatible instrumentation
  (--run-index, --metrics-url, per-Q JSONL, /metrics snapshots). Both new
  flags default to no-op; existing callers are unaffected.

Sentinel paths:
- /run/nvllm/zero_ssm_on_realloc.enabled -> SSM gate fires
- /run/nvllm/kv_zero_for_mamba_ids.enabled -> KV new_block_ids relax fires

The KV relax is in the overlay for completeness but NOT in the production
patch: the 2026-05-15 4-arm sweep showed a deterministic -1 question on
kv_only (47/50 x 5 vs 48/50 x 5 on the other three arms).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Commits the harness-validation evidence for the SSM zero-on-realloc
patch + ablation harness (preceding two commits in this branch).

Honest framing:
  This is harness validation only. The β-coop sustained-load collapse
  was not reproducing on the 2026-05-15 host, so the 4-arm sweep cannot
  prove the SSM patch fixes the collapse. What it does prove:
  - Sentinel-file gating works through vLLM EngineCore (env-stripped)
    where env-var gating did not.
  - Production SSM patch is correctness-neutral under non-collapse load
    (both / neither / ssm_only all 48/50, identical to baseline).
  - The KV new_block_ids channel relax (kv_only arm) is NOT
    correctness-neutral: a deterministic -1 question (47/50 x 5). This
    is the basis for not shipping the KV relax to default.

No nsys trace was captured; no perf claim is made. Median decode is
flat across arms (within 0.03 tok/s).

Layout:
  benchmarks/nvllm/traces/ssm_zero_on_realloc/2026-05-15-sentinel-ablation/
    summary.md            - per AGENTS.md S4: host/image manifest,
                            per-arm verdict table, what shows/doesn't,
                            reproduce commands
    ANALYSIS.md           - per-Q Run-4 breakdown, steady-state stats,
                            drained-KV invariant table
    runner_manifest.json  - host_driver, image_id, prompt_set_hash,
                            git_sha (manifest's image_digest "\nno-digest"
                            artifact cleaned in-place at copy time)
    comparison.json       - aggregate per-arm verdict pointer
    verdict-{arm}.json    - four per-arm verdicts (force-added per
                            memory:feedback_evidence_force_add since
                            benchmarks/**/*.json is gitignored)

Also fixes the IMAGE_DIGEST emit bug in
scripts/ablation/run_ssm_ablation_suite.sh (the same "cmd 2>/dev/null
|| echo X" pattern that bit the harness counter; uses "cmd 2>/dev/null
|| true" + "${VAR:-X}" instead).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Natfii Natfii merged commit 6761983 into main May 16, 2026
@Natfii Natfii deleted the feat/ssm-zero-on-realloc branch May 16, 2026 00:37
Natfii added a commit that referenced this pull request May 16, 2026
…14)

All three memtester runs returned rc=0 with 0 FAILURE lines:
- sanity_4G:  436s,  rc=0, FAILURE count=0
- band_32G:   3391s, rc=0, FAILURE count=0
- band_64G:   6764s, rc=0, FAILURE count=0

This gate ruled out RAM in the 4G / 32G / 64G bands as a hardware
contributor to the β-coop sustained-load collapse arc, prior to running
D2.7. D2.7 was subsequently skipped because the SSM zero-on-realloc fix
(PR #13) closed the diagnosis arc.

The 32-64 GB band is the suspect one per NVIDIA forum reports of
reproducible memtester failures on some Spark units. This unit passed.

Context for why this gate exists (Spark fleet caveats):
- LPDDR5x on Spark has no ECC (NVIDIA-confirmed); silent bit-flip
  corruption is possible.
- Memtester 32-64 GB failures reported by other Spark owners; unresolved
  upstream as of 2026-05-14.
- Thermal-sensor errata on at least one unit means nvidia-smi
  throttle-reasons is not a reliable HW-health gate.

See benchmarks/nvllm/traces/hw_gates/2026-05-14-memtester-32G-64G-clean/
summary.md for the full host/manifest, what this does and does not
prove, and reproduction commands. Sources linked there.

Co-authored-by: Natfii <27841768+Natfii@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Natfii added a commit that referenced this pull request May 16, 2026
)

Fresh nvllm:gb10-ssm image off main HEAD (PR #13 squash, 6761983)
with the SSM zero-on-realloc patch baked in. Re-ran the 2026-05-02
phaseE-tax 3-leg sweep (lower8 / phaseE-off / all-beta) to verify the
Phase 4 verdict still holds post-SSM-fix.

Result: Phase 4 stays dead. Verdict reproduced within run-to-run noise
(~1-2% per-kernel mean_us, identical ordering, GSM8K within +-2 questions).

Per-kernel A/B (2026-05-02 -> 2026-05-15, mean_us per call):
- DecodeKernel:           17.04-17.11 -> 17.13-17.27  (+0.2 to +1.3%)
- PhaseE_Beta_Kernel:     40.64-40.83 -> 41.31-41.46  (+1.6%)
- Phase_D_MLP_Kernel:     23.93 -> 24.15              (+0.9%)

Per-token aggregate (ms/tok, decode + mlp + beta):
- lower8:     320 -> 322  (+0.8%)
- phaseE-off: 656 -> 663  (+1.0%)
- all-beta:   369 -> 372  (+0.7%)

Ordering preserved: lower8 << all-beta << phaseE-off. SSM patch did
NOT change the cost model that resolved against Phase 4; the patch
fires at request-realloc boundaries outside the decode hot path.

GSM8K (50 questions, seed=42):
- lower8:     46/50 (was 47/50, 1 timeout on Q45 long-output boundary)
- phaseE-off:  4/50 (was 2/50, mostly 180s timeouts as expected)
- all-beta:   47/50 (was 47/50)

memory:feedback_phase4_dead needs NO update. Path to re-opening Phase 4
remains the same: make beta cheaper first (NVFP4 GEMV K-parallel
reduction), then revisit fusion atop a cheaper beta kernel.

Also includes:
- 1-line ergonomic fix to docs/research/2026-05-02-phaseE-tax-3leg/
  run_3leg.sh: make OUT_ROOT env-overridable so re-runs land in a new
  evidence dir without disturbing the prior one.
- .gitignore rules for the new evidence dir's raw .pt.trace.json.gz
  (~346 MB total) and *_serve.log + profiler_out_*.txt, parallel to
  the existing 2026-05-02 rules.

Committed evidence (per AGENTS.md S4): summary.md, per-leg
profile_kernels.csv, profile_metadata.json, gsm8k.json, gsm8k_metadata.json,
perq.jsonl, mem_watchdog.log. Force-added per memory:feedback_evidence_force_add
since benchmarks/**/*.json is gitignored.

Co-authored-by: Natfii <27841768+Natfii@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant