evidence + diagnosis: β-coop layer-sweep + 12L_3_47 survival-soak gate failure#12
Closed
Natfii wants to merge 2 commits into
Closed
evidence + diagnosis: β-coop layer-sweep + 12L_3_47 survival-soak gate failure#12Natfii wants to merge 2 commits into
Natfii wants to merge 2 commits into
Conversation
Stage 2b 5-run survival soak at the 12L_3_47 beta-coop arm
(CUTE_PHASE_E_LAYERS=3,7,11,15,19,23,27,31,35,39,43,47, CUTE_WO_SPLIT=8)
fails Gate 2b:
Run 1: 48/50 pass
Run 2: 48/50 pass
Run 3: 48/50 pass
Run 4: 11/50 fail (collapse at Q12, persistent through end of run)
Run 5: 37/50 fail (inherited collapse Q0-Q13, sharp recovery at Q14)
Container alive at end, 0 errors logged, 0 docker-log corruption hits.
Silent quality collapse under sustained load, not a crash. The Q13->Q14
recovery is sharp (261s broken wall -> 64.3s clean wall) — categorical
state flip, not analog drift.
Bundles in the Task 0a/1a infrastructure the sweep depended on:
- scripts/serve-cute.sh: extends the /tmp/c2_diag/ENV sentinel file to
pass CUTE_PHASE_E_* + CUTE_BETA_REGION_TIMING through the EngineCore
env-strip (per feedback_vllm_enginecore_env_strip). Also adds
NVLLM_BIND_MOUNT_QWEN35=1 bind-mount option and a CAUTION header
note documenting why 2L_3_7 stays the safe default until the soak
failure has a root cause.
- vllm/nvllm/models/qwen3_5.py: extends the sentinel-file accept-list
to CUTE_PHASE_E_* and CUTE_BETA_REGION_TIMING= keys.
- vllm/v1/attention/backends/cute_paged/_backend.py: adds the
[PHASE_E_DISPATCH] audit log (CUTE_PHASE_E_DISPATCH_LOG=1) that
extract_dispatch_log.py parses for the per-leg dispatch gate.
Includes the Stage 0a baseline (per-call beta median, region breakdown),
the 5-arm sweep summary (2L through 16L), and the full Stage 2b
survival soak evidence (5 runs, verdict.json, docker.log,
dispatch_audit, summary.md).
See docs/research/2026-05-09-beta-coop-layer-sweep-wo8/soak/summary.md
for the per-question miss table.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Post-Stage-2b diagnosis plan documenting:
- Hypothesis space verdicts: per-run aging and monotonic degradation
ruled out; process-level self-recoverable corruption confirmed via
Run 5's sharp recovery at Q14 (261s broken wall -> 64.3s clean wall).
- Strongest code-level suspect (per friend code-review 2026-05-11):
persistent beta-coop workspace/reset lifecycle. wo_output,
mlp_partial_fp32, counters, and barriers live in persistent impl
attributes because host zeroing hung CUDA graph capture. Resets
fire at multiple sites (captured memset op, pre-launch counter
zero_(), inside-kernel MLP-partial reset). A missed reset ->
persistent garbage -> silent quality collapse across many requests
until a natural overwrite clears it. Matches the observed
sustained-load + sharp-recovery shape better than generic
"approximation drift."
- 6-leg bisection plan (D2.0 through D2.5): wo1 vs wo8, lite path,
phaseE=0, layer-count, 2L survival control. Prefix-caching and
eager-mode legs demoted (hybrid models default-off prefix cache
per vllm/config/model.py:1791; eager broken on SM120 per CLAUDE.md).
- D3 per-request instrumentation design: request_id, question index,
layer set, per-layer path, CUTE_WO_SPLIT, nat, data pointers,
reset call counts, finish reason, generated tokens, output hash.
Drops the once-per-pair dedup in CUTE_WO_RESET_LOG under the
diagnostic flag so per-request reset invocations are observable.
- Side hardening: fail-closed CUTE_PHASE_E_LAYERS parsing
(_backend.py:139 currently silently turns malformed CSV into "all
layers beta-coop"; production-safe behavior is to refuse to start).
- Decision: 2L_3_7 stays the serve-cute default. The header CAUTION
block added to scripts/serve-cute.sh in the previous commit
memorializes the policy.
Refs: docs/research/2026-05-09-beta-coop-layer-sweep-wo8/soak/summary.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Author
|
Closing without merging. The β-coop sustained-load collapse this PR's evidence was diagnosing was resolved 2026-05-15 via the SSM zero-on-realloc patch (PR #13). The cherry-pick bisection arc this PR documented (Stage 2b + D2.1–D2.6) turned out to be a dead end relative to the actual substrate (stale mamba recurrent state on slot recycle, not anything in the PR #10 tier-1 cherry-picks) — the bisection legs perturbed the shape but never fixed the magnitude, which is the canonical "substrate not in the bisect frame" signal (see The diagnosis-arc receipts are preserved in:
No code in this branch needs to land on main; the resolution path was different. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CUTE_PHASE_E_LAYERS=3,7,11,15,19,23,27,31,35,39,43,47,CUTE_WO_SPLIT=8) fails Gate 2b: silent quality collapse under sustained load (Runs 4-5 collapse mid-run; container stays alive; 0 errors logged; no docker-log corruption hits).CUTE_PHASE_E_*+CUTE_BETA_REGION_TIMING,[PHASE_E_DISPATCH]audit log,NVLLM_BIND_MOUNT_QWEN35=1) plus the full Stage 0a + 5-arm sweep + Stage 2b soak evidence.docs/research/2026-05-09-beta-coop-layer-sweep-wo8/DIAGNOSIS.mdopening a diagnosis arc: 6-leg bisection plan, leading hypothesis (persistent β-coop workspace/reset lifecycle), D3 per-request instrumentation design, side-hardening note (fail-closedCUTE_PHASE_E_LAYERSparsing).serve-cute.shdefault via a header CAUTION block inscripts/serve-cute.sh.Soak verdict
gate_2b_pass=false,container_alive_at_end=true,docker_log_corruption_hits=0. The Q13→Q14 recovery (261s broken wall → 64.3s clean wall) is sharp/categorical, not analog drift — strong signal for a state flip (eviction / rotation / reset) rather than gradual numerical degradation.Leading hypothesis (per friend code-review)
β-coop reads its workspace buffers from persistent impl attributes (
wo_output,mlp_partial_fp32, counters, barriers) because host-side zeroing hung CUDA graph capture. Resets fire at multiple sites (captured memset op, pre-launchzero_(), inside-kernel MLP-partial reset). A missed reset → persistent garbage → silent quality collapse across many requests until a natural overwrite clears it. Matches the observed sustained-load + sharp-recovery shape better than generic "approximation drift."See
DIAGNOSIS.mdfor the full bisection table (D2.0–D2.5: wo1, lite, phaseE=0, layer-count, 2L survival control), the demotions (prefix caching off because hybrid models default-off already; eager mode broken on SM120), and the D3 instrumentation field list.Test plan
stage0a/verdict.mdrunner.sh,arms.csv,extract_dispatch_log.py, dispatch-audit gatesweep/summary.mdsoak/summary.mdCUTE_PHASE_E_LAYERSparsing (follow-up)AI assistance disclosure
Claude Opus 4.7 was used for runner scripting, sweep analysis, summary aggregation, and diagnosis-plan authoring. All runtime evidence (5 soak runs, 5 sweep arms, Stage 0a baseline, dispatch audits) was produced on the human submitter's DGX Spark. Code changes were reviewed line-by-line by the submitter and an external friend whose code-review notes are credited inline in
DIAGNOSIS.md(load-bearing citations verified live against the working tree before incorporation).🤖 Generated with Claude Code