Skip to content

evidence + diagnosis: β-coop layer-sweep + 12L_3_47 survival-soak gate failure#12

Closed
Natfii wants to merge 2 commits into
mainfrom
work/beta-layer-sweep-wo8
Closed

evidence + diagnosis: β-coop layer-sweep + 12L_3_47 survival-soak gate failure#12
Natfii wants to merge 2 commits into
mainfrom
work/beta-layer-sweep-wo8

Conversation

@Natfii
Copy link
Copy Markdown

@Natfii Natfii commented May 11, 2026

Summary

  • Stage 2b 5-run survival soak at the 12L_3_47 β-coop arm (CUTE_PHASE_E_LAYERS=3,7,11,15,19,23,27,31,35,39,43,47, CUTE_WO_SPLIT=8) fails Gate 2b: silent quality collapse under sustained load (Runs 4-5 collapse mid-run; container stays alive; 0 errors logged; no docker-log corruption hits).
  • Bundles the Task 0a/1a infrastructure the sweep depended on (sentinel-file workaround for CUTE_PHASE_E_* + CUTE_BETA_REGION_TIMING, [PHASE_E_DISPATCH] audit log, NVLLM_BIND_MOUNT_QWEN35=1) plus the full Stage 0a + 5-arm sweep + Stage 2b soak evidence.
  • Adds docs/research/2026-05-09-beta-coop-layer-sweep-wo8/DIAGNOSIS.md opening a diagnosis arc: 6-leg bisection plan, leading hypothesis (persistent β-coop workspace/reset lifecycle), D3 per-request instrumentation design, side-hardening note (fail-closed CUTE_PHASE_E_LAYERS parsing).
  • Memorializes 2L_3_7 stays the serve-cute.sh default via a header CAUTION block in scripts/serve-cute.sh.

Soak verdict

Run Correct Errors Pass
1 48/50 0 yes
2 48/50 0 yes
3 48/50 0 yes
4 11/50 0 no (collapse at Q12, persistent through end of run)
5 37/50 0 no (inherited collapse Q0-Q13, sharp recovery at Q14)

gate_2b_pass=false, container_alive_at_end=true, docker_log_corruption_hits=0. The Q13→Q14 recovery (261s broken wall → 64.3s clean wall) is sharp/categorical, not analog drift — strong signal for a state flip (eviction / rotation / reset) rather than gradual numerical degradation.

Leading hypothesis (per friend code-review)

β-coop reads its workspace buffers from persistent impl attributes (wo_output, mlp_partial_fp32, counters, barriers) because host-side zeroing hung CUDA graph capture. Resets fire at multiple sites (captured memset op, pre-launch zero_(), inside-kernel MLP-partial reset). A missed reset → persistent garbage → silent quality collapse across many requests until a natural overwrite clears it. Matches the observed sustained-load + sharp-recovery shape better than generic "approximation drift."

See DIAGNOSIS.md for the full bisection table (D2.0–D2.5: wo1, lite, phaseE=0, layer-count, 2L survival control), the demotions (prefix caching off because hybrid models default-off already; eager mode broken on SM120), and the D3 instrumentation field list.

Test plan

  • Stage 0a baseline (2L+wo8 GSM8K-50 + per-call β median): see stage0a/verdict.md
  • Stage 1a runner harness: runner.sh, arms.csv, extract_dispatch_log.py, dispatch-audit gate
  • Stage 1b-c 5-arm sweep (2L, 4L, 8L, 12L, 16L) with per-arm GSM8K-50 + dispatch audit: see sweep/summary.md
  • Stage 2a dev baseline pick (12L_3_47)
  • Stage 2b 5-run survival soak at 12L_3_47 (this PR's headline artifact): see soak/summary.md
  • D2.1–D2.5 bisection legs (follow-up — each is a 10h-wall soak)
  • D3 per-request instrumentation patch (follow-up)
  • Side hardening: fail-closed CUTE_PHASE_E_LAYERS parsing (follow-up)

AI assistance disclosure

Claude Opus 4.7 was used for runner scripting, sweep analysis, summary aggregation, and diagnosis-plan authoring. All runtime evidence (5 soak runs, 5 sweep arms, Stage 0a baseline, dispatch audits) was produced on the human submitter's DGX Spark. Code changes were reviewed line-by-line by the submitter and an external friend whose code-review notes are credited inline in DIAGNOSIS.md (load-bearing citations verified live against the working tree before incorporation).

🤖 Generated with Claude Code

Natfii and others added 2 commits May 11, 2026 07:27
Stage 2b 5-run survival soak at the 12L_3_47 beta-coop arm
(CUTE_PHASE_E_LAYERS=3,7,11,15,19,23,27,31,35,39,43,47, CUTE_WO_SPLIT=8)
fails Gate 2b:

  Run 1: 48/50 pass
  Run 2: 48/50 pass
  Run 3: 48/50 pass
  Run 4: 11/50 fail (collapse at Q12, persistent through end of run)
  Run 5: 37/50 fail (inherited collapse Q0-Q13, sharp recovery at Q14)

Container alive at end, 0 errors logged, 0 docker-log corruption hits.
Silent quality collapse under sustained load, not a crash. The Q13->Q14
recovery is sharp (261s broken wall -> 64.3s clean wall) — categorical
state flip, not analog drift.

Bundles in the Task 0a/1a infrastructure the sweep depended on:

  - scripts/serve-cute.sh: extends the /tmp/c2_diag/ENV sentinel file to
    pass CUTE_PHASE_E_* + CUTE_BETA_REGION_TIMING through the EngineCore
    env-strip (per feedback_vllm_enginecore_env_strip). Also adds
    NVLLM_BIND_MOUNT_QWEN35=1 bind-mount option and a CAUTION header
    note documenting why 2L_3_7 stays the safe default until the soak
    failure has a root cause.

  - vllm/nvllm/models/qwen3_5.py: extends the sentinel-file accept-list
    to CUTE_PHASE_E_* and CUTE_BETA_REGION_TIMING= keys.

  - vllm/v1/attention/backends/cute_paged/_backend.py: adds the
    [PHASE_E_DISPATCH] audit log (CUTE_PHASE_E_DISPATCH_LOG=1) that
    extract_dispatch_log.py parses for the per-leg dispatch gate.

Includes the Stage 0a baseline (per-call beta median, region breakdown),
the 5-arm sweep summary (2L through 16L), and the full Stage 2b
survival soak evidence (5 runs, verdict.json, docker.log,
dispatch_audit, summary.md).

See docs/research/2026-05-09-beta-coop-layer-sweep-wo8/soak/summary.md
for the per-question miss table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Post-Stage-2b diagnosis plan documenting:

  - Hypothesis space verdicts: per-run aging and monotonic degradation
    ruled out; process-level self-recoverable corruption confirmed via
    Run 5's sharp recovery at Q14 (261s broken wall -> 64.3s clean wall).

  - Strongest code-level suspect (per friend code-review 2026-05-11):
    persistent beta-coop workspace/reset lifecycle. wo_output,
    mlp_partial_fp32, counters, and barriers live in persistent impl
    attributes because host zeroing hung CUDA graph capture. Resets
    fire at multiple sites (captured memset op, pre-launch counter
    zero_(), inside-kernel MLP-partial reset). A missed reset ->
    persistent garbage -> silent quality collapse across many requests
    until a natural overwrite clears it. Matches the observed
    sustained-load + sharp-recovery shape better than generic
    "approximation drift."

  - 6-leg bisection plan (D2.0 through D2.5): wo1 vs wo8, lite path,
    phaseE=0, layer-count, 2L survival control. Prefix-caching and
    eager-mode legs demoted (hybrid models default-off prefix cache
    per vllm/config/model.py:1791; eager broken on SM120 per CLAUDE.md).

  - D3 per-request instrumentation design: request_id, question index,
    layer set, per-layer path, CUTE_WO_SPLIT, nat, data pointers,
    reset call counts, finish reason, generated tokens, output hash.
    Drops the once-per-pair dedup in CUTE_WO_RESET_LOG under the
    diagnostic flag so per-request reset invocations are observable.

  - Side hardening: fail-closed CUTE_PHASE_E_LAYERS parsing
    (_backend.py:139 currently silently turns malformed CSV into "all
    layers beta-coop"; production-safe behavior is to refuse to start).

  - Decision: 2L_3_7 stays the serve-cute default. The header CAUTION
    block added to scripts/serve-cute.sh in the previous commit
    memorializes the policy.

Refs: docs/research/2026-05-09-beta-coop-layer-sweep-wo8/soak/summary.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Natfii
Copy link
Copy Markdown
Author

Natfii commented May 16, 2026

Closing without merging.

The β-coop sustained-load collapse this PR's evidence was diagnosing was resolved 2026-05-15 via the SSM zero-on-realloc patch (PR #13). The cherry-pick bisection arc this PR documented (Stage 2b + D2.1–D2.6) turned out to be a dead end relative to the actual substrate (stale mamba recurrent state on slot recycle, not anything in the PR #10 tier-1 cherry-picks) — the bisection legs perturbed the shape but never fixed the magnitude, which is the canonical "substrate not in the bisect frame" signal (see memory:feedback_substrate_not_cherry_pick).

The diagnosis-arc receipts are preserved in:

  • memory:project_beta_coop_sustained_collapse (full historical record of D2.x bisection + falsified hypotheses)
  • memory:feedback_substrate_not_cherry_pick (the methodology lesson)
  • PR feat: SSM zero-on-realloc guard + sentinel ablation harness #13 commit messages and benchmarks/nvllm/traces/ssm_zero_on_realloc/2026-05-15-sentinel-ablation/summary.md (the resolution + harness-validation evidence)

No code in this branch needs to land on main; the resolution path was different.

@Natfii Natfii closed this May 16, 2026
@Natfii Natfii deleted the work/beta-layer-sweep-wo8 branch May 16, 2026 00:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant