Skip to content

feat(bless): production-capable opt-in FULL+blessed serving#5

Merged
Natfii merged 42 commits intomainfrom
feat/cute-full-cache-bless
May 2, 2026
Merged

feat(bless): production-capable opt-in FULL+blessed serving#5
Natfii merged 42 commits intomainfrom
feat/cute-full-cache-bless

Conversation

@Natfii
Copy link
Copy Markdown

@Natfii Natfii commented May 2, 2026

Purpose

Land the bless cache workflow + opt-in serve-cute-full.sh path
that makes FULL+blessed CuTe serving production-capable on lower-8 layers
at concurrency=1, plus the apples-to-apples bench evidence backing it.

Background. The earlier persistent-buffers / workspace-reset direction
(Path B) was closed out (5d92deff3) once Z1 evidence (b2677abd3) showed
the FULL-graph instability was torch.compile/inductor non-determinism, not
a workspace-reset bug. This PR ships the alternative fix: pin compile
artifacts via a blessed AOT cache mounted read-only into the container,
so every cold start replays the same captured graphs.

Default behavior is unchanged. PIECEWISE remains the repo default.
FULL+blessed is opt-in via serve-cute-full.sh only. The bench result
was performance-neutral within noise (-0.27% aggregate kernel time,
+0.39% decode tok/s) — explicitly NOT a clear win, so no default flip.

This PR contains 42 commits sequenced as:

  1. C2 spike infra (efb6fdb29) — --json-out + --evidence-dir args
  2. Common helpers (9ead479aa..dbf40bce3) — compute_blessed_config_hash,
    resolve_blessed_manifest, verify_blessed_cache, HF revision resolver
  3. Persistent-buffers / Path B campaign + closeout
    (fcbdef8da..5d92deff3) — work that is now superseded; closeout
    doc is in this PR's history for traceability
  4. Bless orchestrator + launcher (88e28a002..8bf779825)
  5. serve-cute-full.sh integration (6ee3db2ab)
  6. Bless docs + post-ship fixes (fdcf92d46..f3a4ed0e6)
  7. Evidence:
    • ce26aaaa0 — bless-v1 first blessed cache + production trace + GSM8K
      • negative tests
    • 42793fc04 — apples-to-apples bench vs PIECEWISE (this PR's
      numerical headline)

Test Plan

# 1. Bless cache (one-time per (image_id × model × config) tuple)
bash scripts/bless_cache.sh

# 2. Opt-in serving with the blessed cache
bash scripts/serve-cute-full.sh   # uses lower-8 + n=1 + blessed mount

# 3. Reproduce the apples-to-apples bench (~1.5-2 hr wall clock)
bash docs/research/cute_full_blessed_traces/capture_full_vs_piecewise.sh

# 4. GSM8K sanity check (per CLAUDE.md post-quant gate)
.venv/bin/python scripts/gsm8k_eval_50.py --base-url http://localhost:8000/v1

Test Result

Apples-to-apples bench (commit 42793fc04)

PIECEWISE FULL+blessed Δ
TTFT (single 256-tok streaming) 593.5 ms 592.1 ms -1.4 ms
Decode tok/s 2.343 2.352 +0.39%
Aggregate kernel time (77 common kernels) 85,205.0 ms 84,976.5 ms -228.6 ms (-0.27%)
DecodeKernel mean μs 17244.2 17128.7 -0.7%/call
PhaseE_Beta_Kernel mean μs 40967.9 40799.3 -0.4%/call
Host watchdog peak 82 GiB 80 GiB (of 119 GiB)

Full evidence:
benchmarks/nvllm/traces/cute_full_blessed/2026-05-01-vs-piecewise/summary.md

bless-v1 evidence (commit ce26aaaa0)

GSM8K parity with PIECEWISE baseline, deterministic across cold starts,
negative tests confirm fail-closed behavior on bad cache.

Caveats

  • Lower-8 layers only (CUTE_PHASE_E_LAYERS=0..7); all-32-layer bless is
    future work.
  • Concurrency=1 / max_num_seqs=1 only; n>1 is future work.
  • *.pt.trace.json.gz raw torch profiler dumps are gitignored
    (37 MB total); reproducible via the harness.
  • One harness-write hook caveat: benchmarks/ and traces/ are
    evidence-only (write-blocked); future bench runs must stage outside
    then cp in.

AI assistance

This work was done with Claude Code assistance (Opus 4.7 1M-context).
Submitting human reviewed every changed line and ran the bench
end-to-end on-device.

Not duplicating

This is a fork-internal feature for the DGX Spark target; no upstream
vLLM PR addresses opt-in FULL+blessed serving for SM120.


Essential Elements Checklist
  • Purpose: opt-in FULL+blessed serving + bench evidence
  • Test plan: bless + serve + bench commands above
  • Test results: per-kernel + streaming numbers in table
  • Documentation: docs/blessed-caches/README.md, evidence
    summary, harness scripts under docs/research/

Natfii and others added 30 commits April 30, 2026 15:25
…ttributes

Pre-existing per-call torch.zeros allocations inside run_beta_coop_full
were unsafe under FULL_AND_PIECEWISE CUDA graph capture: the graph baked
the original allocation addresses, but each replay's torch.zeros call
returned a DIFFERENT graph-pool address. The kernel kept reading and
writing the originally-captured (now-stale) addresses, producing
"first 8 chars stable, decode tokens 2+ diverge" — verbatim symptom of
vLLM upstream vllm-project#35175 in CuTe form.

Hoisted five buffers to persistent attributes on
CutePagedAttentionImpl, allocated in attach_mlp_fusion inside the
existing CUTE_PHASE_E_FUSION try-block:

  self._phase_e_coop_wo_output            [max_num_seqs, 4, hidden]    f32
  self._phase_e_coop_mlp_partial_fp32     [max_num_seqs, slc, hidden]  f32
  self._phase_e_coop_mlp_arrival_count    [max_num_seqs, num_k_tiles]  u32
  self._phase_e_coop_grid_barrier_i32     [max_num_seqs]               i32
  self._phase_e_coop_phase1_arrival_count [max_num_seqs]               i32

run_beta_coop_full now requires them as keyword-only kwargs (after a
`*,` separator since Python forbids non-default after defaulted
params). The existing counter .zero_() calls at phase_e_kernel.py:
3036-3038 stay as captured memsets — now hitting stable addresses.
mlp_partial_fp32 continues to rely on the v6 in-kernel CTA-local reset
at phase_e_kernel.py:~4191. wo_output gets no reset in this v1 patch;
v2 (out of scope) adds a separate captured reset op iff C2 fails.

External callers (precompile script, kernel test) updated to allocate
their own dummy buffers and pass them in. Epsilon-epilogue test also
cleaned up: lite kernel now invoked with emit_next_layernorm=False to
match the deleted Assert 3 (next_hidden); next_gamma allocation removed
because the kernel signature gates next_input_layernorm_gamma on
emit_next_layernorm. lite_next_hidden buffer kept because
Phase_D_MLP_Kernel asserts it non-None whenever emit_epilogue=True
(mlp_kernel.py:497-498).

Three new structural tests guard the contract going forward:
- test_run_beta_coop_full_has_persistent_buffer_kwargs (sig presence)
- test_run_beta_coop_full_no_internal_workspace_zeros (no per-call alloc)
- test_attach_mlp_fusion_allocates_persistent_beta_coop_buffers

Spec:  docs/superpowers/specs/2026-04-30-beta-coop-persistent-buffers-design.md
Plan:  docs/superpowers/plans/2026-04-30-beta-coop-persistent-buffers-plan.md
Bug:   vllm-project#35175 (analog)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Records postpatch validation for β-coop persistent workspace buffers v1:
C0 PIECEWISE GSM8K PASS, PIECEWISE replay coherence PASS, and
FULL lower-8 replay coherence FAIL.

The result lands v1 evidence and closeout only. Remaining suspected
blocker is stale wo_output content at a stable address; v2 should add
an explicit captured wo_output reset.

Code:  1cc51ab
Spec:  docs/superpowers/specs/2026-04-30-beta-coop-persistent-buffers-design.md
Plan:  docs/superpowers/plans/2026-04-30-beta-coop-persistent-buffers-plan.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
C1 + multi-token C2 investigation that culminated in the 2026-04-30
pre-patch closeout (evidence/2026-04-30-1430-closeout/CLOSEOUT.md):
FULL+β-coop is BLOCKED, cumulative-drift across layers, vLLM vllm-project#35175
analog. Production = PIECEWISE+β-coop (already shipping in v0.3.0).

Adds:
- vllm/v1/worker/gpu_model_runner.py: CUTE_FULL_GRAPH_PROBE v2
  (module-level booleans, two-trigger). Self-mutation variant from
  the morning hung capture 20+ min; module-level state is graph-safe
  (feedback_no_self_mut_in_cudagraph_dispatch).
- docs/research/2026-04-29-full-graph-spike/c2_full_layer_bisect.sh:
  layer-CSV-driven harness for C2 across full-attn layer subsets.
- docs/research/2026-04-29-full-graph-spike/c2_piecewise_betacoop.sh:
  PIECEWISE control wrapper.
- 16 evidence dirs (2026-04-30-0752 → 2026-04-30-1430-closeout)
  capturing single-layer / lower-8 / upper-8 / all-16 bisects,
  capture-flake hangs, and the 2026-04-30-1430-closeout writeup.

README + _sync_host_edits.sh + c1_replay_proof.sh updated to
match the layer-bisect harness flow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a `direct_register_custom_op` that zeroes
`_phase_e_coop_wo_output[:nat]` via cudaMemsetAsync (libcudart) on the
current torch CUDA stream, captured as a memset graph node ordered
before each `run_beta_coop_full` launch. Targets the v1 closeout's
"stale content at stable address" failure under FULL_AND_PIECEWISE.

- New file: _wo_output_reset_op.py (lazy libcudart bind, precondition
  asserts, env-gated CUTE_WO_RESET_LOG=1 capture-side probe).
- Side-effect import at qwen3_5.py:42 (mirrors existing _beta_coop_op
  registration site).
- Callsite at _backend.py:1540 (eager body of cute_beta_coop_run; not
  a splitting boundary; FX topology unchanged from v1).

Spec: docs/superpowers/specs/2026-04-30-beta-coop-persistent-buffers-v2-design.md
v1 closeout: docs/research/2026-04-29-full-graph-spike/evidence/2026-04-30-1552-postpatch-v1/CLOSEOUT.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends _sync_host_edits.sh with docker-cp + sentinel checks for the
two new sync targets in the v2 patch:
- _wo_output_reset_op.py (new file)
- nvllm/models/qwen3_5.py (modified import line)

Also adds nvllm/models/__pycache__ to the stale-pyc cleanup so
qwen3_5.pyc doesn't shadow the new import line.

Forwards CUTE_WO_RESET_LOG through c2_full_layer_bisect.sh so the Gate 1
capture-side reset probe is actually enabled inside the container.

No rebuild needed (feedback_rebuild_guard); next gate run picks up
both files via docker cp.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-ups from the Phase 1 code review of fcbdef8:

- M2: explicit `int(...)` cast on `torch.cuda.current_stream().cuda_stream`
  before binding into the cudaMemsetAsync ctypes call. Aligns with the
  sibling pattern used by 8 callsites in phase_e_kernel.py and kernel.py,
  and hardens the error-path f-string `stream={stream_handle:#x}` against
  a future torch version returning a non-int handle (which would mask the
  real CUDA error with a TypeError mid-format).

- M1: reworded module docstring to drop the "doubly preventing DCE" claim
  (which conflated DCE-safety with stream-capture-safety) and replaced
  it with the accurate explanation: the call runs inside the eager body
  of the cute_beta_coop_run splitting boundary, so Dynamo/FX never sees
  it and DCE simply doesn't apply. mutates_args is now framed as schema
  hygiene for any future trace site, not as a DCE guard.

Behavior unchanged at this commit; both edits are local to the new
_wo_output_reset_op.py module.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
[:nat] slice only, leaves tail intact, nat==0 is no-op.

Spec §6.2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec §6.3.1. Confirms the new captured-reset op runs eager between
PIECEWISE pieces without breaking the production decode path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec §6.3.2. Confirms the new captured-reset op preserves PIECEWISE
deterministic replay coherence (the v1 production path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec §6.4 + §6.5 probe (1). c2_replay_coherence + capture-side
runtime log (CUTE_WO_RESET_LOG=1) confirming the reset op fires
on every attached layer.

Result: FAIL — see c2_replay_coherence.md for unique count.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec §6.5 probe (3) + AGENTS.md §4. Captured a full nsys trace
under the v2 lower-8 (3,7,11,15,19,23,27,31) FULL_AND_PIECEWISE
config. Trace contains 166 MEMSET + 2946 KERNEL activity rows.

Honest limitation noted in summary.md: nsys followed only the
API server PID; the EngineCore subprocess (where β-coop and the
captured wo_output cudaMemsetAsync actually run) was NOT crossed,
so the trace shows 0 graph-captured (graphNodeId IS NOT NULL)
events. The β-coop kernels are absent from this PID's stream;
visible kernels are sampling/elementwise. Promoting this to a
captured-graph-node-ordering proof needs a follow-up arc using
either nsys child-tree-follow or vLLM's torch profiler API hooks
(per project memory feedback_vllm_profiling).

Trace + GPU activity ordering committed for forensic inspection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captured cudaMemsetAsync reset for _phase_e_coop_wo_output is wired
correctly and fires once per attached layer at FULL-graph capture
(8 unique stable data_ptrs in the [CUTE_WO_RESET] runtime log), but
the FULL+β-coop lower-8 replay-divergence bug class is unfixed:
Gate 1 c2_replay_coherence reports unique=4, cross-prompt dependent.

The "stale content at stable address" hypothesis from the v1 closeout
is insufficient. v2 patch ships as a no-op-for-FULL but
PIECEWISE-clean infrastructure layer:
- C0 PIECEWISE+β-coop GSM8K-sanity 8/8 PASS
- C2 PIECEWISE+β-coop replay coherence unique=1 cross-indep PASS
- Functional CUDA smoke (op zeros [:nat], leaves tail intact) PASS

PIECEWISE production path is intact. v3 candidates:
A) host-captured reset for mlp_partial_fp32 (replace in-kernel
   CTA-local reset; the v1 closeout's named escalation candidate),
B) re-evaluate the workspace-residue diagnosis altogether, or
C) defer to upstream vllm-project#40969 (open, same hardware/cudagraph_mode).

Per AFK instruction: do NOT auto-escalate to v3. Closeout stops here
for human review.

Spec: docs/superpowers/specs/2026-04-30-beta-coop-persistent-buffers-v2-design.md
Closeout: docs/research/2026-04-29-full-graph-spike/evidence/2026-04-30-1822-postpatch-v2/CLOSEOUT.md
Memory: ~/.claude/projects/-home-natfii-docker-nvllm/memory/project_full_graph_blocked.md (updated, gitignored)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…m-project#40969

Per user code review of the prior closeout commit (8c7111a):

1. Soften "captured graph node ordered before β-coop" claim. The
   nsys trace did NOT independently prove per-replay graph-node
   ordering (EngineCore subprocess not captured). Reframed as
   "issued during FULL-graph capture on the current stream;
   expected to be captured as a graph memset node, but per-replay
   ordering not independently proven." Applied at the verdict
   paragraph and the "Why v2 failed" section.

2. Replace "no-op-for-FULL" with "not sufficient to enable FULL;
   PIECEWISE-clean infrastructure/refactor remains shippable."
   The patch DOES affect FULL behavior (it just doesn't make FULL
   correct); the prior phrasing was misleading. Applied at the
   verdict paragraph and the Status block.

3. Resolve evidence-path ambiguity. Sibling evidence dirs use
   relative `../2026-04-30-XXXX/` paths (unambiguous from the
   closeout's own location); the trace dir outside `evidence/`
   uses `<repo-root>/benchmarks/...` notation.

4. "reuses v2's `_wo_output_reset_op.py` shape" → "...op pattern"
   in the v3 escalation candidate. mlp_partial_fp32's buffer
   shape differs from wo_output's; only the op pattern carries
   over, not the byte-count math.

5. Date-stamp upstream vllm-project#40969 references with the recheck
   timestamp (OPEN as of 2026-04-30, last upstream activity
   2026-04-28T11:06:08Z) so future readers can see how stale the
   claim is. Applied to all three references in the doc.

No factual claims changed; only framing and traceability.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds CUTE_DISPATCH_AUDIT=1 probe at gpu_model_runner.py post-DP
re-dispatch site. Logs the FINAL returned mode + descriptor for the
first 100 dispatch calls, with raw vs descriptor fields side-by-side
so prefill (raw_tokens > 1) can be distinguished from steady-decode
(raw_tokens == raw_reqs == 1) rows.

Module-level int counter only (no setattr on self, per
feedback_no_self_mut_in_cudagraph_dispatch).

Path B Step 1 of the v2 β-coop diagnosis re-evaluation: confirm
whether the lower-8 FULL run is truly FULL replay during steady
decode, or silently hybrid/PIECEWISE per step. The existing
first-any / first-FULL booleans only prove FULL was reached at
least once; this gives steady-state visibility.

Sync infra:
- c2_full_layer_bisect.sh:73 forwards CUTE_DISPATCH_AUDIT to the
  EngineCore subprocess (per feedback_vllm_enginecore_env_strip).
- _sync_host_edits.sh adds a sentinel grep so a stale image copy
  doesn't silently invalidate the audit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-call CUTE_DISPATCH_AUDIT probe captured 100 dispatches across the
c2_replay_coherence pattern under FULL_AND_PIECEWISE on lower-8 β-coop
layers. Findings:

- 89 / 100 rows: FULL mode with uniform_decode=True, desc_uniform=True
  — every steady-decode dispatch was FULL.
- 2 / 100 rows: PIECEWISE at idx=2 / idx=6 during FULL graph CAPTURE
  (uniform_decode=False, capture phase, before first-FULL probe fired)
  — capture-time variant for non-uniform shapes, not steady decode.
- 9 / 100 rows: NONE — 7 force_eager warmup + 3 prefill of 12-token
  prompt (eager prefill is normal vLLM behavior).

Verdict: hybrid-dispatch hypothesis is RULED OUT for the steady decode
path. Proceed to Step 2 (inspect _beta_coop_op.py for capture-time
freeze risks).

Sidebar: this run produced unique=1 PASS (vs Gate 1 unique=4 FAIL,
same code path mod the new probe). v2 reset still fires (8 unique
[CUTE_WO_RESET] data_ptrs). Either the audit probe perturbed something
or the bug is genuinely stochastic — needs an audit-OFF re-run to
disambiguate. Single c2 trial cannot characterize this regime; this
reinforces the v2 closeout's "controlled multi-seed test we did not
run" framing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each trial uses CUTE_WO_RESET_LOG=1 (matches failing Gate 1) +
CUTE_DISPATCH_AUDIT=0 (audit OFF) + CUTE_FULL_GRAPH_PROBE=1
(bisect-script default). Tear down between trials. No code changes
under test from Step 1 audit run (HEAD d36abf7).

Per-trial evidence: trial.md with git SHA, env contract, c2 unique
count, same/cross/overall pass flags, first-any/first-FULL probe
presence, wo_reset unique data_ptrs.

Result: MIXED (2/3 PASS, 1/3 FAIL). Trial 1 unique=1 PASS, trial 2
unique=3 FAIL, trial 3 unique=1 PASS. Summary applies the Step X
verdict logic: mixed branch -> run two more trials and treat as
statistical, NOT a fix declaration. The original Gate 1 unique=4
FAIL appears to belong to the same statistical distribution as
v2 audit-OFF, not a deterministic bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
X.4+X.5 use identical protocol to X.1-X.3 (see commit 68c6ab9).
CUTE_WO_RESET_LOG=1, CUTE_DISPATCH_AUDIT=0, CUTE_FULL_GRAPH_PROBE=1.
Tear down between trials. No code changes under test from HEAD.

Per-trial trial.md format identical to X.1-X.3.

5-trial summary at evidence/<ts>-pathb-x-summary-5trial/summary.md
applies the user's statistical verdict logic:
- 0/5 or 1/5 FAIL: baseline too unstable to justify a patch.
- 2/5 or 3/5 FAIL: stochastic bug; v3 needs statistical acceptance.
- 4/5 or 5/5 FAIL: focused Z patch target.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds env-driven mount of /root/.cache/vllm to a host directory.
When PATHB_Z1_VLLM_CACHE_HOST_DIR is unset, the script behaves
identically to before (no mount, per-container scratch cache).

For Path B Z1 controlled causality test: lock torch.compile AOT
cache across fresh containers to test whether artifact-size /
compile-path is causal for the X-trial PASS/FAIL discriminator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per user code review of the Step 1 dispatch-audit run: bullet-form
of the user-specified evidence fields (dispatch_hypothesis,
steady_decode_rows, non_FULL_rows, coherence_result, interpretation)
plus a probe-design lesson section noting the c2 PASS-vs-FAIL flip
on the audit run reinforces "minimize hot-path state mutation."

Added inline before the audit-OFF reproducibility experiment (X)
was dispatched — this is the rationale the X experiment then
tested.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ausal

Locked torch.compile AOT cache to a host-mounted dir to test whether
the X-trial PASS/FAIL discriminator (62 MB log-reported / 70 MB disk
vs 73 MB log-reported / 81 MB disk artifact size, perfect 6/6
correlation) is causal vs merely correlative.

Two artifacts produced by inductor at the same cache key
(9a5549f23a178e35a9a3e9b4bed7adf1d137d22f3fc06ef8048d589e5d625721)
but distinct sha256s — confirming inductor non-determinism for the
same input graph:
- GOOD: sha256 651e00bd5997bacd9a062da66e6c9a078ed3c4469c27c715d8b025041a2a8264
- BAD:  sha256 af68c498c6ee45b60165d584a870f2f072068153a7c76d9592fc0097efe63c80

Trial results (each cache_reused on all 5, sha256 unchanged before
vs after each trial):
- Locked GOOD cache: 5/5 PASS
- Locked BAD cache: 1/5 PASS, 4/5 FAIL

Compared to baseline X-trials (cold cache, mixed artifact): 3/5 PASS.

The 5x ratio between cache labels (100% PASS vs 20% PASS) makes the
artifact identity the dominant load-bearing variable. The user's
verdict-framework branch "5/5 PASS + cache_reused on all 5 →
causality basically closed" is satisfied.

The bad.5 PASS (with sha256 verified unchanged) suggests a smaller
secondary source of non-determinism — not enough to undermine the
artifact-identity finding, but worth noting as residual uncertainty.

Implication for Z: production fix is "persist a known-good
torch.compile AOT cache across container starts," NOT a β-coop
workspace patch. Path B is closed.

Files:
- evidence-preserve: X.1-X.5 compile metrics (path/size/key/verdict);
  sha256 of X-trial artifacts NOT preserved (containers were torn
  down before this experiment).
- 5 good-trial dirs + 5 bad-trial dirs (each has trial.md, JSON, MD,
  full docker logs, probe slices, wo_reset log).
- summary at evidence/<ts>-pathb-z1-summary/summary.md.

Bisect-script env hook (PATHB_Z1_VLLM_CACHE_HOST_DIR) was committed
separately as f002ee4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Updates the v2 closeout and Z1 summary with Path B + Z1 findings.
v2's earlier "stochastic FAIL inside FULL+β-coop replay/kernel
behavior" framing is largely superseded: Z1 showed cache artifact/
directory identity is the dominant load-bearing variable explaining
9/10 outcomes. v3 mlp_partial_fp32 host-reset is **explicitly
orphaned** (Y showed in-kernel reset is mechanically correct; Z1
showed FAIL is upstream of any workspace reset).

Per code-review feedback (2026-05-01):

1. Softened causality wording in both docs from "CAUSES the
   verdict in 9 of 10 trials" / "fully closed root cause" to
   "dominant load-bearing variable explains 9/10 outcomes" /
   "dominant evidence for upstream torch.compile/inductor
   non-determinism, not a fully closed RCA." The bad.5 PASS keeps
   the door open for a smaller secondary non-determinism source.

2. Added a cache-directory manifest section to the Z1 summary
   showing only the AOT model file differed between snapshots
   (modelinfos/, computation_graph.py, cache_key_factors.json all
   byte-identical). Reframes the discriminator as "cache
   artifact/directory identity" since the mount pinned all of
   /root/.cache/vllm.

3. Production fix recommendations now require fail-closed cache
   handling: verify expected sha256 before launch, refuse to start
   on missing/empty/mismatched, mount read-only after bootstrap.
   Bare RW mount is NOT sufficient.

4. Probe-off validation gate added as REQUIRED before declaring a
   cache production-ready. All Z1 trials ran with
   CUTE_FULL_GRAPH_PROBE=1 and CUTE_WO_RESET_LOG=1; production
   blessing must validate locked-good with all CUTE_* probes off.

Closeout updates:
- New "Path B Update — 2026-05-01" section at end summarizing
  Step 1 (dispatch audit), Y/Y2 (code inspection), X (audit-OFF
  reproducibility), A (PASS/FAIL log diff), Z1 (cache-pin causality).
- "Recommendation" section paths A and B marked superseded; A
  (mlp_partial reset) explicitly orphaned.
- "Followup investigation candidates" item 1 (mlp_partial reset)
  marked orphaned; new item 4 added for upstream torch.compile/
  inductor non-determinism tracking.

Memory project_full_graph_blocked.md updated separately (gitignored).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bless orchestrator (Phase 2 cache workaround) needs a stable JSON
output path per validation trial. Default evidence-dir behavior is
unchanged when neither arg is passed.

Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 0)
Phase 2 cache workaround Task 1.1. Deterministic sha256 over a canonical
JSON of 18 cache-affecting inputs. Argument order is part of the contract;
changing it invalidates all manifests.

Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 1.1)
Address code quality review of Task 1.1:
- Add test_compute_hash_anchor_value: pins canonical JSON output for a
  fixed 18-input fixture to a hard-coded sha256, so a refactor of the
  jq filter body or a misordered args call flips the hash and fails
  loudly instead of silently producing a wrong-but-still-64-hex result.
- Add test_compute_hash_rejects_wrong_arg_count: exercises the
  assert_exit_code harness primitive (previously dead code) and the
  helper's wrong-arg-count guard.

Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 1.1 follow-up)
Phase 2 cache workaround Task 1.2. Manifest lookup is by config_hash
field (not filename); duplicate hashes are exit 2 (hard corruption).
Verify checks size + sha256 of every files[] entry.

Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 1.2)
Address code quality review of Task 1.2:
- I-1: nvllm_resolve_blessed_manifest now stderr-warns on a manifest that
  parses but lacks .config_hash (was silent skip — masked a real
  packaging bug class).
- I-2: add coverage for resolve helper's default-dir fallback path.
- M-3: add tests for 0-byte file rejection and empty files[] in
  nvllm_verify_blessed_cache (both paths existed but were uncovered).
- M-5: use -ne (numeric) instead of != (string) for size comparison in
  verify helper, matching the surrounding -eq usage.

Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 1.2 follow-up)
Phase 2 cache workaround Task 1.3. Refusal helpers print structured
remediation messages and return 1 (callers chain || exit 1 for hard
exit). HF revision resolver delegates to huggingface_hub.model_info and
validates the 40-char sha format. Container-exists guard prevents
clobbering operator-owned containers (CLAUDE.md docker rule).

Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 1.3)
Address code quality review of Task 1.3:
- Important: model_id was interpolated directly into Python source,
  breaking parse if the id contained quotes/backslashes/newlines.
  Pass via MODEL_ID env var with single-quoted Python source.
- Minor: docstring listed a 'huggingface-cli fallback' that was never
  implemented. Strike to match the actual single-path behavior.

Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 1.3 follow-up)
Phase 2 cache workaround Task 2.1. CLI surface, BlessConfig, TrialResult.
Phase 1/2/3 wiring lands in subsequent tasks.

Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 2.1)
Phase 2 cache workaround Task 2.2. Phase 1 builds the RW container,
polls /v1/models, sends one fixed completion to force prefill+decode +
AOT artifact write, gracefully stops, resolves the 4 expected files
via globs, returns aot_sha/size for Phase 2 reuse signal.

Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 2.2)
Natfii and others added 12 commits May 1, 2026 06:47
Address code quality review of Task 2.2:
- I-1: phase1_bootstrap and _docker_stop force-removed containers,
  conflicting with the detect-and-refuse pattern established by
  nvllm_refuse_if_container_exists (Task 1.3). Replace with
  docker stop -t N + docker rm (no -f). A leftover orchestrator
  container stops cleanly; an unexpected operator-owned container
  surfaces as a loud failure rather than silent destruction.
- I-2: _poll_models 'attempt_max=3' kwarg was multiplied by 50
  internally → up to 150 retries, drifting from spec §7.3's 'max 3
  retries' contract. Drop the multiplier and rename to
  max_transient_retries for clarity. Also moves sleep(2) out of the
  except clause so non-200 polling waits between attempts.

Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 2.2 follow-up)
Phase 2 cache workaround Task 2.3. classify_cache_reuse implements the
3 Z1-derived signals (AOT load marker, no 'saved AOT' lines, post-trial
sha unchanged). parse_c2_json gates on same+cross+unique=1.

Deviation from plan: phase2_validate's per-trial defensive cleanup uses
non-force `docker stop -t 10 + docker rm` instead of the plan's
`docker rm -f`. Same reasoning as ea0046d applied to Task 2.2: the
launcher's nvllm_refuse_if_container_exists already refuses
operator-owned containers; the orchestrator should not silently destroy
unexpected state, only its own leftovers, which stop+rm handles cleanly.

Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 2.3)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 cache workaround Task 2.4. accept() builds the manifest dict,
archives prior on --rebless (manifest + blessed dir, with timestamp +
old artifact sha8 in name), atomic mv staging->blessed, writes JSON.
reject() preserves staging as evidence with failure summary, no manifest.

Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 2.4)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 cache workaround Task 2.5. main() resolves paths from env
(NVLLM_BLESSED_CACHE_ROOT, HF_MODEL), runs phase 1, runs phase 2,
decides accept/reject. Refuses early if blessed dir exists without
--rebless. Exit codes: 0 PASS, 1 refuse-no-rebless, 2 phase-1-fail,
3 phase-2-fail.

Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 2.5)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…trator

Phase 2 cache workaround Task 3. Bash preflight: image, GPU memory, no
running container, jq+flock available, HF revision resolved. Derives
config_hash. Acquires flock per config_hash. Refuses early if manifest
exists and no --rebless. Then exec .venv/bin/python orchestrator.

shellcheck clean.

Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 3)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…erify

Phase 2 cache workaround Task 4. Defaults flipped:
  - CUTE_FULL_GRAPH_PROBE=0 (was hardcoded 1)
  - CUTE_WO_RESET_LOG=0
  - CUTE_DISPATCH_AUDIT=0
  - CUTE_PHASE_E_LAYERS=0,1,2,3,4,5,6,7

Added pre-docker-run verify-and-mount block that derives config_hash,
resolves manifest, verifies cache, refuses on no-match / drift / unsafe-dev.
--debug (eager) bypasses verification entirely.

Cleanup: removed 4 dead variables (KV_CACHE, ATTN_BACKEND, MAX_MODEL_LEN,
MAX_NUM_SEQS) shadowed by the new *_VAL set; shellcheck-clean.

Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 4)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ts table

Phase 2 cache workaround Task 5. README documents what the directory is,
links Z1 evidence, lists active manifests (table auto-regenerated by
orchestrator on each accept()), summarizes the bless protocol, and
explains the three refusal modes.

Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 5)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Empirical: first bless attempt failed in 8s with ECONNREFUSED. vLLM
doesn't bind port 8000 until model load (~5 min for 27B) finishes;
during that window every poll returns ECONNREFUSED. Counting those
exhausts max_transient_retries=3 in seconds and aborts the bless before
the model can possibly be ready.

Fix: ConnectionRefusedError is the EXPECTED boot signal — sleep through
it without counting. timeout_s (default 600s) is the correct wall-clock
bound for boot. Other URLErrors (DNS, TLS, broken pipe) still count as
true transients with the 3-retry cap.

Two regression tests:
- ECONNREFUSED looped 20× still succeeds when followed by 200
- DNS-style URLError still trips after 3 retries

This corrects the post-review fix in ea0046d, which assumed "real boot
does NOT flap repeatedly". It does — with ECONNREFUSED, the entire boot.

Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 6 hardening)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Empirical: bless attempt #2 reached the AOT-flush step but Phase 1 failed
sha256 with [Errno 13] Permission denied on the AOT model file. vllm
runs as root in the container; torch writes the AOT artifact mode 0600
into the host-mounted staging dir, leaving it root-owned and unreadable
by the orchestrator (host user natfii).

Add _chmod_staging_for_host: docker exec chmod -R a+rX before container
stop. Capital X = traverse-only on dirs (preserves directory write
semantics), files become world-readable. Phase 2 :ro mounts and the
host-side verify in serve-cute-full.sh both read as unprivileged users.

Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 6 hardening)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes uncovered by bless attempts #2-#4:

1. Post-shutdown perms normalization (replaces the racy pre-stop chmod
   from cdd59d0). torch/inductor can finalize the AOT artifact during
   graceful shutdown, so chmod-then-stop leaves files written between
   chmod and exit still root-owned mode 0600. New helper:
   _normalize_staging_permissions_for_host runs a short-lived RW helper
   container against the staging mount AFTER docker stop, executing
   chown -R <host uid>:<host gid> + chmod -R u+rwX,go+rX. Phase 1 calls
   it twice: before deleting stale staging (so root-owned leftovers can
   be rmtree'd) and after stopping the bootstrap container.

2. Pre-create dummy_cache/ in staging. vLLM caching.py:466-467 always
   calls os.makedirs(<vllm_root>/dummy_cache, exist_ok=True) on the
   AOT-load path, even when disable_cache=True. With our :ro Phase 2
   mount this raised EROFS, the AOT load aborted, and torch.compile
   silently recompiled — every Phase 2 trial reported cache miss.
   Pre-creating dummy_cache/ in staging makes the makedirs a no-op
   under :ro (verified: makedirs(exist_ok=True) on existing dir under
   read-only parent returns success).

Empirical: bless attempt #4 ran the full flow (5 trials, all c2 PASS),
but every trial recompiled — see trial_1_container.log line 133:
"Compiling model again due to a load failure ... reason: [Errno 30]
Read-only file system: '/root/.cache/vllm/dummy_cache'".

Two new tests:
- _normalize_staging_permissions_uses_helper_container shape check
- makedirs_dummy_cache_no_ops_under_readonly_parent (regression)

Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 6 hardening)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…negative tests

Phase 2 cache workaround Task 6 (integration). First production manifest
committed: config_hash e6d32b41c46842c97f877339e86c79d6cc11004a238bef32f2cd3fdb73ce28db,
AOT artifact sha d97e88db. Closes the bless-then-mount-RO workaround for the
Z1 inductor non-determinism problem.

Validation evidence:
  - K=5 / 5 trials PASS, c2_replay_coherence n=8 each
    (same-prompt unique=1, cross-prompt independent)
  - cache_reused=true on every trial (aot_load log marker present,
    zero "saved AOT compiled function" lines, post-trial AOT sha unchanged)
  - production serve confirms Directly load AOT compilation from path
    (decorators.py:305) on the canonical RO mount
  - GSM8K-50 (seed=42) on the production serve: 47 / 50 = 94.0%, well
    above the kernel-change "no regression vs prior phase" gate (β-coop
    baseline ~30-31/50)

Negative tests (both refused, no container started):
  - probe-on (CUTE_FULL_GRAPH_PROBE=1): config_hash differs -> "No matching
    manifest" -> refusal
  - drift (corrupted AOT byte in an isolated copy of the cache, NOT the
    canonical artifact): "DRIFT DETECTED" with size mismatch diagnostic ->
    refusal. Canonical cache verified intact after the test, so the
    workaround does not leave production broken when its safety nets fire.

nsys trace scope (benchmarks/nvllm/traces/cute_full_blessed/2026-05-01-bless-v1):
  The 1.5 MB changed.nsys-rep covers a 90 s capture window with one
  in-flight 256-token completion at ~2.4 tok/s decode. The trace supports
  the FULL_AND_PIECEWISE launch structure claim (166 cudaGraphLaunch_v10000
  + 7976 cudaLaunchKernel + 4548 cuLaunchKernelEx) but does NOT support
  per-kernel performance claims yet — cuda_gpu_kern_sum reports no GPU
  kernel data because per-kernel CUPTI attribution into FULL-graph nodes
  did not surface in this run despite --cuda-graph-trace=node. Per-kernel
  µs breakdown and any speedup numbers require a follow-up trace using
  torch profiler or a different CUPTI configuration; do not cite speedups
  from this trace.

Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 6)
Spec: docs/superpowers/specs/2026-05-01-cute-full-cache-production-workaround-design.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ral within noise

Two-leg trace bundle for FULL+blessed (lower-8 + n=1) vs PIECEWISE on
matched config (only cudagraph_mode + bless mount differ). Demonstrates
that the opt-in serve-cute-full.sh path is production-capable on a
trace-backed basis without the prior torch.compile/inductor non-determinism
(per bless-v1 in ce26aaa).

Headline (single 256-tok streaming request, unprofiled):
- TTFT 593→592 ms, decode 2.343→2.352 tok/s — identical within noise

Per-kernel aggregate (77 common kernels):
- -228.6 ms (-0.27%) in favor of FULL+blessed
- DecodeKernel -0.7%/call (-321 ms total), PhaseE_Beta_Kernel -0.4%/call
- FP4 GEMM cutlass::device_kernel +1.2%/call (+107 ms — counter-shift)
- Small triton/elementwise kernels 16-30% faster under FULL graph

Host watchdog peak: 80-82 GiB / 119 GiB. Well below danger; the prior OOM
was a max_iterations bug, not a budget problem (see Caveats §2 in summary).

Verdict: not a clear win — performance-neutral. User gates respected:
no all-32 bless attempted, no n>1 attempted, default remains PIECEWISE
pending a clear win.

Bundle:
- benchmarks/nvllm/traces/cute_full_blessed/2026-05-01-vs-piecewise/
  · summary.md (AGENTS.md §4 deliverable)
  · piecewise.nsys-rep, full.nsys-rep (system-wide, 60s window each)
  · {piecewise,full}_kernels.csv (per-kernel μs)
  · {piecewise,full}_streaming.json, _meta.json, _serve.log, _mem.log
  · comparison.md, comparison.json (per-kernel diff)
  · raw .pt.trace.json.gz files are gitignored (reproducible from harness)
- docs/research/cute_full_blessed_traces/
  · capture_full_vs_piecewise.sh (harness, ~1.5-2 hr wall)
  · render_comparison.py (CSV → markdown + json)
  · streaming_ttft.py (single streaming request helper)
- .gitignore: carve-out for cute_full_blessed area

Caveat: profiler bounded to first ~200 worker iterations per leg via
max_iterations=200 (active_iterations alone is dead code without
wait/warmup_iterations — verified vs vllm/profiler/wrapper.py:104-116,
205-220). Wall workload still runs the full 30 × 256 tokens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Natfii Natfii merged commit 10aa787 into main May 2, 2026
@Natfii Natfii deleted the feat/cute-full-cache-bless branch May 2, 2026 12:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant