feat(bless): production-capable opt-in FULL+blessed serving by Natfii · Pull Request #5 · Navi-AI-Lab/nvllm

Natfii · 2026-05-02T12:15:49Z

Purpose

Land the bless cache workflow + opt-in serve-cute-full.sh path
that makes FULL+blessed CuTe serving production-capable on lower-8 layers
at concurrency=1, plus the apples-to-apples bench evidence backing it.

Background. The earlier persistent-buffers / workspace-reset direction
(Path B) was closed out (5d92deff3) once Z1 evidence (b2677abd3) showed
the FULL-graph instability was torch.compile/inductor non-determinism, not
a workspace-reset bug. This PR ships the alternative fix: pin compile
artifacts via a blessed AOT cache mounted read-only into the container,
so every cold start replays the same captured graphs.

Default behavior is unchanged. PIECEWISE remains the repo default.
FULL+blessed is opt-in via serve-cute-full.sh only. The bench result
was performance-neutral within noise (-0.27% aggregate kernel time,
+0.39% decode tok/s) — explicitly NOT a clear win, so no default flip.

This PR contains 42 commits sequenced as:

C2 spike infra (efb6fdb29) — --json-out + --evidence-dir args
Common helpers (9ead479aa..dbf40bce3) — compute_blessed_config_hash,
resolve_blessed_manifest, verify_blessed_cache, HF revision resolver
Persistent-buffers / Path B campaign + closeout
(fcbdef8da..5d92deff3) — work that is now superseded; closeout
doc is in this PR's history for traceability
Bless orchestrator + launcher (88e28a002..8bf779825)
serve-cute-full.sh integration (6ee3db2ab)
Bless docs + post-ship fixes (fdcf92d46..f3a4ed0e6)
Evidence:
- ce26aaaa0 — bless-v1 first blessed cache + production trace + GSM8K
  - negative tests
- 42793fc04 — apples-to-apples bench vs PIECEWISE (this PR's
  numerical headline)

Test Plan

# 1. Bless cache (one-time per (image_id × model × config) tuple)
bash scripts/bless_cache.sh

# 2. Opt-in serving with the blessed cache
bash scripts/serve-cute-full.sh   # uses lower-8 + n=1 + blessed mount

# 3. Reproduce the apples-to-apples bench (~1.5-2 hr wall clock)
bash docs/research/cute_full_blessed_traces/capture_full_vs_piecewise.sh

# 4. GSM8K sanity check (per CLAUDE.md post-quant gate)
.venv/bin/python scripts/gsm8k_eval_50.py --base-url http://localhost:8000/v1

Test Result

Apples-to-apples bench (commit `42793fc04`)

	PIECEWISE	FULL+blessed	Δ
TTFT (single 256-tok streaming)	593.5 ms	592.1 ms	-1.4 ms
Decode tok/s	2.343	2.352	+0.39%
Aggregate kernel time (77 common kernels)	85,205.0 ms	84,976.5 ms	-228.6 ms (-0.27%)
DecodeKernel mean μs	17244.2	17128.7	-0.7%/call
PhaseE_Beta_Kernel mean μs	40967.9	40799.3	-0.4%/call
Host watchdog peak	82 GiB	80 GiB	(of 119 GiB)

Full evidence:
benchmarks/nvllm/traces/cute_full_blessed/2026-05-01-vs-piecewise/summary.md

bless-v1 evidence (commit `ce26aaaa0`)

GSM8K parity with PIECEWISE baseline, deterministic across cold starts,
negative tests confirm fail-closed behavior on bad cache.

Caveats

Lower-8 layers only (CUTE_PHASE_E_LAYERS=0..7); all-32-layer bless is
future work.
Concurrency=1 / max_num_seqs=1 only; n>1 is future work.
*.pt.trace.json.gz raw torch profiler dumps are gitignored
(37 MB total); reproducible via the harness.
One harness-write hook caveat: benchmarks/ and traces/ are
evidence-only (write-blocked); future bench runs must stage outside
then cp in.

AI assistance

This work was done with Claude Code assistance (Opus 4.7 1M-context).
Submitting human reviewed every changed line and ran the bench
end-to-end on-device.

Not duplicating

This is a fork-internal feature for the DGX Spark target; no upstream
vLLM PR addresses opt-in FULL+blessed serving for SM120.

Essential Elements Checklist

Purpose: opt-in FULL+blessed serving + bench evidence
Test plan: bless + serve + bench commands above
Test results: per-kernel + streaming numbers in table
Documentation: docs/blessed-caches/README.md, evidence
summary, harness scripts under docs/research/

…ttributes Pre-existing per-call torch.zeros allocations inside run_beta_coop_full were unsafe under FULL_AND_PIECEWISE CUDA graph capture: the graph baked the original allocation addresses, but each replay's torch.zeros call returned a DIFFERENT graph-pool address. The kernel kept reading and writing the originally-captured (now-stale) addresses, producing "first 8 chars stable, decode tokens 2+ diverge" — verbatim symptom of vLLM upstream vllm-project#35175 in CuTe form. Hoisted five buffers to persistent attributes on CutePagedAttentionImpl, allocated in attach_mlp_fusion inside the existing CUTE_PHASE_E_FUSION try-block: self._phase_e_coop_wo_output [max_num_seqs, 4, hidden] f32 self._phase_e_coop_mlp_partial_fp32 [max_num_seqs, slc, hidden] f32 self._phase_e_coop_mlp_arrival_count [max_num_seqs, num_k_tiles] u32 self._phase_e_coop_grid_barrier_i32 [max_num_seqs] i32 self._phase_e_coop_phase1_arrival_count [max_num_seqs] i32 run_beta_coop_full now requires them as keyword-only kwargs (after a `*,` separator since Python forbids non-default after defaulted params). The existing counter .zero_() calls at phase_e_kernel.py: 3036-3038 stay as captured memsets — now hitting stable addresses. mlp_partial_fp32 continues to rely on the v6 in-kernel CTA-local reset at phase_e_kernel.py:~4191. wo_output gets no reset in this v1 patch; v2 (out of scope) adds a separate captured reset op iff C2 fails. External callers (precompile script, kernel test) updated to allocate their own dummy buffers and pass them in. Epsilon-epilogue test also cleaned up: lite kernel now invoked with emit_next_layernorm=False to match the deleted Assert 3 (next_hidden); next_gamma allocation removed because the kernel signature gates next_input_layernorm_gamma on emit_next_layernorm. lite_next_hidden buffer kept because Phase_D_MLP_Kernel asserts it non-None whenever emit_epilogue=True (mlp_kernel.py:497-498). Three new structural tests guard the contract going forward: - test_run_beta_coop_full_has_persistent_buffer_kwargs (sig presence) - test_run_beta_coop_full_no_internal_workspace_zeros (no per-call alloc) - test_attach_mlp_fusion_allocates_persistent_beta_coop_buffers Spec: docs/superpowers/specs/2026-04-30-beta-coop-persistent-buffers-design.md Plan: docs/superpowers/plans/2026-04-30-beta-coop-persistent-buffers-plan.md Bug: vllm-project#35175 (analog) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Records postpatch validation for β-coop persistent workspace buffers v1: C0 PIECEWISE GSM8K PASS, PIECEWISE replay coherence PASS, and FULL lower-8 replay coherence FAIL. The result lands v1 evidence and closeout only. Remaining suspected blocker is stale wo_output content at a stable address; v2 should add an explicit captured wo_output reset. Code: 1cc51ab Spec: docs/superpowers/specs/2026-04-30-beta-coop-persistent-buffers-design.md Plan: docs/superpowers/plans/2026-04-30-beta-coop-persistent-buffers-plan.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

C1 + multi-token C2 investigation that culminated in the 2026-04-30 pre-patch closeout (evidence/2026-04-30-1430-closeout/CLOSEOUT.md): FULL+β-coop is BLOCKED, cumulative-drift across layers, vLLM vllm-project#35175 analog. Production = PIECEWISE+β-coop (already shipping in v0.3.0). Adds: - vllm/v1/worker/gpu_model_runner.py: CUTE_FULL_GRAPH_PROBE v2 (module-level booleans, two-trigger). Self-mutation variant from the morning hung capture 20+ min; module-level state is graph-safe (feedback_no_self_mut_in_cudagraph_dispatch). - docs/research/2026-04-29-full-graph-spike/c2_full_layer_bisect.sh: layer-CSV-driven harness for C2 across full-attn layer subsets. - docs/research/2026-04-29-full-graph-spike/c2_piecewise_betacoop.sh: PIECEWISE control wrapper. - 16 evidence dirs (2026-04-30-0752 → 2026-04-30-1430-closeout) capturing single-layer / lower-8 / upper-8 / all-16 bisects, capture-flake hangs, and the 2026-04-30-1430-closeout writeup. README + _sync_host_edits.sh + c1_replay_proof.sh updated to match the layer-bisect harness flow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a `direct_register_custom_op` that zeroes `_phase_e_coop_wo_output[:nat]` via cudaMemsetAsync (libcudart) on the current torch CUDA stream, captured as a memset graph node ordered before each `run_beta_coop_full` launch. Targets the v1 closeout's "stale content at stable address" failure under FULL_AND_PIECEWISE. - New file: _wo_output_reset_op.py (lazy libcudart bind, precondition asserts, env-gated CUTE_WO_RESET_LOG=1 capture-side probe). - Side-effect import at qwen3_5.py:42 (mirrors existing _beta_coop_op registration site). - Callsite at _backend.py:1540 (eager body of cute_beta_coop_run; not a splitting boundary; FX topology unchanged from v1). Spec: docs/superpowers/specs/2026-04-30-beta-coop-persistent-buffers-v2-design.md v1 closeout: docs/research/2026-04-29-full-graph-spike/evidence/2026-04-30-1552-postpatch-v1/CLOSEOUT.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Extends _sync_host_edits.sh with docker-cp + sentinel checks for the two new sync targets in the v2 patch: - _wo_output_reset_op.py (new file) - nvllm/models/qwen3_5.py (modified import line) Also adds nvllm/models/__pycache__ to the stale-pyc cleanup so qwen3_5.pyc doesn't shadow the new import line. Forwards CUTE_WO_RESET_LOG through c2_full_layer_bisect.sh so the Gate 1 capture-side reset probe is actually enabled inside the container. No rebuild needed (feedback_rebuild_guard); next gate run picks up both files via docker cp. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two follow-ups from the Phase 1 code review of fcbdef8: - M2: explicit `int(...)` cast on `torch.cuda.current_stream().cuda_stream` before binding into the cudaMemsetAsync ctypes call. Aligns with the sibling pattern used by 8 callsites in phase_e_kernel.py and kernel.py, and hardens the error-path f-string `stream={stream_handle:#x}` against a future torch version returning a non-int handle (which would mask the real CUDA error with a TypeError mid-format). - M1: reworded module docstring to drop the "doubly preventing DCE" claim (which conflated DCE-safety with stream-capture-safety) and replaced it with the accurate explanation: the call runs inside the eager body of the cute_beta_coop_run splitting boundary, so Dynamo/FX never sees it and DCE simply doesn't apply. mutates_args is now framed as schema hygiene for any future trace site, not as a DCE guard. Behavior unchanged at this commit; both edits are local to the new _wo_output_reset_op.py module. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

[:nat] slice only, leaves tail intact, nat==0 is no-op. Spec §6.2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Spec §6.3.1. Confirms the new captured-reset op runs eager between PIECEWISE pieces without breaking the production decode path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Spec §6.3.2. Confirms the new captured-reset op preserves PIECEWISE deterministic replay coherence (the v1 production path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Spec §6.4 + §6.5 probe (1). c2_replay_coherence + capture-side runtime log (CUTE_WO_RESET_LOG=1) confirming the reset op fires on every attached layer. Result: FAIL — see c2_replay_coherence.md for unique count. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Spec §6.5 probe (3) + AGENTS.md §4. Captured a full nsys trace under the v2 lower-8 (3,7,11,15,19,23,27,31) FULL_AND_PIECEWISE config. Trace contains 166 MEMSET + 2946 KERNEL activity rows. Honest limitation noted in summary.md: nsys followed only the API server PID; the EngineCore subprocess (where β-coop and the captured wo_output cudaMemsetAsync actually run) was NOT crossed, so the trace shows 0 graph-captured (graphNodeId IS NOT NULL) events. The β-coop kernels are absent from this PID's stream; visible kernels are sampling/elementwise. Promoting this to a captured-graph-node-ordering proof needs a follow-up arc using either nsys child-tree-follow or vLLM's torch profiler API hooks (per project memory feedback_vllm_profiling). Trace + GPU activity ordering committed for forensic inspection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Captured cudaMemsetAsync reset for _phase_e_coop_wo_output is wired correctly and fires once per attached layer at FULL-graph capture (8 unique stable data_ptrs in the [CUTE_WO_RESET] runtime log), but the FULL+β-coop lower-8 replay-divergence bug class is unfixed: Gate 1 c2_replay_coherence reports unique=4, cross-prompt dependent. The "stale content at stable address" hypothesis from the v1 closeout is insufficient. v2 patch ships as a no-op-for-FULL but PIECEWISE-clean infrastructure layer: - C0 PIECEWISE+β-coop GSM8K-sanity 8/8 PASS - C2 PIECEWISE+β-coop replay coherence unique=1 cross-indep PASS - Functional CUDA smoke (op zeros [:nat], leaves tail intact) PASS PIECEWISE production path is intact. v3 candidates: A) host-captured reset for mlp_partial_fp32 (replace in-kernel CTA-local reset; the v1 closeout's named escalation candidate), B) re-evaluate the workspace-residue diagnosis altogether, or C) defer to upstream vllm-project#40969 (open, same hardware/cudagraph_mode). Per AFK instruction: do NOT auto-escalate to v3. Closeout stops here for human review. Spec: docs/superpowers/specs/2026-04-30-beta-coop-persistent-buffers-v2-design.md Closeout: docs/research/2026-04-29-full-graph-spike/evidence/2026-04-30-1822-postpatch-v2/CLOSEOUT.md Memory: ~/.claude/projects/-home-natfii-docker-nvllm/memory/project_full_graph_blocked.md (updated, gitignored) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…m-project#40969 Per user code review of the prior closeout commit (8c7111a): 1. Soften "captured graph node ordered before β-coop" claim. The nsys trace did NOT independently prove per-replay graph-node ordering (EngineCore subprocess not captured). Reframed as "issued during FULL-graph capture on the current stream; expected to be captured as a graph memset node, but per-replay ordering not independently proven." Applied at the verdict paragraph and the "Why v2 failed" section. 2. Replace "no-op-for-FULL" with "not sufficient to enable FULL; PIECEWISE-clean infrastructure/refactor remains shippable." The patch DOES affect FULL behavior (it just doesn't make FULL correct); the prior phrasing was misleading. Applied at the verdict paragraph and the Status block. 3. Resolve evidence-path ambiguity. Sibling evidence dirs use relative `../2026-04-30-XXXX/` paths (unambiguous from the closeout's own location); the trace dir outside `evidence/` uses `<repo-root>/benchmarks/...` notation. 4. "reuses v2's `_wo_output_reset_op.py` shape" → "...op pattern" in the v3 escalation candidate. mlp_partial_fp32's buffer shape differs from wo_output's; only the op pattern carries over, not the byte-count math. 5. Date-stamp upstream vllm-project#40969 references with the recheck timestamp (OPEN as of 2026-04-30, last upstream activity 2026-04-28T11:06:08Z) so future readers can see how stale the claim is. Applied to all three references in the doc. No factual claims changed; only framing and traceability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds CUTE_DISPATCH_AUDIT=1 probe at gpu_model_runner.py post-DP re-dispatch site. Logs the FINAL returned mode + descriptor for the first 100 dispatch calls, with raw vs descriptor fields side-by-side so prefill (raw_tokens > 1) can be distinguished from steady-decode (raw_tokens == raw_reqs == 1) rows. Module-level int counter only (no setattr on self, per feedback_no_self_mut_in_cudagraph_dispatch). Path B Step 1 of the v2 β-coop diagnosis re-evaluation: confirm whether the lower-8 FULL run is truly FULL replay during steady decode, or silently hybrid/PIECEWISE per step. The existing first-any / first-FULL booleans only prove FULL was reached at least once; this gives steady-state visibility. Sync infra: - c2_full_layer_bisect.sh:73 forwards CUTE_DISPATCH_AUDIT to the EngineCore subprocess (per feedback_vllm_enginecore_env_strip). - _sync_host_edits.sh adds a sentinel grep so a stale image copy doesn't silently invalidate the audit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per-call CUTE_DISPATCH_AUDIT probe captured 100 dispatches across the c2_replay_coherence pattern under FULL_AND_PIECEWISE on lower-8 β-coop layers. Findings: - 89 / 100 rows: FULL mode with uniform_decode=True, desc_uniform=True — every steady-decode dispatch was FULL. - 2 / 100 rows: PIECEWISE at idx=2 / idx=6 during FULL graph CAPTURE (uniform_decode=False, capture phase, before first-FULL probe fired) — capture-time variant for non-uniform shapes, not steady decode. - 9 / 100 rows: NONE — 7 force_eager warmup + 3 prefill of 12-token prompt (eager prefill is normal vLLM behavior). Verdict: hybrid-dispatch hypothesis is RULED OUT for the steady decode path. Proceed to Step 2 (inspect _beta_coop_op.py for capture-time freeze risks). Sidebar: this run produced unique=1 PASS (vs Gate 1 unique=4 FAIL, same code path mod the new probe). v2 reset still fires (8 unique [CUTE_WO_RESET] data_ptrs). Either the audit probe perturbed something or the bug is genuinely stochastic — needs an audit-OFF re-run to disambiguate. Single c2 trial cannot characterize this regime; this reinforces the v2 closeout's "controlled multi-seed test we did not run" framing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Each trial uses CUTE_WO_RESET_LOG=1 (matches failing Gate 1) + CUTE_DISPATCH_AUDIT=0 (audit OFF) + CUTE_FULL_GRAPH_PROBE=1 (bisect-script default). Tear down between trials. No code changes under test from Step 1 audit run (HEAD d36abf7). Per-trial evidence: trial.md with git SHA, env contract, c2 unique count, same/cross/overall pass flags, first-any/first-FULL probe presence, wo_reset unique data_ptrs. Result: MIXED (2/3 PASS, 1/3 FAIL). Trial 1 unique=1 PASS, trial 2 unique=3 FAIL, trial 3 unique=1 PASS. Summary applies the Step X verdict logic: mixed branch -> run two more trials and treat as statistical, NOT a fix declaration. The original Gate 1 unique=4 FAIL appears to belong to the same statistical distribution as v2 audit-OFF, not a deterministic bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

X.4+X.5 use identical protocol to X.1-X.3 (see commit 68c6ab9). CUTE_WO_RESET_LOG=1, CUTE_DISPATCH_AUDIT=0, CUTE_FULL_GRAPH_PROBE=1. Tear down between trials. No code changes under test from HEAD. Per-trial trial.md format identical to X.1-X.3. 5-trial summary at evidence/<ts>-pathb-x-summary-5trial/summary.md applies the user's statistical verdict logic: - 0/5 or 1/5 FAIL: baseline too unstable to justify a patch. - 2/5 or 3/5 FAIL: stochastic bug; v3 needs statistical acceptance. - 4/5 or 5/5 FAIL: focused Z patch target. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds env-driven mount of /root/.cache/vllm to a host directory. When PATHB_Z1_VLLM_CACHE_HOST_DIR is unset, the script behaves identically to before (no mount, per-container scratch cache). For Path B Z1 controlled causality test: lock torch.compile AOT cache across fresh containers to test whether artifact-size / compile-path is causal for the X-trial PASS/FAIL discriminator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per user code review of the Step 1 dispatch-audit run: bullet-form of the user-specified evidence fields (dispatch_hypothesis, steady_decode_rows, non_FULL_rows, coherence_result, interpretation) plus a probe-design lesson section noting the c2 PASS-vs-FAIL flip on the audit run reinforces "minimize hot-path state mutation." Added inline before the audit-OFF reproducibility experiment (X) was dispatched — this is the rationale the X experiment then tested. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ausal Locked torch.compile AOT cache to a host-mounted dir to test whether the X-trial PASS/FAIL discriminator (62 MB log-reported / 70 MB disk vs 73 MB log-reported / 81 MB disk artifact size, perfect 6/6 correlation) is causal vs merely correlative. Two artifacts produced by inductor at the same cache key (9a5549f23a178e35a9a3e9b4bed7adf1d137d22f3fc06ef8048d589e5d625721) but distinct sha256s — confirming inductor non-determinism for the same input graph: - GOOD: sha256 651e00bd5997bacd9a062da66e6c9a078ed3c4469c27c715d8b025041a2a8264 - BAD: sha256 af68c498c6ee45b60165d584a870f2f072068153a7c76d9592fc0097efe63c80 Trial results (each cache_reused on all 5, sha256 unchanged before vs after each trial): - Locked GOOD cache: 5/5 PASS - Locked BAD cache: 1/5 PASS, 4/5 FAIL Compared to baseline X-trials (cold cache, mixed artifact): 3/5 PASS. The 5x ratio between cache labels (100% PASS vs 20% PASS) makes the artifact identity the dominant load-bearing variable. The user's verdict-framework branch "5/5 PASS + cache_reused on all 5 → causality basically closed" is satisfied. The bad.5 PASS (with sha256 verified unchanged) suggests a smaller secondary source of non-determinism — not enough to undermine the artifact-identity finding, but worth noting as residual uncertainty. Implication for Z: production fix is "persist a known-good torch.compile AOT cache across container starts," NOT a β-coop workspace patch. Path B is closed. Files: - evidence-preserve: X.1-X.5 compile metrics (path/size/key/verdict); sha256 of X-trial artifacts NOT preserved (containers were torn down before this experiment). - 5 good-trial dirs + 5 bad-trial dirs (each has trial.md, JSON, MD, full docker logs, probe slices, wo_reset log). - summary at evidence/<ts>-pathb-z1-summary/summary.md. Bisect-script env hook (PATHB_Z1_VLLM_CACHE_HOST_DIR) was committed separately as f002ee4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Updates the v2 closeout and Z1 summary with Path B + Z1 findings. v2's earlier "stochastic FAIL inside FULL+β-coop replay/kernel behavior" framing is largely superseded: Z1 showed cache artifact/ directory identity is the dominant load-bearing variable explaining 9/10 outcomes. v3 mlp_partial_fp32 host-reset is **explicitly orphaned** (Y showed in-kernel reset is mechanically correct; Z1 showed FAIL is upstream of any workspace reset). Per code-review feedback (2026-05-01): 1. Softened causality wording in both docs from "CAUSES the verdict in 9 of 10 trials" / "fully closed root cause" to "dominant load-bearing variable explains 9/10 outcomes" / "dominant evidence for upstream torch.compile/inductor non-determinism, not a fully closed RCA." The bad.5 PASS keeps the door open for a smaller secondary non-determinism source. 2. Added a cache-directory manifest section to the Z1 summary showing only the AOT model file differed between snapshots (modelinfos/, computation_graph.py, cache_key_factors.json all byte-identical). Reframes the discriminator as "cache artifact/directory identity" since the mount pinned all of /root/.cache/vllm. 3. Production fix recommendations now require fail-closed cache handling: verify expected sha256 before launch, refuse to start on missing/empty/mismatched, mount read-only after bootstrap. Bare RW mount is NOT sufficient. 4. Probe-off validation gate added as REQUIRED before declaring a cache production-ready. All Z1 trials ran with CUTE_FULL_GRAPH_PROBE=1 and CUTE_WO_RESET_LOG=1; production blessing must validate locked-good with all CUTE_* probes off. Closeout updates: - New "Path B Update — 2026-05-01" section at end summarizing Step 1 (dispatch audit), Y/Y2 (code inspection), X (audit-OFF reproducibility), A (PASS/FAIL log diff), Z1 (cache-pin causality). - "Recommendation" section paths A and B marked superseded; A (mlp_partial reset) explicitly orphaned. - "Followup investigation candidates" item 1 (mlp_partial reset) marked orphaned; new item 4 added for upstream torch.compile/ inductor non-determinism tracking. Memory project_full_graph_blocked.md updated separately (gitignored). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The bless orchestrator (Phase 2 cache workaround) needs a stable JSON output path per validation trial. Default evidence-dir behavior is unchanged when neither arg is passed. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 0)

Phase 2 cache workaround Task 1.1. Deterministic sha256 over a canonical JSON of 18 cache-affecting inputs. Argument order is part of the contract; changing it invalidates all manifests. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 1.1)

Address code quality review of Task 1.1: - Add test_compute_hash_anchor_value: pins canonical JSON output for a fixed 18-input fixture to a hard-coded sha256, so a refactor of the jq filter body or a misordered args call flips the hash and fails loudly instead of silently producing a wrong-but-still-64-hex result. - Add test_compute_hash_rejects_wrong_arg_count: exercises the assert_exit_code harness primitive (previously dead code) and the helper's wrong-arg-count guard. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 1.1 follow-up)

Phase 2 cache workaround Task 1.2. Manifest lookup is by config_hash field (not filename); duplicate hashes are exit 2 (hard corruption). Verify checks size + sha256 of every files[] entry. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 1.2)

Address code quality review of Task 1.2: - I-1: nvllm_resolve_blessed_manifest now stderr-warns on a manifest that parses but lacks .config_hash (was silent skip — masked a real packaging bug class). - I-2: add coverage for resolve helper's default-dir fallback path. - M-3: add tests for 0-byte file rejection and empty files[] in nvllm_verify_blessed_cache (both paths existed but were uncovered). - M-5: use -ne (numeric) instead of != (string) for size comparison in verify helper, matching the surrounding -eq usage. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 1.2 follow-up)

Phase 2 cache workaround Task 1.3. Refusal helpers print structured remediation messages and return 1 (callers chain || exit 1 for hard exit). HF revision resolver delegates to huggingface_hub.model_info and validates the 40-char sha format. Container-exists guard prevents clobbering operator-owned containers (CLAUDE.md docker rule). Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 1.3)

Address code quality review of Task 1.3: - Important: model_id was interpolated directly into Python source, breaking parse if the id contained quotes/backslashes/newlines. Pass via MODEL_ID env var with single-quoted Python source. - Minor: docstring listed a 'huggingface-cli fallback' that was never implemented. Strike to match the actual single-path behavior. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 1.3 follow-up)

Phase 2 cache workaround Task 2.1. CLI surface, BlessConfig, TrialResult. Phase 1/2/3 wiring lands in subsequent tasks. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 2.1)

Phase 2 cache workaround Task 2.2. Phase 1 builds the RW container, polls /v1/models, sends one fixed completion to force prefill+decode + AOT artifact write, gracefully stops, resolves the 4 expected files via globs, returns aot_sha/size for Phase 2 reuse signal. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 2.2)

Address code quality review of Task 2.2: - I-1: phase1_bootstrap and _docker_stop force-removed containers, conflicting with the detect-and-refuse pattern established by nvllm_refuse_if_container_exists (Task 1.3). Replace with docker stop -t N + docker rm (no -f). A leftover orchestrator container stops cleanly; an unexpected operator-owned container surfaces as a loud failure rather than silent destruction. - I-2: _poll_models 'attempt_max=3' kwarg was multiplied by 50 internally → up to 150 retries, drifting from spec §7.3's 'max 3 retries' contract. Drop the multiplier and rename to max_transient_retries for clarity. Also moves sleep(2) out of the except clause so non-200 polling waits between attempts. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 2.2 follow-up)

Phase 2 cache workaround Task 2.3. classify_cache_reuse implements the 3 Z1-derived signals (AOT load marker, no 'saved AOT' lines, post-trial sha unchanged). parse_c2_json gates on same+cross+unique=1. Deviation from plan: phase2_validate's per-trial defensive cleanup uses non-force `docker stop -t 10 + docker rm` instead of the plan's `docker rm -f`. Same reasoning as ea0046d applied to Task 2.2: the launcher's nvllm_refuse_if_container_exists already refuses operator-owned containers; the orchestrator should not silently destroy unexpected state, only its own leftovers, which stop+rm handles cleanly. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 2.3) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 2 cache workaround Task 2.4. accept() builds the manifest dict, archives prior on --rebless (manifest + blessed dir, with timestamp + old artifact sha8 in name), atomic mv staging->blessed, writes JSON. reject() preserves staging as evidence with failure summary, no manifest. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 2.4) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 2 cache workaround Task 2.5. main() resolves paths from env (NVLLM_BLESSED_CACHE_ROOT, HF_MODEL), runs phase 1, runs phase 2, decides accept/reject. Refuses early if blessed dir exists without --rebless. Exit codes: 0 PASS, 1 refuse-no-rebless, 2 phase-1-fail, 3 phase-2-fail. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 2.5) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…trator Phase 2 cache workaround Task 3. Bash preflight: image, GPU memory, no running container, jq+flock available, HF revision resolved. Derives config_hash. Acquires flock per config_hash. Refuses early if manifest exists and no --rebless. Then exec .venv/bin/python orchestrator. shellcheck clean. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 3) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…erify Phase 2 cache workaround Task 4. Defaults flipped: - CUTE_FULL_GRAPH_PROBE=0 (was hardcoded 1) - CUTE_WO_RESET_LOG=0 - CUTE_DISPATCH_AUDIT=0 - CUTE_PHASE_E_LAYERS=0,1,2,3,4,5,6,7 Added pre-docker-run verify-and-mount block that derives config_hash, resolves manifest, verifies cache, refuses on no-match / drift / unsafe-dev. --debug (eager) bypasses verification entirely. Cleanup: removed 4 dead variables (KV_CACHE, ATTN_BACKEND, MAX_MODEL_LEN, MAX_NUM_SEQS) shadowed by the new *_VAL set; shellcheck-clean. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 4) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ts table Phase 2 cache workaround Task 5. README documents what the directory is, links Z1 evidence, lists active manifests (table auto-regenerated by orchestrator on each accept()), summarizes the bless protocol, and explains the three refusal modes. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 5) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Empirical: first bless attempt failed in 8s with ECONNREFUSED. vLLM doesn't bind port 8000 until model load (~5 min for 27B) finishes; during that window every poll returns ECONNREFUSED. Counting those exhausts max_transient_retries=3 in seconds and aborts the bless before the model can possibly be ready. Fix: ConnectionRefusedError is the EXPECTED boot signal — sleep through it without counting. timeout_s (default 600s) is the correct wall-clock bound for boot. Other URLErrors (DNS, TLS, broken pipe) still count as true transients with the 3-retry cap. Two regression tests: - ECONNREFUSED looped 20× still succeeds when followed by 200 - DNS-style URLError still trips after 3 retries This corrects the post-review fix in ea0046d, which assumed "real boot does NOT flap repeatedly". It does — with ECONNREFUSED, the entire boot. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 6 hardening) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Empirical: bless attempt #2 reached the AOT-flush step but Phase 1 failed sha256 with [Errno 13] Permission denied on the AOT model file. vllm runs as root in the container; torch writes the AOT artifact mode 0600 into the host-mounted staging dir, leaving it root-owned and unreadable by the orchestrator (host user natfii). Add _chmod_staging_for_host: docker exec chmod -R a+rX before container stop. Capital X = traverse-only on dirs (preserves directory write semantics), files become world-readable. Phase 2 :ro mounts and the host-side verify in serve-cute-full.sh both read as unprivileged users. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 6 hardening) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two fixes uncovered by bless attempts #2-#4: 1. Post-shutdown perms normalization (replaces the racy pre-stop chmod from cdd59d0). torch/inductor can finalize the AOT artifact during graceful shutdown, so chmod-then-stop leaves files written between chmod and exit still root-owned mode 0600. New helper: _normalize_staging_permissions_for_host runs a short-lived RW helper container against the staging mount AFTER docker stop, executing chown -R <host uid>:<host gid> + chmod -R u+rwX,go+rX. Phase 1 calls it twice: before deleting stale staging (so root-owned leftovers can be rmtree'd) and after stopping the bootstrap container. 2. Pre-create dummy_cache/ in staging. vLLM caching.py:466-467 always calls os.makedirs(<vllm_root>/dummy_cache, exist_ok=True) on the AOT-load path, even when disable_cache=True. With our :ro Phase 2 mount this raised EROFS, the AOT load aborted, and torch.compile silently recompiled — every Phase 2 trial reported cache miss. Pre-creating dummy_cache/ in staging makes the makedirs a no-op under :ro (verified: makedirs(exist_ok=True) on existing dir under read-only parent returns success). Empirical: bless attempt #4 ran the full flow (5 trials, all c2 PASS), but every trial recompiled — see trial_1_container.log line 133: "Compiling model again due to a load failure ... reason: [Errno 30] Read-only file system: '/root/.cache/vllm/dummy_cache'". Two new tests: - _normalize_staging_permissions_uses_helper_container shape check - makedirs_dummy_cache_no_ops_under_readonly_parent (regression) Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 6 hardening) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…negative tests Phase 2 cache workaround Task 6 (integration). First production manifest committed: config_hash e6d32b41c46842c97f877339e86c79d6cc11004a238bef32f2cd3fdb73ce28db, AOT artifact sha d97e88db. Closes the bless-then-mount-RO workaround for the Z1 inductor non-determinism problem. Validation evidence: - K=5 / 5 trials PASS, c2_replay_coherence n=8 each (same-prompt unique=1, cross-prompt independent) - cache_reused=true on every trial (aot_load log marker present, zero "saved AOT compiled function" lines, post-trial AOT sha unchanged) - production serve confirms Directly load AOT compilation from path (decorators.py:305) on the canonical RO mount - GSM8K-50 (seed=42) on the production serve: 47 / 50 = 94.0%, well above the kernel-change "no regression vs prior phase" gate (β-coop baseline ~30-31/50) Negative tests (both refused, no container started): - probe-on (CUTE_FULL_GRAPH_PROBE=1): config_hash differs -> "No matching manifest" -> refusal - drift (corrupted AOT byte in an isolated copy of the cache, NOT the canonical artifact): "DRIFT DETECTED" with size mismatch diagnostic -> refusal. Canonical cache verified intact after the test, so the workaround does not leave production broken when its safety nets fire. nsys trace scope (benchmarks/nvllm/traces/cute_full_blessed/2026-05-01-bless-v1): The 1.5 MB changed.nsys-rep covers a 90 s capture window with one in-flight 256-token completion at ~2.4 tok/s decode. The trace supports the FULL_AND_PIECEWISE launch structure claim (166 cudaGraphLaunch_v10000 + 7976 cudaLaunchKernel + 4548 cuLaunchKernelEx) but does NOT support per-kernel performance claims yet — cuda_gpu_kern_sum reports no GPU kernel data because per-kernel CUPTI attribution into FULL-graph nodes did not surface in this run despite --cuda-graph-trace=node. Per-kernel µs breakdown and any speedup numbers require a follow-up trace using torch profiler or a different CUPTI configuration; do not cite speedups from this trace. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 6) Spec: docs/superpowers/specs/2026-05-01-cute-full-cache-production-workaround-design.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ral within noise Two-leg trace bundle for FULL+blessed (lower-8 + n=1) vs PIECEWISE on matched config (only cudagraph_mode + bless mount differ). Demonstrates that the opt-in serve-cute-full.sh path is production-capable on a trace-backed basis without the prior torch.compile/inductor non-determinism (per bless-v1 in ce26aaa). Headline (single 256-tok streaming request, unprofiled): - TTFT 593→592 ms, decode 2.343→2.352 tok/s — identical within noise Per-kernel aggregate (77 common kernels): - -228.6 ms (-0.27%) in favor of FULL+blessed - DecodeKernel -0.7%/call (-321 ms total), PhaseE_Beta_Kernel -0.4%/call - FP4 GEMM cutlass::device_kernel +1.2%/call (+107 ms — counter-shift) - Small triton/elementwise kernels 16-30% faster under FULL graph Host watchdog peak: 80-82 GiB / 119 GiB. Well below danger; the prior OOM was a max_iterations bug, not a budget problem (see Caveats §2 in summary). Verdict: not a clear win — performance-neutral. User gates respected: no all-32 bless attempted, no n>1 attempted, default remains PIECEWISE pending a clear win. Bundle: - benchmarks/nvllm/traces/cute_full_blessed/2026-05-01-vs-piecewise/ · summary.md (AGENTS.md §4 deliverable) · piecewise.nsys-rep, full.nsys-rep (system-wide, 60s window each) · {piecewise,full}_kernels.csv (per-kernel μs) · {piecewise,full}_streaming.json, _meta.json, _serve.log, _mem.log · comparison.md, comparison.json (per-kernel diff) · raw .pt.trace.json.gz files are gitignored (reproducible from harness) - docs/research/cute_full_blessed_traces/ · capture_full_vs_piecewise.sh (harness, ~1.5-2 hr wall) · render_comparison.py (CSV → markdown + json) · streaming_ttft.py (single streaming request helper) - .gitignore: carve-out for cute_full_blessed area Caveat: profiler bounded to first ~200 worker iterations per leg via max_iterations=200 (active_iterations alone is dead code without wait/warmup_iterations — verified vs vllm/profiler/wrapper.py:104-116, 205-220). Wall workload still runs the full 30 × 256 tokens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Natfii and others added 30 commits April 30, 2026 15:25

evidence(v2): functional CUDA smoke PASS — wo_output reset op zeroes

37f427b

[:nat] slice only, leaves tail intact, nat==0 is no-op. Spec §6.2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

evidence(v2): C0 PIECEWISE β-coop GSM8K sanity — no regression

3fe1707

Spec §6.3.1. Confirms the new captured-reset op runs eager between PIECEWISE pieces without breaking the production decode path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

evidence(v2): C2 PIECEWISE β-coop control — unique=1, cross-indep

514349e

Spec §6.3.2. Confirms the new captured-reset op preserves PIECEWISE deterministic replay coherence (the v1 production path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(bless): orchestrator skeleton + CLI + dataclasses

88e28a0

Phase 2 cache workaround Task 2.1. CLI surface, BlessConfig, TrialResult. Phase 1/2/3 wiring lands in subsequent tasks. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 2.1)

Natfii and others added 12 commits May 1, 2026 06:47

Natfii merged commit 10aa787 into main May 2, 2026

Natfii deleted the feat/cute-full-cache-bless branch May 2, 2026 12:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bless): production-capable opt-in FULL+blessed serving#5

feat(bless): production-capable opt-in FULL+blessed serving#5
Natfii merged 42 commits intomainfrom
feat/cute-full-cache-bless

Natfii commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Natfii commented May 2, 2026

Purpose

Test Plan

Test Result

Apples-to-apples bench (commit 42793fc04)

bless-v1 evidence (commit ce26aaaa0)

Caveats

AI assistance

Not duplicating

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Apples-to-apples bench (commit `42793fc04`)

bless-v1 evidence (commit `ce26aaaa0`)