feat(bless): production-capable opt-in FULL+blessed serving#5
Merged
Conversation
…ttributes Pre-existing per-call torch.zeros allocations inside run_beta_coop_full were unsafe under FULL_AND_PIECEWISE CUDA graph capture: the graph baked the original allocation addresses, but each replay's torch.zeros call returned a DIFFERENT graph-pool address. The kernel kept reading and writing the originally-captured (now-stale) addresses, producing "first 8 chars stable, decode tokens 2+ diverge" — verbatim symptom of vLLM upstream vllm-project#35175 in CuTe form. Hoisted five buffers to persistent attributes on CutePagedAttentionImpl, allocated in attach_mlp_fusion inside the existing CUTE_PHASE_E_FUSION try-block: self._phase_e_coop_wo_output [max_num_seqs, 4, hidden] f32 self._phase_e_coop_mlp_partial_fp32 [max_num_seqs, slc, hidden] f32 self._phase_e_coop_mlp_arrival_count [max_num_seqs, num_k_tiles] u32 self._phase_e_coop_grid_barrier_i32 [max_num_seqs] i32 self._phase_e_coop_phase1_arrival_count [max_num_seqs] i32 run_beta_coop_full now requires them as keyword-only kwargs (after a `*,` separator since Python forbids non-default after defaulted params). The existing counter .zero_() calls at phase_e_kernel.py: 3036-3038 stay as captured memsets — now hitting stable addresses. mlp_partial_fp32 continues to rely on the v6 in-kernel CTA-local reset at phase_e_kernel.py:~4191. wo_output gets no reset in this v1 patch; v2 (out of scope) adds a separate captured reset op iff C2 fails. External callers (precompile script, kernel test) updated to allocate their own dummy buffers and pass them in. Epsilon-epilogue test also cleaned up: lite kernel now invoked with emit_next_layernorm=False to match the deleted Assert 3 (next_hidden); next_gamma allocation removed because the kernel signature gates next_input_layernorm_gamma on emit_next_layernorm. lite_next_hidden buffer kept because Phase_D_MLP_Kernel asserts it non-None whenever emit_epilogue=True (mlp_kernel.py:497-498). Three new structural tests guard the contract going forward: - test_run_beta_coop_full_has_persistent_buffer_kwargs (sig presence) - test_run_beta_coop_full_no_internal_workspace_zeros (no per-call alloc) - test_attach_mlp_fusion_allocates_persistent_beta_coop_buffers Spec: docs/superpowers/specs/2026-04-30-beta-coop-persistent-buffers-design.md Plan: docs/superpowers/plans/2026-04-30-beta-coop-persistent-buffers-plan.md Bug: vllm-project#35175 (analog) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Records postpatch validation for β-coop persistent workspace buffers v1: C0 PIECEWISE GSM8K PASS, PIECEWISE replay coherence PASS, and FULL lower-8 replay coherence FAIL. The result lands v1 evidence and closeout only. Remaining suspected blocker is stale wo_output content at a stable address; v2 should add an explicit captured wo_output reset. Code: 1cc51ab Spec: docs/superpowers/specs/2026-04-30-beta-coop-persistent-buffers-design.md Plan: docs/superpowers/plans/2026-04-30-beta-coop-persistent-buffers-plan.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
C1 + multi-token C2 investigation that culminated in the 2026-04-30 pre-patch closeout (evidence/2026-04-30-1430-closeout/CLOSEOUT.md): FULL+β-coop is BLOCKED, cumulative-drift across layers, vLLM vllm-project#35175 analog. Production = PIECEWISE+β-coop (already shipping in v0.3.0). Adds: - vllm/v1/worker/gpu_model_runner.py: CUTE_FULL_GRAPH_PROBE v2 (module-level booleans, two-trigger). Self-mutation variant from the morning hung capture 20+ min; module-level state is graph-safe (feedback_no_self_mut_in_cudagraph_dispatch). - docs/research/2026-04-29-full-graph-spike/c2_full_layer_bisect.sh: layer-CSV-driven harness for C2 across full-attn layer subsets. - docs/research/2026-04-29-full-graph-spike/c2_piecewise_betacoop.sh: PIECEWISE control wrapper. - 16 evidence dirs (2026-04-30-0752 → 2026-04-30-1430-closeout) capturing single-layer / lower-8 / upper-8 / all-16 bisects, capture-flake hangs, and the 2026-04-30-1430-closeout writeup. README + _sync_host_edits.sh + c1_replay_proof.sh updated to match the layer-bisect harness flow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a `direct_register_custom_op` that zeroes `_phase_e_coop_wo_output[:nat]` via cudaMemsetAsync (libcudart) on the current torch CUDA stream, captured as a memset graph node ordered before each `run_beta_coop_full` launch. Targets the v1 closeout's "stale content at stable address" failure under FULL_AND_PIECEWISE. - New file: _wo_output_reset_op.py (lazy libcudart bind, precondition asserts, env-gated CUTE_WO_RESET_LOG=1 capture-side probe). - Side-effect import at qwen3_5.py:42 (mirrors existing _beta_coop_op registration site). - Callsite at _backend.py:1540 (eager body of cute_beta_coop_run; not a splitting boundary; FX topology unchanged from v1). Spec: docs/superpowers/specs/2026-04-30-beta-coop-persistent-buffers-v2-design.md v1 closeout: docs/research/2026-04-29-full-graph-spike/evidence/2026-04-30-1552-postpatch-v1/CLOSEOUT.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends _sync_host_edits.sh with docker-cp + sentinel checks for the two new sync targets in the v2 patch: - _wo_output_reset_op.py (new file) - nvllm/models/qwen3_5.py (modified import line) Also adds nvllm/models/__pycache__ to the stale-pyc cleanup so qwen3_5.pyc doesn't shadow the new import line. Forwards CUTE_WO_RESET_LOG through c2_full_layer_bisect.sh so the Gate 1 capture-side reset probe is actually enabled inside the container. No rebuild needed (feedback_rebuild_guard); next gate run picks up both files via docker cp. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-ups from the Phase 1 code review of fcbdef8: - M2: explicit `int(...)` cast on `torch.cuda.current_stream().cuda_stream` before binding into the cudaMemsetAsync ctypes call. Aligns with the sibling pattern used by 8 callsites in phase_e_kernel.py and kernel.py, and hardens the error-path f-string `stream={stream_handle:#x}` against a future torch version returning a non-int handle (which would mask the real CUDA error with a TypeError mid-format). - M1: reworded module docstring to drop the "doubly preventing DCE" claim (which conflated DCE-safety with stream-capture-safety) and replaced it with the accurate explanation: the call runs inside the eager body of the cute_beta_coop_run splitting boundary, so Dynamo/FX never sees it and DCE simply doesn't apply. mutates_args is now framed as schema hygiene for any future trace site, not as a DCE guard. Behavior unchanged at this commit; both edits are local to the new _wo_output_reset_op.py module. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
[:nat] slice only, leaves tail intact, nat==0 is no-op. Spec §6.2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec §6.3.1. Confirms the new captured-reset op runs eager between PIECEWISE pieces without breaking the production decode path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec §6.3.2. Confirms the new captured-reset op preserves PIECEWISE deterministic replay coherence (the v1 production path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec §6.4 + §6.5 probe (1). c2_replay_coherence + capture-side runtime log (CUTE_WO_RESET_LOG=1) confirming the reset op fires on every attached layer. Result: FAIL — see c2_replay_coherence.md for unique count. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec §6.5 probe (3) + AGENTS.md §4. Captured a full nsys trace under the v2 lower-8 (3,7,11,15,19,23,27,31) FULL_AND_PIECEWISE config. Trace contains 166 MEMSET + 2946 KERNEL activity rows. Honest limitation noted in summary.md: nsys followed only the API server PID; the EngineCore subprocess (where β-coop and the captured wo_output cudaMemsetAsync actually run) was NOT crossed, so the trace shows 0 graph-captured (graphNodeId IS NOT NULL) events. The β-coop kernels are absent from this PID's stream; visible kernels are sampling/elementwise. Promoting this to a captured-graph-node-ordering proof needs a follow-up arc using either nsys child-tree-follow or vLLM's torch profiler API hooks (per project memory feedback_vllm_profiling). Trace + GPU activity ordering committed for forensic inspection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captured cudaMemsetAsync reset for _phase_e_coop_wo_output is wired correctly and fires once per attached layer at FULL-graph capture (8 unique stable data_ptrs in the [CUTE_WO_RESET] runtime log), but the FULL+β-coop lower-8 replay-divergence bug class is unfixed: Gate 1 c2_replay_coherence reports unique=4, cross-prompt dependent. The "stale content at stable address" hypothesis from the v1 closeout is insufficient. v2 patch ships as a no-op-for-FULL but PIECEWISE-clean infrastructure layer: - C0 PIECEWISE+β-coop GSM8K-sanity 8/8 PASS - C2 PIECEWISE+β-coop replay coherence unique=1 cross-indep PASS - Functional CUDA smoke (op zeros [:nat], leaves tail intact) PASS PIECEWISE production path is intact. v3 candidates: A) host-captured reset for mlp_partial_fp32 (replace in-kernel CTA-local reset; the v1 closeout's named escalation candidate), B) re-evaluate the workspace-residue diagnosis altogether, or C) defer to upstream vllm-project#40969 (open, same hardware/cudagraph_mode). Per AFK instruction: do NOT auto-escalate to v3. Closeout stops here for human review. Spec: docs/superpowers/specs/2026-04-30-beta-coop-persistent-buffers-v2-design.md Closeout: docs/research/2026-04-29-full-graph-spike/evidence/2026-04-30-1822-postpatch-v2/CLOSEOUT.md Memory: ~/.claude/projects/-home-natfii-docker-nvllm/memory/project_full_graph_blocked.md (updated, gitignored) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…m-project#40969 Per user code review of the prior closeout commit (8c7111a): 1. Soften "captured graph node ordered before β-coop" claim. The nsys trace did NOT independently prove per-replay graph-node ordering (EngineCore subprocess not captured). Reframed as "issued during FULL-graph capture on the current stream; expected to be captured as a graph memset node, but per-replay ordering not independently proven." Applied at the verdict paragraph and the "Why v2 failed" section. 2. Replace "no-op-for-FULL" with "not sufficient to enable FULL; PIECEWISE-clean infrastructure/refactor remains shippable." The patch DOES affect FULL behavior (it just doesn't make FULL correct); the prior phrasing was misleading. Applied at the verdict paragraph and the Status block. 3. Resolve evidence-path ambiguity. Sibling evidence dirs use relative `../2026-04-30-XXXX/` paths (unambiguous from the closeout's own location); the trace dir outside `evidence/` uses `<repo-root>/benchmarks/...` notation. 4. "reuses v2's `_wo_output_reset_op.py` shape" → "...op pattern" in the v3 escalation candidate. mlp_partial_fp32's buffer shape differs from wo_output's; only the op pattern carries over, not the byte-count math. 5. Date-stamp upstream vllm-project#40969 references with the recheck timestamp (OPEN as of 2026-04-30, last upstream activity 2026-04-28T11:06:08Z) so future readers can see how stale the claim is. Applied to all three references in the doc. No factual claims changed; only framing and traceability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds CUTE_DISPATCH_AUDIT=1 probe at gpu_model_runner.py post-DP re-dispatch site. Logs the FINAL returned mode + descriptor for the first 100 dispatch calls, with raw vs descriptor fields side-by-side so prefill (raw_tokens > 1) can be distinguished from steady-decode (raw_tokens == raw_reqs == 1) rows. Module-level int counter only (no setattr on self, per feedback_no_self_mut_in_cudagraph_dispatch). Path B Step 1 of the v2 β-coop diagnosis re-evaluation: confirm whether the lower-8 FULL run is truly FULL replay during steady decode, or silently hybrid/PIECEWISE per step. The existing first-any / first-FULL booleans only prove FULL was reached at least once; this gives steady-state visibility. Sync infra: - c2_full_layer_bisect.sh:73 forwards CUTE_DISPATCH_AUDIT to the EngineCore subprocess (per feedback_vllm_enginecore_env_strip). - _sync_host_edits.sh adds a sentinel grep so a stale image copy doesn't silently invalidate the audit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-call CUTE_DISPATCH_AUDIT probe captured 100 dispatches across the c2_replay_coherence pattern under FULL_AND_PIECEWISE on lower-8 β-coop layers. Findings: - 89 / 100 rows: FULL mode with uniform_decode=True, desc_uniform=True — every steady-decode dispatch was FULL. - 2 / 100 rows: PIECEWISE at idx=2 / idx=6 during FULL graph CAPTURE (uniform_decode=False, capture phase, before first-FULL probe fired) — capture-time variant for non-uniform shapes, not steady decode. - 9 / 100 rows: NONE — 7 force_eager warmup + 3 prefill of 12-token prompt (eager prefill is normal vLLM behavior). Verdict: hybrid-dispatch hypothesis is RULED OUT for the steady decode path. Proceed to Step 2 (inspect _beta_coop_op.py for capture-time freeze risks). Sidebar: this run produced unique=1 PASS (vs Gate 1 unique=4 FAIL, same code path mod the new probe). v2 reset still fires (8 unique [CUTE_WO_RESET] data_ptrs). Either the audit probe perturbed something or the bug is genuinely stochastic — needs an audit-OFF re-run to disambiguate. Single c2 trial cannot characterize this regime; this reinforces the v2 closeout's "controlled multi-seed test we did not run" framing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each trial uses CUTE_WO_RESET_LOG=1 (matches failing Gate 1) + CUTE_DISPATCH_AUDIT=0 (audit OFF) + CUTE_FULL_GRAPH_PROBE=1 (bisect-script default). Tear down between trials. No code changes under test from Step 1 audit run (HEAD d36abf7). Per-trial evidence: trial.md with git SHA, env contract, c2 unique count, same/cross/overall pass flags, first-any/first-FULL probe presence, wo_reset unique data_ptrs. Result: MIXED (2/3 PASS, 1/3 FAIL). Trial 1 unique=1 PASS, trial 2 unique=3 FAIL, trial 3 unique=1 PASS. Summary applies the Step X verdict logic: mixed branch -> run two more trials and treat as statistical, NOT a fix declaration. The original Gate 1 unique=4 FAIL appears to belong to the same statistical distribution as v2 audit-OFF, not a deterministic bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
X.4+X.5 use identical protocol to X.1-X.3 (see commit 68c6ab9). CUTE_WO_RESET_LOG=1, CUTE_DISPATCH_AUDIT=0, CUTE_FULL_GRAPH_PROBE=1. Tear down between trials. No code changes under test from HEAD. Per-trial trial.md format identical to X.1-X.3. 5-trial summary at evidence/<ts>-pathb-x-summary-5trial/summary.md applies the user's statistical verdict logic: - 0/5 or 1/5 FAIL: baseline too unstable to justify a patch. - 2/5 or 3/5 FAIL: stochastic bug; v3 needs statistical acceptance. - 4/5 or 5/5 FAIL: focused Z patch target. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds env-driven mount of /root/.cache/vllm to a host directory. When PATHB_Z1_VLLM_CACHE_HOST_DIR is unset, the script behaves identically to before (no mount, per-container scratch cache). For Path B Z1 controlled causality test: lock torch.compile AOT cache across fresh containers to test whether artifact-size / compile-path is causal for the X-trial PASS/FAIL discriminator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per user code review of the Step 1 dispatch-audit run: bullet-form of the user-specified evidence fields (dispatch_hypothesis, steady_decode_rows, non_FULL_rows, coherence_result, interpretation) plus a probe-design lesson section noting the c2 PASS-vs-FAIL flip on the audit run reinforces "minimize hot-path state mutation." Added inline before the audit-OFF reproducibility experiment (X) was dispatched — this is the rationale the X experiment then tested. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ausal Locked torch.compile AOT cache to a host-mounted dir to test whether the X-trial PASS/FAIL discriminator (62 MB log-reported / 70 MB disk vs 73 MB log-reported / 81 MB disk artifact size, perfect 6/6 correlation) is causal vs merely correlative. Two artifacts produced by inductor at the same cache key (9a5549f23a178e35a9a3e9b4bed7adf1d137d22f3fc06ef8048d589e5d625721) but distinct sha256s — confirming inductor non-determinism for the same input graph: - GOOD: sha256 651e00bd5997bacd9a062da66e6c9a078ed3c4469c27c715d8b025041a2a8264 - BAD: sha256 af68c498c6ee45b60165d584a870f2f072068153a7c76d9592fc0097efe63c80 Trial results (each cache_reused on all 5, sha256 unchanged before vs after each trial): - Locked GOOD cache: 5/5 PASS - Locked BAD cache: 1/5 PASS, 4/5 FAIL Compared to baseline X-trials (cold cache, mixed artifact): 3/5 PASS. The 5x ratio between cache labels (100% PASS vs 20% PASS) makes the artifact identity the dominant load-bearing variable. The user's verdict-framework branch "5/5 PASS + cache_reused on all 5 → causality basically closed" is satisfied. The bad.5 PASS (with sha256 verified unchanged) suggests a smaller secondary source of non-determinism — not enough to undermine the artifact-identity finding, but worth noting as residual uncertainty. Implication for Z: production fix is "persist a known-good torch.compile AOT cache across container starts," NOT a β-coop workspace patch. Path B is closed. Files: - evidence-preserve: X.1-X.5 compile metrics (path/size/key/verdict); sha256 of X-trial artifacts NOT preserved (containers were torn down before this experiment). - 5 good-trial dirs + 5 bad-trial dirs (each has trial.md, JSON, MD, full docker logs, probe slices, wo_reset log). - summary at evidence/<ts>-pathb-z1-summary/summary.md. Bisect-script env hook (PATHB_Z1_VLLM_CACHE_HOST_DIR) was committed separately as f002ee4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Updates the v2 closeout and Z1 summary with Path B + Z1 findings. v2's earlier "stochastic FAIL inside FULL+β-coop replay/kernel behavior" framing is largely superseded: Z1 showed cache artifact/ directory identity is the dominant load-bearing variable explaining 9/10 outcomes. v3 mlp_partial_fp32 host-reset is **explicitly orphaned** (Y showed in-kernel reset is mechanically correct; Z1 showed FAIL is upstream of any workspace reset). Per code-review feedback (2026-05-01): 1. Softened causality wording in both docs from "CAUSES the verdict in 9 of 10 trials" / "fully closed root cause" to "dominant load-bearing variable explains 9/10 outcomes" / "dominant evidence for upstream torch.compile/inductor non-determinism, not a fully closed RCA." The bad.5 PASS keeps the door open for a smaller secondary non-determinism source. 2. Added a cache-directory manifest section to the Z1 summary showing only the AOT model file differed between snapshots (modelinfos/, computation_graph.py, cache_key_factors.json all byte-identical). Reframes the discriminator as "cache artifact/directory identity" since the mount pinned all of /root/.cache/vllm. 3. Production fix recommendations now require fail-closed cache handling: verify expected sha256 before launch, refuse to start on missing/empty/mismatched, mount read-only after bootstrap. Bare RW mount is NOT sufficient. 4. Probe-off validation gate added as REQUIRED before declaring a cache production-ready. All Z1 trials ran with CUTE_FULL_GRAPH_PROBE=1 and CUTE_WO_RESET_LOG=1; production blessing must validate locked-good with all CUTE_* probes off. Closeout updates: - New "Path B Update — 2026-05-01" section at end summarizing Step 1 (dispatch audit), Y/Y2 (code inspection), X (audit-OFF reproducibility), A (PASS/FAIL log diff), Z1 (cache-pin causality). - "Recommendation" section paths A and B marked superseded; A (mlp_partial reset) explicitly orphaned. - "Followup investigation candidates" item 1 (mlp_partial reset) marked orphaned; new item 4 added for upstream torch.compile/ inductor non-determinism tracking. Memory project_full_graph_blocked.md updated separately (gitignored). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bless orchestrator (Phase 2 cache workaround) needs a stable JSON output path per validation trial. Default evidence-dir behavior is unchanged when neither arg is passed. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 0)
Phase 2 cache workaround Task 1.1. Deterministic sha256 over a canonical JSON of 18 cache-affecting inputs. Argument order is part of the contract; changing it invalidates all manifests. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 1.1)
Address code quality review of Task 1.1: - Add test_compute_hash_anchor_value: pins canonical JSON output for a fixed 18-input fixture to a hard-coded sha256, so a refactor of the jq filter body or a misordered args call flips the hash and fails loudly instead of silently producing a wrong-but-still-64-hex result. - Add test_compute_hash_rejects_wrong_arg_count: exercises the assert_exit_code harness primitive (previously dead code) and the helper's wrong-arg-count guard. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 1.1 follow-up)
Phase 2 cache workaround Task 1.2. Manifest lookup is by config_hash field (not filename); duplicate hashes are exit 2 (hard corruption). Verify checks size + sha256 of every files[] entry. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 1.2)
Address code quality review of Task 1.2: - I-1: nvllm_resolve_blessed_manifest now stderr-warns on a manifest that parses but lacks .config_hash (was silent skip — masked a real packaging bug class). - I-2: add coverage for resolve helper's default-dir fallback path. - M-3: add tests for 0-byte file rejection and empty files[] in nvllm_verify_blessed_cache (both paths existed but were uncovered). - M-5: use -ne (numeric) instead of != (string) for size comparison in verify helper, matching the surrounding -eq usage. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 1.2 follow-up)
Phase 2 cache workaround Task 1.3. Refusal helpers print structured remediation messages and return 1 (callers chain || exit 1 for hard exit). HF revision resolver delegates to huggingface_hub.model_info and validates the 40-char sha format. Container-exists guard prevents clobbering operator-owned containers (CLAUDE.md docker rule). Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 1.3)
Address code quality review of Task 1.3: - Important: model_id was interpolated directly into Python source, breaking parse if the id contained quotes/backslashes/newlines. Pass via MODEL_ID env var with single-quoted Python source. - Minor: docstring listed a 'huggingface-cli fallback' that was never implemented. Strike to match the actual single-path behavior. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 1.3 follow-up)
Phase 2 cache workaround Task 2.1. CLI surface, BlessConfig, TrialResult. Phase 1/2/3 wiring lands in subsequent tasks. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 2.1)
Phase 2 cache workaround Task 2.2. Phase 1 builds the RW container, polls /v1/models, sends one fixed completion to force prefill+decode + AOT artifact write, gracefully stops, resolves the 4 expected files via globs, returns aot_sha/size for Phase 2 reuse signal. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 2.2)
Address code quality review of Task 2.2: - I-1: phase1_bootstrap and _docker_stop force-removed containers, conflicting with the detect-and-refuse pattern established by nvllm_refuse_if_container_exists (Task 1.3). Replace with docker stop -t N + docker rm (no -f). A leftover orchestrator container stops cleanly; an unexpected operator-owned container surfaces as a loud failure rather than silent destruction. - I-2: _poll_models 'attempt_max=3' kwarg was multiplied by 50 internally → up to 150 retries, drifting from spec §7.3's 'max 3 retries' contract. Drop the multiplier and rename to max_transient_retries for clarity. Also moves sleep(2) out of the except clause so non-200 polling waits between attempts. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 2.2 follow-up)
Phase 2 cache workaround Task 2.3. classify_cache_reuse implements the 3 Z1-derived signals (AOT load marker, no 'saved AOT' lines, post-trial sha unchanged). parse_c2_json gates on same+cross+unique=1. Deviation from plan: phase2_validate's per-trial defensive cleanup uses non-force `docker stop -t 10 + docker rm` instead of the plan's `docker rm -f`. Same reasoning as ea0046d applied to Task 2.2: the launcher's nvllm_refuse_if_container_exists already refuses operator-owned containers; the orchestrator should not silently destroy unexpected state, only its own leftovers, which stop+rm handles cleanly. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 2.3) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 cache workaround Task 2.4. accept() builds the manifest dict, archives prior on --rebless (manifest + blessed dir, with timestamp + old artifact sha8 in name), atomic mv staging->blessed, writes JSON. reject() preserves staging as evidence with failure summary, no manifest. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 2.4) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 cache workaround Task 2.5. main() resolves paths from env (NVLLM_BLESSED_CACHE_ROOT, HF_MODEL), runs phase 1, runs phase 2, decides accept/reject. Refuses early if blessed dir exists without --rebless. Exit codes: 0 PASS, 1 refuse-no-rebless, 2 phase-1-fail, 3 phase-2-fail. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 2.5) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…trator Phase 2 cache workaround Task 3. Bash preflight: image, GPU memory, no running container, jq+flock available, HF revision resolved. Derives config_hash. Acquires flock per config_hash. Refuses early if manifest exists and no --rebless. Then exec .venv/bin/python orchestrator. shellcheck clean. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 3) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…erify Phase 2 cache workaround Task 4. Defaults flipped: - CUTE_FULL_GRAPH_PROBE=0 (was hardcoded 1) - CUTE_WO_RESET_LOG=0 - CUTE_DISPATCH_AUDIT=0 - CUTE_PHASE_E_LAYERS=0,1,2,3,4,5,6,7 Added pre-docker-run verify-and-mount block that derives config_hash, resolves manifest, verifies cache, refuses on no-match / drift / unsafe-dev. --debug (eager) bypasses verification entirely. Cleanup: removed 4 dead variables (KV_CACHE, ATTN_BACKEND, MAX_MODEL_LEN, MAX_NUM_SEQS) shadowed by the new *_VAL set; shellcheck-clean. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 4) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ts table Phase 2 cache workaround Task 5. README documents what the directory is, links Z1 evidence, lists active manifests (table auto-regenerated by orchestrator on each accept()), summarizes the bless protocol, and explains the three refusal modes. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 5) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Empirical: first bless attempt failed in 8s with ECONNREFUSED. vLLM doesn't bind port 8000 until model load (~5 min for 27B) finishes; during that window every poll returns ECONNREFUSED. Counting those exhausts max_transient_retries=3 in seconds and aborts the bless before the model can possibly be ready. Fix: ConnectionRefusedError is the EXPECTED boot signal — sleep through it without counting. timeout_s (default 600s) is the correct wall-clock bound for boot. Other URLErrors (DNS, TLS, broken pipe) still count as true transients with the 3-retry cap. Two regression tests: - ECONNREFUSED looped 20× still succeeds when followed by 200 - DNS-style URLError still trips after 3 retries This corrects the post-review fix in ea0046d, which assumed "real boot does NOT flap repeatedly". It does — with ECONNREFUSED, the entire boot. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 6 hardening) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Empirical: bless attempt #2 reached the AOT-flush step but Phase 1 failed sha256 with [Errno 13] Permission denied on the AOT model file. vllm runs as root in the container; torch writes the AOT artifact mode 0600 into the host-mounted staging dir, leaving it root-owned and unreadable by the orchestrator (host user natfii). Add _chmod_staging_for_host: docker exec chmod -R a+rX before container stop. Capital X = traverse-only on dirs (preserves directory write semantics), files become world-readable. Phase 2 :ro mounts and the host-side verify in serve-cute-full.sh both read as unprivileged users. Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 6 hardening) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes uncovered by bless attempts #2-#4: 1. Post-shutdown perms normalization (replaces the racy pre-stop chmod from cdd59d0). torch/inductor can finalize the AOT artifact during graceful shutdown, so chmod-then-stop leaves files written between chmod and exit still root-owned mode 0600. New helper: _normalize_staging_permissions_for_host runs a short-lived RW helper container against the staging mount AFTER docker stop, executing chown -R <host uid>:<host gid> + chmod -R u+rwX,go+rX. Phase 1 calls it twice: before deleting stale staging (so root-owned leftovers can be rmtree'd) and after stopping the bootstrap container. 2. Pre-create dummy_cache/ in staging. vLLM caching.py:466-467 always calls os.makedirs(<vllm_root>/dummy_cache, exist_ok=True) on the AOT-load path, even when disable_cache=True. With our :ro Phase 2 mount this raised EROFS, the AOT load aborted, and torch.compile silently recompiled — every Phase 2 trial reported cache miss. Pre-creating dummy_cache/ in staging makes the makedirs a no-op under :ro (verified: makedirs(exist_ok=True) on existing dir under read-only parent returns success). Empirical: bless attempt #4 ran the full flow (5 trials, all c2 PASS), but every trial recompiled — see trial_1_container.log line 133: "Compiling model again due to a load failure ... reason: [Errno 30] Read-only file system: '/root/.cache/vllm/dummy_cache'". Two new tests: - _normalize_staging_permissions_uses_helper_container shape check - makedirs_dummy_cache_no_ops_under_readonly_parent (regression) Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 6 hardening) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…negative tests
Phase 2 cache workaround Task 6 (integration). First production manifest
committed: config_hash e6d32b41c46842c97f877339e86c79d6cc11004a238bef32f2cd3fdb73ce28db,
AOT artifact sha d97e88db. Closes the bless-then-mount-RO workaround for the
Z1 inductor non-determinism problem.
Validation evidence:
- K=5 / 5 trials PASS, c2_replay_coherence n=8 each
(same-prompt unique=1, cross-prompt independent)
- cache_reused=true on every trial (aot_load log marker present,
zero "saved AOT compiled function" lines, post-trial AOT sha unchanged)
- production serve confirms Directly load AOT compilation from path
(decorators.py:305) on the canonical RO mount
- GSM8K-50 (seed=42) on the production serve: 47 / 50 = 94.0%, well
above the kernel-change "no regression vs prior phase" gate (β-coop
baseline ~30-31/50)
Negative tests (both refused, no container started):
- probe-on (CUTE_FULL_GRAPH_PROBE=1): config_hash differs -> "No matching
manifest" -> refusal
- drift (corrupted AOT byte in an isolated copy of the cache, NOT the
canonical artifact): "DRIFT DETECTED" with size mismatch diagnostic ->
refusal. Canonical cache verified intact after the test, so the
workaround does not leave production broken when its safety nets fire.
nsys trace scope (benchmarks/nvllm/traces/cute_full_blessed/2026-05-01-bless-v1):
The 1.5 MB changed.nsys-rep covers a 90 s capture window with one
in-flight 256-token completion at ~2.4 tok/s decode. The trace supports
the FULL_AND_PIECEWISE launch structure claim (166 cudaGraphLaunch_v10000
+ 7976 cudaLaunchKernel + 4548 cuLaunchKernelEx) but does NOT support
per-kernel performance claims yet — cuda_gpu_kern_sum reports no GPU
kernel data because per-kernel CUPTI attribution into FULL-graph nodes
did not surface in this run despite --cuda-graph-trace=node. Per-kernel
µs breakdown and any speedup numbers require a follow-up trace using
torch profiler or a different CUPTI configuration; do not cite speedups
from this trace.
Plan: docs/superpowers/plans/2026-05-01-cute-full-cache-production-workaround.md (Task 6)
Spec: docs/superpowers/specs/2026-05-01-cute-full-cache-production-workaround-design.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ral within noise Two-leg trace bundle for FULL+blessed (lower-8 + n=1) vs PIECEWISE on matched config (only cudagraph_mode + bless mount differ). Demonstrates that the opt-in serve-cute-full.sh path is production-capable on a trace-backed basis without the prior torch.compile/inductor non-determinism (per bless-v1 in ce26aaa). Headline (single 256-tok streaming request, unprofiled): - TTFT 593→592 ms, decode 2.343→2.352 tok/s — identical within noise Per-kernel aggregate (77 common kernels): - -228.6 ms (-0.27%) in favor of FULL+blessed - DecodeKernel -0.7%/call (-321 ms total), PhaseE_Beta_Kernel -0.4%/call - FP4 GEMM cutlass::device_kernel +1.2%/call (+107 ms — counter-shift) - Small triton/elementwise kernels 16-30% faster under FULL graph Host watchdog peak: 80-82 GiB / 119 GiB. Well below danger; the prior OOM was a max_iterations bug, not a budget problem (see Caveats §2 in summary). Verdict: not a clear win — performance-neutral. User gates respected: no all-32 bless attempted, no n>1 attempted, default remains PIECEWISE pending a clear win. Bundle: - benchmarks/nvllm/traces/cute_full_blessed/2026-05-01-vs-piecewise/ · summary.md (AGENTS.md §4 deliverable) · piecewise.nsys-rep, full.nsys-rep (system-wide, 60s window each) · {piecewise,full}_kernels.csv (per-kernel μs) · {piecewise,full}_streaming.json, _meta.json, _serve.log, _mem.log · comparison.md, comparison.json (per-kernel diff) · raw .pt.trace.json.gz files are gitignored (reproducible from harness) - docs/research/cute_full_blessed_traces/ · capture_full_vs_piecewise.sh (harness, ~1.5-2 hr wall) · render_comparison.py (CSV → markdown + json) · streaming_ttft.py (single streaming request helper) - .gitignore: carve-out for cute_full_blessed area Caveat: profiler bounded to first ~200 worker iterations per leg via max_iterations=200 (active_iterations alone is dead code without wait/warmup_iterations — verified vs vllm/profiler/wrapper.py:104-116, 205-220). Wall workload still runs the full 30 × 256 tokens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Land the bless cache workflow + opt-in
serve-cute-full.shpaththat makes FULL+blessed CuTe serving production-capable on lower-8 layers
at concurrency=1, plus the apples-to-apples bench evidence backing it.
Background. The earlier persistent-buffers / workspace-reset direction
(Path B) was closed out (
5d92deff3) once Z1 evidence (b2677abd3) showedthe FULL-graph instability was torch.compile/inductor non-determinism, not
a workspace-reset bug. This PR ships the alternative fix: pin compile
artifacts via a blessed AOT cache mounted read-only into the container,
so every cold start replays the same captured graphs.
Default behavior is unchanged. PIECEWISE remains the repo default.
FULL+blessed is opt-in via
serve-cute-full.shonly. The bench resultwas performance-neutral within noise (-0.27% aggregate kernel time,
+0.39% decode tok/s) — explicitly NOT a clear win, so no default flip.
This PR contains 42 commits sequenced as:
efb6fdb29) —--json-out+--evidence-dirargs9ead479aa..dbf40bce3) —compute_blessed_config_hash,resolve_blessed_manifest,verify_blessed_cache, HF revision resolver(
fcbdef8da..5d92deff3) — work that is now superseded; closeoutdoc is in this PR's history for traceability
88e28a002..8bf779825)serve-cute-full.shintegration (6ee3db2ab)fdcf92d46..f3a4ed0e6)ce26aaaa0— bless-v1 first blessed cache + production trace + GSM8K42793fc04— apples-to-apples bench vs PIECEWISE (this PR'snumerical headline)
Test Plan
Test Result
Apples-to-apples bench (commit
42793fc04)Full evidence:
benchmarks/nvllm/traces/cute_full_blessed/2026-05-01-vs-piecewise/summary.mdbless-v1 evidence (commit
ce26aaaa0)GSM8K parity with PIECEWISE baseline, deterministic across cold starts,
negative tests confirm fail-closed behavior on bad cache.
Caveats
CUTE_PHASE_E_LAYERS=0..7); all-32-layer bless isfuture work.
max_num_seqs=1only; n>1 is future work.*.pt.trace.json.gzraw torch profiler dumps are gitignored(37 MB total); reproducible via the harness.
benchmarks/andtraces/areevidence-only (write-blocked); future bench runs must stage outside
then
cpin.AI assistance
This work was done with Claude Code assistance (Opus 4.7 1M-context).
Submitting human reviewed every changed line and ran the bench
end-to-end on-device.
Not duplicating
This is a fork-internal feature for the DGX Spark target; no upstream
vLLM PR addresses opt-in FULL+blessed serving for SM120.
Essential Elements Checklist
docs/blessed-caches/README.md, evidencesummary, harness scripts under
docs/research/