evidence(β-coop region breakdown): 36% K-reducible, W_O is the bottleneck#6
Merged
evidence(β-coop region breakdown): 36% K-reducible, W_O is the bottleneck#6
Conversation
Adds three @dsl_user_op helpers for region-timing instrumentation: - _read_globaltimer_u64: %globaltimer mov, cross-SM synchronized ns clock - _read_clock64_u64: %clock64 mov, per-SM cycle counter (fallback) - _st_global_u64: st.global.b64 for u64 tick scratch writes Single-line brace pattern matches the proven _ld_global_b16_to_f32 / _st_global_bf16_from_f32 wrappers above. No callers yet; gated rollout in subsequent tasks behind CUTE_BETA_REGION_TIMING=1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Validates the Task 1 helpers compile and emit monotonic u64 ticks without a docker rebuild. Probes globaltimer first (preferred: cross-SM synchronized) then clock64 (fallback: per-SM cycles). Per project memory feedback_kernel_repro_before_rebuild: standalone repro runs inside the existing nvllm:gb10 image via bind mount, saves the ~60min rebuild cycle if PTX inline-asm has constraint bugs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds optional region_timing_buf param to run_beta_coop_full (default None) and includes the env-gated _region_timing_enabled flag in the β-coop compile cache key so the disk cache (feat/cute-full-compile-cache) correctly distinguishes timing-on vs timing-off compile artifacts. No instrumentation writes yet — Task 5+ adds the actual clock64 reads behind this gate. Production path is unchanged: env unset → flag False → key same as before → cache hits the existing artifact → no kernel arg, no overhead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds CUTE_BETA_REGION_TIMING=1 env-gated allocation of a persistent
(num_ctas, 11, 2) u64 scratch tensor on CutePagedAttentionImpl,
alongside the existing _phase_e_coop_{wo_output,mlp_partial_fp32,...}
workspace buffers. Plumbs to run_beta_coop_full as region_timing_buf
kwarg.
Production behavior unchanged: env unset → buffer is None → kwarg
defaults to None → kernel sees timing-off compile path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
De-risking step before instrumenting all 11 regions: writes Phase 0 entry/exit u64 ticks per active CTA (bx==0, by<4) into region_timing_buf when CUTE_BETA_REGION_TIMING=1. Wrapped in cutlass.const_expr(region_timing_enabled) so timing-off path generates no PTX. Runtime overhead measurement deferred to Task 12 (single rebuild + capture for all instrumentation). This commit is code-only; structural pytest (test_phase_e_region_timing.py) confirms 3 PASS, AST parse OK. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a /tmp/.dump_region_timings sentinel check at the end of CutePagedAttentionImpl.forward(); when present, dumps _phase_e_coop_region_timing to /root/.cache/vllm/region_timings.npy then deletes the sentinel. scripts/trigger_region_timing_dump.sh writes the sentinel inside the container and pulls the .npy out via docker cp — that is the canonical extraction path used by the run_breakdown.sh orchestrator (no separate Python wrapper). Pattern follows feedback_vllm_enginecore_env_strip — env vars don't reach EngineCore reliably, so we use a runtime sentinel instead. Runtime end-to-end verification (sentinel triggers a dump, /tmp file produced, region 0 nonzero only on cta_id 0) deferred to Task 12 along with all other instrumentation runtime checks. This commit is code-only; structural pytest 3 PASS, AST parse OK. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds globaltimer entry+exit writes for: - Region 1: Phase 1 entry → W_O start - Region 4: grid barrier wait (all CTAs; wait-time NOT work-time) - Region 5: Phase 3 entry → load_x sync Combined with Phase 0 from prior commit, this gives the four outer phase totals. Sub-region instrumentation (Phase 1 W_O sub-splits and Phase 3 stages) follows in next commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Splits Phase 1 measurement into: - Region 2: W_O GEMV body (K-reduction candidate site #1) - Region 3: W_O end → barrier-arrive (cleanup) Region 1+2+3 should sum to the Phase 1 total measured indirectly via the gap between Region 0 exit and Region 4 entry. Sanity check documented in the host reducer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Splits Phase 3 measurement at the existing safe stage boundaries (per friend review — splitting inside the FC1 inner loop at phase_e_kernel.py:4249 was rejected as a perturbation): - Region 6: partial_reset - Region 7: stage 3a FC1 gate/up + reductions + SiLU (largest expected) - Region 8: stage 3b quant - Region 9: stage 3c FC2 + atomicAdd (K-reduction candidate site #3) - Region 10: stage 3.4 arrival/gather All 11 regions now instrumented. Active-CTA mask verified per cta_id = by*slice_ctas + bx packing: - region 0 (Phase 0): cta_id 0 only (bx==0 && by==0, 1 CTA/seq) - regions 1-3 (Phase 1): cta_ids {0, 8, 16, 24} (bx==0 && by<4, 4 CTAs/seq) - regions 4-10: all 64 CTAs/seq Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Task 8 placed R9 exit BEFORE the per-k-tile mlp_arrival atomic_add and R10 entry AFTER it, leaving the atomic_add in a measurement gap. Region 9 is named "FC2 + atomicAdd" — the cost should be attributed to R9, not vanish into a gap. Reorders the timing block so: - atomic_add fires - R9 exit recorded (atomicAdd cost in R9 wall-time window) - R10 entry recorded (Stage 3.4 wait + gather starts) No behavior change to the kernel math; only the timing-buffer write ordering. Production path unchanged (constexpr-gated). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
reduce_region_timings(buf, *, slice_ctas, num_k_tiles, num_seqs,
tick_source, nsys_total_us=None)
takes the (num_ctas, 11, 2) tick scratch + grid shape and produces a
per-region DataFrame with mean/median/p99 ticks, n_active_ctas,
cta_class, and (when calibrated) median_us and frac_of_kernel.
Per friend review of the prior draft:
- Active masks derived from (slice_ctas, num_k_tiles, num_seqs)
via cta_id = by*slice_ctas + bx, NOT by slicing the first N rows.
Phase 0 = 1 CTA/seq (cta_id 0); Phase 1 = 4 CTAs/seq
(cta_ids {0, slice_ctas, 2*slice_ctas, 3*slice_ctas}); Phase 3
= all 64.
- Fields renamed *_ns → *_ticks; added tick_source column. median_us
is NaN unless tick_source=globaltimer AND nsys_total_us is given.
With clock64, cycle→μs conversion is unreliable under dynamic
clocks (per memory feedback_verify_env_var_consumer applied to
props.clockRate).
- Region 4 (grid_barrier_wait) is labelled barrier_wait and
excluded from the work-fraction denominator.
- Per-CTA median is the representative wall-time contribution
(concurrent execution within a region — sum would be wrong).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures memory throughput, achieved occupancy, L1/L2 hit rates, and compute-vs-memory-bound classification for PhaseE_Beta_Kernel. Adjunct to the per-region timing breakdown — answers 'is the kernel memory-bound at all?' which constrains how much K-parallel reduction can help (Veitner pattern is memory-bound load-amortization play). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two-boot capture: profile boot (timing on, torch profiler) + sanity boot (timing off, GSM8K-50). Extractor calibrates per-CTA medians against nsys mean μs and prints the K-reducible fraction verdict. Decision rule applied automatically in extract_regions.py: ≥50% K-reducible (regions 2+7+9): STRONG GO 40-50%: PROCEED 25-40%: CONDITIONAL on memory-bound class <25%: NO-GO for K-parallel alone Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two pre-flight fixes discovered while staging Task 12 runtime: 1. scripts/serve-cute.sh did NOT pass CUTE_BETA_REGION_TIMING via -e. The EngineCore worker imports _backend.py inside the container, and the env-gated _REGION_TIMING_ENABLED constant is read at module-import time — without -e, the worker never sees the flag and the timing buffer is never allocated. Added passthrough (default 0) alongside the existing CUTE_PHASE_E_* env passthroughs. 2. docs/research/2026-05-02-beta-region-breakdown/run_breakdown.sh called gsm8k_eval_50.py with --base-url/--out/--model "$HF_MODEL" but the actual CLI is --api/--save/--model "default" (--served-model-name in serve-cute.sh is "default", not the HF id). Also bare `python` replaced with .venv/bin/python per AGENTS.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First Boot 1 attempt produced empty region_timings.npy because: 1. Orchestrator only set CUTE_BETA_REGION_TIMING=1 — never set CUTE_PHASE_E_FUSION=1 or CUTE_PHASE_E_LAYERS=0..7. Without those, _phase_e_coop_kernel is None, the env-gated allocation block in _backend.py never fires, _phase_e_coop_region_timing stays None, sentinel-file dump produces nothing, docker cp fails. 2. /start_profile and /stop_profile returned 404 because vLLM gates those endpoints on VLLM_TORCH_PROFILER_DIR being set. 3. Completion curls used model="$HF_MODEL" but serve-cute.sh sets --served-model-name "default" — the burst would have 404'd. Fixes: - Profile boot now exports CUTE_PHASE_E_FUSION=1 + CUTE_PHASE_E_LAYERS=0..7 (lower8 production config — β fires on layers 3, 7) + VLLM_TORCH_PROFILER_DIR=/root/.cache/vllm/profiler. - Sanity boot also exports the fusion env so GSM8K tests the production β-coop path (not a different code path). - Completion curls now use model="default". - scripts/serve-cute.sh propagates VLLM_TORCH_PROFILER_DIR via -e. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug: Boot 1 silently produced no region_timings.npy. Root cause was a shape mismatch (256 vs 64 CTAs); β-coop kernel hit AssertionError but fell back to β-lite without surfacing, so the orchestrator only saw the downstream `docker cp` failure of a never-written file. Friend's 7-part fix: 1. Slice region_timing_buf at the call site to match per-call nat (mirrors wo_output[:nat] etc). Persistent buffer is sized for max_num_seqs=4 → 256 rows; per-call kernel sees nat=1 → 64 rows. 2. Track _phase_e_coop_region_timing_last_ctas after each launch and slice the sentinel-file dump to match — otherwise dump writes 256 rows but extractor expects 64. 3. CUTE_PHASE_E_FALLBACK_RAISE=1 in profile boot — fail-fast on β-coop failure instead of silent β-lite fallback. 4. CUTE_PHASE_E_LAYERS=0,1,2,3,4,5,6,7 (was `0..7`); the env parser is CSV-int only, malformed values fall back to None = ALL layers (which is why prior run fired β-coop on layers 3,7,11,15...63). 5. Use model="default" in trigger_region_timing_dump.sh and run_ncu.sh (was the HF id; serve-cute.sh sets --served-model-name "default"). 6. Sentinel dump now gates on _phase_e_use_beta_coop AND last_ctas>0, not just buffer-not-None — so non-fusion layers and β-lite fallback don't trigger a dump. 7. scripts/serve-cute.sh: NVLLM_BIND_MOUNT_CUTE_PAGED=1 flag overlays host vllm/v1/attention/backends/cute_paged/ onto the in-image /app/nvllm/... path. Pure-Python dir (no .so), so safe — avoids the ABI mismatch Task 2 hit when bind-mounting the whole vllm tree. Lets us iterate this Python-only fix without a 60-min rebuild. 7 structural pytest tests still PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gates Boot 1 (with the 7-part fix) crashed at first β-coop launch with: DSLRuntimeError: The function const_expr(?) received a dynamic expression (non compile-time constant). If your expression depends on dynamic values: Remove const_expr() Misuse of CuTe DSL API: cutlass.const_expr(x) is an *assertion* that x IS a compile-time constant — not a wrapper to make a branch compile-time. Compile-time branching on a Constexpr[bool] parameter happens automatically when the kernel-side parameter is declared as Constexpr[bool]; just write `if region_timing_enabled:` directly. Replaces 22 instances of `if cutlass.const_expr(region_timing_enabled):` with `if region_timing_enabled:` across all 11 instrumented regions (0/1/2/3/4/5/6/7/8/9/10) in _kernel_phase_0_to_4. Same compile-time branching semantics (Constexpr[bool] param at the kernel signature is what matters). Production path still emits no instrumentation PTX when region_timing_enabled is False. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…neck Per-region wall-clock breakdown of the β-coop kernel under lower-8 production config (8 layers fused, 64 CTAs/call, %globaltimer ticks). Result: regions 2 + 7 + 9 (the K-reducible work) sum to 36% of kernel time — strict reading of the priority memo's gate is CONDITIONAL — but the cost is concentrated in W_O GEMV at only 4 active CTAs (region 2 alone = 34.3%), and the resulting 4-of-48 SM bottleneck creates the ~37% barrier wait that the other 60 CTAs spin through. Verdict: PROCEED with a W_O K-parallel prototype before FC1. Sanity: GSM8K-50 with timing-on instrumentation = 47/50 (94%); meets gate exactly, and ~16 above the prior β-coop kernel-change baseline. Caveats called out for next iteration: - NCU adjunct failed: kernel-name regex needs `regex:phase_0_to_4` to match the mangled symbol. - /start_profile returned 404; VLLM_TORCH_PROFILER_DIR doesn't reach EngineCore. Calibrated against prior phaseE-tax β-coop mean_us (40,635.6 μs, n_calls=5100) instead. - GSM8K is timing-on correctness sanity; a timing-off rerun for production-path equivalence is recorded as follow-up. Branch is 17 commits ahead of main with the full instrumentation chain (PTX helpers + Constexpr-gated kernel param + sentinel-file dump + reducer + tests + orchestrator). This commit ships the evidence the chain produced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
project_strategy_priorities.mdcandidate (1) — W_O is the prototype site, not FC1 (FC1 only looks tiny because it's already 64-CTA parallel; W_O is serial)Test plan
region_timing_bufConstexpr-gated paramregion_timings.npydumped (shape(64, 11, 2), last-launch only)regex:phase_0_to_4(next iteration)Caveats called out for next iteration
phase_0_to_4to match the mangled symbol (currentlyPhaseE_Beta_Kernel|cute_kernel, which doesn't)/start_profilereturned 404 —VLLM_TORCH_PROFILER_DIRdoesn't reach EngineCore (env-stripping). Calibrated against prior phaseE-tax β-coopmean_us=40,635.6 μs(n_calls=5100) insteadEvidence
benchmarks/nvllm/traces/cute_paged_attn/2026-05-02-beta-region-breakdown/summary.mdbenchmarks/nvllm/traces/cute_paged_attn/2026-05-02-beta-region-breakdown/region_breakdown.csvbenchmarks/nvllm/traces/cute_paged_attn/2026-05-02-beta-region-breakdown/sanity_gsm8k.json🤖 Generated with Claude Code