evidence(β-coop region breakdown): 36% K-reducible, W_O is the bottleneck by Natfii · Pull Request #6 · Navi-AI-Lab/nvllm

Natfii · 2026-05-03T02:39:52Z

Summary

Per-region wall-clock breakdown of the β-coop kernel under lower-8 production config (8 layers fused, 64 CTAs/call, %globaltimer ticks)
Result: regions 2 + 7 + 9 (K-reducible) sum to 36% of kernel time — strict CONDITIONAL bracket — but cost is concentrated in W_O GEMV at only 4 active CTAs (R2 alone = 34.3%), creating the ~37% barrier wait the other 60 CTAs spin through
Verdict: PROCEED with a W_O K-parallel prototype before FC1. Refines project_strategy_priorities.md candidate (1) — W_O is the prototype site, not FC1 (FC1 only looks tiny because it's already 64-CTA parallel; W_O is serial)
Sanity: GSM8K-50 with timing-on instrumentation = 47/50 (94%), meets gate exactly

Test plan

All 7 structural pytests pass (3 region timing + 4 reducer)
β-coop kernel compiles with region_timing_buf Constexpr-gated param
region_timings.npy dumped (shape (64, 11, 2), last-launch only)
Reduction extractor produces verdict naming W_O (data-driven via dominant K-reducible region by ticks)
GSM8K-50 sanity: 47/50 timing-on
Timing-OFF rerun for production-path equivalence (recorded as follow-up)
NCU rerun with regex:phase_0_to_4 (next iteration)

Caveats called out for next iteration

NCU adjunct failed: regex needs phase_0_to_4 to match the mangled symbol (currently PhaseE_Beta_Kernel|cute_kernel, which doesn't)
/start_profile returned 404 — VLLM_TORCH_PROFILER_DIR doesn't reach EngineCore (env-stripping). Calibrated against prior phaseE-tax β-coop mean_us=40,635.6 μs (n_calls=5100) instead

Evidence

benchmarks/nvllm/traces/cute_paged_attn/2026-05-02-beta-region-breakdown/summary.md
benchmarks/nvllm/traces/cute_paged_attn/2026-05-02-beta-region-breakdown/region_breakdown.csv
benchmarks/nvllm/traces/cute_paged_attn/2026-05-02-beta-region-breakdown/sanity_gsm8k.json

🤖 Generated with Claude Code

Adds three @dsl_user_op helpers for region-timing instrumentation: - _read_globaltimer_u64: %globaltimer mov, cross-SM synchronized ns clock - _read_clock64_u64: %clock64 mov, per-SM cycle counter (fallback) - _st_global_u64: st.global.b64 for u64 tick scratch writes Single-line brace pattern matches the proven _ld_global_b16_to_f32 / _st_global_bf16_from_f32 wrappers above. No callers yet; gated rollout in subsequent tasks behind CUTE_BETA_REGION_TIMING=1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Validates the Task 1 helpers compile and emit monotonic u64 ticks without a docker rebuild. Probes globaltimer first (preferred: cross-SM synchronized) then clock64 (fallback: per-SM cycles). Per project memory feedback_kernel_repro_before_rebuild: standalone repro runs inside the existing nvllm:gb10 image via bind mount, saves the ~60min rebuild cycle if PTX inline-asm has constraint bugs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds optional region_timing_buf param to run_beta_coop_full (default None) and includes the env-gated _region_timing_enabled flag in the β-coop compile cache key so the disk cache (feat/cute-full-compile-cache) correctly distinguishes timing-on vs timing-off compile artifacts. No instrumentation writes yet — Task 5+ adds the actual clock64 reads behind this gate. Production path is unchanged: env unset → flag False → key same as before → cache hits the existing artifact → no kernel arg, no overhead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds CUTE_BETA_REGION_TIMING=1 env-gated allocation of a persistent (num_ctas, 11, 2) u64 scratch tensor on CutePagedAttentionImpl, alongside the existing _phase_e_coop_{wo_output,mlp_partial_fp32,...} workspace buffers. Plumbs to run_beta_coop_full as region_timing_buf kwarg. Production behavior unchanged: env unset → buffer is None → kwarg defaults to None → kernel sees timing-off compile path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

De-risking step before instrumenting all 11 regions: writes Phase 0 entry/exit u64 ticks per active CTA (bx==0, by<4) into region_timing_buf when CUTE_BETA_REGION_TIMING=1. Wrapped in cutlass.const_expr(region_timing_enabled) so timing-off path generates no PTX. Runtime overhead measurement deferred to Task 12 (single rebuild + capture for all instrumentation). This commit is code-only; structural pytest (test_phase_e_region_timing.py) confirms 3 PASS, AST parse OK. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a /tmp/.dump_region_timings sentinel check at the end of CutePagedAttentionImpl.forward(); when present, dumps _phase_e_coop_region_timing to /root/.cache/vllm/region_timings.npy then deletes the sentinel. scripts/trigger_region_timing_dump.sh writes the sentinel inside the container and pulls the .npy out via docker cp — that is the canonical extraction path used by the run_breakdown.sh orchestrator (no separate Python wrapper). Pattern follows feedback_vllm_enginecore_env_strip — env vars don't reach EngineCore reliably, so we use a runtime sentinel instead. Runtime end-to-end verification (sentinel triggers a dump, /tmp file produced, region 0 nonzero only on cta_id 0) deferred to Task 12 along with all other instrumentation runtime checks. This commit is code-only; structural pytest 3 PASS, AST parse OK. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds globaltimer entry+exit writes for: - Region 1: Phase 1 entry → W_O start - Region 4: grid barrier wait (all CTAs; wait-time NOT work-time) - Region 5: Phase 3 entry → load_x sync Combined with Phase 0 from prior commit, this gives the four outer phase totals. Sub-region instrumentation (Phase 1 W_O sub-splits and Phase 3 stages) follows in next commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Splits Phase 1 measurement into: - Region 2: W_O GEMV body (K-reduction candidate site #1) - Region 3: W_O end → barrier-arrive (cleanup) Region 1+2+3 should sum to the Phase 1 total measured indirectly via the gap between Region 0 exit and Region 4 entry. Sanity check documented in the host reducer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Splits Phase 3 measurement at the existing safe stage boundaries (per friend review — splitting inside the FC1 inner loop at phase_e_kernel.py:4249 was rejected as a perturbation): - Region 6: partial_reset - Region 7: stage 3a FC1 gate/up + reductions + SiLU (largest expected) - Region 8: stage 3b quant - Region 9: stage 3c FC2 + atomicAdd (K-reduction candidate site #3) - Region 10: stage 3.4 arrival/gather All 11 regions now instrumented. Active-CTA mask verified per cta_id = by*slice_ctas + bx packing: - region 0 (Phase 0): cta_id 0 only (bx==0 && by==0, 1 CTA/seq) - regions 1-3 (Phase 1): cta_ids {0, 8, 16, 24} (bx==0 && by<4, 4 CTAs/seq) - regions 4-10: all 64 CTAs/seq Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Task 8 placed R9 exit BEFORE the per-k-tile mlp_arrival atomic_add and R10 entry AFTER it, leaving the atomic_add in a measurement gap. Region 9 is named "FC2 + atomicAdd" — the cost should be attributed to R9, not vanish into a gap. Reorders the timing block so: - atomic_add fires - R9 exit recorded (atomicAdd cost in R9 wall-time window) - R10 entry recorded (Stage 3.4 wait + gather starts) No behavior change to the kernel math; only the timing-buffer write ordering. Production path unchanged (constexpr-gated). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

reduce_region_timings(buf, *, slice_ctas, num_k_tiles, num_seqs, tick_source, nsys_total_us=None) takes the (num_ctas, 11, 2) tick scratch + grid shape and produces a per-region DataFrame with mean/median/p99 ticks, n_active_ctas, cta_class, and (when calibrated) median_us and frac_of_kernel. Per friend review of the prior draft: - Active masks derived from (slice_ctas, num_k_tiles, num_seqs) via cta_id = by*slice_ctas + bx, NOT by slicing the first N rows. Phase 0 = 1 CTA/seq (cta_id 0); Phase 1 = 4 CTAs/seq (cta_ids {0, slice_ctas, 2*slice_ctas, 3*slice_ctas}); Phase 3 = all 64. - Fields renamed *_ns → *_ticks; added tick_source column. median_us is NaN unless tick_source=globaltimer AND nsys_total_us is given. With clock64, cycle→μs conversion is unreliable under dynamic clocks (per memory feedback_verify_env_var_consumer applied to props.clockRate). - Region 4 (grid_barrier_wait) is labelled barrier_wait and excluded from the work-fraction denominator. - Per-CTA median is the representative wall-time contribution (concurrent execution within a region — sum would be wrong). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Captures memory throughput, achieved occupancy, L1/L2 hit rates, and compute-vs-memory-bound classification for PhaseE_Beta_Kernel. Adjunct to the per-region timing breakdown — answers 'is the kernel memory-bound at all?' which constrains how much K-parallel reduction can help (Veitner pattern is memory-bound load-amortization play). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two-boot capture: profile boot (timing on, torch profiler) + sanity boot (timing off, GSM8K-50). Extractor calibrates per-CTA medians against nsys mean μs and prints the K-reducible fraction verdict. Decision rule applied automatically in extract_regions.py: ≥50% K-reducible (regions 2+7+9): STRONG GO 40-50%: PROCEED 25-40%: CONDITIONAL on memory-bound class <25%: NO-GO for K-parallel alone Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two pre-flight fixes discovered while staging Task 12 runtime: 1. scripts/serve-cute.sh did NOT pass CUTE_BETA_REGION_TIMING via -e. The EngineCore worker imports _backend.py inside the container, and the env-gated _REGION_TIMING_ENABLED constant is read at module-import time — without -e, the worker never sees the flag and the timing buffer is never allocated. Added passthrough (default 0) alongside the existing CUTE_PHASE_E_* env passthroughs. 2. docs/research/2026-05-02-beta-region-breakdown/run_breakdown.sh called gsm8k_eval_50.py with --base-url/--out/--model "$HF_MODEL" but the actual CLI is --api/--save/--model "default" (--served-model-name in serve-cute.sh is "default", not the HF id). Also bare `python` replaced with .venv/bin/python per AGENTS.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

First Boot 1 attempt produced empty region_timings.npy because: 1. Orchestrator only set CUTE_BETA_REGION_TIMING=1 — never set CUTE_PHASE_E_FUSION=1 or CUTE_PHASE_E_LAYERS=0..7. Without those, _phase_e_coop_kernel is None, the env-gated allocation block in _backend.py never fires, _phase_e_coop_region_timing stays None, sentinel-file dump produces nothing, docker cp fails. 2. /start_profile and /stop_profile returned 404 because vLLM gates those endpoints on VLLM_TORCH_PROFILER_DIR being set. 3. Completion curls used model="$HF_MODEL" but serve-cute.sh sets --served-model-name "default" — the burst would have 404'd. Fixes: - Profile boot now exports CUTE_PHASE_E_FUSION=1 + CUTE_PHASE_E_LAYERS=0..7 (lower8 production config — β fires on layers 3, 7) + VLLM_TORCH_PROFILER_DIR=/root/.cache/vllm/profiler. - Sanity boot also exports the fusion env so GSM8K tests the production β-coop path (not a different code path). - Completion curls now use model="default". - scripts/serve-cute.sh propagates VLLM_TORCH_PROFILER_DIR via -e. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bug: Boot 1 silently produced no region_timings.npy. Root cause was a shape mismatch (256 vs 64 CTAs); β-coop kernel hit AssertionError but fell back to β-lite without surfacing, so the orchestrator only saw the downstream `docker cp` failure of a never-written file. Friend's 7-part fix: 1. Slice region_timing_buf at the call site to match per-call nat (mirrors wo_output[:nat] etc). Persistent buffer is sized for max_num_seqs=4 → 256 rows; per-call kernel sees nat=1 → 64 rows. 2. Track _phase_e_coop_region_timing_last_ctas after each launch and slice the sentinel-file dump to match — otherwise dump writes 256 rows but extractor expects 64. 3. CUTE_PHASE_E_FALLBACK_RAISE=1 in profile boot — fail-fast on β-coop failure instead of silent β-lite fallback. 4. CUTE_PHASE_E_LAYERS=0,1,2,3,4,5,6,7 (was `0..7`); the env parser is CSV-int only, malformed values fall back to None = ALL layers (which is why prior run fired β-coop on layers 3,7,11,15...63). 5. Use model="default" in trigger_region_timing_dump.sh and run_ncu.sh (was the HF id; serve-cute.sh sets --served-model-name "default"). 6. Sentinel dump now gates on _phase_e_use_beta_coop AND last_ctas>0, not just buffer-not-None — so non-fusion layers and β-lite fallback don't trigger a dump. 7. scripts/serve-cute.sh: NVLLM_BIND_MOUNT_CUTE_PAGED=1 flag overlays host vllm/v1/attention/backends/cute_paged/ onto the in-image /app/nvllm/... path. Pure-Python dir (no .so), so safe — avoids the ABI mismatch Task 2 hit when bind-mounting the whole vllm tree. Lets us iterate this Python-only fix without a 60-min rebuild. 7 structural pytest tests still PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…gates Boot 1 (with the 7-part fix) crashed at first β-coop launch with: DSLRuntimeError: The function const_expr(?) received a dynamic expression (non compile-time constant). If your expression depends on dynamic values: Remove const_expr() Misuse of CuTe DSL API: cutlass.const_expr(x) is an *assertion* that x IS a compile-time constant — not a wrapper to make a branch compile-time. Compile-time branching on a Constexpr[bool] parameter happens automatically when the kernel-side parameter is declared as Constexpr[bool]; just write `if region_timing_enabled:` directly. Replaces 22 instances of `if cutlass.const_expr(region_timing_enabled):` with `if region_timing_enabled:` across all 11 instrumented regions (0/1/2/3/4/5/6/7/8/9/10) in _kernel_phase_0_to_4. Same compile-time branching semantics (Constexpr[bool] param at the kernel signature is what matters). Production path still emits no instrumentation PTX when region_timing_enabled is False. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…neck Per-region wall-clock breakdown of the β-coop kernel under lower-8 production config (8 layers fused, 64 CTAs/call, %globaltimer ticks). Result: regions 2 + 7 + 9 (the K-reducible work) sum to 36% of kernel time — strict reading of the priority memo's gate is CONDITIONAL — but the cost is concentrated in W_O GEMV at only 4 active CTAs (region 2 alone = 34.3%), and the resulting 4-of-48 SM bottleneck creates the ~37% barrier wait that the other 60 CTAs spin through. Verdict: PROCEED with a W_O K-parallel prototype before FC1. Sanity: GSM8K-50 with timing-on instrumentation = 47/50 (94%); meets gate exactly, and ~16 above the prior β-coop kernel-change baseline. Caveats called out for next iteration: - NCU adjunct failed: kernel-name regex needs `regex:phase_0_to_4` to match the mangled symbol. - /start_profile returned 404; VLLM_TORCH_PROFILER_DIR doesn't reach EngineCore. Calibrated against prior phaseE-tax β-coop mean_us (40,635.6 μs, n_calls=5100) instead. - GSM8K is timing-on correctness sanity; a timing-off rerun for production-path equivalence is recorded as follow-up. Branch is 17 commits ahead of main with the full instrumentation chain (PTX helpers + Constexpr-gated kernel param + sentinel-file dump + reducer + tests + orchestrator). This commit ships the evidence the chain produced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Natfii and others added 18 commits May 2, 2026 17:16

Natfii merged commit 46ad9bb into main May 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evidence(β-coop region breakdown): 36% K-reducible, W_O is the bottleneck#6

evidence(β-coop region breakdown): 36% K-reducible, W_O is the bottleneck#6
Natfii merged 18 commits intomainfrom
feat/beta-coop-region-timing

Natfii commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Natfii commented May 3, 2026

Summary

Test plan

Caveats called out for next iteration

Evidence

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant