Skip to content

evidence(β-coop region breakdown): 36% K-reducible, W_O is the bottleneck#6

Merged
Natfii merged 18 commits intomainfrom
feat/beta-coop-region-timing
May 3, 2026
Merged

evidence(β-coop region breakdown): 36% K-reducible, W_O is the bottleneck#6
Natfii merged 18 commits intomainfrom
feat/beta-coop-region-timing

Conversation

@Natfii
Copy link
Copy Markdown

@Natfii Natfii commented May 3, 2026

Summary

  • Per-region wall-clock breakdown of the β-coop kernel under lower-8 production config (8 layers fused, 64 CTAs/call, %globaltimer ticks)
  • Result: regions 2 + 7 + 9 (K-reducible) sum to 36% of kernel time — strict CONDITIONAL bracket — but cost is concentrated in W_O GEMV at only 4 active CTAs (R2 alone = 34.3%), creating the ~37% barrier wait the other 60 CTAs spin through
  • Verdict: PROCEED with a W_O K-parallel prototype before FC1. Refines project_strategy_priorities.md candidate (1) — W_O is the prototype site, not FC1 (FC1 only looks tiny because it's already 64-CTA parallel; W_O is serial)
  • Sanity: GSM8K-50 with timing-on instrumentation = 47/50 (94%), meets gate exactly

Test plan

  • All 7 structural pytests pass (3 region timing + 4 reducer)
  • β-coop kernel compiles with region_timing_buf Constexpr-gated param
  • region_timings.npy dumped (shape (64, 11, 2), last-launch only)
  • Reduction extractor produces verdict naming W_O (data-driven via dominant K-reducible region by ticks)
  • GSM8K-50 sanity: 47/50 timing-on
  • Timing-OFF rerun for production-path equivalence (recorded as follow-up)
  • NCU rerun with regex:phase_0_to_4 (next iteration)

Caveats called out for next iteration

  • NCU adjunct failed: regex needs phase_0_to_4 to match the mangled symbol (currently PhaseE_Beta_Kernel|cute_kernel, which doesn't)
  • /start_profile returned 404 — VLLM_TORCH_PROFILER_DIR doesn't reach EngineCore (env-stripping). Calibrated against prior phaseE-tax β-coop mean_us=40,635.6 μs (n_calls=5100) instead

Evidence

  • benchmarks/nvllm/traces/cute_paged_attn/2026-05-02-beta-region-breakdown/summary.md
  • benchmarks/nvllm/traces/cute_paged_attn/2026-05-02-beta-region-breakdown/region_breakdown.csv
  • benchmarks/nvllm/traces/cute_paged_attn/2026-05-02-beta-region-breakdown/sanity_gsm8k.json

🤖 Generated with Claude Code

Natfii and others added 18 commits May 2, 2026 17:16
Adds three @dsl_user_op helpers for region-timing instrumentation:
- _read_globaltimer_u64: %globaltimer mov, cross-SM synchronized ns clock
- _read_clock64_u64:     %clock64 mov, per-SM cycle counter (fallback)
- _st_global_u64:        st.global.b64 for u64 tick scratch writes

Single-line brace pattern matches the proven _ld_global_b16_to_f32 /
_st_global_bf16_from_f32 wrappers above. No callers yet; gated rollout
in subsequent tasks behind CUTE_BETA_REGION_TIMING=1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Validates the Task 1 helpers compile and emit monotonic u64 ticks
without a docker rebuild. Probes globaltimer first (preferred:
cross-SM synchronized) then clock64 (fallback: per-SM cycles).

Per project memory feedback_kernel_repro_before_rebuild: standalone
repro runs inside the existing nvllm:gb10 image via bind mount,
saves the ~60min rebuild cycle if PTX inline-asm has constraint bugs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds optional region_timing_buf param to run_beta_coop_full (default
None) and includes the env-gated _region_timing_enabled flag in the
β-coop compile cache key so the disk cache (feat/cute-full-compile-cache)
correctly distinguishes timing-on vs timing-off compile artifacts.

No instrumentation writes yet — Task 5+ adds the actual clock64
reads behind this gate. Production path is unchanged: env unset →
flag False → key same as before → cache hits the existing artifact →
no kernel arg, no overhead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds CUTE_BETA_REGION_TIMING=1 env-gated allocation of a persistent
(num_ctas, 11, 2) u64 scratch tensor on CutePagedAttentionImpl,
alongside the existing _phase_e_coop_{wo_output,mlp_partial_fp32,...}
workspace buffers. Plumbs to run_beta_coop_full as region_timing_buf
kwarg.

Production behavior unchanged: env unset → buffer is None → kwarg
defaults to None → kernel sees timing-off compile path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
De-risking step before instrumenting all 11 regions: writes Phase 0
entry/exit u64 ticks per active CTA (bx==0, by<4) into region_timing_buf
when CUTE_BETA_REGION_TIMING=1.

Wrapped in cutlass.const_expr(region_timing_enabled) so timing-off path
generates no PTX. Runtime overhead measurement deferred to Task 12 (single
rebuild + capture for all instrumentation). This commit is code-only;
structural pytest (test_phase_e_region_timing.py) confirms 3 PASS,
AST parse OK.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a /tmp/.dump_region_timings sentinel check at the end of
CutePagedAttentionImpl.forward(); when present, dumps
_phase_e_coop_region_timing to /root/.cache/vllm/region_timings.npy
then deletes the sentinel.

scripts/trigger_region_timing_dump.sh writes the sentinel inside the
container and pulls the .npy out via docker cp — that is the
canonical extraction path used by the run_breakdown.sh orchestrator
(no separate Python wrapper).

Pattern follows feedback_vllm_enginecore_env_strip — env vars don't
reach EngineCore reliably, so we use a runtime sentinel instead.

Runtime end-to-end verification (sentinel triggers a dump, /tmp file
produced, region 0 nonzero only on cta_id 0) deferred to Task 12 along
with all other instrumentation runtime checks. This commit is code-only;
structural pytest 3 PASS, AST parse OK.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds globaltimer entry+exit writes for:
  - Region 1: Phase 1 entry → W_O start
  - Region 4: grid barrier wait (all CTAs; wait-time NOT work-time)
  - Region 5: Phase 3 entry → load_x sync

Combined with Phase 0 from prior commit, this gives the four outer
phase totals. Sub-region instrumentation (Phase 1 W_O sub-splits and
Phase 3 stages) follows in next commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Splits Phase 1 measurement into:
  - Region 2: W_O GEMV body (K-reduction candidate site #1)
  - Region 3: W_O end → barrier-arrive (cleanup)

Region 1+2+3 should sum to the Phase 1 total measured indirectly via
the gap between Region 0 exit and Region 4 entry. Sanity check
documented in the host reducer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Splits Phase 3 measurement at the existing safe stage boundaries
(per friend review — splitting inside the FC1 inner loop at
phase_e_kernel.py:4249 was rejected as a perturbation):
  - Region 6: partial_reset
  - Region 7: stage 3a FC1 gate/up + reductions + SiLU (largest expected)
  - Region 8: stage 3b quant
  - Region 9: stage 3c FC2 + atomicAdd (K-reduction candidate site #3)
  - Region 10: stage 3.4 arrival/gather

All 11 regions now instrumented. Active-CTA mask verified per
cta_id = by*slice_ctas + bx packing:
  - region 0  (Phase 0): cta_id 0 only (bx==0 && by==0, 1 CTA/seq)
  - regions 1-3 (Phase 1): cta_ids {0, 8, 16, 24} (bx==0 && by<4, 4 CTAs/seq)
  - regions 4-10: all 64 CTAs/seq

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Task 8 placed R9 exit BEFORE the per-k-tile mlp_arrival atomic_add and
R10 entry AFTER it, leaving the atomic_add in a measurement gap. Region
9 is named "FC2 + atomicAdd" — the cost should be attributed to R9,
not vanish into a gap.

Reorders the timing block so:
  - atomic_add fires
  - R9 exit recorded (atomicAdd cost in R9 wall-time window)
  - R10 entry recorded (Stage 3.4 wait + gather starts)

No behavior change to the kernel math; only the timing-buffer write
ordering. Production path unchanged (constexpr-gated).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
reduce_region_timings(buf, *, slice_ctas, num_k_tiles, num_seqs,
                      tick_source, nsys_total_us=None)
takes the (num_ctas, 11, 2) tick scratch + grid shape and produces a
per-region DataFrame with mean/median/p99 ticks, n_active_ctas,
cta_class, and (when calibrated) median_us and frac_of_kernel.

Per friend review of the prior draft:
  - Active masks derived from (slice_ctas, num_k_tiles, num_seqs)
    via cta_id = by*slice_ctas + bx, NOT by slicing the first N rows.
    Phase 0 = 1 CTA/seq (cta_id 0); Phase 1 = 4 CTAs/seq
    (cta_ids {0, slice_ctas, 2*slice_ctas, 3*slice_ctas}); Phase 3
    = all 64.
  - Fields renamed *_ns → *_ticks; added tick_source column. median_us
    is NaN unless tick_source=globaltimer AND nsys_total_us is given.
    With clock64, cycle→μs conversion is unreliable under dynamic
    clocks (per memory feedback_verify_env_var_consumer applied to
    props.clockRate).
  - Region 4 (grid_barrier_wait) is labelled barrier_wait and
    excluded from the work-fraction denominator.
  - Per-CTA median is the representative wall-time contribution
    (concurrent execution within a region — sum would be wrong).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures memory throughput, achieved occupancy, L1/L2 hit rates,
and compute-vs-memory-bound classification for PhaseE_Beta_Kernel.
Adjunct to the per-region timing breakdown — answers 'is the kernel
memory-bound at all?' which constrains how much K-parallel reduction
can help (Veitner pattern is memory-bound load-amortization play).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two-boot capture: profile boot (timing on, torch profiler) +
sanity boot (timing off, GSM8K-50). Extractor calibrates per-CTA
medians against nsys mean μs and prints the K-reducible fraction
verdict.

Decision rule applied automatically in extract_regions.py:
  ≥50% K-reducible (regions 2+7+9): STRONG GO
  40-50%:                            PROCEED
  25-40%:                            CONDITIONAL on memory-bound class
  <25%:                              NO-GO for K-parallel alone

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two pre-flight fixes discovered while staging Task 12 runtime:

1. scripts/serve-cute.sh did NOT pass CUTE_BETA_REGION_TIMING via -e.
   The EngineCore worker imports _backend.py inside the container, and
   the env-gated _REGION_TIMING_ENABLED constant is read at module-import
   time — without -e, the worker never sees the flag and the timing
   buffer is never allocated. Added passthrough (default 0) alongside
   the existing CUTE_PHASE_E_* env passthroughs.

2. docs/research/2026-05-02-beta-region-breakdown/run_breakdown.sh
   called gsm8k_eval_50.py with --base-url/--out/--model "$HF_MODEL" but
   the actual CLI is --api/--save/--model "default" (--served-model-name
   in serve-cute.sh is "default", not the HF id). Also bare `python`
   replaced with .venv/bin/python per AGENTS.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First Boot 1 attempt produced empty region_timings.npy because:
1. Orchestrator only set CUTE_BETA_REGION_TIMING=1 — never set
   CUTE_PHASE_E_FUSION=1 or CUTE_PHASE_E_LAYERS=0..7. Without those,
   _phase_e_coop_kernel is None, the env-gated allocation block in
   _backend.py never fires, _phase_e_coop_region_timing stays None,
   sentinel-file dump produces nothing, docker cp fails.
2. /start_profile and /stop_profile returned 404 because vLLM gates
   those endpoints on VLLM_TORCH_PROFILER_DIR being set.
3. Completion curls used model="$HF_MODEL" but serve-cute.sh sets
   --served-model-name "default" — the burst would have 404'd.

Fixes:
- Profile boot now exports CUTE_PHASE_E_FUSION=1 + CUTE_PHASE_E_LAYERS=0..7
  (lower8 production config — β fires on layers 3, 7) +
  VLLM_TORCH_PROFILER_DIR=/root/.cache/vllm/profiler.
- Sanity boot also exports the fusion env so GSM8K tests the production
  β-coop path (not a different code path).
- Completion curls now use model="default".
- scripts/serve-cute.sh propagates VLLM_TORCH_PROFILER_DIR via -e.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug: Boot 1 silently produced no region_timings.npy. Root cause was a
shape mismatch (256 vs 64 CTAs); β-coop kernel hit AssertionError but
fell back to β-lite without surfacing, so the orchestrator only saw
the downstream `docker cp` failure of a never-written file.

Friend's 7-part fix:

1. Slice region_timing_buf at the call site to match per-call nat
   (mirrors wo_output[:nat] etc). Persistent buffer is sized for
   max_num_seqs=4 → 256 rows; per-call kernel sees nat=1 → 64 rows.
2. Track _phase_e_coop_region_timing_last_ctas after each launch and
   slice the sentinel-file dump to match — otherwise dump writes 256
   rows but extractor expects 64.
3. CUTE_PHASE_E_FALLBACK_RAISE=1 in profile boot — fail-fast on β-coop
   failure instead of silent β-lite fallback.
4. CUTE_PHASE_E_LAYERS=0,1,2,3,4,5,6,7 (was `0..7`); the env parser is
   CSV-int only, malformed values fall back to None = ALL layers
   (which is why prior run fired β-coop on layers 3,7,11,15...63).
5. Use model="default" in trigger_region_timing_dump.sh and run_ncu.sh
   (was the HF id; serve-cute.sh sets --served-model-name "default").
6. Sentinel dump now gates on _phase_e_use_beta_coop AND last_ctas>0,
   not just buffer-not-None — so non-fusion layers and β-lite
   fallback don't trigger a dump.
7. scripts/serve-cute.sh: NVLLM_BIND_MOUNT_CUTE_PAGED=1 flag overlays
   host vllm/v1/attention/backends/cute_paged/ onto the in-image
   /app/nvllm/... path. Pure-Python dir (no .so), so safe — avoids
   the ABI mismatch Task 2 hit when bind-mounting the whole vllm tree.
   Lets us iterate this Python-only fix without a 60-min rebuild.

7 structural pytest tests still PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gates

Boot 1 (with the 7-part fix) crashed at first β-coop launch with:

  DSLRuntimeError: The function const_expr(?) received a dynamic
  expression (non compile-time constant).
  If your expression depends on dynamic values: Remove const_expr()

Misuse of CuTe DSL API: cutlass.const_expr(x) is an *assertion* that x
IS a compile-time constant — not a wrapper to make a branch
compile-time. Compile-time branching on a Constexpr[bool] parameter
happens automatically when the kernel-side parameter is declared as
Constexpr[bool]; just write `if region_timing_enabled:` directly.

Replaces 22 instances of `if cutlass.const_expr(region_timing_enabled):`
with `if region_timing_enabled:` across all 11 instrumented regions
(0/1/2/3/4/5/6/7/8/9/10) in _kernel_phase_0_to_4. Same compile-time
branching semantics (Constexpr[bool] param at the kernel signature is
what matters). Production path still emits no instrumentation PTX
when region_timing_enabled is False.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…neck

Per-region wall-clock breakdown of the β-coop kernel under lower-8
production config (8 layers fused, 64 CTAs/call, %globaltimer ticks).

Result: regions 2 + 7 + 9 (the K-reducible work) sum to 36% of kernel
time — strict reading of the priority memo's gate is CONDITIONAL — but
the cost is concentrated in W_O GEMV at only 4 active CTAs (region 2
alone = 34.3%), and the resulting 4-of-48 SM bottleneck creates the
~37% barrier wait that the other 60 CTAs spin through. Verdict:
PROCEED with a W_O K-parallel prototype before FC1.

Sanity: GSM8K-50 with timing-on instrumentation = 47/50 (94%); meets
gate exactly, and ~16 above the prior β-coop kernel-change baseline.

Caveats called out for next iteration:
- NCU adjunct failed: kernel-name regex needs `regex:phase_0_to_4`
  to match the mangled symbol.
- /start_profile returned 404; VLLM_TORCH_PROFILER_DIR doesn't reach
  EngineCore. Calibrated against prior phaseE-tax β-coop mean_us
  (40,635.6 μs, n_calls=5100) instead.
- GSM8K is timing-on correctness sanity; a timing-off rerun for
  production-path equivalence is recorded as follow-up.

Branch is 17 commits ahead of main with the full instrumentation
chain (PTX helpers + Constexpr-gated kernel param + sentinel-file
dump + reducer + tests + orchestrator). This commit ships the
evidence the chain produced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Natfii Natfii merged commit 46ad9bb into main May 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant