diff --git a/README.md b/README.md
index 9c98818c3429..12b90ba4501f 100644
--- a/README.md
+++ b/README.md
@@ -48,7 +48,7 @@ Also on Docker Hub: `docker.io/naviailab/nvllm:latest`
 - `~/.cache/flashinfer` — FlashInfer JIT kernels
 - `~/.cache/vllm_compile` → `/root/.cache/vllm/torch_compile_cache` — CUDA graph cache
 
-**For gated models** (e.g., Gemma 4): pass `-e HF_TOKEN=hf_...` or mount a token file.
+**For gated models** (e.g., Gemma 4): pass `HF_TOKEN` via env or mount a credentials file.
 
 ### Prerequisites
 - NVIDIA DGX Spark (GB10) or GH200
@@ -88,8 +88,8 @@ curl http://localhost:8000/v1/chat/completions \
 
 | Script | Model | Status | Context |
 |--------|-------|--------|---------|
-| `serve.sh` | [Qwen3.5-27B-NVFP4-Opus-GB10](https://huggingface.co/natfii/Qwen3.5-27B-NVFP4-Opus-GB10) | Active (default) | 64K |
-| `serve-cute.sh` | [Qwen3.5-27B with CuTe Paged Attention](https://huggingface.co/ig1/Qwen3.5-35B-A3B-NVFP4) | Active (kernel dev) | 64K |
+| `serve.sh` | [Qwen3.5-27B-NVFP4](https://huggingface.co/ig1/Qwen3.5-27B-NVFP4) | Active (default) | 64K |
+| `serve-cute.sh` | [Qwen3.5-27B-NVFP4](https://huggingface.co/ig1/Qwen3.5-27B-NVFP4) (CuTe Paged Attention; override `HF_MODEL` env) | Active (kernel dev) | 64K |
 | `serve-nemotron.sh` | Nemotron-3-Super-120B-A12B-NVFP4 | Not Ready | 128K |
 | `serve-gemma4.sh` | Gemma 4 31B IT NVFP4 | Degraded (see script) | 32K |
 
@@ -104,6 +104,7 @@ curl http://localhost:8000/v1/chat/completions \
 
 **Now — Qwen3.5-27B kernel work**
 - CuTe DSL paged attention uber-kernel (fused attention + W_O GEMV + RMSNorm)
+- `CUTE_WO_SPLIT=8` opt-in K-parallel W_O GEMV prototype (default off; opt-in via env on `serve-cute.sh`). Region cluster (R2+R4+R11+R12) shrinks 6.49× at synthetic load; ~5% per-token speedup at GSM8K workload. See [evidence summary](benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/summary.md).
 - CUDA graph support (FULL_AND_PIECEWISE mode)
 - End-to-end fusion validation through Qwen3NextAttention
 
@@ -128,15 +129,17 @@ Benchmarked on Qwen3.5-27B-NVFP4 (rate=8, max-num-seqs=4):
 | TPOT p50 | 89.2 ms | 80.0 ms | **-10.2%** |
 | TPOT p99 | 91.7 ms | 82.7 ms | **-9.8%** |
 
-> **Warning:** Large models (>75 GB) that leave minimal memory headroom on the GB10's 128 GB unified memory may crash during CUDA graph capture with the stream-K kernel. Use `--debug` (eager mode) to test first, or use a smaller model.
+[Trace](benchmarks/nvllm/traces/gemm_stream_k_cudagraph/2026-04-21/) — committed `streamk_graphs.pt.trace.json.gz` + per-kernel CSVs.
+
+> **Warning:** Large models (>75 GB) that leave minimal memory headroom on the GB10's 128 GB unified memory may crash during CUDA graph capture with the stream-K kernel. Use a smaller model to test first.
 
 ### CuTe Paged Attention Backend (Prototype)
 
 Custom paged attention backend using CuTe Python DSL, targeting SM120/SM121 FP8 MMA instructions. Registered as `CUTE_PAGED` in vLLM's attention backend registry.
 
-**Status:** Backend interface validated end-to-end. PyTorch prototype serves live inference. CuTe DSL kernel replacement in progress.
+**Status:** Experimental CuTe DSL backend; production decode path since v0.3.0. β-coop fused kernel (attention + W_O + RMSNorm + MLP) is the default. Opt-in `CUTE_WO_SPLIT=8` K-parallel W_O GEMV prototype lands the W_O bottleneck reduction (5.99× R2, 8.68× R4 grid-barrier; see [evidence summary](benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/summary.md)).
 
-Launch with: `./scripts/serve-cute.sh --debug`
+Launch with: `./scripts/serve-cute.sh` (default PIECEWISE CUDA graphs). Avoid `--debug` (eager mode) on SM120 — it produces gibberish unrelated to kernel correctness; PIECEWISE is the validation path.
 
 ## Acknowledgments
 
@@ -147,5 +150,5 @@ Launch with: `./scripts/serve-cute.sh --debug`
 - **[CUTLASS PR #3030](https://github.com/NVIDIA/cutlass/pull/3030)** by blake-snc (Second Nature Computing) — SM120 Flash Attention v2 reference for fused multi-head attention on Blackwell.
   - [`docs/kernel-insights/2026-04-10-cutlass-pr3030-sm120-fmha.md`](docs/kernel-insights/2026-04-10-cutlass-pr3030-sm120-fmha.md) — SM120 FMHA patterns and tile configs
 - **[CUTLASS](https://github.com/NVIDIA/cutlass)** by NVIDIA — CuTe Python DSL for SM120 kernel development. The FP4 decode GEMM kernel with stream-K scheduling is adapted from CUTLASS test kernels.
-- **[Simon Veitner's CuTe DSL / NVFP4 blog](https://veitner.bearblog.dev/blog/)** — Reference reading for NVFP4 GEMV K-parallel reduction patterns identified as the next β-coop optimization direction (see [phaseE-tax bench](benchmarks/nvllm/traces/cute_paged_attn/2026-05-02-phaseE-tax-3leg/summary.md)).
+- **[Simon Veitner's CuTe DSL / NVFP4 blog](https://veitner.bearblog.dev/blog/)** — Reference reading for NVFP4 GEMV K-parallel reduction patterns. Applied to W_O GEMV in the `CUTE_WO_SPLIT=8` opt-in prototype (see [wo_split=8 evidence summary](benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/summary.md)).
 - **[vLLM](https://github.com/vllm-project/vllm)** — The upstream project this fork is based on.
diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline.nsys-rep b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline.nsys-rep
new file mode 100644
index 000000000000..316cfb72ac97
Binary files /dev/null and b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline.nsys-rep differ
diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline.sqlite b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline.sqlite
new file mode 100644
index 000000000000..f8268d45d770
Binary files /dev/null and b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline.sqlite differ
diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_gsm8k.json b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_gsm8k.json
new file mode 100644
index 000000000000..ef39e44bedc6
--- /dev/null
+++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_gsm8k.json
@@ -0,0 +1,463 @@
+{
+  "label": "task10_wo_split_1_baseline",
+  "model": "default",
+  "api": "http://localhost:8000/v1",
+  "n": 50,
+  "seed": 42,
+  "correct": 4,
+  "errors": 46,
+  "accuracy": "4/50 (8.0%)",
+  "total_seconds": 8781.8,
+  "results": [
+    {
+      "i": 0,
+      "expected": "2280",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "The girls are trying to raise money for a carnival. Kim raises $320 more than Al..."
+    },
+    {
+      "i": 1,
+      "expected": "1",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Kalinda is working on a 360 piece puzzle with her mom. Kalinda can normally add ..."
+    },
+    {
+      "i": 2,
+      "expected": "5",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Tom's ship can travel at 10 miles per hour.  He is sailing from 1 to 4 PM.  He t..."
+    },
+    {
+      "i": 3,
+      "expected": "12",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "James decides to buy birthday candles for his 2 sons.  One of them is 12 and the..."
+    },
+    {
+      "i": 4,
+      "expected": "273",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Mariah\u2019s grandma was teaching her to knit. Mariah used 1/4 of a skein of yarn. H..."
+    },
+    {
+      "i": 5,
+      "expected": "45",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "While playing with her friends in their school playground, Katelyn saw 50 fairie..."
+    },
+    {
+      "i": 6,
+      "expected": "21",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "If Ann is 9 years old and her brother is twice her age, how old will her brother..."
+    },
+    {
+      "i": 7,
+      "expected": "145",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Twenty dozen cups cost $1200 less than the total cost of half a dozen plates sol..."
+    },
+    {
+      "i": 8,
+      "expected": "60",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Last year there were 50 students enrolled in a calligraphy class. This year, the..."
+    },
+    {
+      "i": 9,
+      "expected": "122",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Rani has ten more crabs than Monic, who has 4 fewer crabs than Bo. If Bo has 40 ..."
+    },
+    {
+      "i": 10,
+      "expected": "29",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Brian's basement was damp and musty, so he bought a dehumidifier to remove moist..."
+    },
+    {
+      "i": 11,
+      "expected": "80",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "George, a grade six teacher, ordered 600 burritos for the sixth-grade class picn..."
+    },
+    {
+      "i": 12,
+      "expected": "36",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Jim spends 2 hours watching TV and then decides to go to bed and reads for half ..."
+    },
+    {
+      "i": 13,
+      "expected": "1430",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Janet buys a brooch for her daughter.  She pays $500 for the material to make it..."
+    },
+    {
+      "i": 14,
+      "expected": "5",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Ducks need to eat 3.5 pounds of insects each week to survive. If there is a floc..."
+    },
+    {
+      "i": 15,
+      "expected": "5",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Joey has 214 points before his turn in Scrabble. He scores 26 points. Then Marcy..."
+    },
+    {
+      "i": 16,
+      "expected": "5",
+      "got": "5",
+      "status": "OK",
+      "elapsed": 176.7,
+      "raw_tail": "y books each child received:**\n    Sarah divides these books equally among her 4 kids.\n    $$20 \\text{ books} / 4 \\text{ kids} = 5 \\text{ books per kid}$$\n\n**Answer:** Each child received **5** books.",
+      "question": "Sarah went to buy books from the store and spent $300 on the books. If each book..."
+    },
+    {
+      "i": 17,
+      "expected": "66",
+      "got": "66",
+      "status": "OK",
+      "elapsed": 75.8,
+      "raw_tail": "ook = $48\n  3 pencils \u00d7 $6/pencil = $1...\n</think>\n\n$18\n\nTed spends **$48** on books ($16 \u00d7 3) and **$18** on pencils ($6 \u00d7 3).\n\nAdding these together gives the total amount spent:\n$48 + $18 = **$66**",
+      "question": "Ted starts with $200. He buys 3 books for 16 dollars each and 3 pencils for 6 do..."
+    },
+    {
+      "i": 18,
+      "expected": "15",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Eve had 20 pieces of pomelos. After giving her friend some pomelos, Eve is left ..."
+    },
+    {
+      "i": 19,
+      "expected": "40",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "The Doubtfire sisters are driving home with 7 kittens adopted from the local ani..."
+    },
+    {
+      "i": 20,
+      "expected": "93",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Jen got 3 fish.  They each need $1 worth of food a day.  How much does she spend..."
+    },
+    {
+      "i": 21,
+      "expected": "2000",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Elliott is trying to walk 10,000 steps a day. He finished half of his steps on h..."
+    },
+    {
+      "i": 22,
+      "expected": "1520",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "While working at the restaurant, each of the forty customers who came into the r..."
+    },
+    {
+      "i": 23,
+      "expected": "11050",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "On a certain day, the total cost of filling up 20 helium balloons was $900. Two ..."
+    },
+    {
+      "i": 24,
+      "expected": "90",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "To raise money for their class fund, each of the 30 students from one class sold..."
+    },
+    {
+      "i": 25,
+      "expected": "40000",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "John decides to get a loan by mortgaging his home.  His house is worth $250,000...."
+    },
+    {
+      "i": 26,
+      "expected": "21",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Liam and Mitchell own competing lemonade stands across the street from one anoth..."
+    },
+    {
+      "i": 27,
+      "expected": "18",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Melanie is a door-to-door saleswoman. She sold a third of her vacuum cleaners at..."
+    },
+    {
+      "i": 28,
+      "expected": "14",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Some people got on a bus at the terminal. At the first bus stop, 5 more people g..."
+    },
+    {
+      "i": 29,
+      "expected": "23",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "A bumper car rink has 12 red cars. They have 2 fewer green cars than they have r..."
+    },
+    {
+      "i": 30,
+      "expected": "145",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Students in class 3B are collecting school points for behavior. If they get enou..."
+    },
+    {
+      "i": 31,
+      "expected": "123",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "If a classroom has 3 times as many girls as they do boys, and 1/10 as many nonge..."
+    },
+    {
+      "i": 32,
+      "expected": "98",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "The area of Billie's rectangular bedroom is 360 square feet. If the length of hi..."
+    },
+    {
+      "i": 33,
+      "expected": "7",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Manolo bought five lollipops and four candies that cost $3.20. If each lollipop ..."
+    },
+    {
+      "i": 34,
+      "expected": "34",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Three friends spent $20.25 on 3 tickets to the fair. They also spent $4.50 less ..."
+    },
+    {
+      "i": 35,
+      "expected": "38",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "At the local Pick Your Own fruit orchard, you could pick your own peaches for $2..."
+    },
+    {
+      "i": 36,
+      "expected": "320",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Elaine initially had 20 Pokemon cards. After a month, she collected three times ..."
+    },
+    {
+      "i": 37,
+      "expected": "50",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "In a student council election, candidate A got 20% of the votes while candidate ..."
+    },
+    {
+      "i": 38,
+      "expected": "50",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Gerald and Julia divided $100 in the ratio 3:2. If Gerald spent $10 on a book, h..."
+    },
+    {
+      "i": 39,
+      "expected": "84",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Madeline ate 6 grapes. Her brother wanted to make grape juice and used up 5 time..."
+    },
+    {
+      "i": 40,
+      "expected": "50",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Elvis has a monthly saving target of $1125. In April, he wants to save twice as ..."
+    },
+    {
+      "i": 41,
+      "expected": "8000",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Marilyn's first record sold 10 times as many copies as Harald's. If they sold 88..."
+    },
+    {
+      "i": 42,
+      "expected": "280",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "For the children's day celebration, children are to perform mass drills in the p..."
+    },
+    {
+      "i": 43,
+      "expected": "30",
+      "got": "30",
+      "status": "OK",
+      "elapsed": 92.0,
+      "raw_tail": " Age**: Martha is 4 years younger than Mike.\n   $$24 - 4 = 20$$\n   So, Martha is **20** years old.\n3. **Jeff's Age**: Jeff is 10 years older than Martha.\n   $$20 + 10 = 30$$\n\nJeff is **30** years old.",
+      "question": "Jeff is 10 years older than his younger sister, Martha. Martha, on the other han..."
+    },
+    {
+      "i": 44,
+      "expected": "192",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Ten stalls have 20 cows each. Mr. Sylas buys 40 cows and divides them equally, p..."
+    },
+    {
+      "i": 45,
+      "expected": "276",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Henry needs to assemble some toys, specifically 57 cars and 73 motorcycles. Henr..."
+    },
+    {
+      "i": 46,
+      "expected": "32",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Anakin and Locsin went to the beach today. Anakin caught 10 starfish, 6 sea hors..."
+    },
+    {
+      "i": 47,
+      "expected": "25",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Nik has 200 crayons. He wants to separate them into groups of 8 and put them int..."
+    },
+    {
+      "i": 48,
+      "expected": "10",
+      "got": "10",
+      "status": "OK",
+      "elapsed": 154.6,
+      "raw_tail": " $8 + (0.50 \\times 8) = 12$). However, because this is too heavy, he switches to a weight that is two pounds lighter than this new amount ($12 - 2$).\n\nTherefore, Jamaal now uses a **10**-pound weight.",
+      "question": "Jamaal is at the gym. He has been using an 8-pound weight. He increases the weig..."
+    },
+    {
+      "i": 49,
+      "expected": "84",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "In a family, there are 2 brothers and 3 sisters. All sisters are the same age, w..."
+    }
+  ]
+}
\ No newline at end of file
diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_gsm8k_fullthink.json b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_gsm8k_fullthink.json
new file mode 100644
index 000000000000..5109d06c72b0
--- /dev/null
+++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_gsm8k_fullthink.json
@@ -0,0 +1,463 @@
+{
+  "label": "task10_wo_split_1_baseline_fullthink",
+  "model": "default",
+  "api": "http://localhost:8000/v1",
+  "n": 50,
+  "seed": 42,
+  "correct": 48,
+  "errors": 0,
+  "accuracy": "48/50 (96.0%)",
+  "total_seconds": 3760.3,
+  "results": [
+    {
+      "i": 0,
+      "expected": "2280",
+      "got": "2180",
+      "status": "WRONG",
+      "elapsed": 119.2,
+      "raw_tail": "Sarah raised $300\n- Maryam raised $400 more than Sarah, so Maryam raised $300 + $400 = $700\n\nNow I need to add up all the amounts:\n$430 + $750 + $300 + $700 = $2180\n\nThe total amount raised is $2180.\n",
+      "question": "The girls are trying to raise money for a carnival. Kim raises $320 more than Al..."
+    },
+    {
+      "i": 1,
+      "expected": "1",
+      "got": "1",
+      "status": "OK",
+      "elapsed": 50.6,
+      "raw_tail": "2 = 2 pieces per minute.\n3. Combined rate: 4 + 2 = 6 pieces per minute.\n4. Total pieces: 360.\n5. Time in minutes: 360 / 6 = 60 minutes.\n6. Convert to hours: 60 minutes = 1 hour.\nThe answer is 1 hour.\n",
+      "question": "Kalinda is working on a 360 piece puzzle with her mom. Kalinda can normally add ..."
+    },
+    {
+      "i": 2,
+      "expected": "5",
+      "got": "5",
+      "status": "OK",
+      "elapsed": 51.8,
+      "raw_tail": " 3 = 30 miles.\n2. Now I need to find out how long it takes to travel back 30 miles at 6 mph. Time = Distance / Speed = 30 / 6 = 5 hours.\n3. So it takes Tom 5 hours to get back.\nThe answer is 5 hours.\n",
+      "question": "Tom's ship can travel at 10 miles per hour.  He is sailing from 1 to 4 PM.  He t..."
+    },
+    {
+      "i": 3,
+      "expected": "12",
+      "got": "12",
+      "status": "OK",
+      "elapsed": 85.1,
+      "raw_tail": ". Since each pack contains 5 candles, he needs 20 / 5 = 4 packs.\nFinally, I need to calculate the total cost. Each pack costs $3, so 4 packs cost 4 * $3 = $12.\nTherefore, James spends $12 on candles.\n",
+      "question": "James decides to buy birthday candles for his 2 sons.  One of them is 12 and the..."
+    },
+    {
+      "i": 4,
+      "expected": "273",
+      "got": "273",
+      "status": "OK",
+      "elapsed": 76.6,
+      "raw_tail": " I need to find the total yards of yarn they used altogether. I add Mariah's usage and her grandma's usage: 91 + 182 = 273 yards.\nTherefore, they used 273 yards of yarn altogether.\nThe answer is 273.\n",
+      "question": "Mariah\u2019s grandma was teaching her to knit. Mariah used 1/4 of a skein of yarn. H..."
+    },
+    {
+      "i": 5,
+      "expected": "45",
+      "got": "45",
+      "status": "OK",
+      "elapsed": 53.8,
+      "raw_tail": ". Half of 50 is 25.\nSo, 25 fairies joined the original 50 fairies.\n50 + 25 = 75 fairies.\nThen, 30 fairies flew away.\n75 - 30 = 45 fairies.\nTherefore, there are 45 fairies remaining.\nThe answer is 45.\n",
+      "question": "While playing with her friends in their school playground, Katelyn saw 50 fairie..."
+    },
+    {
+      "i": 6,
+      "expected": "21",
+      "got": "21",
+      "status": "OK",
+      "elapsed": 46.4,
+      "raw_tail": "her is twice her age, so he is 9 * 2 = 18 years old.\nNow, I need to find out how old he will be in 3 years.\n18 + 3 = 21 years old.\nSo, Ann's brother will be 21 years old in 3 years.\nThe answer is 21.\n",
+      "question": "If Ann is 9 years old and her brother is twice her age, how old will her brother..."
+    },
+    {
+      "i": 7,
+      "expected": "145",
+      "got": "145",
+      "status": "OK",
+      "elapsed": 77.2,
+      "raw_tail": "otal cost of the plates, so $36,000 - $1200 = $34,800.\nNow, I need to find the cost per cup. Twenty dozen cups is 20 \u00d7 12 = 240 cups.\nSo, the cost per cup is $34,800 \u00f7 240 = $145.\nThe answer is $145.\n",
+      "question": "Twenty dozen cups cost $1200 less than the total cost of half a dozen plates sol..."
+    },
+    {
+      "i": 8,
+      "expected": "60",
+      "got": "60",
+      "status": "OK",
+      "elapsed": 40.5,
+      "raw_tail": "age: 20%\n3. Calculate the increase: 50 \u00d7 0.20 = 10 students\n4. Add the increase to last year's enrollment: 50 + 10 = 60 students\n5. Therefore, this year's enrollment is 60 students.\nThe answer is 60.\n",
+      "question": "Last year there were 50 students enrolled in a calligraphy class. This year, the..."
+    },
+    {
+      "i": 9,
+      "expected": "122",
+      "got": "122",
+      "status": "OK",
+      "elapsed": 60.3,
+      "raw_tail": " has 40 - 4 = 36 crabs\n- Rani has 10 more crabs than Monic, so Rani has 36 + 10 = 46 crabs\n\nNow I need to calculate the total:\nTotal = Bo + Monic + Rani = 40 + 36 + 46 = 122 crabs\n\nThe answer is 122.\n",
+      "question": "Rani has ten more crabs than Monic, who has 4 fewer crabs than Bo. If Bo has 40 ..."
+    },
+    {
+      "i": 10,
+      "expected": "29",
+      "got": "29",
+      "status": "OK",
+      "elapsed": 187.5,
+      "raw_tail": "**\n    *   Sum of all periods: $3 \\text{ liters} + 6 \\text{ liters} + 20 \\text{ liters}$.\n    *   Calculation: $3 + 6 = 9$; $9 + 20 = 29$.\n\nThe total amount of water removed is 29 liters.\n\n29 liters.\n",
+      "question": "Brian's basement was damp and musty, so he bought a dehumidifier to remove moist..."
+    },
+    {
+      "i": 11,
+      "expected": "80",
+      "got": "80",
+      "status": "OK",
+      "elapsed": 92.3,
+      "raw_tail": "ver burritos.\nLeftover burritos = Total burritos ordered - Total burritos consumed\nLeftover burritos = 600 - 520 = 80 burritos.\nTherefore, the total number of leftover burritos from the picnic is 80.\n",
+      "question": "George, a grade six teacher, ordered 600 burritos for the sixth-grade class picn..."
+    },
+    {
+      "i": 12,
+      "expected": "36",
+      "got": "36",
+      "status": "OK",
+      "elapsed": 52.6,
+      "raw_tail": "1 hour.\n3. Total time per session (TV + reading) = 2 + 1 = 3 hours.\n4. He does this 3 times a week, so weekly time = 3 * 3 = 9 hours.\n5. Over 4 weeks, total time = 9 * 4 = 36 hours.\nThe answer is 36.\n",
+      "question": "Jim spends 2 hours watching TV and then decides to go to bed and reads for half ..."
+    },
+    {
+      "i": 13,
+      "expected": "1430",
+      "got": "1430",
+      "status": "OK",
+      "elapsed": 74.4,
+      "raw_tail": "0% of $1300 = 0.10 \u00d7 $1300 = $130\n\nFinally, I need to calculate the total amount Janet paid.\nTotal amount paid: $1300 + $130 = $1430\n\nTherefore, Janet paid $1430 for the brooch.\n\nThe answer is: $1430\n",
+      "question": "Janet buys a brooch for her daughter.  She pays $500 for the material to make it..."
+    },
+    {
+      "i": 14,
+      "expected": "5",
+      "got": "5",
+      "status": "OK",
+      "elapsed": 63.3,
+      "raw_tail": "t this to a daily amount.\n5. There are 7 days in a week, so I divide the weekly amount by 7.\n6. 35 \u00f7 7 = 5 pounds per day.\n7. Therefore, the flock needs 5 pounds of insects per day.\n\nThe answer is 5.\n",
+      "question": "Ducks need to eat 3.5 pounds of insects each week to survive. If there is a floc..."
+    },
+    {
+      "i": 15,
+      "expected": "5",
+      "got": "5",
+      "status": "OK",
+      "elapsed": 44.0,
+      "raw_tail": "nts.\nNext, I need to calculate Marcy's new score: 225 + 10 = 235 points.\nNow I need to find the difference between their scores: 240 - 235 = 5 points.\nSo Joey is winning by 5 points.\nThe answer is 5.\n",
+      "question": "Joey has 214 points before his turn in Scrabble. He scores 26 points. Then Marcy..."
+    },
+    {
+      "i": 16,
+      "expected": "5",
+      "got": "5",
+      "status": "OK",
+      "elapsed": 45.7,
+      "raw_tail": " costs $15.\nSo, total books = $300 / $15 = 20 books.\nNext, she gave an equal number of books to her 4 kids.\nSo, books per child = 20 / 4 = 5 books.\nTherefore, each child got 5 books.\nThe answer is 5.\n",
+      "question": "Sarah went to buy books from the store and spent $300 on the books. If each book..."
+    },
+    {
+      "i": 17,
+      "expected": "66",
+      "got": "66",
+      "status": "OK",
+      "elapsed": 41.8,
+      "raw_tail": " books is 3 * 16 = $48.\nHe also buys 3 pencils at $6 each, so the cost for pencils is 3 * 6 = $18.\nThe total amount spent is the sum of the cost of books and pencils: 48 + 18 = $66.\nThe answer is 66.\n",
+      "question": "Ted starts with $200. He buys 3 books for 16 dollars each and 3 pencils for 6 do..."
+    },
+    {
+      "i": 18,
+      "expected": "15",
+      "got": "15",
+      "status": "OK",
+      "elapsed": 42.7,
+      "raw_tail": "4 of the original amount left.\n3. Calculate how many pomelos she has left: 20 \u00d7 1/4 = 5 pomelos.\n4. Calculate how many pomelos she gave away: 20 - 5 = 15 pomelos.\nTherefore, Eve gave away 15 pomelos.\n",
+      "question": "Eve had 20 pieces of pomelos. After giving her friend some pomelos, Eve is left ..."
+    },
+    {
+      "i": 19,
+      "expected": "40",
+      "got": "40",
+      "status": "OK",
+      "elapsed": 39.1,
+      "raw_tail": "opted kittens.\n2. Patchy has had thrice the number of adopted kittens, which is 3 * 7 = 21 kittens.\n3. Trixie has had 12 kittens.\n4. The total number of kittens is 7 + 21 + 12 = 40.\nThe answer is 40.\n",
+      "question": "The Doubtfire sisters are driving home with 7 kittens adopted from the local ani..."
+    },
+    {
+      "i": 20,
+      "expected": "93",
+      "got": "93",
+      "status": "OK",
+      "elapsed": 51.6,
+      "raw_tail": "$3 per day\n2. Next, I need to find out how many days are in May.\n   - May has 31 days\n3. Finally, I need to calculate the total cost for the month.\n   - $3 per day \u00d7 31 days = $93\n\nThe answer is $93.\n",
+      "question": "Jen got 3 fish.  They each need $1 worth of food a day.  How much does she spend..."
+    },
+    {
+      "i": 21,
+      "expected": "2000",
+      "got": "2000",
+      "status": "OK",
+      "elapsed": 101.4,
+      "raw_tail": "ol and 1,000 steps with his friend, which is 5,000 + 1,000 = 6,000 steps.\n7. The remaining steps must be from his jog: 8,000 - 6,000 = 2,000 steps.\nTherefore, Elliott took 2,000 steps during his jog.\n",
+      "question": "Elliott is trying to walk 10,000 steps a day. He finished half of his steps on h..."
+    },
+    {
+      "i": 22,
+      "expected": "1520",
+      "got": "1520",
+      "status": "OK",
+      "elapsed": 38.9,
+      "raw_tail": "Rafaela's tips: 40 customers \u00d7 $20 = $800\nJulieta's tips: $800 - (10% of $800) = $800 - $80 = $720\nTotal tips: $800 + $720 = $1520\nThe answer is 1520.\n",
+      "question": "While working at the restaurant, each of the forty customers who came into the r..."
+    },
+    {
+      "i": 23,
+      "expected": "11050",
+      "got": "11050",
+      "status": "OK",
+      "elapsed": 62.3,
+      "raw_tail": ".\nNew cost per balloon = $45 + $20 = $65\n\nBentley filled 170 balloons at the new price.\nTotal cost = 170 \u00d7 $65 = $11,050\n\nTherefore, Bentley paid $11,050 to fill 170 balloons.\n\nThe answer is: $11,050\n",
+      "question": "On a certain day, the total cost of filling up 20 helium balloons was $900. Two ..."
+    },
+    {
+      "i": 24,
+      "expected": "90",
+      "got": "90",
+      "status": "OK",
+      "elapsed": 74.3,
+      "raw_tail": "he total is 30 * 10 = 300 lollipops.\nFinally, I need to find the total profit. The profit per lollipop is $0.3 and they sold 300 lollipops, so the total profit is 300 * $0.3 = $90.\nThe answer is $90.\n",
+      "question": "To raise money for their class fund, each of the 30 students from one class sold..."
+    },
+    {
+      "i": 25,
+      "expected": "40000",
+      "got": "40000",
+      "status": "OK",
+      "elapsed": 68.1,
+      "raw_tail": "d to calculate how much he used to pay off debts: 60% of $100,000 = 0.60 \u00d7 $100,000 = $60,000\n3. Finally, I need to calculate the leftover amount: $100,000 - $60,000 = $40,000\n\nThe answer is $40,000.\n",
+      "question": "John decides to get a loan by mortgaging his home.  His house is worth $250,000...."
+    },
+    {
+      "i": 26,
+      "expected": "21",
+      "got": "21",
+      "status": "OK",
+      "elapsed": 49.9,
+      "raw_tail": "eekend.\nTo find out how many more dollars Mitchell made than Liam, I need to subtract Liam's earnings from Mitchell's earnings.\n$84 - $63 = $21.\nSo Mitchell made $21 more than Liam.\nThe answer is 21.\n",
+      "question": "Liam and Mitchell own competing lemonade stands across the street from one anoth..."
+    },
+    {
+      "i": 27,
+      "expected": "18",
+      "got": "18",
+      "status": "OK",
+      "elapsed": 134.3,
+      "raw_tail": "1/3 at green house: 18 * 1/3 = 6 sold, 12 left.\n- Sold 2 to red house: 12 - 2 = 10 left.\n- Sold half at orange house: 10 * 1/2 = 5 sold, 5 left.\n- Final count: 5 vacuum cleaners. \u2713\n\nThe answer is 18.\n",
+      "question": "Melanie is a door-to-door saleswoman. She sold a third of her vacuum cleaners at..."
+    },
+    {
+      "i": 28,
+      "expected": "14",
+      "got": "14",
+      "status": "OK",
+      "elapsed": 69.6,
+      "raw_tail": "+ 5) - 7 + 8 = x + 6.\n4. We know that at the third stop, there were 20 people, so x + 6 = 20.\n5. Solving for x: x = 20 - 6 = 14.\nTherefore, 14 people got on the bus at the terminal.\nThe answer is 14.\n",
+      "question": "Some people got on a bus at the terminal. At the first bus stop, 5 more people g..."
+    },
+    {
+      "i": 29,
+      "expected": "23",
+      "got": "23",
+      "status": "OK",
+      "elapsed": 41.7,
+      "raw_tail": "1. Red cars: 12\n2. Green cars: 12 - 2 = 10\n3. Blue cars: 3 * 10 = 30\n4. Total cars: 75\n5. Yellow cars: 75 - (12 + 10 + 30) = 75 - 52 = 23\n\nThe answer is 23.\n",
+      "question": "A bumper car rink has 12 red cars. They have 2 fewer green cars than they have r..."
+    },
+    {
+      "i": 30,
+      "expected": "145",
+      "got": "145",
+      "status": "OK",
+      "elapsed": 78.9,
+      "raw_tail": " Marta has 3 * 35 = 105 points.\nTotal points collected by the class = 50 + 65 + 35 + 105 = 255 points.\nThe minimum threshold is 400 points.\nPoints missing = 400 - 255 = 145 points.\nThe answer is 145.\n",
+      "question": "Students in class 3B are collecting school points for behavior. If they get enou..."
+    },
+    {
+      "i": 31,
+      "expected": "123",
+      "got": "123",
+      "status": "OK",
+      "elapsed": 96.0,
+      "raw_tail": " girls, and nongendered children together.\n    *   Total = Boys + Girls + Nongendered\n    *   Total = $30 + 90 + 3$\n    *   Total = $123$.\n\nThe classroom has a total of 123 children.\n\n**Answer:** 123\n",
+      "question": "If a classroom has 3 times as many girls as they do boys, and 1/10 as many nonge..."
+    },
+    {
+      "i": 32,
+      "expected": "98",
+      "got": "98",
+      "status": "OK",
+      "elapsed": 78.8,
+      "raw_tail": "idth = 360 \u00f7 9 = 40 feet\n\nNow I can find the perimeter:\nPerimeter = 2 \u00d7 (Length + Width)\nPerimeter = 2 \u00d7 (9 + 40)\nPerimeter = 2 \u00d7 49\nPerimeter = 98 feet\n\nThe perimeter of Billie's bedroom is 98 feet.\n",
+      "question": "The area of Billie's rectangular bedroom is 360 square feet. If the length of hi..."
+    },
+    {
+      "i": 33,
+      "expected": "7",
+      "got": "7",
+      "status": "OK",
+      "elapsed": 97.7,
+      "raw_tail": "ow I need to find the cost of 10 lollipops and 10 candies.\nCost of 10 lollipops = 10 * $0.40 = $4.00.\nCost of 10 candies = 10 * $0.30 = $3.00.\nTotal cost = $4.00 + $3.00 = $7.00.\nThe answer is $7.00.\n",
+      "question": "Manolo bought five lollipops and four candies that cost $3.20. If each lollipop ..."
+    },
+    {
+      "i": 34,
+      "expected": "34",
+      "got": "34",
+      "status": "OK",
+      "elapsed": 127.0,
+      "raw_tail": ". Since they agreed to split all the costs evenly, I need to divide the total cost by the number of friends. So each friend paid $102 / 3 = $34.\n\nTherefore, each of them paid $34.\n\nThe answer is: $34\n",
+      "question": "Three friends spent $20.25 on 3 tickets to the fair. They also spent $4.50 less ..."
+    },
+    {
+      "i": 35,
+      "expected": "38",
+      "got": "38",
+      "status": "OK",
+      "elapsed": 49.7,
+      "raw_tail": "aches: 6 pounds \u00d7 $2.00 per pound = $12.00\n- Plums: 8 pounds \u00d7 $1.00 per pound = $8.00\n- Apricots: 6 pounds \u00d7 $3.00 per pound = $18.00\n- Total: $12.00 + $8.00 + $18.00 = $38.00\n\nThe answer is $38.00.\n",
+      "question": "At the local Pick Your Own fruit orchard, you could pick your own peaches for $2..."
+    },
+    {
+      "i": 36,
+      "expected": "320",
+      "got": "320",
+      "status": "OK",
+      "elapsed": 75.3,
+      "raw_tail": "(60 + 40) = 2 * 100 = 200 cards\n\nNow let me calculate the total:\nInitial + First month + Second month + Third month\n= 20 + 60 + 40 + 200\n= 320 cards\n\nTherefore, Elaine has 320 Pokemon cards in total.\n",
+      "question": "Elaine initially had 20 Pokemon cards. After a month, she collected three times ..."
+    },
+    {
+      "i": 37,
+      "expected": "50",
+      "got": "50",
+      "status": "OK",
+      "elapsed": 90.4,
+      "raw_tail": "al number of votes is 100, and we know that candidate A got 20 votes and candidate B got 30 votes. So, candidate C got 100 - 20 - 30 = 50 votes.\nTherefore, candidate C got 50 votes.\nThe answer is 50.\n",
+      "question": "In a student council election, candidate A got 20% of the votes while candidate ..."
+    },
+    {
+      "i": 38,
+      "expected": "50",
+      "got": "50",
+      "status": "OK",
+      "elapsed": 73.6,
+      "raw_tail": "'s share = 3 parts \u00d7 $20 = $60.\nJulia's share = 2 parts \u00d7 $20 = $40.\nNow, Gerald spent $10 on a book.\nSo, Gerald's remaining money = $60 - $10 = $50.\nTherefore, Gerald had $50 left.\nThe answer is 50.\n",
+      "question": "Gerald and Julia divided $100 in the ratio 3:2. If Gerald spent $10 on a book, h..."
+    },
+    {
+      "i": 39,
+      "expected": "84",
+      "got": "84",
+      "status": "OK",
+      "elapsed": 50.0,
+      "raw_tail": "3. Their mother made 4 pies, and each pie requires 12 grapes: 4 * 12 = 48 grapes.\n4. Total grapes = Madeline's grapes + Brother's grapes + Mother's grapes = 6 + 30 + 48 = 84 grapes.\nThe answer is 84.\n",
+      "question": "Madeline ate 6 grapes. Her brother wanted to make grape juice and used up 5 time..."
+    },
+    {
+      "i": 40,
+      "expected": "50",
+      "got": "50",
+      "status": "OK",
+      "elapsed": 115.4,
+      "raw_tail": "o he saves $25 per day in the first half, and $2x = $50 per day in the second half.\n8. The question asks for the amount he needs to save each day in the second half, which is $50.\n\nThe answer is $50.\n",
+      "question": "Elvis has a monthly saving target of $1125. In April, he wants to save twice as ..."
+    },
+    {
+      "i": 41,
+      "expected": "8000",
+      "got": "8000",
+      "status": "OK",
+      "elapsed": 68.0,
+      "raw_tail": "lso know that M + H = 88,000 (combined sales).\nSubstituting M = 10H into the second equation:\n10H + H = 88,000\n11H = 88,000\nH = 88,000 / 11\nH = 8,000\nSo Harald sold 8,000 copies.\nThe answer is 8,000.\n",
+      "question": "Marilyn's first record sold 10 times as many copies as Harald's. If they sold 88..."
+    },
+    {
+      "i": 42,
+      "expected": "280",
+      "got": "280",
+      "status": "OK",
+      "elapsed": 50.7,
+      "raw_tail": "ows \u00d7 8 children per row = 56 children per school\n3. Calculate the total number of children:\n   - 5 schools \u00d7 56 children per school = 280 children\n\nTherefore, there can be 280 children in the drill.\n",
+      "question": "For the children's day celebration, children are to perform mass drills in the p..."
+    },
+    {
+      "i": 43,
+      "expected": "30",
+      "got": "30",
+      "status": "OK",
+      "elapsed": 36.4,
+      "raw_tail": "is 24 years old.\n2. Martha is 4 years younger than Mike, so Martha is 24 - 4 = 20 years old.\n3. Jeff is 10 years older than Martha, so Jeff is 20 + 10 = 30 years old.\nTherefore, Jeff is 30 years old.\n",
+      "question": "Jeff is 10 years older than his younger sister, Martha. Martha, on the other han..."
+    },
+    {
+      "i": 44,
+      "expected": "192",
+      "got": "1",
+      "status": "WRONG",
+      "elapsed": 225.9,
+      "raw_tail": "er: 176 cows in 8 stalls.\n\nLet me double-check:\n- 20 stalls \u00d7 20 cows = 400 initial cows\n- 40 new cows \u00f7 20 stalls = 2 new cows per stall\n- Each stall now has 20 + 2 = 22 cows\n- 8 stalls \u00d7 22 cows = 1",
+      "question": "Ten stalls have 20 cows each. Mr. Sylas buys 40 cows and divides them equally, p..."
+    },
+    {
+      "i": 45,
+      "expected": "276",
+      "got": "276",
+      "status": "OK",
+      "elapsed": 94.0,
+      "raw_tail": "ut how many wheels will be left, I need to subtract the total wheels needed from the total wheels available. So, 650 - 374 = 276 wheels.\nTherefore, Henry will have 276 wheels left.\nThe answer is 276.\n",
+      "question": "Henry needs to assemble some toys, specifically 57 cars and 73 motorcycles. Henr..."
+    },
+    {
+      "i": 46,
+      "expected": "32",
+      "got": "32",
+      "status": "OK",
+      "elapsed": 58.3,
+      "raw_tail": "r Anakin = 10 + 6 + 3 = 19\n\nLocsin caught:\n- Starfish: 10 - 5 = 5\n- Sea horses: 6 - 3 = 3\n- Clownfish: 3 + 2 = 5\nTotal for Locsin = 5 + 3 + 5 = 13\n\nTotal fish caught = 19 + 13 = 32\n\nThe answer is 32.\n",
+      "question": "Anakin and Locsin went to the beach today. Anakin caught 10 starfish, 6 sea hors..."
+    },
+    {
+      "i": 47,
+      "expected": "25",
+      "got": "25",
+      "status": "OK",
+      "elapsed": 118.3,
+      "raw_tail": "I need to convert the total weight to pounds.\nThere are 16 ounces in 1 pound.\nTotal weight in pounds = 400 \u00f7 16 = 25 pounds.\n\nThe total weight of the crayons and boxes is 25 pounds.\nThe answer is 25.\n",
+      "question": "Nik has 200 crayons. He wants to separate them into groups of 8 and put them int..."
+    },
+    {
+      "i": 48,
+      "expected": "10",
+      "got": "10",
+      "status": "OK",
+      "elapsed": 54.7,
+      "raw_tail": "nds.\n3. So the increased weight is: 8 + 4 = 12 pounds.\n4. This weight is too heavy, so he uses a weight 2 pounds lighter than 12 pounds.\n5. The final weight is: 12 - 2 = 10 pounds.\n\nThe answer is 10.\n",
+      "question": "Jamaal is at the gym. He has been using an 8-pound weight. He increases the weig..."
+    },
+    {
+      "i": 49,
+      "expected": "84",
+      "got": "84",
+      "status": "OK",
+      "elapsed": 83.9,
+      "raw_tail": "s old, so their total age is 12 + 24 = 36 years.\nFinally, let's add the total age of the sisters and brothers together: 48 + 36 = 84 years.\nTherefore, the total age of all these siblings is 84 years.\n",
+      "question": "In a family, there are 2 brothers and 3 sisters. All sisters are the same age, w..."
+    }
+  ]
+}
\ No newline at end of file
diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/config.json b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/config.json
new file mode 100644
index 000000000000..f9f4c880e61e
--- /dev/null
+++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/config.json
@@ -0,0 +1,37 @@
+{
+  "git_sha": "unknown",
+  "wo_split": 1,
+  "total_wo_ctas": 4,
+  "slice_ctas": 8,
+  "gather_ctas": 32,
+  "total_grid_ctas_per_seq": 32,
+  "active_wo_ctas": 4,
+  "hidden_size": 5120,
+  "num_kv_heads": 4,
+  "num_q_heads": 24,
+  "head_dim": 256,
+  "K": 6144,
+  "num_k_groups": 384,
+  "num_k_tiles": 96,
+  "num_active_tokens": 1,
+  "seed": 4242,
+  "launches": 50,
+  "cooperative": true,
+  "ncu": false,
+  "warmup_launches": 0,
+  "dtypes": {
+    "attn_output": "bfloat16",
+    "wo_weight": "uint8",
+    "wo_scales": "uint8",
+    "wo_gs": "float32",
+    "wo_output": "float32",
+    "final_out": "float32"
+  },
+  "cache_key": "35fee3f003016249",
+  "diagnostic_no_cooperative": false,
+  "image_id": null,
+  "torch_version": "2.12.0.dev20260402+cu132",
+  "python_version": "3.12.3 (main, Mar  3 2026, 12:15:18) [GCC 13.3.0]",
+  "effective_bytes_formula": "PAYLOAD = B*K*2 + H*K//2 + nmt*nkt*32*4*4 + 4 + B*H*4; SCRATCH = B*total_wo_ctas*H*4 + GATHER_CTAS*total_wo_ctas*B*H*4; EFFECTIVE = PAYLOAD + SCRATCH; GATHER_CTAS = slice_ctas*num_kv_heads (this run: 32)",
+  "cache_status_first_event": "HIT"
+}
diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/correctness_gate_split_order.json b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/correctness_gate_split_order.json
new file mode 100644
index 000000000000..1287ebc647fc
--- /dev/null
+++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/correctness_gate_split_order.json
@@ -0,0 +1,10 @@
+{
+  "passes": true,
+  "max_abs": 0.0,
+  "max_rel": 0.0,
+  "rtol": 0.001,
+  "atol": 0.0001,
+  "ref_function": "reference_split_order(wo_split=1)",
+  "wo_split": 1,
+  "kind": "AUTHORITATIVE"
+}
diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/correctness_vs_chained.json b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/correctness_vs_chained.json
new file mode 100644
index 000000000000..c098b65eb5b3
--- /dev/null
+++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/correctness_vs_chained.json
@@ -0,0 +1,6 @@
+{
+  "max_abs": 0.0,
+  "max_rel": 0.0,
+  "ref_function": "reference_chained_fma",
+  "kind": "DIAGNOSTIC"
+}
diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/correctness_vs_matmul.json b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/correctness_vs_matmul.json
new file mode 100644
index 000000000000..f019c09b03da
--- /dev/null
+++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/correctness_vs_matmul.json
@@ -0,0 +1,6 @@
+{
+  "max_abs": 0.06640625,
+  "max_rel": 0.006844626739621162,
+  "ref_function": "reference_matmul",
+  "kind": "DIAGNOSTIC"
+}
diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/timing.csv b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/timing.csv
new file mode 100644
index 000000000000..ddb577f1ef64
--- /dev/null
+++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/timing.csv
@@ -0,0 +1,51 @@
+launch_idx,elapsed_us,payload_bytes,scratch_bytes,effective_bytes,effective_gbps,is_warmup
+0,14802.335739,17727492,2703360,20430852,1.380245,0
+1,13730.976105,17727492,2703360,20430852,1.487939,0
+2,13796.223640,17727492,2703360,20430852,1.480902,0
+3,13867.839813,17727492,2703360,20430852,1.473254,0
+4,13700.991631,17727492,2703360,20430852,1.491195,0
+5,13705.504417,17727492,2703360,20430852,1.490704,0
+6,13693.375587,17727492,2703360,20430852,1.492025,0
+7,13770.688057,17727492,2703360,20430852,1.483648,0
+8,13704.992294,17727492,2703360,20430852,1.490760,0
+9,13722.335815,17727492,2703360,20430852,1.488876,0
+10,13803.551674,17727492,2703360,20430852,1.480116,0
+11,13716.256142,17727492,2703360,20430852,1.489536,0
+12,13741.279602,17727492,2703360,20430852,1.486823,0
+13,13814.111710,17727492,2703360,20430852,1.478984,0
+14,13842.975616,17727492,2703360,20430852,1.475900,0
+15,13775.135994,17727492,2703360,20430852,1.483169,0
+16,13746.368408,17727492,2703360,20430852,1.486273,0
+17,14579.999924,17727492,2703360,20430852,1.401293,0
+18,13786.591530,17727492,2703360,20430852,1.481936,0
+19,13732.192039,17727492,2703360,20430852,1.487807,0
+20,13801.471710,17727492,2703360,20430852,1.480339,0
+21,13725.376129,17727492,2703360,20430852,1.488546,0
+22,13752.511978,17727492,2703360,20430852,1.485609,0
+23,13766.400337,17727492,2703360,20430852,1.484110,0
+24,13796.799660,17727492,2703360,20430852,1.480840,0
+25,13753.600121,17727492,2703360,20430852,1.485491,0
+26,13727.775574,17727492,2703360,20430852,1.488286,0
+27,16090.143204,17727492,2703360,20430852,1.269774,0
+28,28432.863235,17727492,2703360,20430852,0.718565,0
+29,14688.256264,17727492,2703360,20430852,1.390965,0
+30,15311.231613,17727492,2703360,20430852,1.334370,0
+31,14410.176277,17727492,2703360,20430852,1.417807,0
+32,14028.096199,17727492,2703360,20430852,1.456424,0
+33,13780.639648,17727492,2703360,20430852,1.482576,0
+34,13766.016006,17727492,2703360,20430852,1.484151,0
+35,13752.032280,17727492,2703360,20430852,1.485661,0
+36,13723.872185,17727492,2703360,20430852,1.488709,0
+37,13732.128143,17727492,2703360,20430852,1.487814,0
+38,13781.951904,17727492,2703360,20430852,1.482435,0
+39,13744.159698,17727492,2703360,20430852,1.486512,0
+40,13721.792221,17727492,2703360,20430852,1.488935,0
+41,13742.303848,17727492,2703360,20430852,1.486712,0
+42,13776.415825,17727492,2703360,20430852,1.483031,0
+43,13775.712013,17727492,2703360,20430852,1.483107,0
+44,13731.167793,17727492,2703360,20430852,1.487918,0
+45,13713.727951,17727492,2703360,20430852,1.489810,0
+46,13810.175896,17727492,2703360,20430852,1.479406,0
+47,13793.248177,17727492,2703360,20430852,1.481221,0
+48,13752.256393,17727492,2703360,20430852,1.485636,0
+49,13806.719780,17727492,2703360,20430852,1.479776,0
diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_region_breakdown.csv b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_region_breakdown.csv
new file mode 100644
index 000000000000..9cc11c31e64a
--- /dev/null
+++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_region_breakdown.csv
@@ -0,0 +1,14 @@
+region_id,region,n_active_ctas,cta_class,tick_source,mean_ticks,median_ticks,p99_ticks,raw_total_ticks,median_us,frac_of_kernel
+0,phase0_pre_attn,1,phase0,globaltimer,12512.0,12512.0,12512.0,12512,12.512,0.0003128
+1,phase1_attn_pre_wo,4,phase1,globaltimer,245264.0,248320.0,248320.0,981056,248.32,0.006208
+2,phase1_wo_gemv,4,phase1,globaltimer,14152008.0,14121472.0,14858986.24,56608032,14121.472,0.3530368
+3,phase1_wo_post,1,phase1,globaltimer,128.0,128.0,128.0,128,0.128,3.2e-06
+4,grid_barrier_wait,64,barrier_wait,globaltimer,14310038.0,15211456.0,15211520.0,915842432,15211.456,
+5,phase3_load_x,64,phase3,globaltimer,1948.5,1952.0,1984.0,124704,1.952,4.88e-05
+6,phase3_partial_reset,64,phase3,globaltimer,102.5,96.0,160.0,6560,0.096,2.4e-06
+7,phase3_3a_fc1_silu,64,phase3,globaltimer,564574.5,564448.0,610369.6,36132768,564.448,0.0141112
+8,phase3_3b_quant,64,phase3,globaltimer,546.5,544.0,576.0,34976,0.544,1.36e-05
+9,phase3_3c_fc2_atomic,64,phase3,globaltimer,114624.5,113328.0,174219.84,7335968,113.328,0.0028332
+10,phase3_3d_arrival,64,phase3,globaltimer,314.5,96.0,2603.84,20128,0.096,2.4e-06
+11,phase1_pre_wo_wait,0,barrier_wait,globaltimer,0.0,0.0,0.0,0,0.0,
+12,phase1_gather_reduce,1,dynamic_single,globaltimer,73312.0,73312.0,73312.0,73312,73.312,
diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_region_timings.npy b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_region_timings.npy
new file mode 100644
index 000000000000..9f84718c2bd1
Binary files /dev/null and b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_region_timings.npy differ
diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed.nsys-rep b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed.nsys-rep
new file mode 100644
index 000000000000..91cb212d5803
Binary files /dev/null and b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed.nsys-rep differ
diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed.sqlite b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed.sqlite
new file mode 100644
index 000000000000..ceb1960b7eee
Binary files /dev/null and b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed.sqlite differ
diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_gsm8k.json b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_gsm8k.json
new file mode 100644
index 000000000000..62151fcede40
--- /dev/null
+++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_gsm8k.json
@@ -0,0 +1,463 @@
+{
+  "label": "task11_wo_split_8",
+  "model": "default",
+  "api": "http://localhost:8000/v1",
+  "n": 50,
+  "seed": 42,
+  "correct": 4,
+  "errors": 46,
+  "accuracy": "4/50 (8.0%)",
+  "total_seconds": 8782.5,
+  "results": [
+    {
+      "i": 0,
+      "expected": "2280",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "The girls are trying to raise money for a carnival. Kim raises $320 more than Al..."
+    },
+    {
+      "i": 1,
+      "expected": "1",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Kalinda is working on a 360 piece puzzle with her mom. Kalinda can normally add ..."
+    },
+    {
+      "i": 2,
+      "expected": "5",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Tom's ship can travel at 10 miles per hour.  He is sailing from 1 to 4 PM.  He t..."
+    },
+    {
+      "i": 3,
+      "expected": "12",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "James decides to buy birthday candles for his 2 sons.  One of them is 12 and the..."
+    },
+    {
+      "i": 4,
+      "expected": "273",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Mariah\u2019s grandma was teaching her to knit. Mariah used 1/4 of a skein of yarn. H..."
+    },
+    {
+      "i": 5,
+      "expected": "45",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "While playing with her friends in their school playground, Katelyn saw 50 fairie..."
+    },
+    {
+      "i": 6,
+      "expected": "21",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "If Ann is 9 years old and her brother is twice her age, how old will her brother..."
+    },
+    {
+      "i": 7,
+      "expected": "145",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Twenty dozen cups cost $1200 less than the total cost of half a dozen plates sol..."
+    },
+    {
+      "i": 8,
+      "expected": "60",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Last year there were 50 students enrolled in a calligraphy class. This year, the..."
+    },
+    {
+      "i": 9,
+      "expected": "122",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Rani has ten more crabs than Monic, who has 4 fewer crabs than Bo. If Bo has 40 ..."
+    },
+    {
+      "i": 10,
+      "expected": "29",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Brian's basement was damp and musty, so he bought a dehumidifier to remove moist..."
+    },
+    {
+      "i": 11,
+      "expected": "80",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "George, a grade six teacher, ordered 600 burritos for the sixth-grade class picn..."
+    },
+    {
+      "i": 12,
+      "expected": "36",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Jim spends 2 hours watching TV and then decides to go to bed and reads for half ..."
+    },
+    {
+      "i": 13,
+      "expected": "1430",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Janet buys a brooch for her daughter.  She pays $500 for the material to make it..."
+    },
+    {
+      "i": 14,
+      "expected": "5",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Ducks need to eat 3.5 pounds of insects each week to survive. If there is a floc..."
+    },
+    {
+      "i": 15,
+      "expected": "5",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Joey has 214 points before his turn in Scrabble. He scores 26 points. Then Marcy..."
+    },
+    {
+      "i": 16,
+      "expected": "5",
+      "got": "5",
+      "status": "OK",
+      "elapsed": 177.4,
+      "raw_tail": "y books each child received:**\n    Sarah divides these books equally among her 4 kids.\n    $$20 \\text{ books} / 4 \\text{ kids} = 5 \\text{ books per kid}$$\n\n**Answer:** Each child received **5** books.",
+      "question": "Sarah went to buy books from the store and spent $300 on the books. If each book..."
+    },
+    {
+      "i": 17,
+      "expected": "66",
+      "got": "66",
+      "status": "OK",
+      "elapsed": 75.8,
+      "raw_tail": "ook = $48\n  3 pencils \u00d7 $6/pencil = $1...\n</think>\n\n$18\n\nTed spends **$48** on books ($16 \u00d7 3) and **$18** on pencils ($6 \u00d7 3).\n\nAdding these together gives the total amount spent:\n$48 + $18 = **$66**",
+      "question": "Ted starts with $200. He buys 3 books for 16 dollars each and 3 pencils for 6 do..."
+    },
+    {
+      "i": 18,
+      "expected": "15",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Eve had 20 pieces of pomelos. After giving her friend some pomelos, Eve is left ..."
+    },
+    {
+      "i": 19,
+      "expected": "40",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "The Doubtfire sisters are driving home with 7 kittens adopted from the local ani..."
+    },
+    {
+      "i": 20,
+      "expected": "93",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Jen got 3 fish.  They each need $1 worth of food a day.  How much does she spend..."
+    },
+    {
+      "i": 21,
+      "expected": "2000",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Elliott is trying to walk 10,000 steps a day. He finished half of his steps on h..."
+    },
+    {
+      "i": 22,
+      "expected": "1520",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "While working at the restaurant, each of the forty customers who came into the r..."
+    },
+    {
+      "i": 23,
+      "expected": "11050",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "On a certain day, the total cost of filling up 20 helium balloons was $900. Two ..."
+    },
+    {
+      "i": 24,
+      "expected": "90",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "To raise money for their class fund, each of the 30 students from one class sold..."
+    },
+    {
+      "i": 25,
+      "expected": "40000",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "John decides to get a loan by mortgaging his home.  His house is worth $250,000...."
+    },
+    {
+      "i": 26,
+      "expected": "21",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Liam and Mitchell own competing lemonade stands across the street from one anoth..."
+    },
+    {
+      "i": 27,
+      "expected": "18",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Melanie is a door-to-door saleswoman. She sold a third of her vacuum cleaners at..."
+    },
+    {
+      "i": 28,
+      "expected": "14",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Some people got on a bus at the terminal. At the first bus stop, 5 more people g..."
+    },
+    {
+      "i": 29,
+      "expected": "23",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "A bumper car rink has 12 red cars. They have 2 fewer green cars than they have r..."
+    },
+    {
+      "i": 30,
+      "expected": "145",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Students in class 3B are collecting school points for behavior. If they get enou..."
+    },
+    {
+      "i": 31,
+      "expected": "123",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "If a classroom has 3 times as many girls as they do boys, and 1/10 as many nonge..."
+    },
+    {
+      "i": 32,
+      "expected": "98",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "The area of Billie's rectangular bedroom is 360 square feet. If the length of hi..."
+    },
+    {
+      "i": 33,
+      "expected": "7",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Manolo bought five lollipops and four candies that cost $3.20. If each lollipop ..."
+    },
+    {
+      "i": 34,
+      "expected": "34",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Three friends spent $20.25 on 3 tickets to the fair. They also spent $4.50 less ..."
+    },
+    {
+      "i": 35,
+      "expected": "38",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "At the local Pick Your Own fruit orchard, you could pick your own peaches for $2..."
+    },
+    {
+      "i": 36,
+      "expected": "320",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Elaine initially had 20 Pokemon cards. After a month, she collected three times ..."
+    },
+    {
+      "i": 37,
+      "expected": "50",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "In a student council election, candidate A got 20% of the votes while candidate ..."
+    },
+    {
+      "i": 38,
+      "expected": "50",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Gerald and Julia divided $100 in the ratio 3:2. If Gerald spent $10 on a book, h..."
+    },
+    {
+      "i": 39,
+      "expected": "84",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Madeline ate 6 grapes. Her brother wanted to make grape juice and used up 5 time..."
+    },
+    {
+      "i": 40,
+      "expected": "50",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Elvis has a monthly saving target of $1125. In April, he wants to save twice as ..."
+    },
+    {
+      "i": 41,
+      "expected": "8000",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Marilyn's first record sold 10 times as many copies as Harald's. If they sold 88..."
+    },
+    {
+      "i": 42,
+      "expected": "280",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "For the children's day celebration, children are to perform mass drills in the p..."
+    },
+    {
+      "i": 43,
+      "expected": "30",
+      "got": "30",
+      "status": "OK",
+      "elapsed": 92.2,
+      "raw_tail": " Age**: Martha is 4 years younger than Mike.\n   $$24 - 4 = 20$$\n   So, Martha is **20** years old.\n3. **Jeff's Age**: Jeff is 10 years older than Martha.\n   $$20 + 10 = 30$$\n\nJeff is **30** years old.",
+      "question": "Jeff is 10 years older than his younger sister, Martha. Martha, on the other han..."
+    },
+    {
+      "i": 44,
+      "expected": "192",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.0,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Ten stalls have 20 cows each. Mr. Sylas buys 40 cows and divides them equally, p..."
+    },
+    {
+      "i": 45,
+      "expected": "276",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Henry needs to assemble some toys, specifically 57 cars and 73 motorcycles. Henr..."
+    },
+    {
+      "i": 46,
+      "expected": "32",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Anakin and Locsin went to the beach today. Anakin caught 10 starfish, 6 sea hors..."
+    },
+    {
+      "i": 47,
+      "expected": "25",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "Nik has 200 crayons. He wants to separate them into groups of 8 and put them int..."
+    },
+    {
+      "i": 48,
+      "expected": "10",
+      "got": "10",
+      "status": "OK",
+      "elapsed": 154.3,
+      "raw_tail": " $8 + (0.50 \\times 8) = 12$). However, because this is too heavy, he switches to a weight that is two pounds lighter than this new amount ($12 - 2$).\n\nTherefore, Jamaal now uses a **10**-pound weight.",
+      "question": "Jamaal is at the gym. He has been using an 8-pound weight. He increases the weig..."
+    },
+    {
+      "i": 49,
+      "expected": "84",
+      "got": "",
+      "status": "ERROR",
+      "elapsed": 180.1,
+      "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)",
+      "question": "In a family, there are 2 brothers and 3 sisters. All sisters are the same age, w..."
+    }
+  ]
+}
\ No newline at end of file
diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_gsm8k_fullthink.json b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_gsm8k_fullthink.json
new file mode 100644
index 000000000000..0dcd13c4266c
--- /dev/null
+++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_gsm8k_fullthink.json
@@ -0,0 +1,463 @@
+{
+  "label": "task11_wo_split_8_fullthink",
+  "model": "default",
+  "api": "http://localhost:8000/v1",
+  "n": 50,
+  "seed": 42,
+  "correct": 47,
+  "errors": 0,
+  "accuracy": "47/50 (94.0%)",
+  "total_seconds": 3663.7,
+  "results": [
+    {
+      "i": 0,
+      "expected": "2280",
+      "got": "2180",
+      "status": "WRONG",
+      "elapsed": 118.0,
+      "raw_tail": "sed $300\n- Maryam raised $400 more than Sarah, so Maryam raised $300 + $400 = $700\n\nNow I need to add up all the amounts:\n$430 + $750 + $300 + $700 = $2180\n\nTherefore, they all raised $2180 in total.\n",
+      "question": "The girls are trying to raise money for a carnival. Kim raises $320 more than Al..."
+    },
+    {
+      "i": 1,
+      "expected": "1",
+      "got": "1",
+      "status": "OK",
+      "elapsed": 44.3,
+      "raw_tail": "e = 4/2 = 2 pieces per minute\n3. Combined rate: 4 + 2 = 6 pieces per minute\n4. Total pieces: 360\n5. Time in minutes: 360/6 = 60 minutes\n6. Convert to hours: 60 minutes = 1 hour\n\nThe answer is 1 hour.\n",
+      "question": "Kalinda is working on a 360 piece puzzle with her mom. Kalinda can normally add ..."
+    },
+    {
+      "i": 2,
+      "expected": "5",
+      "got": "5",
+      "status": "OK",
+      "elapsed": 56.0,
+      "raw_tail": " 30 miles.\n4. Now I need to find out how long it takes to travel back 30 miles at 6 mph.\n5. Time = Distance / Speed = 30 / 6 = 5 hours.\n6. So it takes Tom 5 hours to get back.\n\nThe answer is 5 hours.\n",
+      "question": "Tom's ship can travel at 10 miles per hour.  He is sailing from 1 to 4 PM.  He t..."
+    },
+    {
+      "i": 3,
+      "expected": "12",
+      "got": "12",
+      "status": "OK",
+      "elapsed": 80.0,
+      "raw_tail": ". Since each pack contains 5 candles, he needs 20 / 5 = 4 packs.\nFinally, I need to calculate the total cost. Each pack costs $3, so 4 packs cost 4 * $3 = $12.\nTherefore, James spends $12 on candles.\n",
+      "question": "James decides to buy birthday candles for his 2 sons.  One of them is 12 and the..."
+    },
+    {
+      "i": 4,
+      "expected": "273",
+      "got": "273",
+      "status": "OK",
+      "elapsed": 62.3,
+      "raw_tail": "1/2 of a skein.\nSo grandma used: 1/2 \u00d7 364 = 182 yards.\nNow I need to add these together to find the total:\n91 + 182 = 273 yards.\nTherefore, they used 273 yards of yarn altogether.\nThe answer is 273.\n",
+      "question": "Mariah\u2019s grandma was teaching her to knit. Mariah used 1/4 of a skein of yarn. H..."
+    },
+    {
+      "i": 5,
+      "expected": "45",
+      "got": "45",
+      "status": "OK",
+      "elapsed": 55.9,
+      "raw_tail": "ed the original 50 fairies, so the total became 50 + 25 = 75 fairies.\nThen, 30 fairies flew away, so we subtract 30 from 75.\n75 - 30 = 45.\nTherefore, there are 45 fairies remaining.\nThe answer is 45.\n",
+      "question": "While playing with her friends in their school playground, Katelyn saw 50 fairie..."
+    },
+    {
+      "i": 6,
+      "expected": "21",
+      "got": "21",
+      "status": "OK",
+      "elapsed": 42.8,
+      "raw_tail": "other is twice her age, so he is 9 * 2 = 18 years old.\nNow I need to find out how old he will be in 3 years.\n18 + 3 = 21 years old.\nSo Ann's brother will be 21 years old in 3 years.\nThe answer is 21.\n",
+      "question": "If Ann is 9 years old and her brother is twice her age, how old will her brother..."
+    },
+    {
+      "i": 7,
+      "expected": "145",
+      "got": "145",
+      "status": "OK",
+      "elapsed": 76.2,
+      "raw_tail": "o the total cost of cups is $36000 - $1200 = $34800.\nNow, I need to find the cost of each cup. Twenty dozen cups is 20 * 12 = 240 cups.\nSo the cost per cup is $34800 / 240 = $145.\nThe answer is $145.\n",
+      "question": "Twenty dozen cups cost $1200 less than the total cost of half a dozen plates sol..."
+    },
+    {
+      "i": 8,
+      "expected": "60",
+      "got": "60",
+      "status": "OK",
+      "elapsed": 64.5,
+      "raw_tail": "nts.\n4.  **Calculate the total enrollment for this year:** Add the number of new students to the initial number.\n    *   $50 + 10 = 60$ students.\n\nTherefore, there are 60 students enrolled this year.\n",
+      "question": "Last year there were 50 students enrolled in a calligraphy class. This year, the..."
+    },
+    {
+      "i": 9,
+      "expected": "122",
+      "got": "122",
+      "status": "OK",
+      "elapsed": 66.1,
+      "raw_tail": "crabs\n\nNow I need to calculate the total number of crabs:\nTotal = Bo's crabs + Monic's crabs + Rani's crabs\nTotal = 40 + 36 + 46 = 122 crabs\n\nThe total number of crabs the three have together is 122.\n",
+      "question": "Rani has ten more crabs than Monic, who has 4 fewer crabs than Bo. If Bo has 40 ..."
+    },
+    {
+      "i": 10,
+      "expected": "29",
+      "got": "29",
+      "status": "OK",
+      "elapsed": 177.4,
+      "raw_tail": "**\n    *   Sum of all periods: $3 \\text{ liters} + 6 \\text{ liters} + 20 \\text{ liters}$.\n    *   Calculation: $3 + 6 = 9$; $9 + 20 = 29$.\n\nThe total amount of water removed is 29 liters.\n\n29 liters.\n",
+      "question": "Brian's basement was damp and musty, so he bought a dehumidifier to remove moist..."
+    },
+    {
+      "i": 11,
+      "expected": "80",
+      "got": "80",
+      "status": "OK",
+      "elapsed": 78.6,
+      "raw_tail": ". George) = 520 burritos.\nFinally, I need to find the number of leftover burritos.\nTotal ordered = 600 burritos.\nLeftover = Total ordered - Total consumed = 600 - 520 = 80 burritos.\nThe answer is 80.\n",
+      "question": "George, a grade six teacher, ordered 600 burritos for the sixth-grade class picn..."
+    },
+    {
+      "i": 12,
+      "expected": "36",
+      "got": "36",
+      "status": "OK",
+      "elapsed": 100.1,
+      "raw_tail": "   *   Duration: 4 weeks.\n    *   Total time: $9 \\text{ hours/week} \\times 4 \\text{ weeks} = 36 \\text{ hours}$.\n\nThe total time Jim spends on TV and reading in 4 weeks is 36 hours.\n\nAnswer: 36 hours.\n",
+      "question": "Jim spends 2 hours watching TV and then decides to go to bed and reads for half ..."
+    },
+    {
+      "i": 13,
+      "expected": "1430",
+      "got": "1430",
+      "status": "OK",
+      "elapsed": 55.7,
+      "raw_tail": "hich is 10% of the total:\n10% of $1300 = 0.10 \u00d7 $1300 = $130\n\nFinally, I need to add the insurance cost to the total:\n$1300 + $130 = $1430\n\nTherefore, Janet paid $1430 in total.\n\nThe answer is: $1430\n",
+      "question": "Janet buys a brooch for her daughter.  She pays $500 for the material to make it..."
+    },
+    {
+      "i": 14,
+      "expected": "5",
+      "got": "5",
+      "status": "OK",
+      "elapsed": 53.1,
+      "raw_tail": "ds\n\n2. Next, I need to convert this to a daily amount.\n   - There are 7 days in a week\n   - Daily amount = 35 \u00f7 7 = 5 pounds\n\nTherefore, the flock needs 5 pounds of insects per day.\n\nThe answer is: 5\n",
+      "question": "Ducks need to eat 3.5 pounds of insects each week to survive. If there is a floc..."
+    },
+    {
+      "i": 15,
+      "expected": "5",
+      "got": "5",
+      "status": "OK",
+      "elapsed": 56.8,
+      "raw_tail": "'s new score = 225 + 10 = 235 points.\nNow I need to find the difference between Joey's score and Marcy's score.\nDifference = 240 - 235 = 5 points.\nSo Joey is now winning by 5 points.\nThe answer is 5.\n",
+      "question": "Joey has 214 points before his turn in Scrabble. He scores 26 points. Then Marcy..."
+    },
+    {
+      "i": 16,
+      "expected": "5",
+      "got": "5",
+      "status": "OK",
+      "elapsed": 47.1,
+      "raw_tail": "ks.\nNext, I need to find out how many books each child got.\nShe has 4 kids and she gave an equal number of books to each.\nSo, books per child = 20 books / 4 kids = 5 books per child.\nThe answer is 5.\n",
+      "question": "Sarah went to buy books from the store and spent $300 on the books. If each book..."
+    },
+    {
+      "i": 17,
+      "expected": "66",
+      "got": "66",
+      "status": "OK",
+      "elapsed": 36.5,
+      "raw_tail": "each, so the cost for books is 3 * 16 = 48 dollars.\nHe also buys 3 pencils at $6 each, so the cost for pencils is 3 * 6 = 18 dollars.\nThe total amount spent is 48 + 18 = 66 dollars.\nThe answer is 66.\n",
+      "question": "Ted starts with $200. He buys 3 books for 16 dollars each and 3 pencils for 6 do..."
+    },
+    {
+      "i": 18,
+      "expected": "15",
+      "got": "15",
+      "status": "OK",
+      "elapsed": 40.2,
+      "raw_tail": "4 of the original amount left.\n3. Calculate how many pomelos she has left: 20 \u00d7 1/4 = 5 pomelos.\n4. Calculate how many pomelos she gave away: 20 - 5 = 15 pomelos.\nTherefore, Eve gave away 15 pomelos.\n",
+      "question": "Eve had 20 pieces of pomelos. After giving her friend some pomelos, Eve is left ..."
+    },
+    {
+      "i": 19,
+      "expected": "40",
+      "got": "40",
+      "status": "OK",
+      "elapsed": 36.8,
+      "raw_tail": "opted kittens.\n2. Patchy has had thrice the number of adopted kittens, which is 3 * 7 = 21 kittens.\n3. Trixie has had 12 kittens.\n4. The total number of kittens is 7 + 21 + 12 = 40.\nThe answer is 40.\n",
+      "question": "The Doubtfire sisters are driving home with 7 kittens adopted from the local ani..."
+    },
+    {
+      "i": 20,
+      "expected": "93",
+      "got": "93",
+      "status": "OK",
+      "elapsed": 48.6,
+      "raw_tail": "$3 per day\n2. Next, I need to find out how many days are in May.\n   - May has 31 days\n3. Finally, I need to calculate the total cost for the month.\n   - $3 per day \u00d7 31 days = $93\n\nThe answer is $93.\n",
+      "question": "Jen got 3 fish.  They each need $1 worth of food a day.  How much does she spend..."
+    },
+    {
+      "i": 21,
+      "expected": "2000",
+      "got": "5000",
+      "status": "WRONG",
+      "elapsed": 212.8,
+      "raw_tail": "= 8,000 - 6,000 = 2,000 steps.\n\nThe answer is 2,000 steps.\n\nLet me double-check:\n- School walks: 5,000 steps\n- Walk with friend: 1,000 steps\n- Jog: 2,000 steps\n- Remaining: 2,000 steps\nTotal: 5,000 + ",
+      "question": "Elliott is trying to walk 10,000 steps a day. He finished half of his steps on h..."
+    },
+    {
+      "i": 22,
+      "expected": "1520",
+      "got": "1520",
+      "status": "OK",
+      "elapsed": 35.8,
+      "raw_tail": "Rafaela's tips: 40 customers \u00d7 $20 = $800\nJulieta's tips: $800 - (10% of $800) = $800 - $80 = $720\nTotal tips: $800 + $720 = $1520\nThe answer is 1520.\n",
+      "question": "While working at the restaurant, each of the forty customers who came into the r..."
+    },
+    {
+      "i": 23,
+      "expected": "11050",
+      "got": "11050",
+      "status": "OK",
+      "elapsed": 53.3,
+      "raw_tail": "days later, the cost per balloon increased by $20, so the new cost per balloon is $45 + $20 = $65.\nBentley filled 170 balloons at the new price, so he paid 170 * $65 = $11,050.\nThe answer is $11,050.\n",
+      "question": "On a certain day, the total cost of filling up 20 helium balloons was $900. Two ..."
+    },
+    {
+      "i": 24,
+      "expected": "90",
+      "got": "90",
+      "status": "OK",
+      "elapsed": 73.5,
+      "raw_tail": "ofit per lollipop is $0.8 - $0.5 = $0.3.\nFinally, I need to find the total profit. Since they sold 300 lollipops and made $0.3 profit on each, the total profit is 300 * $0.3 = $90.\nThe answer is $90.\n",
+      "question": "To raise money for their class fund, each of the 30 students from one class sold..."
+    },
+    {
+      "i": 25,
+      "expected": "40000",
+      "got": "40000",
+      "status": "OK",
+      "elapsed": 63.9,
+      "raw_tail": "d to calculate how much he used to pay off debts: 60% of $100,000 = 0.60 \u00d7 $100,000 = $60,000\n3. Finally, I need to calculate the leftover amount: $100,000 - $60,000 = $40,000\n\nThe answer is $40,000.\n",
+      "question": "John decides to get a loan by mortgaging his home.  His house is worth $250,000...."
+    },
+    {
+      "i": 26,
+      "expected": "21",
+      "got": "21",
+      "status": "OK",
+      "elapsed": 36.3,
+      "raw_tail": "d 21 lemonades at $4 each, so:\n21 \u00d7 $4 = $84\n\nNow I need to find the difference between what Mitchell made and what Liam made:\n$84 - $63 = $21\n\nSo Mitchell made $21 more than Liam.\n\nThe answer is 21.\n",
+      "question": "Liam and Mitchell own competing lemonade stands across the street from one anoth..."
+    },
+    {
+      "i": 27,
+      "expected": "18",
+      "got": "18",
+      "status": "OK",
+      "elapsed": 97.4,
+      "raw_tail": "ify:\n- Start with 18\n- Sold 1/3 (6) at green house, remaining: 12\n- Sold 2 at red house, remaining: 10\n- Sold half of remaining (5) at orange house, remaining: 5\n- Final count: 5 \u2713\n\nThe answer is 18.\n",
+      "question": "Melanie is a door-to-door saleswoman. She sold a third of her vacuum cleaners at..."
+    },
+    {
+      "i": 28,
+      "expected": "14",
+      "got": "14",
+      "status": "OK",
+      "elapsed": 61.9,
+      "raw_tail": "e total became (x + 5) - 7 + 8 = x + 6.\n4. We know that at the third stop, there were 20 people, so x + 6 = 20.\n5. Solving for x: x = 20 - 6 = 14.\nTherefore, 14 people got on the bus at the terminal.\n",
+      "question": "Some people got on a bus at the terminal. At the first bus stop, 5 more people g..."
+    },
+    {
+      "i": 29,
+      "expected": "23",
+      "got": "23",
+      "status": "OK",
+      "elapsed": 39.9,
+      "raw_tail": "1. Red cars: 12\n2. Green cars: 12 - 2 = 10\n3. Blue cars: 3 * 10 = 30\n4. Total of red, green, and blue cars: 12 + 10 + 30 = 52\n5. Yellow cars: 75 - 52 = 23\nThe answer is 23.\n",
+      "question": "A bumper car rink has 12 red cars. They have 2 fewer green cars than they have r..."
+    },
+    {
+      "i": 30,
+      "expected": "145",
+      "got": "145",
+      "status": "OK",
+      "elapsed": 70.2,
+      "raw_tail": " Marta has 3 * 35 = 105 points.\nTotal points collected by the class = 50 + 65 + 35 + 105 = 255 points.\nThe minimum threshold is 400 points.\nPoints missing = 400 - 255 = 145 points.\nThe answer is 145.\n",
+      "question": "Students in class 3B are collecting school points for behavior. If they get enou..."
+    },
+    {
+      "i": 31,
+      "expected": "123",
+      "got": "123",
+      "status": "OK",
+      "elapsed": 86.7,
+      "raw_tail": "e number of boys, girls, and nongendered children together.\n    *   Total = Boys + Girls + Nongendered\n    *   Total = $30 + 90 + 3$\n    *   Total = $123$.\n\nThe classroom has a total of 123 children.\n",
+      "question": "If a classroom has 3 times as many girls as they do boys, and 1/10 as many nonge..."
+    },
+    {
+      "i": 32,
+      "expected": "98",
+      "got": "98",
+      "status": "OK",
+      "elapsed": 66.9,
+      "raw_tail": "Area = Length \u00d7 Width\nSo, 360 = 9 \u00d7 Width\nWidth = 360 \u00f7 9 = 40 feet.\nNow I can find the perimeter.\nPerimeter = 2 \u00d7 (Length + Width)\nPerimeter = 2 \u00d7 (9 + 40) = 2 \u00d7 49 = 98 feet.\nThe answer is 98 feet.\n",
+      "question": "The area of Billie's rectangular bedroom is 360 square feet. If the length of hi..."
+    },
+    {
+      "i": 33,
+      "expected": "7",
+      "got": "7",
+      "status": "OK",
+      "elapsed": 89.9,
+      "raw_tail": "ow I need to calculate the cost of 10 lollipops and 10 candies.\n10 lollipops cost 10 \u00d7 $0.40 = $4.00.\n10 candies cost 10 \u00d7 $0.30 = $3.00.\nThe total cost is $4.00 + $3.00 = $7.00.\nThe answer is $7.00.\n",
+      "question": "Manolo bought five lollipops and four candies that cost $3.20. If each lollipop ..."
+    },
+    {
+      "i": 34,
+      "expected": "34",
+      "got": "34",
+      "status": "OK",
+      "elapsed": 84.4,
+      "raw_tail": "up all expenses:\nTickets: $20.25\nFood: $15.75\nRides: $66\nTotal = $20.25 + $15.75 + $66 = $102.\nSince there are 3 friends splitting the cost evenly, each person pays $102 / 3 = $34.\nThe answer is $34.\n",
+      "question": "Three friends spent $20.25 on 3 tickets to the fair. They also spent $4.50 less ..."
+    },
+    {
+      "i": 35,
+      "expected": "38",
+      "got": "38",
+      "status": "OK",
+      "elapsed": 46.6,
+      "raw_tail": "- Peaches: 6 pounds * $2.00/pound = $12.00\n- Plums: 8 pounds * $1.00/pound = $8.00\n- Apricots: 6 pounds * $3.00/pound = $18.00\n- Total: $12.00 + $8.00 + $18.00 = $38.00\nThe answer is $38.00.\n",
+      "question": "At the local Pick Your Own fruit orchard, you could pick your own peaches for $2..."
+    },
+    {
+      "i": 36,
+      "expected": "320",
+      "got": "320",
+      "status": "OK",
+      "elapsed": 83.3,
+      "raw_tail": "+ 40) = 2 * 100 = 200 cards.\nNow, I need to add up all the cards she has:\nInitial cards: 20\nFirst month: 60\nSecond month: 40\nThird month: 200\nTotal: 20 + 60 + 40 + 200 = 320 cards.\nThe answer is 320.\n",
+      "question": "Elaine initially had 20 Pokemon cards. After a month, she collected three times ..."
+    },
+    {
+      "i": 37,
+      "expected": "50",
+      "got": "50",
+      "status": "OK",
+      "elapsed": 115.5,
+      "raw_tail": "andidates A and B from the total number of voters:\nTotal votes = 100\nVotes for A = 20\nVotes for B = 30\nVotes for C = 100 - 20 - 30 = 50 votes.\n\nTherefore, candidate C got 50 votes.\n\nThe answer is 50.\n",
+      "question": "In a student council election, candidate A got 20% of the votes while candidate ..."
+    },
+    {
+      "i": 38,
+      "expected": "50",
+      "got": "50",
+      "status": "OK",
+      "elapsed": 67.3,
+      "raw_tail": "'s share = 3 parts \u00d7 $20 = $60.\nJulia's share = 2 parts \u00d7 $20 = $40.\nNow, Gerald spent $10 on a book.\nSo, Gerald's remaining money = $60 - $10 = $50.\nTherefore, Gerald had $50 left.\nThe answer is 50.\n",
+      "question": "Gerald and Julia divided $100 in the ratio 3:2. If Gerald spent $10 on a book, h..."
+    },
+    {
+      "i": 39,
+      "expected": "84",
+      "got": "84",
+      "status": "OK",
+      "elapsed": 46.8,
+      "raw_tail": "3. Their mother made 4 pies, and each pie requires 12 grapes: 4 * 12 = 48 grapes.\n4. Total grapes = Madeline's grapes + Brother's grapes + Mother's grapes = 6 + 30 + 48 = 84 grapes.\nThe answer is 84.\n",
+      "question": "Madeline ate 6 grapes. Her brother wanted to make grape juice and used up 5 time..."
+    },
+    {
+      "i": 40,
+      "expected": "50",
+      "got": "50",
+      "status": "OK",
+      "elapsed": 92.9,
+      "raw_tail": "gs target is $1125, so 45x = 1125\n6. Solving for x: x = 1125 \u00f7 45 = 25\n7. So Elvis saves $25 per day in the first half.\n8. In the second half, he saves $2x = 2 \u00d7 25 = $50 per day.\n\nThe answer is $50.\n",
+      "question": "Elvis has a monthly saving target of $1125. In April, he wants to save twice as ..."
+    },
+    {
+      "i": 41,
+      "expected": "8000",
+      "got": "8000",
+      "status": "OK",
+      "elapsed": 90.6,
+      "raw_tail": "H = 88,000\nH = 88,000 / 11\nH = 8,000\n\nSo Harald sold 8,000 copies.\nLet me verify:\n- Harald: 8,000 copies\n- Marilyn: 10 \u00d7 8,000 = 80,000 copies\n- Total: 8,000 + 80,000 = 88,000 \u2713\n\nThe answer is 8,000.\n",
+      "question": "Marilyn's first record sold 10 times as many copies as Harald's. If they sold 88..."
+    },
+    {
+      "i": 42,
+      "expected": "280",
+      "got": "280",
+      "status": "OK",
+      "elapsed": 48.3,
+      "raw_tail": "dren per row \u00d7 7 rows = 56 children per school\n\n3. Calculate total children for all schools:\n   - 56 children per school \u00d7 5 schools = 280 children\n\nTherefore, there can be 280 children in the drill.\n",
+      "question": "For the children's day celebration, children are to perform mass drills in the p..."
+    },
+    {
+      "i": 43,
+      "expected": "30",
+      "got": "30",
+      "status": "OK",
+      "elapsed": 34.1,
+      "raw_tail": "is 24 years old.\n2. Martha is 4 years younger than Mike, so Martha is 24 - 4 = 20 years old.\n3. Jeff is 10 years older than Martha, so Jeff is 20 + 10 = 30 years old.\nTherefore, Jeff is 30 years old.\n",
+      "question": "Jeff is 10 years older than his younger sister, Martha. Martha, on the other han..."
+    },
+    {
+      "i": 44,
+      "expected": "192",
+      "got": "8",
+      "status": "WRONG",
+      "elapsed": 211.7,
+      "raw_tail": "uous - which 8 stalls? Let me assume it means any 8 stalls, or perhaps 8 of the stalls that originally had cows.\n\nIf we take 8 stalls that originally had 20 cows:\n- Each now has 22 cows\n- Total = 8 \u00d7 ",
+      "question": "Ten stalls have 20 cows each. Mr. Sylas buys 40 cows and divides them equally, p..."
+    },
+    {
+      "i": 45,
+      "expected": "276",
+      "got": "276",
+      "status": "OK",
+      "elapsed": 63.5,
+      "raw_tail": "els for cars + 146 wheels for motorcycles = 374 wheels total.\nFinally, I need to find out how many wheels will be left.\n650 wheels in the box - 374 wheels needed = 276 wheels left.\nThe answer is 276.\n",
+      "question": "Henry needs to assemble some toys, specifically 57 cars and 73 motorcycles. Henr..."
+    },
+    {
+      "i": 46,
+      "expected": "32",
+      "got": "32",
+      "status": "OK",
+      "elapsed": 73.8,
+      "raw_tail": "fish than Anakin, so Locsin caught 3 + 2 = 5 clownfish.\nNow, let's add up all the fish they caught:\nAnakin: 10 + 6 + 3 = 19 fish\nLocsin: 5 + 3 + 5 = 13 fish\nTotal: 19 + 13 = 32 fish\nThe answer is 32.\n",
+      "question": "Anakin and Locsin went to the beach today. Anakin caught 10 starfish, 6 sea hors..."
+    },
+    {
+      "i": 47,
+      "expected": "25",
+      "got": "25",
+      "status": "OK",
+      "elapsed": 104.7,
+      "raw_tail": "Total weight = 200 + 200 = 400 ounces.\n\nFinally, I need to convert the total weight to pounds.\nThere are 16 ounces in 1 pound.\nTotal weight in pounds = 400 \u00f7 16 = 25 pounds.\n\nThe answer is 25 pounds.\n",
+      "question": "Nik has 200 crayons. He wants to separate them into groups of 8 and put them int..."
+    },
+    {
+      "i": 48,
+      "expected": "10",
+      "got": "10",
+      "status": "OK",
+      "elapsed": 55.8,
+      "raw_tail": " 8 + 4 = 12 pounds.\n4. This weight is too heavy, so he uses a weight that is 2 pounds lighter than 12 pounds.\n5. 12 - 2 = 10 pounds.\n6. Therefore, Jamaal now uses a 10-pound weight.\nThe answer is 10.\n",
+      "question": "Jamaal is at the gym. He has been using an 8-pound weight. He increases the weig..."
+    },
+    {
+      "i": 49,
+      "expected": "84",
+      "got": "84",
+      "status": "OK",
+      "elapsed": 58.6,
+      "raw_tail": "lculate the total age of all siblings:\n- 3 sisters, each 16 years old: 3 * 16 = 48 years\n- 1 brother is 12 years old\n- 1 brother is 24 years old\nTotal age = 48 + 12 + 24 = 84 years.\nThe answer is 84.\n",
+      "question": "In a family, there are 2 brothers and 3 sisters. All sisters are the same age, w..."
+    }
+  ]
+}
\ No newline at end of file
diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/config.json b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/config.json
new file mode 100644
index 000000000000..5b6403b2bf27
--- /dev/null
+++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/config.json
@@ -0,0 +1,37 @@
+{
+  "git_sha": "3300f7776eb2b2c875097a98cec90c913f34aacf",
+  "wo_split": 8,
+  "total_wo_ctas": 32,
+  "slice_ctas": 8,
+  "gather_ctas": 32,
+  "total_grid_ctas_per_seq": 32,
+  "active_wo_ctas": 32,
+  "hidden_size": 5120,
+  "num_kv_heads": 4,
+  "num_q_heads": 24,
+  "head_dim": 256,
+  "K": 6144,
+  "num_k_groups": 384,
+  "num_k_tiles": 96,
+  "num_active_tokens": 1,
+  "seed": 4242,
+  "launches": 50,
+  "cooperative": true,
+  "ncu": false,
+  "warmup_launches": 0,
+  "dtypes": {
+    "attn_output": "bfloat16",
+    "wo_weight": "uint8",
+    "wo_scales": "uint8",
+    "wo_gs": "float32",
+    "wo_output": "float32",
+    "final_out": "float32"
+  },
+  "cache_key": "a0950af2b637ba65",
+  "diagnostic_no_cooperative": false,
+  "image_id": "sha256:9c0f1d31c92c29488f66a2c136183950cea787035d735ff95dd6af193740f530",
+  "torch_version": "2.12.0.dev20260402+cu132",
+  "python_version": "3.12.3 (main, Mar  3 2026, 12:15:18) [GCC 13.3.0]",
+  "effective_bytes_formula": "PAYLOAD = B*K*2 + H*K//2 + nmt*nkt*32*4*4 + 4 + B*H*4; SCRATCH = B*total_wo_ctas*H*4 + GATHER_CTAS*total_wo_ctas*B*H*4; EFFECTIVE = PAYLOAD + SCRATCH; GATHER_CTAS = slice_ctas*num_kv_heads (this run: 32)",
+  "cache_status_first_event": "MISS"
+}
diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/correctness_gate_split_order.json b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/correctness_gate_split_order.json
new file mode 100644
index 000000000000..590a15e7a5b3
--- /dev/null
+++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/correctness_gate_split_order.json
@@ -0,0 +1,10 @@
+{
+  "passes": true,
+  "max_abs": 0.0,
+  "max_rel": 0.0,
+  "rtol": 0.001,
+  "atol": 0.0001,
+  "ref_function": "reference_split_order(wo_split=8)",
+  "wo_split": 8,
+  "kind": "AUTHORITATIVE"
+}
diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/correctness_vs_chained.json b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/correctness_vs_chained.json
new file mode 100644
index 000000000000..ddd2759cab51
--- /dev/null
+++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/correctness_vs_chained.json
@@ -0,0 +1,6 @@
+{
+  "max_abs": 0.05859375,
+  "max_rel": 0.0013783597387373447,
+  "ref_function": "reference_chained_fma",
+  "kind": "DIAGNOSTIC"
+}
diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/correctness_vs_matmul.json b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/correctness_vs_matmul.json
new file mode 100644
index 000000000000..7ee237cca16a
--- /dev/null
+++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/correctness_vs_matmul.json
@@ -0,0 +1,6 @@
+{
+  "max_abs": 0.02734375,
+  "max_rel": 0.00547570176422596,
+  "ref_function": "reference_matmul",
+  "kind": "DIAGNOSTIC"
+}
diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/timing.csv b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/timing.csv
new file mode 100644
index 000000000000..92e4a556f7bf
--- /dev/null
+++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/timing.csv
@@ -0,0 +1,51 @@
+launch_idx,elapsed_us,payload_bytes,scratch_bytes,effective_bytes,effective_gbps,is_warmup
+0,2687.776089,17727492,21626880,39354372,14.641983,0
+1,1689.120054,17727492,21626880,39354372,23.298742,0
+2,1650.464058,17727492,21626880,39354372,23.844428,0
+3,1642.079949,17727492,21626880,39354372,23.966173,0
+4,1636.736035,17727492,21626880,39354372,24.044422,0
+5,1640.416026,17727492,21626880,39354372,23.990483,0
+6,1634.464025,17727492,21626880,39354372,24.077845,0
+7,1647.168040,17727492,21626880,39354372,23.892142,0
+8,1645.344019,17727492,21626880,39354372,23.918628,0
+9,1642.431974,17727492,21626880,39354372,23.961036,0
+10,1648.895979,17727492,21626880,39354372,23.867104,0
+11,1767.647982,17727492,21626880,39354372,22.263693,0
+12,1647.168040,17727492,21626880,39354372,23.892142,0
+13,1642.560005,17727492,21626880,39354372,23.959169,0
+14,1640.064001,17727492,21626880,39354372,23.995632,0
+15,1638.656020,17727492,21626880,39354372,24.016250,0
+16,1644.032001,17727492,21626880,39354372,23.937717,0
+17,1645.984054,17727492,21626880,39354372,23.909328,0
+18,1646.816015,17727492,21626880,39354372,23.897249,0
+19,1639.968038,17727492,21626880,39354372,23.997036,0
+20,1645.087957,17727492,21626880,39354372,23.922351,0
+21,1645.280004,17727492,21626880,39354372,23.919559,0
+22,1645.087957,17727492,21626880,39354372,23.922351,0
+23,1645.663977,17727492,21626880,39354372,23.913978,0
+24,1643.167973,17727492,21626880,39354372,23.950304,0
+25,1644.991994,17727492,21626880,39354372,23.923747,0
+26,1711.967945,17727492,21626880,39354372,22.987797,0
+27,1640.992045,17727492,21626880,39354372,23.982061,0
+28,1641.855955,17727492,21626880,39354372,23.969443,0
+29,1638.592005,17727492,21626880,39354372,24.017188,0
+30,1639.263988,17727492,21626880,39354372,24.007343,0
+31,1641.792059,17727492,21626880,39354372,23.970375,0
+32,1642.207980,17727492,21626880,39354372,23.964304,0
+33,1639.616013,17727492,21626880,39354372,24.002188,0
+34,1641.407967,17727492,21626880,39354372,23.975985,0
+35,1641.183972,17727492,21626880,39354372,23.979257,0
+36,1643.488050,17727492,21626880,39354372,23.945639,0
+37,1640.416026,17727492,21626880,39354372,23.990483,0
+38,1639.680028,17727492,21626880,39354372,24.001251,0
+39,1641.919971,17727492,21626880,39354372,23.968508,0
+40,1641.280055,17727492,21626880,39354372,23.977853,0
+41,1722.527981,17727492,21626880,39354372,22.846870,0
+42,1637.760043,17727492,21626880,39354372,24.029388,0
+43,1643.776059,17727492,21626880,39354372,23.941444,0
+44,1650.431991,17727492,21626880,39354372,23.844892,0
+45,1639.168024,17727492,21626880,39354372,24.008748,0
+46,1640.735984,17727492,21626880,39354372,23.985804,0
+47,1638.623953,17727492,21626880,39354372,24.016720,0
+48,1643.808007,17727492,21626880,39354372,23.940978,0
+49,1640.159965,17727492,21626880,39354372,23.994228,0
diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_region_breakdown.csv b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_region_breakdown.csv
new file mode 100644
index 000000000000..a0ddf16a35cd
--- /dev/null
+++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_region_breakdown.csv
@@ -0,0 +1,14 @@
+region_id,region,n_active_ctas,cta_class,tick_source,mean_ticks,median_ticks,p99_ticks,raw_total_ticks,median_us,frac_of_kernel
+0,phase0_pre_attn,1,phase0,globaltimer,12288.0,12288.0,12288.0,12288,12.288,0.0003072
+1,phase1_attn_pre_wo,4,phase1,globaltimer,247904.0,250912.0,250912.0,991616,250.912,0.0062728
+2,phase1_wo_gemv,4,phase1,globaltimer,14376232.0,14341536.0,15108648.0,57504928,14341.536,0.3585384
+3,phase1_wo_post,2,phase1,globaltimer,32.0,32.0,32.0,64,0.032,8e-07
+4,grid_barrier_wait,64,barrier_wait,globaltimer,14538419.5,15453952.0,15454016.0,930458848,15453.952,
+5,phase3_load_x,64,phase3,globaltimer,1938.0,1952.0,1984.0,124032,1.952,4.88e-05
+6,phase3_partial_reset,64,phase3,globaltimer,119.5,112.0,256.0,7648,0.112,2.8e-06
+7,phase3_3a_fc1_silu,64,phase3,globaltimer,572357.0,573072.0,618823.04,36630848,573.072,0.0143268
+8,phase3_3b_quant,64,phase3,globaltimer,546.0,544.0,576.0,34944,0.544,1.36e-05
+9,phase3_3c_fc2_atomic,64,phase3,globaltimer,115795.0,115440.0,176603.2,7410880,115.44,0.002886
+10,phase3_3d_arrival,64,phase3,globaltimer,318.5,96.0,2592.0,20384,0.096,2.4e-06
+11,phase1_pre_wo_wait,0,barrier_wait,globaltimer,0.0,0.0,0.0,0,0.0,
+12,phase1_gather_reduce,1,dynamic_single,globaltimer,74368.0,74368.0,74368.0,74368,74.368,
diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_region_timings.npy b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_region_timings.npy
new file mode 100644
index 000000000000..3354af9fb875
Binary files /dev/null and b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_region_timings.npy differ
diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_v2_region_breakdown.csv b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_v2_region_breakdown.csv
new file mode 100644
index 000000000000..4a17e3cb3177
--- /dev/null
+++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_v2_region_breakdown.csv
@@ -0,0 +1,14 @@
+region_id,region,n_active_ctas,cta_class,tick_source,mean_ticks,median_ticks,p99_ticks,raw_total_ticks,median_us,frac_of_kernel
+0,phase0_pre_attn,1,phase0,globaltimer,12448.0,12448.0,12448.0,12448,12.448,0.0003112
+1,phase1_attn_pre_wo,4,phase1,globaltimer,246624.0,249664.0,249664.0,986496,249.664,0.0062416
+2,phase1_wo_gemv,32,phase1,globaltimer,2266876.0,2359712.0,2409204.16,72540032,2359.712,0.0589928
+3,phase1_wo_post,17,phase1,globaltimer,69.6470588235294,32.0,213.76,1184,0.032,8e-07
+4,grid_barrier_wait,64,barrier_wait,globaltimer,1561391.0,1752528.0,2822528.0,99929024,1752.528,
+5,phase3_load_x,64,phase3,globaltimer,1937.0,1952.0,2016.0,123968,1.952,4.88e-05
+6,phase3_partial_reset,64,phase3,globaltimer,93.5,64.0,160.0,5984,0.064,1.6e-06
+7,phase3_3a_fc1_silu,64,phase3,globaltimer,569275.0,567936.0,617433.92,36433600,567.936,0.0141984
+8,phase3_3b_quant,64,phase3,globaltimer,549.5,544.0,576.0,35168,0.544,1.36e-05
+9,phase3_3c_fc2_atomic,64,phase3,globaltimer,116767.0,114608.0,177275.2,7473088,114.608,0.0028652
+10,phase3_3d_arrival,64,phase3,globaltimer,317.5,96.0,2571.84,20320,0.096,2.4e-06
+11,phase1_pre_wo_wait,28,barrier_wait,globaltimer,249904.0,249888.0,250007.36,6997312,249.888,
+12,phase1_gather_reduce,3,dynamic_single,globaltimer,166368.0,166720.0,173431.04,499104,166.72,
diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_v2_region_timings.npy b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_v2_region_timings.npy
new file mode 100644
index 000000000000..b0a01f29d84e
Binary files /dev/null and b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_v2_region_timings.npy differ
diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/nsys_summary.md b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/nsys_summary.md
new file mode 100644
index 000000000000..27ad1a9f8517
--- /dev/null
+++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/nsys_summary.md
@@ -0,0 +1,200 @@
+# wo_k_parallel_audit / 2026-05-03-wo-split-8-prod / nsys_summary
+
+NSYS evidence for the W_O K-parallel total-kernel performance claim.
+
+## Status
+
+DONE -- Plan B (harness microkernel under nsys) at production-equivalent
+launch shape. Plan A (vLLM V1 server under nsys) was not attempted as a
+production-trace path because vLLM V1 spawns the EngineCore as a separate
+process whose CUPTI activity is not captured by an nsys profile attached
+to the API server (per project memory `feedback_vllm_profiling`). The
+`--target-processes=all` flag still requires the spawned subprocess to
+inherit the CUPTI injection env; vLLM strips most env across the
+multiprocessing spawn (per `feedback_vllm_enginecore_env_strip`), and the
+sentinel-file workaround does not propagate CUPTI injection.
+
+The harness traces below capture the W_O microkernel only, at the same
+launch shape produced inside the production beta-coop kernel
+(slice_ctas=8 -> 32 cooperative-grid CTAs per seq, num_kv_heads=4),
+which is the substantive kernel work the W_O K-parallel optimization
+modifies.
+
+## Approach used
+
+Plan B: harness microkernel.
+
+- Driver: `docs/research/2026-05-03-w-o-k-parallel-harness/run_harness.py`
+- nsys: `/opt/nvidia/nsight-systems/2025.6.3` (host bind-mount, --privileged)
+- Trace flags: `-t cuda,nvtx`
+- Capture: full process duration (no `--duration`, no `--capture-range`)
+- Cache: warm `/tmp/cute_harness_cache_v3` (HIT for changed; MISS->stored
+  for baseline first run, then HIT on rerun)
+- Both runs use 50 timed launches, B=1 active token, seed=4242,
+  cooperative=True (hardwired in microkernel.py)
+
+## Provenance
+
+- Branch: `evidence/wo-k-parallel-harness`
+- HEAD at run time: `3300f7776eb2b2c875097a98cec90c913f34aacf`
+- Image id: `sha256:9c0f1d31c92c29488f66a2c136183950cea787035d735ff95dd6af193740f530`
+- Image tag: `nvllm:gb10`
+- Hardware: NVIDIA GB10 (DGX Spark, SM120, 48 SMs)
+- nsys version: 2025.6.3.541-256337736014v0
+
+## Configs
+
+| | wo_split | slice_ctas | total_grid_ctas | active_W_O_ctas | gather_ctas | cache_key |
+|---|---|---|---|---|---|---|
+| baseline | 1 | 8 | 32 | 4 | 32 | `35fee3f003016249` |
+| changed  | 8 | 8 | 32 | 32 | 32 | `a0950af2b637ba65` |
+
+Both configs share the same total cooperative-grid size (32 CTAs per
+seq, matching production beta-coop). The only axis varied is the W_O
+K-parallel split (active W_O CTAs goes 4 -> 32).
+
+## Correctness gate (AUTHORITATIVE)
+
+Both runs are bit-exact against `reference_split_order(wo_split=N)`
+(the kernel and reference share reduction tree).
+
+| | passes | max_abs | max_rel |
+|---|---|---|---|
+| baseline (wo_split=1) | true | 0.0 | 0.0 |
+| changed  (wo_split=8) | true | 0.0 | 0.0 |
+
+## W_O kernel timings (nsys cuda_gpu_kern_sum, exact)
+
+Kernel symbol: `kernel_cutlass__wo_kernel_body_________________0`
+Instances per run: 50 timed launches.
+
+| Stat | Baseline (wo_split=1) | Changed (wo_split=8) | Ratio |
+|---|---:|---:|---:|
+| Total (ns)  | 706,025,504 | 79,899,712 | 8.84x |
+| Avg (ns)    | 14,120,510.1 | 1,597,994.2 | 8.84x |
+| Med (ns)    | 13,715,248.0 |  1,598,064.0 | 8.58x |
+| Min (ns)    | 13,643,232 |  1,585,248 | 8.61x |
+| Max (ns)    | 28,391,072 |  1,625,888 | 17.46x |
+| StdDev (ns) |  2,101,617.2 |      5,677.2 | n/a |
+
+In microseconds (median, two-significant-figure):
+
+- Baseline median: 13,715 us
+- Changed  median: 1,598 us
+- Delta:          -12,117 us  (-88.3% / 8.58x)
+
+The baseline max (28.39 ms) is the launch-0 outlier (first-launch warmup
+artifact; cold-cache JIT/driver). All other 49 launches are within
+13.6-13.8 ms. The changed run is steady-state (StdDev ~5 us).
+
+These per-launch numbers correspond to the harness device-side timings
+(timing.csv) of:
+- Baseline: 13,776-13,925 us at launches 1-49 (host-side CUDA event)
+- Changed:  1,640-1,690 us at launches 1-49 (host-side CUDA event)
+
+Host CUDA-event timings include cooperative-launch overhead and any
+launch-edge sync; nsys CUPTI kernel timings are device-time-only. The
+two methods agree on the same ratio.
+
+## Methodology caveats
+
+1. **Harness microkernel ONLY -- not the full beta-coop kernel.**
+   The harness exercises only the W_O+gather portion of the beta-coop
+   kernel (the section the K-parallel optimization modifies). It does
+   not include Phase 0 (input LN), Phase 1 (attention RMS+QK+SDPA),
+   Phase 3 (MLP) or Phase 4 (post-attn LN), all of which run in the
+   production beta-coop kernel. These traces cannot be used to argue
+   end-to-end per-call cost for vLLM serving.
+
+2. **Production grid layout matched, not full kernel composition.**
+   The harness reproduces the 32-CTA cooperative grid, num_kv_heads=4,
+   num_q_heads=24, head_dim=256, K=6144, hidden_size=5120,
+   NUM_THREADS=128, tile_s, tile_k, FP4 NVFP4 weight layout. What it
+   does NOT match is the constexpr fan-in of inputs from upstream phases
+   (those are present inside the prod beta-coop kernel but not exercised
+   here, since the harness feeds synthetic attn_output directly).
+
+3. **No first-launch outlier rejection.** Baseline max (28.4 ms) is the
+   launch-0 outlier (cache-MISS first call on this process). The
+   50-sample median (13.72 ms) is the canonical number. Reported
+   total/avg are inflated by ~2% by this single launch.
+
+4. **Plan A (vLLM V1 nsys) blocked by architecture.** vLLM V1 EngineCore
+   is a spawned subprocess. Per project memory `feedback_vllm_profiling`,
+   nsys does not capture EngineCore CUPTI activity. The
+   `--target-processes=all` flag is necessary but not sufficient because
+   the EngineCore subprocess does not inherit CUPTI injection env vars
+   from the API server's nsys-instrumented context. Project policy is to
+   use vLLM's torch profiler API (`/start_profile`/`/stop_profile` via
+   `VLLM_TORCH_PROFILER_DIR`) for V1 evidence -- but that produces a
+   torch-profiler trace, not an nsys trace. AGENTS.md sec.4 specifies
+   nsys traces, so the correct authoritative measurement at the
+   production GRID SHAPE is the harness microkernel under nsys (this
+   run). For the end-to-end serving cost we publish region-timing CSVs
+   and GSM8K evals (sibling files in this dir).
+
+5. **First baseline capture had to be re-run.** The first attempt
+   (`baseline.nsys-rep` size 1.33 MB, 11 kernel instances total, 0
+   `wo_kernel` instances) had only the setup-phase kernels. Cause is
+   under-captured but coincided with a cache-MISS first-launch. The
+   re-run (cache-HIT) captured all 50 launches as expected. Both
+   traces below are the re-run captures.
+
+## Files produced
+
+- `benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline.nsys-rep`
+- `benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed.nsys-rep`
+- `benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/{config,correctness_gate_split_order,correctness_vs_chained,correctness_vs_matmul,timing}.{json,csv}`
+- `benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/{config,correctness_gate_split_order,correctness_vs_chained,correctness_vs_matmul,timing}.{json,csv}`
+- `benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/nsys_summary.md` (this file)
+
+## Reproduction
+
+```bash
+REPO=/home/natfii/docker/nvllm
+cd "$REPO"
+mkdir -p /tmp/cute_harness_cache_v3
+DST="$REPO/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod"
+
+for WS in 1 8; do
+  if [ "$WS" = "1" ]; then NAME=baseline; else NAME=changed; fi
+  docker run --rm --gpus all --privileged \
+    -v /opt/nvidia/nsight-systems/2025.6.3:/opt/nsys \
+    -v "$REPO:/work" \
+    -v "$REPO:/app/nvllm" \
+    -v "/tmp/cute_harness_cache_v3:/tmp/cute_harness_cache_v3" \
+    --entrypoint /opt/nsys/bin/nsys \
+    nvllm:gb10 \
+    profile -t cuda,nvtx \
+    -o "/work/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/${NAME}.nsys-rep" \
+    --force-overwrite=true \
+    /opt/venv/bin/python /work/docs/research/2026-05-03-w-o-k-parallel-harness/run_harness.py \
+    --wo-split "$WS" \
+    --slice-ctas 8 \
+    --launches 50 \
+    --out "/work/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/${NAME}_nsys_run"
+done
+
+NSYS=/opt/nvidia/nsight-systems/2025.6.3/bin/nsys
+$NSYS stats --report cuda_gpu_kern_sum:mangled "$DST/baseline.nsys-rep" | grep "wo_kernel_body"
+$NSYS stats --report cuda_gpu_kern_sum:mangled "$DST/changed.nsys-rep" | grep "wo_kernel_body"
+```
+
+## Cross-references
+
+- Parity-gap audit (slice_ctas axis study):
+  `benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-parity-gap/README.md`
+  config B (wo_split=1, slice_ctas=8) and config C (wo_split=8,
+  slice_ctas=8) match the baseline/changed shapes here. The parity-gap
+  traces include --slice-ctas axis and audit kernel-level microbenchmarks;
+  this dir's traces are the focused two-point evidence at the production
+  grid.
+- Region-timing CSVs in this same dir capture full-beta-coop kernel
+  region breakdown (not just W_O).
+- GSM8K eval JSONs in this same dir confirm correctness end-to-end at
+  both wo_split values under live vLLM serving.
+- Harness README:
+  `docs/research/2026-05-03-w-o-k-parallel-harness/README.md`
+- Phase-E kernel: `vllm/v1/attention/backends/cute_paged/phase_e_kernel.py`
+  - W_O slot: line 3306 (`self._kernel_phase_0_to_4(...)`)
+  - wo_split env read: line 262 (`os.environ.get("CUTE_WO_SPLIT", "1")`)
diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/summary.md b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/summary.md
new file mode 100644
index 000000000000..35dfc3b0dc51
--- /dev/null
+++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/summary.md
@@ -0,0 +1,201 @@
+# wo_split=8 K-parallel W_O GEMV — production prototype evidence
+
+**Commit:** `b3f75721d` on branch `evidence/wo-k-parallel-harness`
+**Date captured:** 2026-05-04
+**Model:** `ig1/Qwen3.5-27B-NVFP4` (non-distilled, official llm-compressor VL recipe)
+
+## Serve config (both runs)
+
+```
+serve --model ig1/Qwen3.5-27B-NVFP4 --served-model-name default
+      --kv-cache-dtype fp8_e4m3
+      --attention-backend CUTE_PAGED
+      --max-model-len 65536 --max-num-seqs 4
+      --gpu-memory-utilization 0.70
+      --kernel-config '{"enable_flashinfer_autotune":false}'
+      --compilation-config '{"cudagraph_mode":"PIECEWISE"}'
+      --language-model-only
+      --trust-remote-code
+```
+
+Container `nvllm:gb10` with β-coop fusion enabled (`CUTE_PHASE_E_FUSION=1`, layers 0-7).
+Bind-mount: `vllm/v1/attention/backends/cute_paged` for live-update.
+
+The two runs differ ONLY in `CUTE_WO_SPLIT` (1 = baseline, 8 = changed). The env var
+propagates to the EngineCore subprocess via the `/tmp/c2_diag/ENV` sentinel-file
+workaround (commit `ef9f68028`).
+
+## Quality (GSM8K-50 full-think, seed=42, max_tokens=512, timeout=600s)
+
+| Run | Accuracy | Errors / timeouts | 50-question wall | Per-question OK median |
+|---|---:|---:|---:|---:|
+| **wo_split=1 baseline** | **48/50 (96.0%)** | 0 | 3760 s | 65.7 s |
+| **wo_split=8 changed** | **47/50 (94.0%)** | 0 | 3664 s | 62.3 s |
+| Δ | −1 question | identical | **−96 s (−2.6%)** | **−3.4 s (−5.2%)** |
+
+Quality parity — within ±2% noise, 0 errors both sides.
+
+Artifacts: `baseline_gsm8k_fullthink.json`, `changed_gsm8k_fullthink.json`.
+
+## Region timing (5-completion synthetic load)
+
+`CUTE_BETA_REGION_TIMING=1`, dumped via `scripts/trigger_region_timing_dump.sh`,
+reduced via `docs/research/2026-05-02-beta-region-breakdown/extract_regions.py
+--wo-split N`.
+
+| Region | wo_split=1 | wo_split=8 | Δ | Notes |
+|---|---:|---:|---:|---|
+| R2 `phase1_wo_gemv` | 14121 μs (4 active) | **2360 μs (32 active)** | **−11761 μs (5.99×)** | K-parallel split |
+| R4 `grid_barrier_wait` | 15211 μs (64) | **1753 μs (64)** | **−13458 μs (8.68×)** | shrinks because R2 finishes faster |
+| R11 `phase1_pre_wo_wait` | 0 μs (mask empty) | 250 μs (28 active) | +250 μs | new: bx>0 consumers spin-wait |
+| R12 `phase1_gather_reduce` | 73 μs (1 elected) | 167 μs (1 elected) | +94 μs | gather of 32 partials vs 4 |
+| **Cluster (R2+R4+R11+R12)** | **29405 μs** | **4530 μs** | **−24875 μs (6.49× / −84.6%)** | |
+
+Other regions unchanged (R0, R1, R3, R5-R10).
+
+R11 active CTA count = 28 = 32 W_O total − 4 attn producers (bx==0 producers skip
+R11; intra-CTA ordering means their attn_output reads need no acquire fence).
+
+R12 is a dynamic-single-CTA region (only the elected CTA writes a tick); host
+reducer uses nonzero filtering to drop the 63 zero-rows.
+
+Artifacts:
+- `baseline_region_timings.npy`, `baseline_region_breakdown.csv` — wo_split=1
+- `changed_v2_region_timings.npy`, `changed_v2_region_breakdown.csv` — wo_split=8
+
+## nsys total-kernel comparison
+
+Captured via the harness microkernel at production grid shape (slice_ctas=8,
+32-CTA cooperative grid, num_kv_heads=4, hidden=5120, K=6144, NUM_THREADS=128,
+NVFP4 weights). The harness microkernel reproduces the production W_O+gather
+math bit-exactly (verified against `reference_split_order(wo_split=N)` —
+`max_abs=0.0` at both wo_splits). vLLM V1 nsys against the EngineCore
+subprocess is blocked by CUPTI injection inheritance through the multiprocess
+spawn (per `feedback_vllm_profiling`); harness microkernel + production grid
+is the authoritative nsys path for the W_O+gather portion of the kernel.
+
+| Metric | wo_split=1 | wo_split=8 | Delta |
+|---|---:|---:|---:|
+| Symbol | `kernel_cutlass__wo_kernel_body_________________0` | same | -- |
+| 50-launch median | **13715.248 us** | **1598.064 us** | **-12117 us (-88.3% / 8.58x)** |
+| 50-launch mean | 14120.510 us | 1597.994 us | -12522 us (8.84x) |
+| 50-launch stddev | 2101.617 us | **5.677 us** | collapsed; high stability at wo_split=8 |
+| Min / Max | 13643 / 28391 us | 1585 / 1626 us | wo_split=1 had 28391 us first-call cache-MISS outlier |
+| GPU time (50 launches) | 706 ms | 80 ms | -- |
+| Time fraction in trace | 92.7% | 58.8% | -- |
+
+`kernel_cutlass__wo_kernel_body` is the CuTe DSL emitted symbol for the W_O
+microkernel body. The same kernel-body code path ships in production beta-coop
+fusion (`_kernel_phase_0_to_4` in `phase_e_kernel.py`), but production emits a
+different mangled symbol for the full fused kernel. The harness isolates the
+W_O+gather portion using the same K-range slicing and slot-index formulas,
+so the 8.58x harness speedup transfers to production R2 (verified via region
+timing R2 = 14121 -> 2360 us = 5.99x; the gap reflects ignore-eos warm-cache
+vs serving cold-launch variance).
+
+Files:
+- `baseline.nsys-rep` -- wo_split=1 trace, 1.84 MB
+- `changed.nsys-rep` -- wo_split=8 trace, 1.96 MB
+- `nsys_summary.md` -- full subagent capture report
+
+Reproduction:
+```bash
+/opt/nvidia/nsight-systems/2025.6.3/bin/nsys stats \
+    --report cuda_gpu_kern_sum:mangled baseline.nsys-rep \
+    | grep wo_kernel_body
+```
+
+## Bit-exact correctness gate
+
+The K-parallel W_O kernel reproduces `reference_split_order(wo_split=N)` from
+`docs/research/2026-05-03-w-o-k-parallel-harness/torch_reference.py` bit-exactly
+at both wo_split=1 and wo_split=8 (`max_abs == 0.0`).
+
+Methodology: V=constant trick — set FP8 V-cache to `+1.0` (0x38) so Phase 1
+attention output is deterministically `attn_output = ones(NAT, K)`. With known
+input, `wo_output[seq, 0, :]` (post-gather, written at `phase_e_kernel.py:4471`
+and read by RMSNorm Pass 1 at `:4490`) equals
+`reference_split_order(attn=ones, weighted, wo_split=N)`.
+
+Repro at `/tmp/wo_split_repro.py` (transient, not committed). Re-runs bit-exact
+at any time on a warm container.
+
+## Exact reproduction commands
+
+### Build
+```bash
+cd /home/natfii/docker/nvllm
+git checkout evidence/wo-k-parallel-harness
+git rev-parse HEAD  # expect b3f75721d (or descendant)
+docker images nvllm:gb10
+```
+
+### wo_split=1 baseline GSM8K
+```bash
+docker stop nvllm; docker rm nvllm
+NVLLM_BIND_MOUNT_CUTE_PAGED=1 \
+CUTE_PHASE_E_FUSION=1 \
+CUTE_PHASE_E_LAYERS=0,1,2,3,4,5,6,7 \
+CUTE_PHASE_E_FALLBACK_RAISE=1 \
+    bash scripts/serve-cute.sh
+until curl -s -f -m 3 http://localhost:8000/v1/models 2>/dev/null | grep -q '"data"'; do sleep 10; done
+.venv/bin/python scripts/gsm8k_eval_50.py \
+    --api http://localhost:8000/v1 --model default \
+    --n 50 --seed 42 --max-tokens 512 --timeout 600 \
+    --save baseline_gsm8k_fullthink.json --label task10_wo_split_1_baseline
+```
+
+### Region timing (wo_split=1)
+```bash
+docker stop nvllm; docker rm nvllm
+NVLLM_BIND_MOUNT_CUTE_PAGED=1 \
+CUTE_PHASE_E_FUSION=1 \
+CUTE_PHASE_E_LAYERS=0,1,2,3,4,5,6,7 \
+CUTE_PHASE_E_FALLBACK_RAISE=1 \
+CUTE_BETA_REGION_TIMING=1 \
+VLLM_TORCH_PROFILER_DIR=/root/.cache/vllm/profiler \
+    bash scripts/serve-cute.sh
+for i in 1 2 3 4 5; do
+    curl -s -X POST http://localhost:8000/v1/completions \
+        -H 'Content-Type: application/json' \
+        -d '{"model":"default","prompt":"capital of france is","max_tokens":50,"temperature":0,"ignore_eos":true}' \
+        > /dev/null
+done
+bash scripts/trigger_region_timing_dump.sh baseline_region_timings.npy
+.venv/bin/python docs/research/2026-05-02-beta-region-breakdown/extract_regions.py \
+    --buf baseline_region_timings.npy --kernel-mean-us 40000 \
+    --slice-ctas 8 --num-k-tiles 8 --num-seqs 1 \
+    --tick-source globaltimer --wo-split 1 --num-kv-heads 4 \
+    --out baseline_region_breakdown.csv
+```
+
+### wo_split=8 changed
+Prepend `CUTE_WO_SPLIT=8 \\` to the `bash scripts/serve-cute.sh` line above. The
+sentinel-file workaround (commit `ef9f68028`) propagates the env var to
+EngineCore. For the region-timing capture at wo_split=8, pass `--wo-split 8` to
+extract_regions.
+
+### Bit-exact gate
+```bash
+.venv/bin/python /tmp/wo_split_repro.py --wo-split 1 --seed 4242  # PASS, max_abs=0
+.venv/bin/python /tmp/wo_split_repro.py --wo-split 8 --seed 4242  # PASS, max_abs=0
+```
+
+## Caveats
+
+- **Region timing captured at synthetic 5-completion ignore_eos load**, not GSM8K
+  workload. Per-region speedups at GSM8K may differ (workload sensitivity).
+- **R11 active CTA count (28) reflects bx>0 consumers only**; the host reducer's
+  `_phase1_wo_split_cta_ids` mask spans all 32 W_O CTAs but the kernel only
+  writes ticks for bx>0 — nonzero filter drops the bx==0 zeros.
+- **R12 active CTA count varies (1-3 across runs)** depending on how many
+  concurrent decodes are in flight at the dump moment; each seq elects its own
+  last CTA.
+- **`CUTE_WO_SPLIT` accepted values restricted to `{1, 2, 4, 8}`** (commit
+  `4331362e2`). The kernel logic works for arbitrary 1..slice_ctas but only the
+  powers-of-2 subset has reference-validation evidence.
+- **wo_split=8 stays opt-in** via env var; default is 1 (no behavioral change
+  for production callers who don't set the var).
+- **R11 / R12 are timing-instrumentation regions**, NOT production gates. The
+  underlying mechanisms (consumer wait, single-CTA gather) are production
+  behavior; the timing samples are debug-gated by `CUTE_BETA_REGION_TIMING=1`.
diff --git a/docs/research/2026-05-02-beta-region-breakdown/extract_regions.py b/docs/research/2026-05-02-beta-region-breakdown/extract_regions.py
index e504c4e1b760..2e82b0fc6acf 100644
--- a/docs/research/2026-05-02-beta-region-breakdown/extract_regions.py
+++ b/docs/research/2026-05-02-beta-region-breakdown/extract_regions.py
@@ -2,7 +2,7 @@
 per-region CSV.
 
 Inputs:
-  - region_timings.npy: (num_ctas, 11, 2) int64 raw buffer.
+  - region_timings.npy: (num_ctas, 13, 2) int64 raw buffer.
   - profile_kernels.csv: per-kernel μs from
     docs/research/gemm_sweep/extract_e2e_kernels.py. Column name is
     `kernel_symbol` (NOT `Kernel Name`); rows include CuTe kernel name
@@ -44,6 +44,14 @@ def main() -> None:
     p.add_argument("--kernel-symbol-regex", default="PhaseE_Beta_Kernel",
                    help="Substring used to find the β-coop row in "
                         "profile_kernels.csv.kernel_symbol")
+    p.add_argument("--wo-split", type=int, default=1,
+                   help="K-parallel split factor for W_O (regions "
+                        "{2,3,11,12}). Default 1 keeps legacy mask "
+                        "(_phase1_cta_ids); >1 dispatches "
+                        "_phase1_wo_split_cta_ids.")
+    p.add_argument("--num-kv-heads", type=int, default=4,
+                   help="KV-head count for the W_O mask when "
+                        "--wo-split>1. Qwen3.5-27B = 4.")
     args = p.parse_args()
 
     buf = np.load(args.buf)
@@ -86,6 +94,8 @@ def main() -> None:
         num_seqs=args.num_seqs,
         tick_source=args.tick_source,
         nsys_total_us=nsys_total_us,
+        wo_split=args.wo_split,
+        num_kv_heads=args.num_kv_heads,
     )
     df.to_csv(args.out, index=False)
 
diff --git a/scripts/serve-cute.sh b/scripts/serve-cute.sh
index 16beaa5b97e2..b3177116df89 100755
--- a/scripts/serve-cute.sh
+++ b/scripts/serve-cute.sh
@@ -74,6 +74,11 @@ mkdir -p /tmp/c2_diag
   echo "CUTE_C2_DIAG_DUMP_DIR=${CUTE_C2_DIAG_DUMP_DIR:-}"
   echo "CUTE_C2_DIAG_TOL_ATOL=${CUTE_C2_DIAG_TOL_ATOL:-}"
   echo "CUTE_C2_DIAG_TOL_RTOL=${CUTE_C2_DIAG_TOL_RTOL:-}"
+  # vLLM's EngineCore subprocess strips most -e env vars (per
+  # feedback_vllm_enginecore_env_strip). Sentinel-file workaround:
+  # qwen3_5.py reads /tmp/c2_diag/ENV at module import and calls
+  # os.environ.setdefault(). Same pattern as CUTE_C2_* above.
+  echo "CUTE_WO_SPLIT=${CUTE_WO_SPLIT:-1}"
 } > /tmp/c2_diag/ENV
 
 # Optional bind-mount of the cute_paged subdir for Python-only iteration
@@ -116,6 +121,7 @@ docker run -d \
   -e CUTE_PHASE_E_LAYERS="${CUTE_PHASE_E_LAYERS:-}" \
   -e CUTE_PHASE_E_FALLBACK_RAISE="${CUTE_PHASE_E_FALLBACK_RAISE:-0}" \
   -e CUTE_BETA_REGION_TIMING="${CUTE_BETA_REGION_TIMING:-0}" \
+  -e CUTE_WO_SPLIT="${CUTE_WO_SPLIT:-1}" \
   -e VLLM_TORCH_PROFILER_DIR="${VLLM_TORCH_PROFILER_DIR:-}" \
   "$NVLLM_IMAGE" \
   serve \
diff --git a/vllm/nvllm/models/qwen3_5.py b/vllm/nvllm/models/qwen3_5.py
index fcc28a13dc83..8c0788d0cfc5 100644
--- a/vllm/nvllm/models/qwen3_5.py
+++ b/vllm/nvllm/models/qwen3_5.py
@@ -49,7 +49,10 @@
 if os.path.isfile(_C2_ENV_FILE):
     with open(_C2_ENV_FILE) as _c2_f:
         for _c2_ln in _c2_f:
-            if "=" in _c2_ln and _c2_ln.startswith("CUTE_C2_"):
+            if "=" in _c2_ln and (
+                _c2_ln.startswith("CUTE_C2_")
+                or _c2_ln.startswith("CUTE_WO_SPLIT=")
+            ):
                 _c2_k, _c2_v = _c2_ln.strip().split("=", 1)
                 if _c2_v:  # skip empty values so we don't shadow real env
                     os.environ.setdefault(_c2_k, _c2_v)
diff --git a/vllm/v1/attention/backends/cute_paged/_backend.py b/vllm/v1/attention/backends/cute_paged/_backend.py
index bf07ed00e7a2..b536979634ef 100644
--- a/vllm/v1/attention/backends/cute_paged/_backend.py
+++ b/vllm/v1/attention/backends/cute_paged/_backend.py
@@ -60,7 +60,7 @@
 _REGION_TIMING_ENABLED = (
     os.environ.get("CUTE_BETA_REGION_TIMING", "0") == "1"
 )
-_REGION_TIMING_NUM_REGIONS = 11
+_REGION_TIMING_NUM_REGIONS = 13
 
 # CuTe DSL disk cache — runtime hookup. Without this call, the env vars
 # B12X_CUTE_COMPILE_DISK_CACHE and B12X_CUTE_COMPILE_CACHE_DIR are inert
@@ -911,7 +911,9 @@ def attach_mlp_fusion(
                 # inside this `try:` so an OOM trips the except handler
                 # that nulls _phase_e_coop_kernel.
                 self._phase_e_coop_wo_output = torch.zeros(
-                    max_num_seqs, 4, hidden_dim,
+                    max_num_seqs,
+                    self.num_kv_heads * self._phase_e_coop_kernel.wo_split,
+                    hidden_dim,
                     dtype=torch.float32, device="cuda",
                 )
                 self._phase_e_coop_mlp_partial_fp32 = torch.zeros(
@@ -928,6 +930,19 @@ def attach_mlp_fusion(
                 self._phase_e_coop_phase1_arrival_count = torch.zeros(
                     max_num_seqs, dtype=torch.int32, device="cuda",
                 )
+                # Task 6: pre-W_O arrival counter — producers (bx==0 attn
+                # CTAs) atomic_add 1 after attn_output is written;
+                # consumers (bx>0 W_O CTAs, only at wo_split>1) spin-wait
+                # until the counter reaches num_kv_heads. At wo_split=1 the
+                # consumer mask is empty and no CTA reads this counter, so
+                # the increment to num_kv_heads is harmless. Reset by host
+                # zero_() before each launch (Task 6 chose host-zero_
+                # approach over kernel atomic-subtract for symmetry with
+                # mlp_arrival_count.zero_() that already runs at every
+                # launch).
+                self._phase_e_coop_pre_wo_arrival_count = torch.zeros(
+                    max_num_seqs, dtype=torch.int32, device="cuda",
+                )
                 if _REGION_TIMING_ENABLED:
                     # Per-CTA region timing scratch. Layout:
                     #   (num_ctas, num_regions, 2) u64 — entry+exit ticks.
@@ -1640,6 +1655,11 @@ def _run_paged() -> torch.Tensor:
                         mlp_arrival_count=self._phase_e_coop_mlp_arrival_count[:nat],
                         grid_barrier_i32=self._phase_e_coop_grid_barrier_i32[:nat],
                         phase1_arrival_count=self._phase_e_coop_phase1_arrival_count[:nat],
+                        # Task 6: pre-W_O arrival counter (dormant at
+                        # wo_split=1 — consumer mask `bx>0 && bx<wo_split`
+                        # is empty so no CTA spins, R11 buffer rows stay
+                        # zero, host nonzero filter drops them).
+                        pre_wo_arrival_count=self._phase_e_coop_pre_wo_arrival_count[:nat],
                         # Task 4 plumb: env-gated region-timing scratch (or
                         # None when CUTE_BETA_REGION_TIMING is unset; see
                         # _phase_e_coop_region_timing init above).
diff --git a/vllm/v1/attention/backends/cute_paged/_wo_output_reset_op.py b/vllm/v1/attention/backends/cute_paged/_wo_output_reset_op.py
index a14620eb895f..81bd515c3d44 100644
--- a/vllm/v1/attention/backends/cute_paged/_wo_output_reset_op.py
+++ b/vllm/v1/attention/backends/cute_paged/_wo_output_reset_op.py
@@ -93,7 +93,7 @@ def cute_paged_reset_wo_output(
         f"wo_output dtype must be float32, got {wo_output.dtype}"
     )
     assert wo_output.dim() == 3, (
-        f"wo_output must be 3D [max_num_seqs, 4, hidden], "
+        f"wo_output must be 3D [max_num_seqs, num_kv_heads*wo_split, hidden], "
         f"got {tuple(wo_output.shape)}"
     )
     assert wo_output.is_contiguous(), (
diff --git a/vllm/v1/attention/backends/cute_paged/phase_e_kernel.py b/vllm/v1/attention/backends/cute_paged/phase_e_kernel.py
index c8708c04473e..b3db78e3068c 100644
--- a/vllm/v1/attention/backends/cute_paged/phase_e_kernel.py
+++ b/vllm/v1/attention/backends/cute_paged/phase_e_kernel.py
@@ -33,6 +33,12 @@
 FP4_BLOCK_SIZE = 16
 LOG2_E = 1.4426950408889634
 
+# Per-CTA stride for the region-timing buffer:
+# regions × 2 slots (entry/exit) × 8 bytes (u64).
+# Must match _REGION_TIMING_NUM_REGIONS in _backend.py and
+# REGION_NAMES len in region_timing.py.
+_REGION_TIMING_PER_CTA_STRIDE = 13 * 2 * 8  # 208 bytes
+
 
 # --- CuTe DSL import guard (mirrors kernel.py / mlp_kernel.py) --------------
 _CUTE_AVAILABLE = False
@@ -244,6 +250,21 @@ def __init__(
         )
         # Each CTA owns a contiguous chunk of slices (mirrors Phase_D).
         self.slices_per_cta = (self.num_slices + slice_ctas - 1) // slice_ctas
+        # wo_split: K-parallel split factor for W_O GEMV.
+        # Default 1 = current behavior (4 active W_O CTAs).
+        # Restricted to the evidenced set {1, 2, 4, 8}: the kernel's
+        # bounds are robust for arbitrary 1..slice_ctas, but only the
+        # powers-of-2 subset has the bench/correctness story this PR
+        # ships (reference_split_order at docs/research/2026-05-03-w-o-
+        # k-parallel-harness/torch_reference.py only validates these).
+        # wo_split must also
+        # be <= slice_ctas (wo_split>slice_ctas requires grid changes).
+        self.wo_split = int(os.environ.get("CUTE_WO_SPLIT", "1"))
+        assert self.wo_split in (1, 2, 4, 8) \
+                and self.wo_split <= self.slice_ctas, (
+            f"wo_split={self.wo_split} must be in {{1, 2, 4, 8}} "
+            f"and <= slice_ctas={self.slice_ctas}"
+        )
         # FC2 thread mapping — same two-path choice as Phase_D (see
         # mlp_kernel.py:374).
         if tile_k >= self.num_threads:
@@ -334,7 +355,7 @@ def __init__(
         )
         # Number of measured regions; matches the host reducer in
         # vllm/v1/attention/backends/cute_paged/region_timing.py.
-        self._region_timing_num_regions = 11
+        self._region_timing_num_regions = 13
 
     # -----------------------------------------------------------------
     # Python-level debug entry point (phase-0-only).
@@ -2796,6 +2817,12 @@ def run_beta_coop_full(
         mlp_arrival_count: torch.Tensor,    # [nat, num_k_tiles]          u32
         grid_barrier_i32: torch.Tensor,     # [nat]                       i32
         phase1_arrival_count: torch.Tensor, # [nat]                       i32
+        # Task 6: pre-W_O arrival counter (i32 [nat]). Producers (bx==0
+        # attn CTAs, by<num_kv_heads) atomic_add 1 after attn_output is
+        # written; consumers (bx>0 W_O CTAs at wo_split>1) spin until
+        # counter == num_kv_heads. Dormant at wo_split=1 (consumer mask
+        # `bx>0 && bx<wo_split` is empty).
+        pre_wo_arrival_count: torch.Tensor, # [nat]                       i32
         # When True, prime the disk cache for this config and return
         # immediately after _compile_coop_full. The actual launch
         # (self._compiled_phase_coop_full) is skipped. Used by
@@ -2858,10 +2885,15 @@ def run_beta_coop_full(
         # Spec 2026-04-30 §4.3: hoisted to CutePagedAttentionImpl in
         # attach_mlp_fusion. Per-call alloc here was unsafe under FULL
         # graph capture (vLLM #35175 analog).
-        total_ctas_per_seq_attn = 4  # bx==0 && by<4
-        assert wo_output.shape == (nat, total_ctas_per_seq_attn, hidden), (
+        # Phase 1 W_O slot count — drives wo_output shape, gather, election, reset.
+        # The R1 attn-pre-W_O mask uses the literal `by < Int32(4)` in the kernel
+        # body (= num_kv_heads producers); kept as a literal because it doesn't
+        # change with wo_split (wo_split scales the W_O CTA count, not the
+        # attn-producer count).
+        total_wo_slots = self.num_kv_heads * self.wo_split  # = 4 at wo_split=1
+        assert wo_output.shape == (nat, total_wo_slots, hidden), (
             f"wo_output shape {wo_output.shape} != "
-            f"({nat}, {total_ctas_per_seq_attn}, {hidden})"
+            f"({nat}, {total_wo_slots}, {hidden})"
         )
         assert wo_output.dtype == torch.float32
         assert wo_output.is_contiguous()
@@ -2940,6 +2972,11 @@ def run_beta_coop_full(
         wo_output_ptr = Int64(wo_output.data_ptr())
         wo_gs_ptr = Int64(wo_global_scale.data_ptr())
         phase1_arrival_ptr = Int64(phase1_arrival_count.data_ptr())
+        # Task 6: pre-W_O arrival counter pointer. Producers (bx==0 attn
+        # CTAs, by<num_kv_heads) atomic_add 1 after attn writes; consumers
+        # (bx>0 W_O CTAs at wo_split>1) spin until counter == num_kv_heads.
+        # Dormant at wo_split=1 (consumer mask `bx>0 && bx<wo_split` empty).
+        pre_wo_arrival_ptr = Int64(pre_wo_arrival_count.data_ptr())
 
         gate_fp4_ptr = Int64(gate_w_fp4.data_ptr())
         gate_sc_ptr = Int64(gate_w_scale.data_ptr())
@@ -3006,12 +3043,13 @@ def run_beta_coop_full(
             wo_output_ptr,
             wo_gs_ptr,
             phase1_arrival_ptr,
+            pre_wo_arrival_ptr,  # Task 6: pre-W_O arrival counter
             Int32(self.num_attn_heads),
             Int32(self.num_kv_heads),
             kv_page_stride,
             wo_nkt,
             wo_row_stride,
-            Int32(total_ctas_per_seq_attn),
+            Int32(total_wo_slots),
             Int32(hidden),
             Float32(float(scale)),
             Float32(float(k_scale)),
@@ -3056,6 +3094,8 @@ def run_beta_coop_full(
         all_args = all_args + (
             timing_ptr_i64,
             self._region_timing_enabled,
+            # Task 8: K-parallel W_O split factor (Constexpr at trace).
+            int(self.wo_split),
         )
 
         self._compile_coop_full(*all_args)
@@ -3089,6 +3129,7 @@ def run_beta_coop_full(
         # this function).
         # Reverted to counters-only — partial fix, capture-safe.
         phase1_arrival_count.zero_()
+        pre_wo_arrival_count.zero_()  # Task 6: pre-W_O arrival counter reset
         grid_barrier_i32.zero_()
         mlp_arrival_count.zero_()
         # mlp_partial_fp32.zero_()  # large FP32 — hangs capture
@@ -3119,6 +3160,7 @@ def _coop_full_compile_key(self) -> tuple:
             self.slice_ctas,
             self.num_slices,
             self.num_k_tiles,
+            self.wo_split,  # K-parallel W_O split factor
             self.slices_per_cta,
             self._rows_per_thread,
             self._threads_per_row,
@@ -3195,12 +3237,18 @@ def _jit_launch_phase_0_to_4(
             wo_output_ptr: Int64,
             wo_gs_ptr: Int64,
             phase1_arrival_ptr: Int64,
+            # Task 6: pre-W_O arrival counter (i32 [nat]). Producers (bx==0
+            # attn CTAs, by<num_kv_heads) atomic_add 1 after attn_output
+            # is published; consumers (bx>0 W_O CTAs at wo_split>1) spin
+            # until counter == num_kv_heads. Dormant at wo_split=1
+            # (consumer mask `bx>0 && bx<wo_split` empty).
+            pre_wo_arrival_ptr: Int64,
             num_q_heads: Int32,
             num_kv_heads: Int32,
             kv_page_stride: Int32,
             wo_num_k_tiles: Int32,
             wo_weight_row_stride: Int32,
-            total_ctas_per_seq_attn: Int32,
+            total_wo_slots: Int32,
             hidden_dim: Int32,
             scale: Float32,
             k_scale: Float32,
@@ -3240,6 +3288,12 @@ def _jit_launch_phase_0_to_4(
             # passing 0 is safe.
             region_timing_ptr: Int64,
             region_timing_enabled: cutlass.Constexpr[bool],
+            # Task 8: K-parallel W_O split factor (Constexpr so
+            # range_constexpr can iterate over slices). At wo_split=1
+            # the new W_O gate `bx < wo_split && by < num_kv_heads` is
+            # equivalent to the legacy `bx == 0 && by < 4`; behavior is
+            # bit-exact preserved.
+            wo_split_const: cutlass.Constexpr[int],
         ):
             """JIT host wrapper for the unified β-coop launch.
 
@@ -3267,12 +3321,13 @@ def _jit_launch_phase_0_to_4(
                 wo_output_ptr,
                 wo_gs_ptr,
                 phase1_arrival_ptr,
+                pre_wo_arrival_ptr,  # Task 6: pre-W_O arrival counter
                 num_q_heads,
                 num_kv_heads,
                 kv_page_stride,
                 wo_num_k_tiles,
                 wo_weight_row_stride,
-                total_ctas_per_seq_attn,
+                total_wo_slots,
                 hidden_dim,
                 scale,
                 k_scale,
@@ -3298,6 +3353,8 @@ def _jit_launch_phase_0_to_4(
                 # Task 3: forward region-timing scratch + constexpr gate.
                 region_timing_ptr,
                 region_timing_enabled,
+                # Task 8: forward Constexpr wo_split.
+                wo_split_const,
             ).launch(
                 grid=[self.slice_ctas, self.num_k_tiles, nat],
                 block=[self.num_threads, 1, 1],
@@ -3327,12 +3384,18 @@ def _kernel_phase_0_to_4(
             wo_output_ptr: Int64,
             wo_gs_ptr: Int64,
             phase1_arrival_ptr: Int64,
+            # Task 6: pre-W_O arrival counter (i32 [nat]). Producers
+            # (bx==0 attn CTAs, by<num_kv_heads) atomic_add 1 after
+            # attn_output is published; consumers (bx>0 W_O CTAs at
+            # wo_split>1) spin until counter == num_kv_heads. Dormant
+            # at wo_split=1 (consumer mask `bx>0 && bx<wo_split` empty).
+            pre_wo_arrival_ptr: Int64,
             num_q_heads: Int32,
             num_kv_heads: Int32,
             kv_page_stride: Int32,
             wo_num_k_tiles: Int32,
             wo_weight_row_stride: Int32,
-            total_ctas_per_seq_attn: Int32,
+            total_wo_slots: Int32,
             hidden_dim: Int32,
             scale: Float32,
             k_scale: Float32,
@@ -3370,6 +3433,13 @@ def _kernel_phase_0_to_4(
             # production path pays only the two scalar arg slots.
             region_timing_ptr: Int64,
             region_timing_enabled: cutlass.Constexpr[bool],
+            # Task 8: K-parallel W_O split factor (Constexpr so
+            # range_constexpr can iterate over slices). The new W_O
+            # gate is `bx < wo_split_const && by < num_kv_heads`. At
+            # wo_split_const=1 it is equivalent to `bx == 0 && by < 4`,
+            # so the legacy single-CTA-per-KV-head behavior is bit-exact
+            # preserved.
+            wo_split_const: cutlass.Constexpr[int],
         ):
             """β-coop unified kernel — Phase 0 → 1 → grid barrier → 3."""
             bx, by, bz = cute.arch.block_idx()
@@ -3417,7 +3487,7 @@ def _kernel_phase_0_to_4(
                     t_entry = _read_globaltimer_u64()
                     _st_global_u64(
                         region_timing_ptr
-                        + Int64(cta_id) * Int64(11 * 2 * 8)
+                        + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE)
                         + Int64(0 * 2 * 8)              # region 0
                         + Int64(0 * 8),                  # slot 0 = entry
                         t_entry,
@@ -3507,7 +3577,7 @@ def _kernel_phase_0_to_4(
                     t_exit = _read_globaltimer_u64()
                     _st_global_u64(
                         region_timing_ptr
-                        + Int64(cta_id) * Int64(11 * 2 * 8)
+                        + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE)
                         + Int64(0 * 2 * 8)              # region 0
                         + Int64(1 * 8),                  # slot 1 = exit
                         t_exit,
@@ -3528,7 +3598,7 @@ def _kernel_phase_0_to_4(
                     t_entry = _read_globaltimer_u64()
                     _st_global_u64(
                         region_timing_ptr
-                        + Int64(cta_id) * Int64(11 * 2 * 8)
+                        + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE)
                         + Int64(1 * 2 * 8)              # region 1
                         + Int64(0 * 8),                  # slot 0 = entry
                         t_entry,
@@ -3995,176 +4065,364 @@ def _kernel_phase_0_to_4(
                             t_exit = _read_globaltimer_u64()
                             _st_global_u64(
                                 region_timing_ptr
-                                + Int64(cta_id) * Int64(11 * 2 * 8)
+                                + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE)
                                 + Int64(1 * 2 * 8)              # region 1
                                 + Int64(1 * 8),                  # slot 1 = exit
                                 t_exit,
                             )
 
-                    # === Phase B: Fused W_O GEMV ===
-                    # Region 2 entry: Phase 1 W_O GEMV body. We are inside
-                    # the bx==0 && by<4 block, so just gate on tid==0.
-                    # K-reduction candidate site #1.
-                    if region_timing_enabled:
-                        if tid == Int32(0):
-                            cta_id = (
-                                bz * Int32(self.slice_ctas * self.num_k_tiles)
-                                + by * Int32(self.slice_ctas)
-                                + bx
-                            )
-                            t_entry = _read_globaltimer_u64()
-                            _st_global_u64(
-                                region_timing_ptr
-                                + Int64(cta_id) * Int64(11 * 2 * 8)
-                                + Int64(2 * 2 * 8)              # region 2
-                                + Int64(0 * 8),                  # slot 0 = entry
-                                t_entry,
-                            )
-
+                    # ========================================================
+                    # Task 6: pre-W_O producer fence + arrival signal.
+                    # Runs INSIDE the bx==0 && by<num_kv_heads parent (this
+                    # block). Each of the 4 attn CTAs (bx==0, by∈[0,4)) does
+                    # one atomic_add 1 after attn_output writes are flushed
+                    # to global. At wo_split=1 no consumer reads the counter
+                    # so the signal is harmless overhead. At wo_split>1 the
+                    # bx>0 W_O CTAs spin-acquire on the counter at kernel-
+                    # level R11 (placed between this parent and the new
+                    # W_O block).
+                    # ========================================================
                     _threadfence()
                     cute.arch.sync_threads()
+                    if tid == Int32(0):
+                        _atomic_add_u32(
+                            pre_wo_arrival_ptr
+                            + Int64(seq_idx * Int32(4)),
+                            Int32(1))
+                    # End of attn-producer parent block. Task 8 lifted W_O
+                    # body out to kernel-level (after R11 wait).
 
-                    attn_base = seq_idx * num_q_heads * hd \
-                        + q_head_start * hd
-                    hd_wo = Int32(self.hidden_size)
-                    n_per_thr_wo = Int32(
-                        self.hidden_size // self.num_threads)
-                    my_row_base = tid * n_per_thr_wo
-
-                    wo_gs = _ld_global_f32(wo_gs_ptr)
+            # ===================================================================
+            # Task 6: R11 = pre_wo_wait. Consumer-side spin-wait + bracketing
+            # timing. Runs at kernel-level (OUTSIDE the bx==0 parent above).
+            #
+            # At wo_split=1 the consumer mask `bx>0 && bx<wo_split=1 && by<4`
+            # is empty — no CTA executes the body, R11 buffer rows stay zero
+            # and the host nonzero filter drops them. Producer atomic_add at
+            # the end of the parent block (4 attn CTAs) brings the counter to
+            # num_kv_heads each launch; reset to 0 by host .zero_() before
+            # the next launch, so no in-kernel decrement is required (mirrors
+            # mlp_arrival_count.zero_() pattern).
+            #
+            # At wo_split>1 (Task 8), the K-parallel W_O CTAs (bx∈[1, wo_split),
+            # by<num_kv_heads) sit here waiting for the 4 producers, then
+            # acquire-fence and proceed to their per-CTA W_O GEMV slice. The
+            # mask is `bx > 0` (NOT `bx >= 0`) — bx==0 is the producer path
+            # and must not enter the consumer wait or it deadlocks.
+            # ===================================================================
+
+            # bx==0 producer CTAs skip the R11 wait (consumer mask FALSE) and
+            # re-enter the W_O gate below: their attn_output reads are intra-CTA,
+            # no acquire-fence needed (only bx>0 consumers acquire below).
+            pre_wo_consumer_active = (
+                (bx > Int32(0))
+                and (bx < Int32(wo_split_const))
+                and (by < Int32(self.num_kv_heads))
+            )
 
-                    for _out_group in cutlass.range_constexpr(
-                        self.hidden_size // self.num_threads // 8
-                    ):
-                        out_base_wo = my_row_base \
-                            + Int32(_out_group * 8)
+            # Region 11 entry: pre_wo_wait (consumer-only mask).
+            if region_timing_enabled:
+                if pre_wo_consumer_active and tid == Int32(0):
+                    cta_id = (
+                        bz * Int32(self.slice_ctas * self.num_k_tiles)
+                        + by * Int32(self.slice_ctas)
+                        + bx
+                    )
+                    t_entry = _read_globaltimer_u64()
+                    _st_global_u64(
+                        region_timing_ptr
+                        + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE)
+                        + Int64(11 * 2 * 8)              # region 11
+                        + Int64(0 * 8),                  # slot 0 = entry
+                        t_entry,
+                    )
 
-                        a0 = Float32(0.0)
-                        a1 = Float32(0.0)
-                        a2 = Float32(0.0)
-                        a3 = Float32(0.0)
-                        a4 = Float32(0.0)
-                        a5 = Float32(0.0)
-                        a6 = Float32(0.0)
-                        a7 = Float32(0.0)
+            # Spin-wait for producers (consumer-only mask). At wo_split=1 the
+            # mask is empty so no CTA enters the loop. At wo_split>1 each
+            # consumer CTA (bx>0) loops on a volatile load until the counter
+            # reaches num_kv_heads (= number of producers).
+            if pre_wo_consumer_active:
+                pre_wo_arrived = Int32(0)
+                while pre_wo_arrived < num_kv_heads:
+                    pre_wo_arrived = _ld_volatile_u32(
+                        pre_wo_arrival_ptr
+                        + Int64(seq_idx * Int32(4))
+                    )
+                _acquire_fence()
+                cute.arch.sync_threads()
 
-                        k_dim = group_size_p1 * hd
-                        k_idx = Int32(0)
-                        while k_idx < k_dim:
-                            attn_val = _ld_global_b16_to_f32(
-                                attn_output_ptr
-                                + Int64((attn_base + k_idx) * Int32(2)))
-                            abs_k = (kv_head_idx * group_size_p1 * hd
-                                     + k_idx)
-                            k_byte = abs_k >> Int32(1)
-                            k_is_hi = abs_k & Int32(1)
-                            k_grp = abs_k >> Int32(4)
+            # Region 11 exit: pre_wo_wait (consumer-only mask).
+            if region_timing_enabled:
+                if pre_wo_consumer_active and tid == Int32(0):
+                    cta_id = (
+                        bz * Int32(self.slice_ctas * self.num_k_tiles)
+                        + by * Int32(self.slice_ctas)
+                        + bx
+                    )
+                    t_exit = _read_globaltimer_u64()
+                    _st_global_u64(
+                        region_timing_ptr
+                        + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE)
+                        + Int64(11 * 2 * 8)              # region 11
+                        + Int64(1 * 8),                  # slot 1 = exit
+                        t_exit,
+                    )
 
-                            for _oi in cutlass.range_constexpr(8):
-                                out_row = out_base_wo + Int32(_oi)
-                                if out_row < hd_wo:
-                                    w_addr = wo_weight_ptr + Int64(
-                                        out_row * wo_weight_row_stride
-                                        + k_byte)
-                                    aligned = w_addr & Int64(
-                                        0xFFFFFFFFFFFFFFFC)
-                                    raw = _ld_global_b32(aligned)
-                                    bpos = Int32(w_addr & Int64(3))
-                                    the_byte = _extract_byte_from_b32(
-                                        raw, bpos)
-                                    nib_shift = k_is_hi << Int32(2)
-                                    nib = (the_byte >> nib_shift) \
-                                        & Int32(0x0F)
-                                    w_f32 = _fp4_nibble_to_f32(nib)
-                                    sf = _ld_swizzled_scale(
-                                        wo_scale_ptr, out_row, k_grp,
-                                        wo_num_k_tiles)
-                                    w_dequant = w_f32 * sf * wo_gs
+            # ===================================================================
+            # Task 8: K-parallel W_O GEMV body (lifted out of attn-producer
+            # parent). Mask: `bx < wo_split && by < num_kv_heads`.
+            #
+            # At wo_split_const=1, mask `bx<1 && by<4` ≡ `bx==0 && by<4` —
+            # the legacy single-CTA-per-KV-head behavior is bit-exact preserved.
+            #
+            # At wo_split_const>1, mask spans `bx ∈ [0, wo_split)` × `by ∈ [0,
+            # num_kv_heads)`, giving total_wo_slots = num_kv_heads * wo_split
+            # K-parallel partial-GEMV CTAs. Each CTA owns a 1/wo_split slice
+            # of one KV-head's K range, accumulates a chained-FMA partial,
+            # and writes to wo_output[seq, slot, :] where slot = by*wo_split+bx
+            # (matches torch_reference.py:438-439, microkernel.py:225-227).
+            #
+            # K-range slicing follows torch_reference.py:443-446 EXACTLY:
+            #     K_per_head      = K // num_kv_heads
+            #     k_start_in_head = (K_per_head * bx) // wo_split
+            #     k_end_in_head   = (K_per_head * (bx + 1)) // wo_split
+            #     k_start = by * K_per_head + k_start_in_head
+            #     k_end   = by * K_per_head + k_end_in_head
+            # ===================================================================
+            if (bx < Int32(wo_split_const)) and (by < Int32(self.num_kv_heads)):
+                # Region 2 entry: Phase 1 W_O GEMV body (now at kernel-level).
+                if region_timing_enabled:
+                    if tid == Int32(0):
+                        cta_id = (
+                            bz * Int32(self.slice_ctas * self.num_k_tiles)
+                            + by * Int32(self.slice_ctas)
+                            + bx
+                        )
+                        t_entry = _read_globaltimer_u64()
+                        _st_global_u64(
+                            region_timing_ptr
+                            + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE)
+                            + Int64(2 * 2 * 8)              # region 2
+                            + Int64(0 * 8),                  # slot 0 = entry
+                            t_entry,
+                        )
 
-                                    if _oi == 0:
-                                        a0 = a0 + w_dequant * attn_val
-                                    if _oi == 1:
-                                        a1 = a1 + w_dequant * attn_val
-                                    if _oi == 2:
-                                        a2 = a2 + w_dequant * attn_val
-                                    if _oi == 3:
-                                        a3 = a3 + w_dequant * attn_val
-                                    if _oi == 4:
-                                        a4 = a4 + w_dequant * attn_val
-                                    if _oi == 5:
-                                        a5 = a5 + w_dequant * attn_val
-                                    if _oi == 6:
-                                        a6 = a6 + w_dequant * attn_val
-                                    if _oi == 7:
-                                        a7 = a7 + w_dequant * attn_val
+                _threadfence()
+                cute.arch.sync_threads()
 
-                            k_idx = k_idx + Int32(1)
+                # --- W_O body locals (recomputed; no longer inherited from
+                # attn-producer parent).
+                kv_head_idx_wo = by
+                group_size_p1_wo = num_q_heads // num_kv_heads
+                q_head_start_wo = kv_head_idx_wo * group_size_p1_wo
+                hd_wo_dim = Int32(self.head_dim)
+                attn_base_wo = (seq_idx * num_q_heads * hd_wo_dim
+                                + q_head_start_wo * hd_wo_dim)
+
+                # K_per_head as constexpr (= num_q_heads*head_dim/num_kv_heads).
+                # The runtime form `group_size_p1_wo * hd_wo_dim` agrees on
+                # non-negative operands. Use runtime form to honor the
+                # kernel's existing dynamic shape contract.
+                k_per_head = group_size_p1_wo * hd_wo_dim
+                # K-range slicing (Python integer-divide; PTX i32 sdiv agrees
+                # for non-negative operands).
+                k_start_in_head = (k_per_head * bx) // Int32(wo_split_const)
+                k_end_in_head = (k_per_head * (bx + Int32(1))) // Int32(wo_split_const)
+
+                hd_wo = Int32(self.hidden_size)
+                n_per_thr_wo = Int32(
+                    self.hidden_size // self.num_threads)
+                my_row_base = tid * n_per_thr_wo
+
+                wo_gs = _ld_global_f32(wo_gs_ptr)
+
+                for _out_group in cutlass.range_constexpr(
+                    self.hidden_size // self.num_threads // 8
+                ):
+                    out_base_wo = my_row_base \
+                        + Int32(_out_group * 8)
+
+                    a0 = Float32(0.0)
+                    a1 = Float32(0.0)
+                    a2 = Float32(0.0)
+                    a3 = Float32(0.0)
+                    a4 = Float32(0.0)
+                    a5 = Float32(0.0)
+                    a6 = Float32(0.0)
+                    a7 = Float32(0.0)
+
+                    # K-loop iterates over [k_start_in_head, k_end_in_head)
+                    # local-in-head. abs_k = kv_head_idx*K_per_head + k_idx
+                    # (matches the attn_output layout: K dim is
+                    # num_kv_heads-major → K_per_head-minor).
+                    k_idx = k_start_in_head
+                    while k_idx < k_end_in_head:
+                        attn_val = _ld_global_b16_to_f32(
+                            attn_output_ptr
+                            + Int64((attn_base_wo + k_idx) * Int32(2)))
+                        abs_k = (kv_head_idx_wo * k_per_head
+                                 + k_idx)
+                        k_byte = abs_k >> Int32(1)
+                        k_is_hi = abs_k & Int32(1)
+                        k_grp = abs_k >> Int32(4)
 
-                        cta_idx = bx * num_kv_heads + by
-                        wo_slot_base = wo_output_ptr + Int64(
-                            (seq_idx * total_ctas_per_seq_attn + cta_idx)
-                            * hd_wo * Int32(4))
                         for _oi in cutlass.range_constexpr(8):
                             out_row = out_base_wo + Int32(_oi)
                             if out_row < hd_wo:
+                                w_addr = wo_weight_ptr + Int64(
+                                    out_row * wo_weight_row_stride
+                                    + k_byte)
+                                aligned = w_addr & Int64(
+                                    0xFFFFFFFFFFFFFFFC)
+                                raw = _ld_global_b32(aligned)
+                                bpos = Int32(w_addr & Int64(3))
+                                the_byte = _extract_byte_from_b32(
+                                    raw, bpos)
+                                nib_shift = k_is_hi << Int32(2)
+                                nib = (the_byte >> nib_shift) \
+                                    & Int32(0x0F)
+                                w_f32 = _fp4_nibble_to_f32(nib)
+                                sf = _ld_swizzled_scale(
+                                    wo_scale_ptr, out_row, k_grp,
+                                    wo_num_k_tiles)
+                                w_dequant = w_f32 * sf * wo_gs
+
                                 if _oi == 0:
-                                    _st_global_f32(
-                                        wo_slot_base + Int64(
-                                            out_row * Int32(4)), a0)
+                                    a0 = a0 + w_dequant * attn_val
                                 if _oi == 1:
-                                    _st_global_f32(
-                                        wo_slot_base + Int64(
-                                            out_row * Int32(4)), a1)
+                                    a1 = a1 + w_dequant * attn_val
                                 if _oi == 2:
-                                    _st_global_f32(
-                                        wo_slot_base + Int64(
-                                            out_row * Int32(4)), a2)
+                                    a2 = a2 + w_dequant * attn_val
                                 if _oi == 3:
-                                    _st_global_f32(
-                                        wo_slot_base + Int64(
-                                            out_row * Int32(4)), a3)
+                                    a3 = a3 + w_dequant * attn_val
                                 if _oi == 4:
-                                    _st_global_f32(
-                                        wo_slot_base + Int64(
-                                            out_row * Int32(4)), a4)
+                                    a4 = a4 + w_dequant * attn_val
                                 if _oi == 5:
-                                    _st_global_f32(
-                                        wo_slot_base + Int64(
-                                            out_row * Int32(4)), a5)
+                                    a5 = a5 + w_dequant * attn_val
                                 if _oi == 6:
-                                    _st_global_f32(
-                                        wo_slot_base + Int64(
-                                            out_row * Int32(4)), a6)
+                                    a6 = a6 + w_dequant * attn_val
                                 if _oi == 7:
-                                    _st_global_f32(
-                                        wo_slot_base + Int64(
-                                            out_row * Int32(4)), a7)
+                                    a7 = a7 + w_dequant * attn_val
+
+                        k_idx = k_idx + Int32(1)
+
+                    # Slot index: slot = by * wo_split + bx (matches
+                    # torch_reference.py:438-439 — slot_id // wo_split == by,
+                    # slot_id %  wo_split == bx).
+                    slot_idx = by * Int32(wo_split_const) + bx
+                    wo_slot_base = wo_output_ptr + Int64(
+                        (seq_idx * total_wo_slots + slot_idx)
+                        * hd_wo * Int32(4))
+                    for _oi in cutlass.range_constexpr(8):
+                        out_row = out_base_wo + Int32(_oi)
+                        if out_row < hd_wo:
+                            if _oi == 0:
+                                _st_global_f32(
+                                    wo_slot_base + Int64(
+                                        out_row * Int32(4)), a0)
+                            if _oi == 1:
+                                _st_global_f32(
+                                    wo_slot_base + Int64(
+                                        out_row * Int32(4)), a1)
+                            if _oi == 2:
+                                _st_global_f32(
+                                    wo_slot_base + Int64(
+                                        out_row * Int32(4)), a2)
+                            if _oi == 3:
+                                _st_global_f32(
+                                    wo_slot_base + Int64(
+                                        out_row * Int32(4)), a3)
+                            if _oi == 4:
+                                _st_global_f32(
+                                    wo_slot_base + Int64(
+                                        out_row * Int32(4)), a4)
+                            if _oi == 5:
+                                _st_global_f32(
+                                    wo_slot_base + Int64(
+                                        out_row * Int32(4)), a5)
+                            if _oi == 6:
+                                _st_global_f32(
+                                    wo_slot_base + Int64(
+                                        out_row * Int32(4)), a6)
+                            if _oi == 7:
+                                _st_global_f32(
+                                    wo_slot_base + Int64(
+                                        out_row * Int32(4)), a7)
 
-                    # === Phase B.5 + C: last-CTA gather + RMSNorm ===
-                    _threadfence()
+                # === Phase B.5 + C: last-CTA gather + RMSNorm ===
+                _threadfence()
 
-                    # Region 2 exit: Phase 1 W_O GEMV body. Recorded just
-                    # after the W_O writes are published by _threadfence.
-                    # We are inside the bx==0 && by<4 block, gate on tid==0.
-                    if region_timing_enabled:
-                        if tid == Int32(0):
-                            cta_id = (
-                                bz * Int32(self.slice_ctas * self.num_k_tiles)
-                                + by * Int32(self.slice_ctas)
-                                + bx
-                            )
-                            t_exit = _read_globaltimer_u64()
-                            _st_global_u64(
-                                region_timing_ptr
-                                + Int64(cta_id) * Int64(11 * 2 * 8)
-                                + Int64(2 * 2 * 8)              # region 2
-                                + Int64(1 * 8),                  # slot 1 = exit
-                                t_exit,
-                            )
+                # Region 2 exit: Phase 1 W_O GEMV body. Recorded just after
+                # the W_O writes are published by _threadfence.
+                if region_timing_enabled:
+                    if tid == Int32(0):
+                        cta_id = (
+                            bz * Int32(self.slice_ctas * self.num_k_tiles)
+                            + by * Int32(self.slice_ctas)
+                            + bx
+                        )
+                        t_exit = _read_globaltimer_u64()
+                        _st_global_u64(
+                            region_timing_ptr
+                            + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE)
+                            + Int64(2 * 2 * 8)              # region 2
+                            + Int64(1 * 8),                  # slot 1 = exit
+                            t_exit,
+                        )
+
+                # Region 3 entry: W_O end → barrier-arrive (cleanup).
+                if region_timing_enabled:
+                    if tid == Int32(0):
+                        cta_id = (
+                            bz * Int32(self.slice_ctas * self.num_k_tiles)
+                            + by * Int32(self.slice_ctas)
+                            + bx
+                        )
+                        t_entry = _read_globaltimer_u64()
+                        _st_global_u64(
+                            region_timing_ptr
+                            + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE)
+                            + Int64(3 * 2 * 8)              # region 3
+                            + Int64(0 * 8),                  # slot 0 = entry
+                            t_entry,
+                        )
+
+                # Region 3 exit: just before _atomic_add_u32 to
+                # phase1_arrival_count (the per-CTA barrier-arrive signal).
+                if region_timing_enabled:
+                    if tid == Int32(0):
+                        cta_id = (
+                            bz * Int32(self.slice_ctas * self.num_k_tiles)
+                            + by * Int32(self.slice_ctas)
+                            + bx
+                        )
+                        t_exit = _read_globaltimer_u64()
+                        _st_global_u64(
+                            region_timing_ptr
+                            + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE)
+                            + Int64(3 * 2 * 8)              # region 3
+                            + Int64(1 * 8),                  # slot 1 = exit
+                            t_exit,
+                        )
+
+                # Election: each W_O CTA's tid==0 atomic-adds 1; the writer
+                # of (total_wo_slots-1) wins and broadcasts is_last_cta=1
+                # via SMEM. Other CTAs broadcast 0.
+                if tid == Int32(0):
+                    old_count = _atomic_add_u32(
+                        phase1_arrival_ptr
+                        + Int64(seq_idx * Int32(4)),
+                        Int32(1))
+                    if old_count == total_wo_slots - Int32(1):
+                        _st_shared_f32(sync_md, Float32(1.0))
+                    else:
+                        _st_shared_f32(sync_md, Float32(0.0))
+                cute.arch.sync_threads()
 
-                    # Region 3 entry: W_O end → barrier-arrive (cleanup).
-                    # Same site as region 2 exit (W_O sync), but recorded
-                    # as a separate block to keep the buffer boundary clean.
+                is_last_cta = _ld_shared_f32(sync_md)
+
+                if is_last_cta > Float32(0.5):
+                    # Region 12 entry: gather_reduce (elected single-CTA only).
                     if region_timing_enabled:
                         if tid == Int32(0):
                             cta_id = (
@@ -4175,15 +4433,130 @@ def _kernel_phase_0_to_4(
                             t_entry = _read_globaltimer_u64()
                             _st_global_u64(
                                 region_timing_ptr
-                                + Int64(cta_id) * Int64(11 * 2 * 8)
-                                + Int64(3 * 2 * 8)              # region 3
+                                + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE)
+                                + Int64(12 * 2 * 8)              # region 12
                                 + Int64(0 * 8),                  # slot 0 = entry
                                 t_entry,
                             )
 
-                    # Region 3 exit: just before the _atomic_add_u32 to
-                    # phase1_arrival_count (the per-CTA barrier-arrive
-                    # signal). We are inside bx==0 && by<4, gate on tid==0.
+                    hd_c = hidden_dim
+                    n_per_thr_c = hd_c // Int32(128)
+
+                    res_base_c = residual_in_ptr + Int64(
+                        seq_idx * hd_c * Int32(2))
+                    wo_base_c = wo_output_ptr + Int64(
+                        seq_idx * total_wo_slots
+                        * hd_c * Int32(4))
+                    gamma_base_c = post_attn_gamma_ptr
+                    out_base_c = attn_output_ptr + Int64(
+                        seq_idx * hd_c * Int32(2))
+                    resout_base_c = residual_output_ptr + Int64(
+                        seq_idx * hd_c * Int32(2))
+
+                    my_start_c = tid * n_per_thr_c
+
+                    # Phase B.5: gather per-CTA slots into slot 0.
+                    for _grp in cutlass.range_constexpr(
+                        self.hidden_size // self.num_threads // 8
+                    ):
+                        for _ei in cutlass.range_constexpr(8):
+                            idx_c = my_start_c + Int32(_grp * 8 + _ei)
+                            gather_acc = Float32(0.0)
+                            cta_i = Int32(0)
+                            while cta_i < total_wo_slots:
+                                slot_addr = wo_output_ptr + Int64(
+                                    (seq_idx * total_wo_slots
+                                     + cta_i)
+                                    * hd_c * Int32(4)
+                                    + idx_c * Int32(4))
+                                gather_acc = gather_acc \
+                                    + _ld_global_f32(slot_addr)
+                                cta_i = cta_i + Int32(1)
+                            _st_global_f32(
+                                wo_base_c
+                                + Int64(idx_c * Int32(4)),
+                                gather_acc,
+                            )
+                    _threadfence()
+                    cute.arch.sync_threads()
+
+                    # Pass 1: residual add + sum-of-squares
+                    ss = Float32(0.0)
+                    for _grp in cutlass.range_constexpr(
+                        self.hidden_size // self.num_threads // 8
+                    ):
+                        base_idx = my_start_c + Int32(_grp * 8)
+                        for _ei in cutlass.range_constexpr(8):
+                            idx_c = base_idx + Int32(_ei)
+                            res_f32 = _ld_global_b16_to_f32(
+                                res_base_c
+                                + Int64(idx_c * Int32(2)))
+                            wo_f32 = _ld_global_f32(
+                                wo_base_c
+                                + Int64(idx_c * Int32(4)))
+                            nr = res_f32 + wo_f32
+                            ss = ss + nr * nr
+
+                    ss = ss + shfl_xor_sync(ss, Int32(1))
+                    ss = ss + shfl_xor_sync(ss, Int32(2))
+                    ss = ss + shfl_xor_sync(ss, Int32(4))
+                    ss = ss + shfl_xor_sync(ss, Int32(8))
+                    ss = ss + shfl_xor_sync(ss, Int32(16))
+
+                    if lane == Int32(0):
+                        _st_shared_f32(
+                            sync_md + Int64(warp * Int32(4)), ss)
+                    cute.arch.sync_threads()
+
+                    if warp == Int32(0):
+                        if lane == Int32(0):
+                            total_ss = _ld_shared_f32(sync_md)
+                            total_ss = total_ss + _ld_shared_f32(
+                                sync_md + Int64(4))
+                            total_ss = total_ss + _ld_shared_f32(
+                                sync_md + Int64(8))
+                            total_ss = total_ss + _ld_shared_f32(
+                                sync_md + Int64(12))
+                            variance = total_ss / Float32(hd_c)
+                            inv_rms = _rsqrt_approx_f32(
+                                variance + rms_eps)
+                            _st_shared_f32(sync_md, inv_rms)
+                    cute.arch.sync_threads()
+
+                    inv_rms_val = _ld_shared_f32(sync_md)
+
+                    # Pass 3: re-read, scale, write BF16 output
+                    for _grp in cutlass.range_constexpr(
+                        self.hidden_size // self.num_threads // 8
+                    ):
+                        base_idx = my_start_c + Int32(_grp * 8)
+                        for _oi in cutlass.range_constexpr(8):
+                            idx_c = base_idx + Int32(_oi)
+                            res_f32 = _ld_global_b16_to_f32(
+                                res_base_c
+                                + Int64(idx_c * Int32(2)))
+                            wo_f32 = _ld_global_f32(
+                                wo_base_c
+                                + Int64(idx_c * Int32(4)))
+                            new_res = res_f32 + wo_f32
+
+                            gamma_f32 = _ld_global_b16_to_f32(
+                                gamma_base_c
+                                + Int64(idx_c * Int32(2)))
+                            # Qwen3_5RMSNorm uses x * (1 + γ) — see vllm/nvllm/layers/layernorm.py:78
+                            hidden_val = new_res * inv_rms_val \
+                                * (Float32(1.0) + gamma_f32)
+
+                            _st_global_bf16_from_f32(
+                                out_base_c
+                                + Int64(idx_c * Int32(2)),
+                                hidden_val)
+                            _st_global_bf16_from_f32(
+                                resout_base_c
+                                + Int64(idx_c * Int32(2)),
+                                new_res)
+
+                    # Region 12 exit: gather_reduce.
                     if region_timing_enabled:
                         if tid == Int32(0):
                             cta_id = (
@@ -4194,149 +4567,18 @@ def _kernel_phase_0_to_4(
                             t_exit = _read_globaltimer_u64()
                             _st_global_u64(
                                 region_timing_ptr
-                                + Int64(cta_id) * Int64(11 * 2 * 8)
-                                + Int64(3 * 2 * 8)              # region 3
+                                + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE)
+                                + Int64(12 * 2 * 8)
                                 + Int64(1 * 8),                  # slot 1 = exit
                                 t_exit,
                             )
 
+                    # Reset arrival counter for next call.
                     if tid == Int32(0):
-                        old_count = _atomic_add_u32(
+                        _atomic_add_u32(
                             phase1_arrival_ptr
                             + Int64(seq_idx * Int32(4)),
-                            Int32(1))
-                        if old_count == total_ctas_per_seq_attn - Int32(1):
-                            _st_shared_f32(sync_md, Float32(1.0))
-                        else:
-                            _st_shared_f32(sync_md, Float32(0.0))
-                    cute.arch.sync_threads()
-
-                    is_last_cta = _ld_shared_f32(sync_md)
-
-                    if is_last_cta > Float32(0.5):
-                        hd_c = hidden_dim
-                        n_per_thr_c = hd_c // Int32(128)
-
-                        res_base_c = residual_in_ptr + Int64(
-                            seq_idx * hd_c * Int32(2))
-                        wo_base_c = wo_output_ptr + Int64(
-                            seq_idx * total_ctas_per_seq_attn
-                            * hd_c * Int32(4))
-                        gamma_base_c = post_attn_gamma_ptr
-                        out_base_c = attn_output_ptr + Int64(
-                            seq_idx * hd_c * Int32(2))
-                        resout_base_c = residual_output_ptr + Int64(
-                            seq_idx * hd_c * Int32(2))
-
-                        my_start_c = tid * n_per_thr_c
-
-                        # Phase B.5: gather per-CTA slots into slot 0.
-                        for _grp in cutlass.range_constexpr(
-                            self.hidden_size // self.num_threads // 8
-                        ):
-                            for _ei in cutlass.range_constexpr(8):
-                                idx_c = my_start_c + Int32(_grp * 8 + _ei)
-                                gather_acc = Float32(0.0)
-                                cta_i = Int32(0)
-                                while cta_i < total_ctas_per_seq_attn:
-                                    slot_addr = wo_output_ptr + Int64(
-                                        (seq_idx * total_ctas_per_seq_attn
-                                         + cta_i)
-                                        * hd_c * Int32(4)
-                                        + idx_c * Int32(4))
-                                    gather_acc = gather_acc \
-                                        + _ld_global_f32(slot_addr)
-                                    cta_i = cta_i + Int32(1)
-                                _st_global_f32(
-                                    wo_base_c
-                                    + Int64(idx_c * Int32(4)),
-                                    gather_acc,
-                                )
-                        _threadfence()
-                        cute.arch.sync_threads()
-
-                        # Pass 1: residual add + sum-of-squares
-                        ss = Float32(0.0)
-                        for _grp in cutlass.range_constexpr(
-                            self.hidden_size // self.num_threads // 8
-                        ):
-                            base_idx = my_start_c + Int32(_grp * 8)
-                            for _ei in cutlass.range_constexpr(8):
-                                idx_c = base_idx + Int32(_ei)
-                                res_f32 = _ld_global_b16_to_f32(
-                                    res_base_c
-                                    + Int64(idx_c * Int32(2)))
-                                wo_f32 = _ld_global_f32(
-                                    wo_base_c
-                                    + Int64(idx_c * Int32(4)))
-                                nr = res_f32 + wo_f32
-                                ss = ss + nr * nr
-
-                        ss = ss + shfl_xor_sync(ss, Int32(1))
-                        ss = ss + shfl_xor_sync(ss, Int32(2))
-                        ss = ss + shfl_xor_sync(ss, Int32(4))
-                        ss = ss + shfl_xor_sync(ss, Int32(8))
-                        ss = ss + shfl_xor_sync(ss, Int32(16))
-
-                        if lane == Int32(0):
-                            _st_shared_f32(
-                                sync_md + Int64(warp * Int32(4)), ss)
-                        cute.arch.sync_threads()
-
-                        if warp == Int32(0):
-                            if lane == Int32(0):
-                                total_ss = _ld_shared_f32(sync_md)
-                                total_ss = total_ss + _ld_shared_f32(
-                                    sync_md + Int64(4))
-                                total_ss = total_ss + _ld_shared_f32(
-                                    sync_md + Int64(8))
-                                total_ss = total_ss + _ld_shared_f32(
-                                    sync_md + Int64(12))
-                                variance = total_ss / Float32(hd_c)
-                                inv_rms = _rsqrt_approx_f32(
-                                    variance + rms_eps)
-                                _st_shared_f32(sync_md, inv_rms)
-                        cute.arch.sync_threads()
-
-                        inv_rms_val = _ld_shared_f32(sync_md)
-
-                        # Pass 3: re-read, scale, write BF16 output
-                        for _grp in cutlass.range_constexpr(
-                            self.hidden_size // self.num_threads // 8
-                        ):
-                            base_idx = my_start_c + Int32(_grp * 8)
-                            for _oi in cutlass.range_constexpr(8):
-                                idx_c = base_idx + Int32(_oi)
-                                res_f32 = _ld_global_b16_to_f32(
-                                    res_base_c
-                                    + Int64(idx_c * Int32(2)))
-                                wo_f32 = _ld_global_f32(
-                                    wo_base_c
-                                    + Int64(idx_c * Int32(4)))
-                                new_res = res_f32 + wo_f32
-
-                                gamma_f32 = _ld_global_b16_to_f32(
-                                    gamma_base_c
-                                    + Int64(idx_c * Int32(2)))
-                                # Qwen3_5RMSNorm uses x * (1 + γ) — see vllm/nvllm/layers/layernorm.py:78
-                                hidden_val = new_res * inv_rms_val \
-                                    * (Float32(1.0) + gamma_f32)
-
-                                _st_global_bf16_from_f32(
-                                    out_base_c
-                                    + Int64(idx_c * Int32(2)),
-                                    hidden_val)
-                                _st_global_bf16_from_f32(
-                                    resout_base_c
-                                    + Int64(idx_c * Int32(2)),
-                                    new_res)
-
-                        # Reset arrival counter for next call.
-                        if tid == Int32(0):
-                            _atomic_add_u32(
-                                phase1_arrival_ptr
-                                + Int64(seq_idx * Int32(4)),
-                                Int32(0) - total_ctas_per_seq_attn)
+                            Int32(0) - total_wo_slots)
 
             # Region 4 entry: grid barrier wait (all 64 CTAs participate).
             # Entry tick recorded at the moment a CTA arrives at the
@@ -4353,7 +4595,7 @@ def _kernel_phase_0_to_4(
                     t_entry = _read_globaltimer_u64()
                     _st_global_u64(
                         region_timing_ptr
-                        + Int64(cta_id) * Int64(11 * 2 * 8)
+                        + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE)
                         + Int64(4 * 2 * 8)              # region 4
                         + Int64(0 * 8),                  # slot 0 = entry
                         t_entry,
@@ -4395,7 +4637,7 @@ def _kernel_phase_0_to_4(
                     t_exit = _read_globaltimer_u64()
                     _st_global_u64(
                         region_timing_ptr
-                        + Int64(cta_id) * Int64(11 * 2 * 8)
+                        + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE)
                         + Int64(4 * 2 * 8)              # region 4
                         + Int64(1 * 8),                  # slot 1 = exit
                         t_exit,
@@ -4417,7 +4659,7 @@ def _kernel_phase_0_to_4(
                     t_entry = _read_globaltimer_u64()
                     _st_global_u64(
                         region_timing_ptr
-                        + Int64(cta_id) * Int64(11 * 2 * 8)
+                        + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE)
                         + Int64(5 * 2 * 8)              # region 5
                         + Int64(0 * 8),                  # slot 0 = entry
                         t_entry,
@@ -4492,7 +4734,7 @@ def _kernel_phase_0_to_4(
                     t_exit = _read_globaltimer_u64()
                     _st_global_u64(
                         region_timing_ptr
-                        + Int64(cta_id) * Int64(11 * 2 * 8)
+                        + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE)
                         + Int64(5 * 2 * 8)              # region 5
                         + Int64(1 * 8),                  # slot 1 = exit
                         t_exit,
@@ -4511,7 +4753,7 @@ def _kernel_phase_0_to_4(
                     t_entry = _read_globaltimer_u64()
                     _st_global_u64(
                         region_timing_ptr
-                        + Int64(cta_id) * Int64(11 * 2 * 8)
+                        + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE)
                         + Int64(6 * 2 * 8)              # region 6
                         + Int64(0 * 8),                  # slot 0 = entry
                         t_entry,
@@ -4580,7 +4822,7 @@ def _kernel_phase_0_to_4(
                     t_exit = _read_globaltimer_u64()
                     _st_global_u64(
                         region_timing_ptr
-                        + Int64(cta_id) * Int64(11 * 2 * 8)
+                        + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE)
                         + Int64(6 * 2 * 8)              # region 6
                         + Int64(1 * 8),                  # slot 1 = exit
                         t_exit,
@@ -4602,7 +4844,7 @@ def _kernel_phase_0_to_4(
                         t_entry = _read_globaltimer_u64()
                         _st_global_u64(
                             region_timing_ptr
-                            + Int64(cta_id) * Int64(11 * 2 * 8)
+                            + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE)
                             + Int64(7 * 2 * 8)              # region 7
                             + Int64(0 * 8),                  # slot 0 = entry
                             t_entry,
@@ -4741,7 +4983,7 @@ def _kernel_phase_0_to_4(
                         t_exit = _read_globaltimer_u64()
                         _st_global_u64(
                             region_timing_ptr
-                            + Int64(cta_id) * Int64(11 * 2 * 8)
+                            + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE)
                             + Int64(7 * 2 * 8)              # region 7
                             + Int64(1 * 8),                  # slot 1 = exit
                             t_exit,
@@ -4760,7 +5002,7 @@ def _kernel_phase_0_to_4(
                         t_entry = _read_globaltimer_u64()
                         _st_global_u64(
                             region_timing_ptr
-                            + Int64(cta_id) * Int64(11 * 2 * 8)
+                            + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE)
                             + Int64(8 * 2 * 8)              # region 8
                             + Int64(0 * 8),                  # slot 0 = entry
                             t_entry,
@@ -4873,7 +5115,7 @@ def _kernel_phase_0_to_4(
                         t_exit = _read_globaltimer_u64()
                         _st_global_u64(
                             region_timing_ptr
-                            + Int64(cta_id) * Int64(11 * 2 * 8)
+                            + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE)
                             + Int64(8 * 2 * 8)              # region 8
                             + Int64(1 * 8),                  # slot 1 = exit
                             t_exit,
@@ -4892,7 +5134,7 @@ def _kernel_phase_0_to_4(
                         t_entry = _read_globaltimer_u64()
                         _st_global_u64(
                             region_timing_ptr
-                            + Int64(cta_id) * Int64(11 * 2 * 8)
+                            + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE)
                             + Int64(9 * 2 * 8)              # region 9
                             + Int64(0 * 8),                  # slot 0 = entry
                             t_entry,
@@ -5092,7 +5334,7 @@ def _kernel_phase_0_to_4(
                     t_exit = _read_globaltimer_u64()
                     _st_global_u64(
                         region_timing_ptr
-                        + Int64(cta_id) * Int64(11 * 2 * 8)
+                        + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE)
                         + Int64(9 * 2 * 8)              # region 9
                         + Int64(1 * 8),                  # slot 1 = exit
                         t_exit,
@@ -5109,7 +5351,7 @@ def _kernel_phase_0_to_4(
                     t_entry = _read_globaltimer_u64()
                     _st_global_u64(
                         region_timing_ptr
-                        + Int64(cta_id) * Int64(11 * 2 * 8)
+                        + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE)
                         + Int64(10 * 2 * 8)             # region 10
                         + Int64(0 * 8),                  # slot 0 = entry
                         t_entry,
@@ -5208,7 +5450,7 @@ def _kernel_phase_0_to_4(
                     t_exit = _read_globaltimer_u64()
                     _st_global_u64(
                         region_timing_ptr
-                        + Int64(cta_id) * Int64(11 * 2 * 8)
+                        + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE)
                         + Int64(10 * 2 * 8)             # region 10
                         + Int64(1 * 8),                  # slot 1 = exit
                         t_exit,
diff --git a/vllm/v1/attention/backends/cute_paged/region_timing.py b/vllm/v1/attention/backends/cute_paged/region_timing.py
index b556732b07f4..a956431f50a6 100644
--- a/vllm/v1/attention/backends/cute_paged/region_timing.py
+++ b/vllm/v1/attention/backends/cute_paged/region_timing.py
@@ -44,10 +44,13 @@
     "phase3_3b_quant",
     "phase3_3c_fc2_atomic",
     "phase3_3d_arrival",
+    "phase1_pre_wo_wait",      # NEW R11: bx>0 W_O CTAs wait for attn producers
+    "phase1_gather_reduce",    # NEW R12: last-CTA gather of total_wo_slots partials
 ]
 PHASE0_REGIONS = {0}                        # single CTA per seq
 PHASE1_REGIONS = {1, 2, 3}                  # 4 CTAs per seq (bx==0, by<4)
-WAIT_NOT_WORK_REGIONS = {4}                 # all CTAs but reported separately
+WAIT_NOT_WORK_REGIONS = {4, 11}             # R4 grid barrier + R11 pre-W_O wait
+DYNAMIC_SINGLE_CTA_REGIONS = {12}           # R12 elected gather/reduce
 PHASE3_REGIONS = {5, 6, 7, 8, 9, 10}        # all CTAs
 
 
@@ -69,12 +72,31 @@ def _phase1_cta_ids(slice_ctas: int, num_k_tiles: int, num_seqs: int) -> np.ndar
     return np.array(out, dtype=np.int64)
 
 
+def _phase1_wo_split_cta_ids(
+    slice_ctas: int,
+    num_k_tiles: int,
+    num_seqs: int,
+    wo_split: int,
+    num_kv_heads: int,
+) -> np.ndarray:
+    """W_O active CTAs with K-parallel split: bx<wo_split && by<num_kv_heads.
+    Each CTA's id = bz * (slice_ctas * num_k_tiles) + by * slice_ctas + bx.
+    """
+    out: list[int] = []
+    for s in range(num_seqs):
+        base = s * slice_ctas * num_k_tiles
+        for by in range(min(num_kv_heads, num_k_tiles)):
+            for bx in range(min(wo_split, slice_ctas)):
+                out.append(base + by * slice_ctas + bx)
+    return np.array(out, dtype=np.int64)
+
+
 @dataclass
 class RegionRow:
     region_id: int
     region: str
     n_active_ctas: int
-    cta_class: str             # "phase0" | "phase1" | "phase3" | "barrier_wait"
+    cta_class: str             # "phase0" | "phase1" | "phase3" | "barrier_wait" | "dynamic_single"
     tick_source: str           # "globaltimer" | "clock64"
     mean_ticks: float
     median_ticks: float
@@ -92,14 +114,21 @@ def reduce_region_timings(
     num_seqs: int,
     tick_source: str,                       # "globaltimer" | "clock64"
     nsys_total_us: Optional[float] = None,
+    wo_split: int = 1,
+    num_kv_heads: int = 0,
 ) -> pd.DataFrame:
-    """Reduce a (num_ctas, 11, 2) tick buffer to per-region rows.
+    """Reduce a (num_ctas, 13, 2) tick buffer to per-region rows.
 
     Active-CTA masks are derived from (slice_ctas, num_k_tiles, num_seqs)
     so callers do NOT pass a "num_attn_active_ctas" count — that count
     is wrong for Phase 0 (1 CTA/seq) vs Phase 1 (4 CTAs/seq) which the
     earlier draft conflated as "32".
 
+    When wo_split > 1, regions {2, 3, 11, 12} are masked using the
+    K-parallel W_O active-CTA layout (bx<wo_split && by<num_kv_heads)
+    via _phase1_wo_split_cta_ids. wo_split=1 falls through to the
+    legacy _phase1_cta_ids mask for full backward compatibility.
+
     median_us is reported ONLY when nsys_total_us is provided AND
     tick_source is globaltimer. With clock64, dynamic-clock effects make
     cycle→μs conversion unreliable, so the column stays NaN — caller
@@ -126,6 +155,17 @@ def reduce_region_timings(
     p0_ids = _phase0_cta_ids(slice_ctas, num_k_tiles, num_seqs)
     p1_ids = _phase1_cta_ids(slice_ctas, num_k_tiles, num_seqs)
     all_ids = np.arange(num_ctas, dtype=np.int64)
+    # K-parallel W_O mask: only valid when wo_split > 1 AND caller
+    # supplied num_kv_heads. Used for R2/R3/R11/R12.
+    if wo_split > 1:
+        assert num_kv_heads > 0, (
+            "wo_split>1 requires num_kv_heads>0 for the K-parallel mask"
+        )
+        wo_split_ids = _phase1_wo_split_cta_ids(
+            slice_ctas, num_k_tiles, num_seqs, wo_split, num_kv_heads,
+        )
+    else:
+        wo_split_ids = None
 
     rows: list[RegionRow] = []
     for r in range(num_regions):
@@ -134,11 +174,29 @@ def reduce_region_timings(
             active_ids = p0_ids
             cta_class = "phase0"
         elif r in PHASE1_REGIONS:
-            active_ids = p1_ids
+            # When wo_split>1, R2/R3 are the W_O GEMV/post regions and
+            # use the K-parallel mask. R1 (phase1_attn_pre_wo) is still
+            # the bx==0 && by<4 set so it stays on p1_ids.
+            if wo_split_ids is not None and r in (2, 3):
+                active_ids = wo_split_ids
+            else:
+                active_ids = p1_ids
             cta_class = "phase1"
         elif r in WAIT_NOT_WORK_REGIONS:
-            active_ids = all_ids
+            # R11 (phase1_pre_wo_wait) uses the K-parallel mask when
+            # wo_split>1 — it's the consumer wait for bx>0 W_O CTAs.
+            # R4 (grid_barrier_wait) stays on all_ids.
+            if wo_split_ids is not None and r == 11:
+                active_ids = wo_split_ids
+            else:
+                active_ids = all_ids
             cta_class = "barrier_wait"
+        elif r in DYNAMIC_SINGLE_CTA_REGIONS:
+            # R12 (phase1_gather_reduce) is the elected single-CTA
+            # gather. Even with wo_split>1 only one CTA writes a tick;
+            # nonzero filter handles it. Mask is all_ids.
+            active_ids = all_ids
+            cta_class = "dynamic_single"
         else:
             active_ids = all_ids
             cta_class = "phase3"
@@ -170,7 +228,7 @@ def reduce_region_timings(
         # source. For globaltimer that's *1000 (ns/μs); for clock64 we
         # cannot convert, so frac is reported as NaN unless caller
         # passes a clock64-calibrated total (not in the v1 reducer API).
-        if r in WAIT_NOT_WORK_REGIONS:
+        if r in WAIT_NOT_WORK_REGIONS or r in DYNAMIC_SINGLE_CTA_REGIONS:
             frac = float("nan")
         elif nsys_total_us is None:
             frac = float("nan")