diff --git a/README.md b/README.md index 9c98818c3429..12b90ba4501f 100644 --- a/README.md +++ b/README.md @@ -48,7 +48,7 @@ Also on Docker Hub: `docker.io/naviailab/nvllm:latest` - `~/.cache/flashinfer` — FlashInfer JIT kernels - `~/.cache/vllm_compile` → `/root/.cache/vllm/torch_compile_cache` — CUDA graph cache -**For gated models** (e.g., Gemma 4): pass `-e HF_TOKEN=hf_...` or mount a token file. +**For gated models** (e.g., Gemma 4): pass `HF_TOKEN` via env or mount a credentials file. ### Prerequisites - NVIDIA DGX Spark (GB10) or GH200 @@ -88,8 +88,8 @@ curl http://localhost:8000/v1/chat/completions \ | Script | Model | Status | Context | |--------|-------|--------|---------| -| `serve.sh` | [Qwen3.5-27B-NVFP4-Opus-GB10](https://huggingface.co/natfii/Qwen3.5-27B-NVFP4-Opus-GB10) | Active (default) | 64K | -| `serve-cute.sh` | [Qwen3.5-27B with CuTe Paged Attention](https://huggingface.co/ig1/Qwen3.5-35B-A3B-NVFP4) | Active (kernel dev) | 64K | +| `serve.sh` | [Qwen3.5-27B-NVFP4](https://huggingface.co/ig1/Qwen3.5-27B-NVFP4) | Active (default) | 64K | +| `serve-cute.sh` | [Qwen3.5-27B-NVFP4](https://huggingface.co/ig1/Qwen3.5-27B-NVFP4) (CuTe Paged Attention; override `HF_MODEL` env) | Active (kernel dev) | 64K | | `serve-nemotron.sh` | Nemotron-3-Super-120B-A12B-NVFP4 | Not Ready | 128K | | `serve-gemma4.sh` | Gemma 4 31B IT NVFP4 | Degraded (see script) | 32K | @@ -104,6 +104,7 @@ curl http://localhost:8000/v1/chat/completions \ **Now — Qwen3.5-27B kernel work** - CuTe DSL paged attention uber-kernel (fused attention + W_O GEMV + RMSNorm) +- `CUTE_WO_SPLIT=8` opt-in K-parallel W_O GEMV prototype (default off; opt-in via env on `serve-cute.sh`). Region cluster (R2+R4+R11+R12) shrinks 6.49× at synthetic load; ~5% per-token speedup at GSM8K workload. See [evidence summary](benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/summary.md). - CUDA graph support (FULL_AND_PIECEWISE mode) - End-to-end fusion validation through Qwen3NextAttention @@ -128,15 +129,17 @@ Benchmarked on Qwen3.5-27B-NVFP4 (rate=8, max-num-seqs=4): | TPOT p50 | 89.2 ms | 80.0 ms | **-10.2%** | | TPOT p99 | 91.7 ms | 82.7 ms | **-9.8%** | -> **Warning:** Large models (>75 GB) that leave minimal memory headroom on the GB10's 128 GB unified memory may crash during CUDA graph capture with the stream-K kernel. Use `--debug` (eager mode) to test first, or use a smaller model. +[Trace](benchmarks/nvllm/traces/gemm_stream_k_cudagraph/2026-04-21/) — committed `streamk_graphs.pt.trace.json.gz` + per-kernel CSVs. + +> **Warning:** Large models (>75 GB) that leave minimal memory headroom on the GB10's 128 GB unified memory may crash during CUDA graph capture with the stream-K kernel. Use a smaller model to test first. ### CuTe Paged Attention Backend (Prototype) Custom paged attention backend using CuTe Python DSL, targeting SM120/SM121 FP8 MMA instructions. Registered as `CUTE_PAGED` in vLLM's attention backend registry. -**Status:** Backend interface validated end-to-end. PyTorch prototype serves live inference. CuTe DSL kernel replacement in progress. +**Status:** Experimental CuTe DSL backend; production decode path since v0.3.0. β-coop fused kernel (attention + W_O + RMSNorm + MLP) is the default. Opt-in `CUTE_WO_SPLIT=8` K-parallel W_O GEMV prototype lands the W_O bottleneck reduction (5.99× R2, 8.68× R4 grid-barrier; see [evidence summary](benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/summary.md)). -Launch with: `./scripts/serve-cute.sh --debug` +Launch with: `./scripts/serve-cute.sh` (default PIECEWISE CUDA graphs). Avoid `--debug` (eager mode) on SM120 — it produces gibberish unrelated to kernel correctness; PIECEWISE is the validation path. ## Acknowledgments @@ -147,5 +150,5 @@ Launch with: `./scripts/serve-cute.sh --debug` - **[CUTLASS PR #3030](https://github.com/NVIDIA/cutlass/pull/3030)** by blake-snc (Second Nature Computing) — SM120 Flash Attention v2 reference for fused multi-head attention on Blackwell. - [`docs/kernel-insights/2026-04-10-cutlass-pr3030-sm120-fmha.md`](docs/kernel-insights/2026-04-10-cutlass-pr3030-sm120-fmha.md) — SM120 FMHA patterns and tile configs - **[CUTLASS](https://github.com/NVIDIA/cutlass)** by NVIDIA — CuTe Python DSL for SM120 kernel development. The FP4 decode GEMM kernel with stream-K scheduling is adapted from CUTLASS test kernels. -- **[Simon Veitner's CuTe DSL / NVFP4 blog](https://veitner.bearblog.dev/blog/)** — Reference reading for NVFP4 GEMV K-parallel reduction patterns identified as the next β-coop optimization direction (see [phaseE-tax bench](benchmarks/nvllm/traces/cute_paged_attn/2026-05-02-phaseE-tax-3leg/summary.md)). +- **[Simon Veitner's CuTe DSL / NVFP4 blog](https://veitner.bearblog.dev/blog/)** — Reference reading for NVFP4 GEMV K-parallel reduction patterns. Applied to W_O GEMV in the `CUTE_WO_SPLIT=8` opt-in prototype (see [wo_split=8 evidence summary](benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/summary.md)). - **[vLLM](https://github.com/vllm-project/vllm)** — The upstream project this fork is based on. diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline.nsys-rep b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline.nsys-rep new file mode 100644 index 000000000000..316cfb72ac97 Binary files /dev/null and b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline.nsys-rep differ diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline.sqlite b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline.sqlite new file mode 100644 index 000000000000..f8268d45d770 Binary files /dev/null and b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline.sqlite differ diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_gsm8k.json b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_gsm8k.json new file mode 100644 index 000000000000..ef39e44bedc6 --- /dev/null +++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_gsm8k.json @@ -0,0 +1,463 @@ +{ + "label": "task10_wo_split_1_baseline", + "model": "default", + "api": "http://localhost:8000/v1", + "n": 50, + "seed": 42, + "correct": 4, + "errors": 46, + "accuracy": "4/50 (8.0%)", + "total_seconds": 8781.8, + "results": [ + { + "i": 0, + "expected": "2280", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "The girls are trying to raise money for a carnival. Kim raises $320 more than Al..." + }, + { + "i": 1, + "expected": "1", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Kalinda is working on a 360 piece puzzle with her mom. Kalinda can normally add ..." + }, + { + "i": 2, + "expected": "5", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Tom's ship can travel at 10 miles per hour. He is sailing from 1 to 4 PM. He t..." + }, + { + "i": 3, + "expected": "12", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "James decides to buy birthday candles for his 2 sons. One of them is 12 and the..." + }, + { + "i": 4, + "expected": "273", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Mariah\u2019s grandma was teaching her to knit. Mariah used 1/4 of a skein of yarn. H..." + }, + { + "i": 5, + "expected": "45", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "While playing with her friends in their school playground, Katelyn saw 50 fairie..." + }, + { + "i": 6, + "expected": "21", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "If Ann is 9 years old and her brother is twice her age, how old will her brother..." + }, + { + "i": 7, + "expected": "145", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Twenty dozen cups cost $1200 less than the total cost of half a dozen plates sol..." + }, + { + "i": 8, + "expected": "60", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Last year there were 50 students enrolled in a calligraphy class. This year, the..." + }, + { + "i": 9, + "expected": "122", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Rani has ten more crabs than Monic, who has 4 fewer crabs than Bo. If Bo has 40 ..." + }, + { + "i": 10, + "expected": "29", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Brian's basement was damp and musty, so he bought a dehumidifier to remove moist..." + }, + { + "i": 11, + "expected": "80", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "George, a grade six teacher, ordered 600 burritos for the sixth-grade class picn..." + }, + { + "i": 12, + "expected": "36", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Jim spends 2 hours watching TV and then decides to go to bed and reads for half ..." + }, + { + "i": 13, + "expected": "1430", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Janet buys a brooch for her daughter. She pays $500 for the material to make it..." + }, + { + "i": 14, + "expected": "5", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Ducks need to eat 3.5 pounds of insects each week to survive. If there is a floc..." + }, + { + "i": 15, + "expected": "5", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Joey has 214 points before his turn in Scrabble. He scores 26 points. Then Marcy..." + }, + { + "i": 16, + "expected": "5", + "got": "5", + "status": "OK", + "elapsed": 176.7, + "raw_tail": "y books each child received:**\n Sarah divides these books equally among her 4 kids.\n $$20 \\text{ books} / 4 \\text{ kids} = 5 \\text{ books per kid}$$\n\n**Answer:** Each child received **5** books.", + "question": "Sarah went to buy books from the store and spent $300 on the books. If each book..." + }, + { + "i": 17, + "expected": "66", + "got": "66", + "status": "OK", + "elapsed": 75.8, + "raw_tail": "ook = $48\n 3 pencils \u00d7 $6/pencil = $1...\n\n\n$18\n\nTed spends **$48** on books ($16 \u00d7 3) and **$18** on pencils ($6 \u00d7 3).\n\nAdding these together gives the total amount spent:\n$48 + $18 = **$66**", + "question": "Ted starts with $200. He buys 3 books for 16 dollars each and 3 pencils for 6 do..." + }, + { + "i": 18, + "expected": "15", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Eve had 20 pieces of pomelos. After giving her friend some pomelos, Eve is left ..." + }, + { + "i": 19, + "expected": "40", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "The Doubtfire sisters are driving home with 7 kittens adopted from the local ani..." + }, + { + "i": 20, + "expected": "93", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Jen got 3 fish. They each need $1 worth of food a day. How much does she spend..." + }, + { + "i": 21, + "expected": "2000", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Elliott is trying to walk 10,000 steps a day. He finished half of his steps on h..." + }, + { + "i": 22, + "expected": "1520", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "While working at the restaurant, each of the forty customers who came into the r..." + }, + { + "i": 23, + "expected": "11050", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "On a certain day, the total cost of filling up 20 helium balloons was $900. Two ..." + }, + { + "i": 24, + "expected": "90", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "To raise money for their class fund, each of the 30 students from one class sold..." + }, + { + "i": 25, + "expected": "40000", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "John decides to get a loan by mortgaging his home. His house is worth $250,000...." + }, + { + "i": 26, + "expected": "21", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Liam and Mitchell own competing lemonade stands across the street from one anoth..." + }, + { + "i": 27, + "expected": "18", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Melanie is a door-to-door saleswoman. She sold a third of her vacuum cleaners at..." + }, + { + "i": 28, + "expected": "14", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Some people got on a bus at the terminal. At the first bus stop, 5 more people g..." + }, + { + "i": 29, + "expected": "23", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "A bumper car rink has 12 red cars. They have 2 fewer green cars than they have r..." + }, + { + "i": 30, + "expected": "145", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Students in class 3B are collecting school points for behavior. If they get enou..." + }, + { + "i": 31, + "expected": "123", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "If a classroom has 3 times as many girls as they do boys, and 1/10 as many nonge..." + }, + { + "i": 32, + "expected": "98", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "The area of Billie's rectangular bedroom is 360 square feet. If the length of hi..." + }, + { + "i": 33, + "expected": "7", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Manolo bought five lollipops and four candies that cost $3.20. If each lollipop ..." + }, + { + "i": 34, + "expected": "34", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Three friends spent $20.25 on 3 tickets to the fair. They also spent $4.50 less ..." + }, + { + "i": 35, + "expected": "38", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "At the local Pick Your Own fruit orchard, you could pick your own peaches for $2..." + }, + { + "i": 36, + "expected": "320", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Elaine initially had 20 Pokemon cards. After a month, she collected three times ..." + }, + { + "i": 37, + "expected": "50", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "In a student council election, candidate A got 20% of the votes while candidate ..." + }, + { + "i": 38, + "expected": "50", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Gerald and Julia divided $100 in the ratio 3:2. If Gerald spent $10 on a book, h..." + }, + { + "i": 39, + "expected": "84", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Madeline ate 6 grapes. Her brother wanted to make grape juice and used up 5 time..." + }, + { + "i": 40, + "expected": "50", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Elvis has a monthly saving target of $1125. In April, he wants to save twice as ..." + }, + { + "i": 41, + "expected": "8000", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Marilyn's first record sold 10 times as many copies as Harald's. If they sold 88..." + }, + { + "i": 42, + "expected": "280", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "For the children's day celebration, children are to perform mass drills in the p..." + }, + { + "i": 43, + "expected": "30", + "got": "30", + "status": "OK", + "elapsed": 92.0, + "raw_tail": " Age**: Martha is 4 years younger than Mike.\n $$24 - 4 = 20$$\n So, Martha is **20** years old.\n3. **Jeff's Age**: Jeff is 10 years older than Martha.\n $$20 + 10 = 30$$\n\nJeff is **30** years old.", + "question": "Jeff is 10 years older than his younger sister, Martha. Martha, on the other han..." + }, + { + "i": 44, + "expected": "192", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Ten stalls have 20 cows each. Mr. Sylas buys 40 cows and divides them equally, p..." + }, + { + "i": 45, + "expected": "276", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Henry needs to assemble some toys, specifically 57 cars and 73 motorcycles. Henr..." + }, + { + "i": 46, + "expected": "32", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Anakin and Locsin went to the beach today. Anakin caught 10 starfish, 6 sea hors..." + }, + { + "i": 47, + "expected": "25", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Nik has 200 crayons. He wants to separate them into groups of 8 and put them int..." + }, + { + "i": 48, + "expected": "10", + "got": "10", + "status": "OK", + "elapsed": 154.6, + "raw_tail": " $8 + (0.50 \\times 8) = 12$). However, because this is too heavy, he switches to a weight that is two pounds lighter than this new amount ($12 - 2$).\n\nTherefore, Jamaal now uses a **10**-pound weight.", + "question": "Jamaal is at the gym. He has been using an 8-pound weight. He increases the weig..." + }, + { + "i": 49, + "expected": "84", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "In a family, there are 2 brothers and 3 sisters. All sisters are the same age, w..." + } + ] +} \ No newline at end of file diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_gsm8k_fullthink.json b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_gsm8k_fullthink.json new file mode 100644 index 000000000000..5109d06c72b0 --- /dev/null +++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_gsm8k_fullthink.json @@ -0,0 +1,463 @@ +{ + "label": "task10_wo_split_1_baseline_fullthink", + "model": "default", + "api": "http://localhost:8000/v1", + "n": 50, + "seed": 42, + "correct": 48, + "errors": 0, + "accuracy": "48/50 (96.0%)", + "total_seconds": 3760.3, + "results": [ + { + "i": 0, + "expected": "2280", + "got": "2180", + "status": "WRONG", + "elapsed": 119.2, + "raw_tail": "Sarah raised $300\n- Maryam raised $400 more than Sarah, so Maryam raised $300 + $400 = $700\n\nNow I need to add up all the amounts:\n$430 + $750 + $300 + $700 = $2180\n\nThe total amount raised is $2180.\n", + "question": "The girls are trying to raise money for a carnival. Kim raises $320 more than Al..." + }, + { + "i": 1, + "expected": "1", + "got": "1", + "status": "OK", + "elapsed": 50.6, + "raw_tail": "2 = 2 pieces per minute.\n3. Combined rate: 4 + 2 = 6 pieces per minute.\n4. Total pieces: 360.\n5. Time in minutes: 360 / 6 = 60 minutes.\n6. Convert to hours: 60 minutes = 1 hour.\nThe answer is 1 hour.\n", + "question": "Kalinda is working on a 360 piece puzzle with her mom. Kalinda can normally add ..." + }, + { + "i": 2, + "expected": "5", + "got": "5", + "status": "OK", + "elapsed": 51.8, + "raw_tail": " 3 = 30 miles.\n2. Now I need to find out how long it takes to travel back 30 miles at 6 mph. Time = Distance / Speed = 30 / 6 = 5 hours.\n3. So it takes Tom 5 hours to get back.\nThe answer is 5 hours.\n", + "question": "Tom's ship can travel at 10 miles per hour. He is sailing from 1 to 4 PM. He t..." + }, + { + "i": 3, + "expected": "12", + "got": "12", + "status": "OK", + "elapsed": 85.1, + "raw_tail": ". Since each pack contains 5 candles, he needs 20 / 5 = 4 packs.\nFinally, I need to calculate the total cost. Each pack costs $3, so 4 packs cost 4 * $3 = $12.\nTherefore, James spends $12 on candles.\n", + "question": "James decides to buy birthday candles for his 2 sons. One of them is 12 and the..." + }, + { + "i": 4, + "expected": "273", + "got": "273", + "status": "OK", + "elapsed": 76.6, + "raw_tail": " I need to find the total yards of yarn they used altogether. I add Mariah's usage and her grandma's usage: 91 + 182 = 273 yards.\nTherefore, they used 273 yards of yarn altogether.\nThe answer is 273.\n", + "question": "Mariah\u2019s grandma was teaching her to knit. Mariah used 1/4 of a skein of yarn. H..." + }, + { + "i": 5, + "expected": "45", + "got": "45", + "status": "OK", + "elapsed": 53.8, + "raw_tail": ". Half of 50 is 25.\nSo, 25 fairies joined the original 50 fairies.\n50 + 25 = 75 fairies.\nThen, 30 fairies flew away.\n75 - 30 = 45 fairies.\nTherefore, there are 45 fairies remaining.\nThe answer is 45.\n", + "question": "While playing with her friends in their school playground, Katelyn saw 50 fairie..." + }, + { + "i": 6, + "expected": "21", + "got": "21", + "status": "OK", + "elapsed": 46.4, + "raw_tail": "her is twice her age, so he is 9 * 2 = 18 years old.\nNow, I need to find out how old he will be in 3 years.\n18 + 3 = 21 years old.\nSo, Ann's brother will be 21 years old in 3 years.\nThe answer is 21.\n", + "question": "If Ann is 9 years old and her brother is twice her age, how old will her brother..." + }, + { + "i": 7, + "expected": "145", + "got": "145", + "status": "OK", + "elapsed": 77.2, + "raw_tail": "otal cost of the plates, so $36,000 - $1200 = $34,800.\nNow, I need to find the cost per cup. Twenty dozen cups is 20 \u00d7 12 = 240 cups.\nSo, the cost per cup is $34,800 \u00f7 240 = $145.\nThe answer is $145.\n", + "question": "Twenty dozen cups cost $1200 less than the total cost of half a dozen plates sol..." + }, + { + "i": 8, + "expected": "60", + "got": "60", + "status": "OK", + "elapsed": 40.5, + "raw_tail": "age: 20%\n3. Calculate the increase: 50 \u00d7 0.20 = 10 students\n4. Add the increase to last year's enrollment: 50 + 10 = 60 students\n5. Therefore, this year's enrollment is 60 students.\nThe answer is 60.\n", + "question": "Last year there were 50 students enrolled in a calligraphy class. This year, the..." + }, + { + "i": 9, + "expected": "122", + "got": "122", + "status": "OK", + "elapsed": 60.3, + "raw_tail": " has 40 - 4 = 36 crabs\n- Rani has 10 more crabs than Monic, so Rani has 36 + 10 = 46 crabs\n\nNow I need to calculate the total:\nTotal = Bo + Monic + Rani = 40 + 36 + 46 = 122 crabs\n\nThe answer is 122.\n", + "question": "Rani has ten more crabs than Monic, who has 4 fewer crabs than Bo. If Bo has 40 ..." + }, + { + "i": 10, + "expected": "29", + "got": "29", + "status": "OK", + "elapsed": 187.5, + "raw_tail": "**\n * Sum of all periods: $3 \\text{ liters} + 6 \\text{ liters} + 20 \\text{ liters}$.\n * Calculation: $3 + 6 = 9$; $9 + 20 = 29$.\n\nThe total amount of water removed is 29 liters.\n\n29 liters.\n", + "question": "Brian's basement was damp and musty, so he bought a dehumidifier to remove moist..." + }, + { + "i": 11, + "expected": "80", + "got": "80", + "status": "OK", + "elapsed": 92.3, + "raw_tail": "ver burritos.\nLeftover burritos = Total burritos ordered - Total burritos consumed\nLeftover burritos = 600 - 520 = 80 burritos.\nTherefore, the total number of leftover burritos from the picnic is 80.\n", + "question": "George, a grade six teacher, ordered 600 burritos for the sixth-grade class picn..." + }, + { + "i": 12, + "expected": "36", + "got": "36", + "status": "OK", + "elapsed": 52.6, + "raw_tail": "1 hour.\n3. Total time per session (TV + reading) = 2 + 1 = 3 hours.\n4. He does this 3 times a week, so weekly time = 3 * 3 = 9 hours.\n5. Over 4 weeks, total time = 9 * 4 = 36 hours.\nThe answer is 36.\n", + "question": "Jim spends 2 hours watching TV and then decides to go to bed and reads for half ..." + }, + { + "i": 13, + "expected": "1430", + "got": "1430", + "status": "OK", + "elapsed": 74.4, + "raw_tail": "0% of $1300 = 0.10 \u00d7 $1300 = $130\n\nFinally, I need to calculate the total amount Janet paid.\nTotal amount paid: $1300 + $130 = $1430\n\nTherefore, Janet paid $1430 for the brooch.\n\nThe answer is: $1430\n", + "question": "Janet buys a brooch for her daughter. She pays $500 for the material to make it..." + }, + { + "i": 14, + "expected": "5", + "got": "5", + "status": "OK", + "elapsed": 63.3, + "raw_tail": "t this to a daily amount.\n5. There are 7 days in a week, so I divide the weekly amount by 7.\n6. 35 \u00f7 7 = 5 pounds per day.\n7. Therefore, the flock needs 5 pounds of insects per day.\n\nThe answer is 5.\n", + "question": "Ducks need to eat 3.5 pounds of insects each week to survive. If there is a floc..." + }, + { + "i": 15, + "expected": "5", + "got": "5", + "status": "OK", + "elapsed": 44.0, + "raw_tail": "nts.\nNext, I need to calculate Marcy's new score: 225 + 10 = 235 points.\nNow I need to find the difference between their scores: 240 - 235 = 5 points.\nSo Joey is winning by 5 points.\nThe answer is 5.\n", + "question": "Joey has 214 points before his turn in Scrabble. He scores 26 points. Then Marcy..." + }, + { + "i": 16, + "expected": "5", + "got": "5", + "status": "OK", + "elapsed": 45.7, + "raw_tail": " costs $15.\nSo, total books = $300 / $15 = 20 books.\nNext, she gave an equal number of books to her 4 kids.\nSo, books per child = 20 / 4 = 5 books.\nTherefore, each child got 5 books.\nThe answer is 5.\n", + "question": "Sarah went to buy books from the store and spent $300 on the books. If each book..." + }, + { + "i": 17, + "expected": "66", + "got": "66", + "status": "OK", + "elapsed": 41.8, + "raw_tail": " books is 3 * 16 = $48.\nHe also buys 3 pencils at $6 each, so the cost for pencils is 3 * 6 = $18.\nThe total amount spent is the sum of the cost of books and pencils: 48 + 18 = $66.\nThe answer is 66.\n", + "question": "Ted starts with $200. He buys 3 books for 16 dollars each and 3 pencils for 6 do..." + }, + { + "i": 18, + "expected": "15", + "got": "15", + "status": "OK", + "elapsed": 42.7, + "raw_tail": "4 of the original amount left.\n3. Calculate how many pomelos she has left: 20 \u00d7 1/4 = 5 pomelos.\n4. Calculate how many pomelos she gave away: 20 - 5 = 15 pomelos.\nTherefore, Eve gave away 15 pomelos.\n", + "question": "Eve had 20 pieces of pomelos. After giving her friend some pomelos, Eve is left ..." + }, + { + "i": 19, + "expected": "40", + "got": "40", + "status": "OK", + "elapsed": 39.1, + "raw_tail": "opted kittens.\n2. Patchy has had thrice the number of adopted kittens, which is 3 * 7 = 21 kittens.\n3. Trixie has had 12 kittens.\n4. The total number of kittens is 7 + 21 + 12 = 40.\nThe answer is 40.\n", + "question": "The Doubtfire sisters are driving home with 7 kittens adopted from the local ani..." + }, + { + "i": 20, + "expected": "93", + "got": "93", + "status": "OK", + "elapsed": 51.6, + "raw_tail": "$3 per day\n2. Next, I need to find out how many days are in May.\n - May has 31 days\n3. Finally, I need to calculate the total cost for the month.\n - $3 per day \u00d7 31 days = $93\n\nThe answer is $93.\n", + "question": "Jen got 3 fish. They each need $1 worth of food a day. How much does she spend..." + }, + { + "i": 21, + "expected": "2000", + "got": "2000", + "status": "OK", + "elapsed": 101.4, + "raw_tail": "ol and 1,000 steps with his friend, which is 5,000 + 1,000 = 6,000 steps.\n7. The remaining steps must be from his jog: 8,000 - 6,000 = 2,000 steps.\nTherefore, Elliott took 2,000 steps during his jog.\n", + "question": "Elliott is trying to walk 10,000 steps a day. He finished half of his steps on h..." + }, + { + "i": 22, + "expected": "1520", + "got": "1520", + "status": "OK", + "elapsed": 38.9, + "raw_tail": "Rafaela's tips: 40 customers \u00d7 $20 = $800\nJulieta's tips: $800 - (10% of $800) = $800 - $80 = $720\nTotal tips: $800 + $720 = $1520\nThe answer is 1520.\n", + "question": "While working at the restaurant, each of the forty customers who came into the r..." + }, + { + "i": 23, + "expected": "11050", + "got": "11050", + "status": "OK", + "elapsed": 62.3, + "raw_tail": ".\nNew cost per balloon = $45 + $20 = $65\n\nBentley filled 170 balloons at the new price.\nTotal cost = 170 \u00d7 $65 = $11,050\n\nTherefore, Bentley paid $11,050 to fill 170 balloons.\n\nThe answer is: $11,050\n", + "question": "On a certain day, the total cost of filling up 20 helium balloons was $900. Two ..." + }, + { + "i": 24, + "expected": "90", + "got": "90", + "status": "OK", + "elapsed": 74.3, + "raw_tail": "he total is 30 * 10 = 300 lollipops.\nFinally, I need to find the total profit. The profit per lollipop is $0.3 and they sold 300 lollipops, so the total profit is 300 * $0.3 = $90.\nThe answer is $90.\n", + "question": "To raise money for their class fund, each of the 30 students from one class sold..." + }, + { + "i": 25, + "expected": "40000", + "got": "40000", + "status": "OK", + "elapsed": 68.1, + "raw_tail": "d to calculate how much he used to pay off debts: 60% of $100,000 = 0.60 \u00d7 $100,000 = $60,000\n3. Finally, I need to calculate the leftover amount: $100,000 - $60,000 = $40,000\n\nThe answer is $40,000.\n", + "question": "John decides to get a loan by mortgaging his home. His house is worth $250,000...." + }, + { + "i": 26, + "expected": "21", + "got": "21", + "status": "OK", + "elapsed": 49.9, + "raw_tail": "eekend.\nTo find out how many more dollars Mitchell made than Liam, I need to subtract Liam's earnings from Mitchell's earnings.\n$84 - $63 = $21.\nSo Mitchell made $21 more than Liam.\nThe answer is 21.\n", + "question": "Liam and Mitchell own competing lemonade stands across the street from one anoth..." + }, + { + "i": 27, + "expected": "18", + "got": "18", + "status": "OK", + "elapsed": 134.3, + "raw_tail": "1/3 at green house: 18 * 1/3 = 6 sold, 12 left.\n- Sold 2 to red house: 12 - 2 = 10 left.\n- Sold half at orange house: 10 * 1/2 = 5 sold, 5 left.\n- Final count: 5 vacuum cleaners. \u2713\n\nThe answer is 18.\n", + "question": "Melanie is a door-to-door saleswoman. She sold a third of her vacuum cleaners at..." + }, + { + "i": 28, + "expected": "14", + "got": "14", + "status": "OK", + "elapsed": 69.6, + "raw_tail": "+ 5) - 7 + 8 = x + 6.\n4. We know that at the third stop, there were 20 people, so x + 6 = 20.\n5. Solving for x: x = 20 - 6 = 14.\nTherefore, 14 people got on the bus at the terminal.\nThe answer is 14.\n", + "question": "Some people got on a bus at the terminal. At the first bus stop, 5 more people g..." + }, + { + "i": 29, + "expected": "23", + "got": "23", + "status": "OK", + "elapsed": 41.7, + "raw_tail": "1. Red cars: 12\n2. Green cars: 12 - 2 = 10\n3. Blue cars: 3 * 10 = 30\n4. Total cars: 75\n5. Yellow cars: 75 - (12 + 10 + 30) = 75 - 52 = 23\n\nThe answer is 23.\n", + "question": "A bumper car rink has 12 red cars. They have 2 fewer green cars than they have r..." + }, + { + "i": 30, + "expected": "145", + "got": "145", + "status": "OK", + "elapsed": 78.9, + "raw_tail": " Marta has 3 * 35 = 105 points.\nTotal points collected by the class = 50 + 65 + 35 + 105 = 255 points.\nThe minimum threshold is 400 points.\nPoints missing = 400 - 255 = 145 points.\nThe answer is 145.\n", + "question": "Students in class 3B are collecting school points for behavior. If they get enou..." + }, + { + "i": 31, + "expected": "123", + "got": "123", + "status": "OK", + "elapsed": 96.0, + "raw_tail": " girls, and nongendered children together.\n * Total = Boys + Girls + Nongendered\n * Total = $30 + 90 + 3$\n * Total = $123$.\n\nThe classroom has a total of 123 children.\n\n**Answer:** 123\n", + "question": "If a classroom has 3 times as many girls as they do boys, and 1/10 as many nonge..." + }, + { + "i": 32, + "expected": "98", + "got": "98", + "status": "OK", + "elapsed": 78.8, + "raw_tail": "idth = 360 \u00f7 9 = 40 feet\n\nNow I can find the perimeter:\nPerimeter = 2 \u00d7 (Length + Width)\nPerimeter = 2 \u00d7 (9 + 40)\nPerimeter = 2 \u00d7 49\nPerimeter = 98 feet\n\nThe perimeter of Billie's bedroom is 98 feet.\n", + "question": "The area of Billie's rectangular bedroom is 360 square feet. If the length of hi..." + }, + { + "i": 33, + "expected": "7", + "got": "7", + "status": "OK", + "elapsed": 97.7, + "raw_tail": "ow I need to find the cost of 10 lollipops and 10 candies.\nCost of 10 lollipops = 10 * $0.40 = $4.00.\nCost of 10 candies = 10 * $0.30 = $3.00.\nTotal cost = $4.00 + $3.00 = $7.00.\nThe answer is $7.00.\n", + "question": "Manolo bought five lollipops and four candies that cost $3.20. If each lollipop ..." + }, + { + "i": 34, + "expected": "34", + "got": "34", + "status": "OK", + "elapsed": 127.0, + "raw_tail": ". Since they agreed to split all the costs evenly, I need to divide the total cost by the number of friends. So each friend paid $102 / 3 = $34.\n\nTherefore, each of them paid $34.\n\nThe answer is: $34\n", + "question": "Three friends spent $20.25 on 3 tickets to the fair. They also spent $4.50 less ..." + }, + { + "i": 35, + "expected": "38", + "got": "38", + "status": "OK", + "elapsed": 49.7, + "raw_tail": "aches: 6 pounds \u00d7 $2.00 per pound = $12.00\n- Plums: 8 pounds \u00d7 $1.00 per pound = $8.00\n- Apricots: 6 pounds \u00d7 $3.00 per pound = $18.00\n- Total: $12.00 + $8.00 + $18.00 = $38.00\n\nThe answer is $38.00.\n", + "question": "At the local Pick Your Own fruit orchard, you could pick your own peaches for $2..." + }, + { + "i": 36, + "expected": "320", + "got": "320", + "status": "OK", + "elapsed": 75.3, + "raw_tail": "(60 + 40) = 2 * 100 = 200 cards\n\nNow let me calculate the total:\nInitial + First month + Second month + Third month\n= 20 + 60 + 40 + 200\n= 320 cards\n\nTherefore, Elaine has 320 Pokemon cards in total.\n", + "question": "Elaine initially had 20 Pokemon cards. After a month, she collected three times ..." + }, + { + "i": 37, + "expected": "50", + "got": "50", + "status": "OK", + "elapsed": 90.4, + "raw_tail": "al number of votes is 100, and we know that candidate A got 20 votes and candidate B got 30 votes. So, candidate C got 100 - 20 - 30 = 50 votes.\nTherefore, candidate C got 50 votes.\nThe answer is 50.\n", + "question": "In a student council election, candidate A got 20% of the votes while candidate ..." + }, + { + "i": 38, + "expected": "50", + "got": "50", + "status": "OK", + "elapsed": 73.6, + "raw_tail": "'s share = 3 parts \u00d7 $20 = $60.\nJulia's share = 2 parts \u00d7 $20 = $40.\nNow, Gerald spent $10 on a book.\nSo, Gerald's remaining money = $60 - $10 = $50.\nTherefore, Gerald had $50 left.\nThe answer is 50.\n", + "question": "Gerald and Julia divided $100 in the ratio 3:2. If Gerald spent $10 on a book, h..." + }, + { + "i": 39, + "expected": "84", + "got": "84", + "status": "OK", + "elapsed": 50.0, + "raw_tail": "3. Their mother made 4 pies, and each pie requires 12 grapes: 4 * 12 = 48 grapes.\n4. Total grapes = Madeline's grapes + Brother's grapes + Mother's grapes = 6 + 30 + 48 = 84 grapes.\nThe answer is 84.\n", + "question": "Madeline ate 6 grapes. Her brother wanted to make grape juice and used up 5 time..." + }, + { + "i": 40, + "expected": "50", + "got": "50", + "status": "OK", + "elapsed": 115.4, + "raw_tail": "o he saves $25 per day in the first half, and $2x = $50 per day in the second half.\n8. The question asks for the amount he needs to save each day in the second half, which is $50.\n\nThe answer is $50.\n", + "question": "Elvis has a monthly saving target of $1125. In April, he wants to save twice as ..." + }, + { + "i": 41, + "expected": "8000", + "got": "8000", + "status": "OK", + "elapsed": 68.0, + "raw_tail": "lso know that M + H = 88,000 (combined sales).\nSubstituting M = 10H into the second equation:\n10H + H = 88,000\n11H = 88,000\nH = 88,000 / 11\nH = 8,000\nSo Harald sold 8,000 copies.\nThe answer is 8,000.\n", + "question": "Marilyn's first record sold 10 times as many copies as Harald's. If they sold 88..." + }, + { + "i": 42, + "expected": "280", + "got": "280", + "status": "OK", + "elapsed": 50.7, + "raw_tail": "ows \u00d7 8 children per row = 56 children per school\n3. Calculate the total number of children:\n - 5 schools \u00d7 56 children per school = 280 children\n\nTherefore, there can be 280 children in the drill.\n", + "question": "For the children's day celebration, children are to perform mass drills in the p..." + }, + { + "i": 43, + "expected": "30", + "got": "30", + "status": "OK", + "elapsed": 36.4, + "raw_tail": "is 24 years old.\n2. Martha is 4 years younger than Mike, so Martha is 24 - 4 = 20 years old.\n3. Jeff is 10 years older than Martha, so Jeff is 20 + 10 = 30 years old.\nTherefore, Jeff is 30 years old.\n", + "question": "Jeff is 10 years older than his younger sister, Martha. Martha, on the other han..." + }, + { + "i": 44, + "expected": "192", + "got": "1", + "status": "WRONG", + "elapsed": 225.9, + "raw_tail": "er: 176 cows in 8 stalls.\n\nLet me double-check:\n- 20 stalls \u00d7 20 cows = 400 initial cows\n- 40 new cows \u00f7 20 stalls = 2 new cows per stall\n- Each stall now has 20 + 2 = 22 cows\n- 8 stalls \u00d7 22 cows = 1", + "question": "Ten stalls have 20 cows each. Mr. Sylas buys 40 cows and divides them equally, p..." + }, + { + "i": 45, + "expected": "276", + "got": "276", + "status": "OK", + "elapsed": 94.0, + "raw_tail": "ut how many wheels will be left, I need to subtract the total wheels needed from the total wheels available. So, 650 - 374 = 276 wheels.\nTherefore, Henry will have 276 wheels left.\nThe answer is 276.\n", + "question": "Henry needs to assemble some toys, specifically 57 cars and 73 motorcycles. Henr..." + }, + { + "i": 46, + "expected": "32", + "got": "32", + "status": "OK", + "elapsed": 58.3, + "raw_tail": "r Anakin = 10 + 6 + 3 = 19\n\nLocsin caught:\n- Starfish: 10 - 5 = 5\n- Sea horses: 6 - 3 = 3\n- Clownfish: 3 + 2 = 5\nTotal for Locsin = 5 + 3 + 5 = 13\n\nTotal fish caught = 19 + 13 = 32\n\nThe answer is 32.\n", + "question": "Anakin and Locsin went to the beach today. Anakin caught 10 starfish, 6 sea hors..." + }, + { + "i": 47, + "expected": "25", + "got": "25", + "status": "OK", + "elapsed": 118.3, + "raw_tail": "I need to convert the total weight to pounds.\nThere are 16 ounces in 1 pound.\nTotal weight in pounds = 400 \u00f7 16 = 25 pounds.\n\nThe total weight of the crayons and boxes is 25 pounds.\nThe answer is 25.\n", + "question": "Nik has 200 crayons. He wants to separate them into groups of 8 and put them int..." + }, + { + "i": 48, + "expected": "10", + "got": "10", + "status": "OK", + "elapsed": 54.7, + "raw_tail": "nds.\n3. So the increased weight is: 8 + 4 = 12 pounds.\n4. This weight is too heavy, so he uses a weight 2 pounds lighter than 12 pounds.\n5. The final weight is: 12 - 2 = 10 pounds.\n\nThe answer is 10.\n", + "question": "Jamaal is at the gym. He has been using an 8-pound weight. He increases the weig..." + }, + { + "i": 49, + "expected": "84", + "got": "84", + "status": "OK", + "elapsed": 83.9, + "raw_tail": "s old, so their total age is 12 + 24 = 36 years.\nFinally, let's add the total age of the sisters and brothers together: 48 + 36 = 84 years.\nTherefore, the total age of all these siblings is 84 years.\n", + "question": "In a family, there are 2 brothers and 3 sisters. All sisters are the same age, w..." + } + ] +} \ No newline at end of file diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/config.json b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/config.json new file mode 100644 index 000000000000..f9f4c880e61e --- /dev/null +++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/config.json @@ -0,0 +1,37 @@ +{ + "git_sha": "unknown", + "wo_split": 1, + "total_wo_ctas": 4, + "slice_ctas": 8, + "gather_ctas": 32, + "total_grid_ctas_per_seq": 32, + "active_wo_ctas": 4, + "hidden_size": 5120, + "num_kv_heads": 4, + "num_q_heads": 24, + "head_dim": 256, + "K": 6144, + "num_k_groups": 384, + "num_k_tiles": 96, + "num_active_tokens": 1, + "seed": 4242, + "launches": 50, + "cooperative": true, + "ncu": false, + "warmup_launches": 0, + "dtypes": { + "attn_output": "bfloat16", + "wo_weight": "uint8", + "wo_scales": "uint8", + "wo_gs": "float32", + "wo_output": "float32", + "final_out": "float32" + }, + "cache_key": "35fee3f003016249", + "diagnostic_no_cooperative": false, + "image_id": null, + "torch_version": "2.12.0.dev20260402+cu132", + "python_version": "3.12.3 (main, Mar 3 2026, 12:15:18) [GCC 13.3.0]", + "effective_bytes_formula": "PAYLOAD = B*K*2 + H*K//2 + nmt*nkt*32*4*4 + 4 + B*H*4; SCRATCH = B*total_wo_ctas*H*4 + GATHER_CTAS*total_wo_ctas*B*H*4; EFFECTIVE = PAYLOAD + SCRATCH; GATHER_CTAS = slice_ctas*num_kv_heads (this run: 32)", + "cache_status_first_event": "HIT" +} diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/correctness_gate_split_order.json b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/correctness_gate_split_order.json new file mode 100644 index 000000000000..1287ebc647fc --- /dev/null +++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/correctness_gate_split_order.json @@ -0,0 +1,10 @@ +{ + "passes": true, + "max_abs": 0.0, + "max_rel": 0.0, + "rtol": 0.001, + "atol": 0.0001, + "ref_function": "reference_split_order(wo_split=1)", + "wo_split": 1, + "kind": "AUTHORITATIVE" +} diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/correctness_vs_chained.json b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/correctness_vs_chained.json new file mode 100644 index 000000000000..c098b65eb5b3 --- /dev/null +++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/correctness_vs_chained.json @@ -0,0 +1,6 @@ +{ + "max_abs": 0.0, + "max_rel": 0.0, + "ref_function": "reference_chained_fma", + "kind": "DIAGNOSTIC" +} diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/correctness_vs_matmul.json b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/correctness_vs_matmul.json new file mode 100644 index 000000000000..f019c09b03da --- /dev/null +++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/correctness_vs_matmul.json @@ -0,0 +1,6 @@ +{ + "max_abs": 0.06640625, + "max_rel": 0.006844626739621162, + "ref_function": "reference_matmul", + "kind": "DIAGNOSTIC" +} diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/timing.csv b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/timing.csv new file mode 100644 index 000000000000..ddb577f1ef64 --- /dev/null +++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/timing.csv @@ -0,0 +1,51 @@ +launch_idx,elapsed_us,payload_bytes,scratch_bytes,effective_bytes,effective_gbps,is_warmup +0,14802.335739,17727492,2703360,20430852,1.380245,0 +1,13730.976105,17727492,2703360,20430852,1.487939,0 +2,13796.223640,17727492,2703360,20430852,1.480902,0 +3,13867.839813,17727492,2703360,20430852,1.473254,0 +4,13700.991631,17727492,2703360,20430852,1.491195,0 +5,13705.504417,17727492,2703360,20430852,1.490704,0 +6,13693.375587,17727492,2703360,20430852,1.492025,0 +7,13770.688057,17727492,2703360,20430852,1.483648,0 +8,13704.992294,17727492,2703360,20430852,1.490760,0 +9,13722.335815,17727492,2703360,20430852,1.488876,0 +10,13803.551674,17727492,2703360,20430852,1.480116,0 +11,13716.256142,17727492,2703360,20430852,1.489536,0 +12,13741.279602,17727492,2703360,20430852,1.486823,0 +13,13814.111710,17727492,2703360,20430852,1.478984,0 +14,13842.975616,17727492,2703360,20430852,1.475900,0 +15,13775.135994,17727492,2703360,20430852,1.483169,0 +16,13746.368408,17727492,2703360,20430852,1.486273,0 +17,14579.999924,17727492,2703360,20430852,1.401293,0 +18,13786.591530,17727492,2703360,20430852,1.481936,0 +19,13732.192039,17727492,2703360,20430852,1.487807,0 +20,13801.471710,17727492,2703360,20430852,1.480339,0 +21,13725.376129,17727492,2703360,20430852,1.488546,0 +22,13752.511978,17727492,2703360,20430852,1.485609,0 +23,13766.400337,17727492,2703360,20430852,1.484110,0 +24,13796.799660,17727492,2703360,20430852,1.480840,0 +25,13753.600121,17727492,2703360,20430852,1.485491,0 +26,13727.775574,17727492,2703360,20430852,1.488286,0 +27,16090.143204,17727492,2703360,20430852,1.269774,0 +28,28432.863235,17727492,2703360,20430852,0.718565,0 +29,14688.256264,17727492,2703360,20430852,1.390965,0 +30,15311.231613,17727492,2703360,20430852,1.334370,0 +31,14410.176277,17727492,2703360,20430852,1.417807,0 +32,14028.096199,17727492,2703360,20430852,1.456424,0 +33,13780.639648,17727492,2703360,20430852,1.482576,0 +34,13766.016006,17727492,2703360,20430852,1.484151,0 +35,13752.032280,17727492,2703360,20430852,1.485661,0 +36,13723.872185,17727492,2703360,20430852,1.488709,0 +37,13732.128143,17727492,2703360,20430852,1.487814,0 +38,13781.951904,17727492,2703360,20430852,1.482435,0 +39,13744.159698,17727492,2703360,20430852,1.486512,0 +40,13721.792221,17727492,2703360,20430852,1.488935,0 +41,13742.303848,17727492,2703360,20430852,1.486712,0 +42,13776.415825,17727492,2703360,20430852,1.483031,0 +43,13775.712013,17727492,2703360,20430852,1.483107,0 +44,13731.167793,17727492,2703360,20430852,1.487918,0 +45,13713.727951,17727492,2703360,20430852,1.489810,0 +46,13810.175896,17727492,2703360,20430852,1.479406,0 +47,13793.248177,17727492,2703360,20430852,1.481221,0 +48,13752.256393,17727492,2703360,20430852,1.485636,0 +49,13806.719780,17727492,2703360,20430852,1.479776,0 diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_region_breakdown.csv b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_region_breakdown.csv new file mode 100644 index 000000000000..9cc11c31e64a --- /dev/null +++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_region_breakdown.csv @@ -0,0 +1,14 @@ +region_id,region,n_active_ctas,cta_class,tick_source,mean_ticks,median_ticks,p99_ticks,raw_total_ticks,median_us,frac_of_kernel +0,phase0_pre_attn,1,phase0,globaltimer,12512.0,12512.0,12512.0,12512,12.512,0.0003128 +1,phase1_attn_pre_wo,4,phase1,globaltimer,245264.0,248320.0,248320.0,981056,248.32,0.006208 +2,phase1_wo_gemv,4,phase1,globaltimer,14152008.0,14121472.0,14858986.24,56608032,14121.472,0.3530368 +3,phase1_wo_post,1,phase1,globaltimer,128.0,128.0,128.0,128,0.128,3.2e-06 +4,grid_barrier_wait,64,barrier_wait,globaltimer,14310038.0,15211456.0,15211520.0,915842432,15211.456, +5,phase3_load_x,64,phase3,globaltimer,1948.5,1952.0,1984.0,124704,1.952,4.88e-05 +6,phase3_partial_reset,64,phase3,globaltimer,102.5,96.0,160.0,6560,0.096,2.4e-06 +7,phase3_3a_fc1_silu,64,phase3,globaltimer,564574.5,564448.0,610369.6,36132768,564.448,0.0141112 +8,phase3_3b_quant,64,phase3,globaltimer,546.5,544.0,576.0,34976,0.544,1.36e-05 +9,phase3_3c_fc2_atomic,64,phase3,globaltimer,114624.5,113328.0,174219.84,7335968,113.328,0.0028332 +10,phase3_3d_arrival,64,phase3,globaltimer,314.5,96.0,2603.84,20128,0.096,2.4e-06 +11,phase1_pre_wo_wait,0,barrier_wait,globaltimer,0.0,0.0,0.0,0,0.0, +12,phase1_gather_reduce,1,dynamic_single,globaltimer,73312.0,73312.0,73312.0,73312,73.312, diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_region_timings.npy b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_region_timings.npy new file mode 100644 index 000000000000..9f84718c2bd1 Binary files /dev/null and b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_region_timings.npy differ diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed.nsys-rep b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed.nsys-rep new file mode 100644 index 000000000000..91cb212d5803 Binary files /dev/null and b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed.nsys-rep differ diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed.sqlite b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed.sqlite new file mode 100644 index 000000000000..ceb1960b7eee Binary files /dev/null and b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed.sqlite differ diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_gsm8k.json b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_gsm8k.json new file mode 100644 index 000000000000..62151fcede40 --- /dev/null +++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_gsm8k.json @@ -0,0 +1,463 @@ +{ + "label": "task11_wo_split_8", + "model": "default", + "api": "http://localhost:8000/v1", + "n": 50, + "seed": 42, + "correct": 4, + "errors": 46, + "accuracy": "4/50 (8.0%)", + "total_seconds": 8782.5, + "results": [ + { + "i": 0, + "expected": "2280", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "The girls are trying to raise money for a carnival. Kim raises $320 more than Al..." + }, + { + "i": 1, + "expected": "1", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Kalinda is working on a 360 piece puzzle with her mom. Kalinda can normally add ..." + }, + { + "i": 2, + "expected": "5", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Tom's ship can travel at 10 miles per hour. He is sailing from 1 to 4 PM. He t..." + }, + { + "i": 3, + "expected": "12", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "James decides to buy birthday candles for his 2 sons. One of them is 12 and the..." + }, + { + "i": 4, + "expected": "273", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Mariah\u2019s grandma was teaching her to knit. Mariah used 1/4 of a skein of yarn. H..." + }, + { + "i": 5, + "expected": "45", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "While playing with her friends in their school playground, Katelyn saw 50 fairie..." + }, + { + "i": 6, + "expected": "21", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "If Ann is 9 years old and her brother is twice her age, how old will her brother..." + }, + { + "i": 7, + "expected": "145", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Twenty dozen cups cost $1200 less than the total cost of half a dozen plates sol..." + }, + { + "i": 8, + "expected": "60", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Last year there were 50 students enrolled in a calligraphy class. This year, the..." + }, + { + "i": 9, + "expected": "122", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Rani has ten more crabs than Monic, who has 4 fewer crabs than Bo. If Bo has 40 ..." + }, + { + "i": 10, + "expected": "29", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Brian's basement was damp and musty, so he bought a dehumidifier to remove moist..." + }, + { + "i": 11, + "expected": "80", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "George, a grade six teacher, ordered 600 burritos for the sixth-grade class picn..." + }, + { + "i": 12, + "expected": "36", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Jim spends 2 hours watching TV and then decides to go to bed and reads for half ..." + }, + { + "i": 13, + "expected": "1430", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Janet buys a brooch for her daughter. She pays $500 for the material to make it..." + }, + { + "i": 14, + "expected": "5", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Ducks need to eat 3.5 pounds of insects each week to survive. If there is a floc..." + }, + { + "i": 15, + "expected": "5", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Joey has 214 points before his turn in Scrabble. He scores 26 points. Then Marcy..." + }, + { + "i": 16, + "expected": "5", + "got": "5", + "status": "OK", + "elapsed": 177.4, + "raw_tail": "y books each child received:**\n Sarah divides these books equally among her 4 kids.\n $$20 \\text{ books} / 4 \\text{ kids} = 5 \\text{ books per kid}$$\n\n**Answer:** Each child received **5** books.", + "question": "Sarah went to buy books from the store and spent $300 on the books. If each book..." + }, + { + "i": 17, + "expected": "66", + "got": "66", + "status": "OK", + "elapsed": 75.8, + "raw_tail": "ook = $48\n 3 pencils \u00d7 $6/pencil = $1...\n\n\n$18\n\nTed spends **$48** on books ($16 \u00d7 3) and **$18** on pencils ($6 \u00d7 3).\n\nAdding these together gives the total amount spent:\n$48 + $18 = **$66**", + "question": "Ted starts with $200. He buys 3 books for 16 dollars each and 3 pencils for 6 do..." + }, + { + "i": 18, + "expected": "15", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Eve had 20 pieces of pomelos. After giving her friend some pomelos, Eve is left ..." + }, + { + "i": 19, + "expected": "40", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "The Doubtfire sisters are driving home with 7 kittens adopted from the local ani..." + }, + { + "i": 20, + "expected": "93", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Jen got 3 fish. They each need $1 worth of food a day. How much does she spend..." + }, + { + "i": 21, + "expected": "2000", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Elliott is trying to walk 10,000 steps a day. He finished half of his steps on h..." + }, + { + "i": 22, + "expected": "1520", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "While working at the restaurant, each of the forty customers who came into the r..." + }, + { + "i": 23, + "expected": "11050", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "On a certain day, the total cost of filling up 20 helium balloons was $900. Two ..." + }, + { + "i": 24, + "expected": "90", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "To raise money for their class fund, each of the 30 students from one class sold..." + }, + { + "i": 25, + "expected": "40000", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "John decides to get a loan by mortgaging his home. His house is worth $250,000...." + }, + { + "i": 26, + "expected": "21", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Liam and Mitchell own competing lemonade stands across the street from one anoth..." + }, + { + "i": 27, + "expected": "18", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Melanie is a door-to-door saleswoman. She sold a third of her vacuum cleaners at..." + }, + { + "i": 28, + "expected": "14", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Some people got on a bus at the terminal. At the first bus stop, 5 more people g..." + }, + { + "i": 29, + "expected": "23", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "A bumper car rink has 12 red cars. They have 2 fewer green cars than they have r..." + }, + { + "i": 30, + "expected": "145", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Students in class 3B are collecting school points for behavior. If they get enou..." + }, + { + "i": 31, + "expected": "123", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "If a classroom has 3 times as many girls as they do boys, and 1/10 as many nonge..." + }, + { + "i": 32, + "expected": "98", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "The area of Billie's rectangular bedroom is 360 square feet. If the length of hi..." + }, + { + "i": 33, + "expected": "7", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Manolo bought five lollipops and four candies that cost $3.20. If each lollipop ..." + }, + { + "i": 34, + "expected": "34", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Three friends spent $20.25 on 3 tickets to the fair. They also spent $4.50 less ..." + }, + { + "i": 35, + "expected": "38", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "At the local Pick Your Own fruit orchard, you could pick your own peaches for $2..." + }, + { + "i": 36, + "expected": "320", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Elaine initially had 20 Pokemon cards. After a month, she collected three times ..." + }, + { + "i": 37, + "expected": "50", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "In a student council election, candidate A got 20% of the votes while candidate ..." + }, + { + "i": 38, + "expected": "50", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Gerald and Julia divided $100 in the ratio 3:2. If Gerald spent $10 on a book, h..." + }, + { + "i": 39, + "expected": "84", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Madeline ate 6 grapes. Her brother wanted to make grape juice and used up 5 time..." + }, + { + "i": 40, + "expected": "50", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Elvis has a monthly saving target of $1125. In April, he wants to save twice as ..." + }, + { + "i": 41, + "expected": "8000", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Marilyn's first record sold 10 times as many copies as Harald's. If they sold 88..." + }, + { + "i": 42, + "expected": "280", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "For the children's day celebration, children are to perform mass drills in the p..." + }, + { + "i": 43, + "expected": "30", + "got": "30", + "status": "OK", + "elapsed": 92.2, + "raw_tail": " Age**: Martha is 4 years younger than Mike.\n $$24 - 4 = 20$$\n So, Martha is **20** years old.\n3. **Jeff's Age**: Jeff is 10 years older than Martha.\n $$20 + 10 = 30$$\n\nJeff is **30** years old.", + "question": "Jeff is 10 years older than his younger sister, Martha. Martha, on the other han..." + }, + { + "i": 44, + "expected": "192", + "got": "", + "status": "ERROR", + "elapsed": 180.0, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Ten stalls have 20 cows each. Mr. Sylas buys 40 cows and divides them equally, p..." + }, + { + "i": 45, + "expected": "276", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Henry needs to assemble some toys, specifically 57 cars and 73 motorcycles. Henr..." + }, + { + "i": 46, + "expected": "32", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Anakin and Locsin went to the beach today. Anakin caught 10 starfish, 6 sea hors..." + }, + { + "i": 47, + "expected": "25", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "Nik has 200 crayons. He wants to separate them into groups of 8 and put them int..." + }, + { + "i": 48, + "expected": "10", + "got": "10", + "status": "OK", + "elapsed": 154.3, + "raw_tail": " $8 + (0.50 \\times 8) = 12$). However, because this is too heavy, he switches to a weight that is two pounds lighter than this new amount ($12 - 2$).\n\nTherefore, Jamaal now uses a **10**-pound weight.", + "question": "Jamaal is at the gym. He has been using an 8-pound weight. He increases the weig..." + }, + { + "i": 49, + "expected": "84", + "got": "", + "status": "ERROR", + "elapsed": 180.1, + "raw_tail": "ERROR: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=180)", + "question": "In a family, there are 2 brothers and 3 sisters. All sisters are the same age, w..." + } + ] +} \ No newline at end of file diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_gsm8k_fullthink.json b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_gsm8k_fullthink.json new file mode 100644 index 000000000000..0dcd13c4266c --- /dev/null +++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_gsm8k_fullthink.json @@ -0,0 +1,463 @@ +{ + "label": "task11_wo_split_8_fullthink", + "model": "default", + "api": "http://localhost:8000/v1", + "n": 50, + "seed": 42, + "correct": 47, + "errors": 0, + "accuracy": "47/50 (94.0%)", + "total_seconds": 3663.7, + "results": [ + { + "i": 0, + "expected": "2280", + "got": "2180", + "status": "WRONG", + "elapsed": 118.0, + "raw_tail": "sed $300\n- Maryam raised $400 more than Sarah, so Maryam raised $300 + $400 = $700\n\nNow I need to add up all the amounts:\n$430 + $750 + $300 + $700 = $2180\n\nTherefore, they all raised $2180 in total.\n", + "question": "The girls are trying to raise money for a carnival. Kim raises $320 more than Al..." + }, + { + "i": 1, + "expected": "1", + "got": "1", + "status": "OK", + "elapsed": 44.3, + "raw_tail": "e = 4/2 = 2 pieces per minute\n3. Combined rate: 4 + 2 = 6 pieces per minute\n4. Total pieces: 360\n5. Time in minutes: 360/6 = 60 minutes\n6. Convert to hours: 60 minutes = 1 hour\n\nThe answer is 1 hour.\n", + "question": "Kalinda is working on a 360 piece puzzle with her mom. Kalinda can normally add ..." + }, + { + "i": 2, + "expected": "5", + "got": "5", + "status": "OK", + "elapsed": 56.0, + "raw_tail": " 30 miles.\n4. Now I need to find out how long it takes to travel back 30 miles at 6 mph.\n5. Time = Distance / Speed = 30 / 6 = 5 hours.\n6. So it takes Tom 5 hours to get back.\n\nThe answer is 5 hours.\n", + "question": "Tom's ship can travel at 10 miles per hour. He is sailing from 1 to 4 PM. He t..." + }, + { + "i": 3, + "expected": "12", + "got": "12", + "status": "OK", + "elapsed": 80.0, + "raw_tail": ". Since each pack contains 5 candles, he needs 20 / 5 = 4 packs.\nFinally, I need to calculate the total cost. Each pack costs $3, so 4 packs cost 4 * $3 = $12.\nTherefore, James spends $12 on candles.\n", + "question": "James decides to buy birthday candles for his 2 sons. One of them is 12 and the..." + }, + { + "i": 4, + "expected": "273", + "got": "273", + "status": "OK", + "elapsed": 62.3, + "raw_tail": "1/2 of a skein.\nSo grandma used: 1/2 \u00d7 364 = 182 yards.\nNow I need to add these together to find the total:\n91 + 182 = 273 yards.\nTherefore, they used 273 yards of yarn altogether.\nThe answer is 273.\n", + "question": "Mariah\u2019s grandma was teaching her to knit. Mariah used 1/4 of a skein of yarn. H..." + }, + { + "i": 5, + "expected": "45", + "got": "45", + "status": "OK", + "elapsed": 55.9, + "raw_tail": "ed the original 50 fairies, so the total became 50 + 25 = 75 fairies.\nThen, 30 fairies flew away, so we subtract 30 from 75.\n75 - 30 = 45.\nTherefore, there are 45 fairies remaining.\nThe answer is 45.\n", + "question": "While playing with her friends in their school playground, Katelyn saw 50 fairie..." + }, + { + "i": 6, + "expected": "21", + "got": "21", + "status": "OK", + "elapsed": 42.8, + "raw_tail": "other is twice her age, so he is 9 * 2 = 18 years old.\nNow I need to find out how old he will be in 3 years.\n18 + 3 = 21 years old.\nSo Ann's brother will be 21 years old in 3 years.\nThe answer is 21.\n", + "question": "If Ann is 9 years old and her brother is twice her age, how old will her brother..." + }, + { + "i": 7, + "expected": "145", + "got": "145", + "status": "OK", + "elapsed": 76.2, + "raw_tail": "o the total cost of cups is $36000 - $1200 = $34800.\nNow, I need to find the cost of each cup. Twenty dozen cups is 20 * 12 = 240 cups.\nSo the cost per cup is $34800 / 240 = $145.\nThe answer is $145.\n", + "question": "Twenty dozen cups cost $1200 less than the total cost of half a dozen plates sol..." + }, + { + "i": 8, + "expected": "60", + "got": "60", + "status": "OK", + "elapsed": 64.5, + "raw_tail": "nts.\n4. **Calculate the total enrollment for this year:** Add the number of new students to the initial number.\n * $50 + 10 = 60$ students.\n\nTherefore, there are 60 students enrolled this year.\n", + "question": "Last year there were 50 students enrolled in a calligraphy class. This year, the..." + }, + { + "i": 9, + "expected": "122", + "got": "122", + "status": "OK", + "elapsed": 66.1, + "raw_tail": "crabs\n\nNow I need to calculate the total number of crabs:\nTotal = Bo's crabs + Monic's crabs + Rani's crabs\nTotal = 40 + 36 + 46 = 122 crabs\n\nThe total number of crabs the three have together is 122.\n", + "question": "Rani has ten more crabs than Monic, who has 4 fewer crabs than Bo. If Bo has 40 ..." + }, + { + "i": 10, + "expected": "29", + "got": "29", + "status": "OK", + "elapsed": 177.4, + "raw_tail": "**\n * Sum of all periods: $3 \\text{ liters} + 6 \\text{ liters} + 20 \\text{ liters}$.\n * Calculation: $3 + 6 = 9$; $9 + 20 = 29$.\n\nThe total amount of water removed is 29 liters.\n\n29 liters.\n", + "question": "Brian's basement was damp and musty, so he bought a dehumidifier to remove moist..." + }, + { + "i": 11, + "expected": "80", + "got": "80", + "status": "OK", + "elapsed": 78.6, + "raw_tail": ". George) = 520 burritos.\nFinally, I need to find the number of leftover burritos.\nTotal ordered = 600 burritos.\nLeftover = Total ordered - Total consumed = 600 - 520 = 80 burritos.\nThe answer is 80.\n", + "question": "George, a grade six teacher, ordered 600 burritos for the sixth-grade class picn..." + }, + { + "i": 12, + "expected": "36", + "got": "36", + "status": "OK", + "elapsed": 100.1, + "raw_tail": " * Duration: 4 weeks.\n * Total time: $9 \\text{ hours/week} \\times 4 \\text{ weeks} = 36 \\text{ hours}$.\n\nThe total time Jim spends on TV and reading in 4 weeks is 36 hours.\n\nAnswer: 36 hours.\n", + "question": "Jim spends 2 hours watching TV and then decides to go to bed and reads for half ..." + }, + { + "i": 13, + "expected": "1430", + "got": "1430", + "status": "OK", + "elapsed": 55.7, + "raw_tail": "hich is 10% of the total:\n10% of $1300 = 0.10 \u00d7 $1300 = $130\n\nFinally, I need to add the insurance cost to the total:\n$1300 + $130 = $1430\n\nTherefore, Janet paid $1430 in total.\n\nThe answer is: $1430\n", + "question": "Janet buys a brooch for her daughter. She pays $500 for the material to make it..." + }, + { + "i": 14, + "expected": "5", + "got": "5", + "status": "OK", + "elapsed": 53.1, + "raw_tail": "ds\n\n2. Next, I need to convert this to a daily amount.\n - There are 7 days in a week\n - Daily amount = 35 \u00f7 7 = 5 pounds\n\nTherefore, the flock needs 5 pounds of insects per day.\n\nThe answer is: 5\n", + "question": "Ducks need to eat 3.5 pounds of insects each week to survive. If there is a floc..." + }, + { + "i": 15, + "expected": "5", + "got": "5", + "status": "OK", + "elapsed": 56.8, + "raw_tail": "'s new score = 225 + 10 = 235 points.\nNow I need to find the difference between Joey's score and Marcy's score.\nDifference = 240 - 235 = 5 points.\nSo Joey is now winning by 5 points.\nThe answer is 5.\n", + "question": "Joey has 214 points before his turn in Scrabble. He scores 26 points. Then Marcy..." + }, + { + "i": 16, + "expected": "5", + "got": "5", + "status": "OK", + "elapsed": 47.1, + "raw_tail": "ks.\nNext, I need to find out how many books each child got.\nShe has 4 kids and she gave an equal number of books to each.\nSo, books per child = 20 books / 4 kids = 5 books per child.\nThe answer is 5.\n", + "question": "Sarah went to buy books from the store and spent $300 on the books. If each book..." + }, + { + "i": 17, + "expected": "66", + "got": "66", + "status": "OK", + "elapsed": 36.5, + "raw_tail": "each, so the cost for books is 3 * 16 = 48 dollars.\nHe also buys 3 pencils at $6 each, so the cost for pencils is 3 * 6 = 18 dollars.\nThe total amount spent is 48 + 18 = 66 dollars.\nThe answer is 66.\n", + "question": "Ted starts with $200. He buys 3 books for 16 dollars each and 3 pencils for 6 do..." + }, + { + "i": 18, + "expected": "15", + "got": "15", + "status": "OK", + "elapsed": 40.2, + "raw_tail": "4 of the original amount left.\n3. Calculate how many pomelos she has left: 20 \u00d7 1/4 = 5 pomelos.\n4. Calculate how many pomelos she gave away: 20 - 5 = 15 pomelos.\nTherefore, Eve gave away 15 pomelos.\n", + "question": "Eve had 20 pieces of pomelos. After giving her friend some pomelos, Eve is left ..." + }, + { + "i": 19, + "expected": "40", + "got": "40", + "status": "OK", + "elapsed": 36.8, + "raw_tail": "opted kittens.\n2. Patchy has had thrice the number of adopted kittens, which is 3 * 7 = 21 kittens.\n3. Trixie has had 12 kittens.\n4. The total number of kittens is 7 + 21 + 12 = 40.\nThe answer is 40.\n", + "question": "The Doubtfire sisters are driving home with 7 kittens adopted from the local ani..." + }, + { + "i": 20, + "expected": "93", + "got": "93", + "status": "OK", + "elapsed": 48.6, + "raw_tail": "$3 per day\n2. Next, I need to find out how many days are in May.\n - May has 31 days\n3. Finally, I need to calculate the total cost for the month.\n - $3 per day \u00d7 31 days = $93\n\nThe answer is $93.\n", + "question": "Jen got 3 fish. They each need $1 worth of food a day. How much does she spend..." + }, + { + "i": 21, + "expected": "2000", + "got": "5000", + "status": "WRONG", + "elapsed": 212.8, + "raw_tail": "= 8,000 - 6,000 = 2,000 steps.\n\nThe answer is 2,000 steps.\n\nLet me double-check:\n- School walks: 5,000 steps\n- Walk with friend: 1,000 steps\n- Jog: 2,000 steps\n- Remaining: 2,000 steps\nTotal: 5,000 + ", + "question": "Elliott is trying to walk 10,000 steps a day. He finished half of his steps on h..." + }, + { + "i": 22, + "expected": "1520", + "got": "1520", + "status": "OK", + "elapsed": 35.8, + "raw_tail": "Rafaela's tips: 40 customers \u00d7 $20 = $800\nJulieta's tips: $800 - (10% of $800) = $800 - $80 = $720\nTotal tips: $800 + $720 = $1520\nThe answer is 1520.\n", + "question": "While working at the restaurant, each of the forty customers who came into the r..." + }, + { + "i": 23, + "expected": "11050", + "got": "11050", + "status": "OK", + "elapsed": 53.3, + "raw_tail": "days later, the cost per balloon increased by $20, so the new cost per balloon is $45 + $20 = $65.\nBentley filled 170 balloons at the new price, so he paid 170 * $65 = $11,050.\nThe answer is $11,050.\n", + "question": "On a certain day, the total cost of filling up 20 helium balloons was $900. Two ..." + }, + { + "i": 24, + "expected": "90", + "got": "90", + "status": "OK", + "elapsed": 73.5, + "raw_tail": "ofit per lollipop is $0.8 - $0.5 = $0.3.\nFinally, I need to find the total profit. Since they sold 300 lollipops and made $0.3 profit on each, the total profit is 300 * $0.3 = $90.\nThe answer is $90.\n", + "question": "To raise money for their class fund, each of the 30 students from one class sold..." + }, + { + "i": 25, + "expected": "40000", + "got": "40000", + "status": "OK", + "elapsed": 63.9, + "raw_tail": "d to calculate how much he used to pay off debts: 60% of $100,000 = 0.60 \u00d7 $100,000 = $60,000\n3. Finally, I need to calculate the leftover amount: $100,000 - $60,000 = $40,000\n\nThe answer is $40,000.\n", + "question": "John decides to get a loan by mortgaging his home. His house is worth $250,000...." + }, + { + "i": 26, + "expected": "21", + "got": "21", + "status": "OK", + "elapsed": 36.3, + "raw_tail": "d 21 lemonades at $4 each, so:\n21 \u00d7 $4 = $84\n\nNow I need to find the difference between what Mitchell made and what Liam made:\n$84 - $63 = $21\n\nSo Mitchell made $21 more than Liam.\n\nThe answer is 21.\n", + "question": "Liam and Mitchell own competing lemonade stands across the street from one anoth..." + }, + { + "i": 27, + "expected": "18", + "got": "18", + "status": "OK", + "elapsed": 97.4, + "raw_tail": "ify:\n- Start with 18\n- Sold 1/3 (6) at green house, remaining: 12\n- Sold 2 at red house, remaining: 10\n- Sold half of remaining (5) at orange house, remaining: 5\n- Final count: 5 \u2713\n\nThe answer is 18.\n", + "question": "Melanie is a door-to-door saleswoman. She sold a third of her vacuum cleaners at..." + }, + { + "i": 28, + "expected": "14", + "got": "14", + "status": "OK", + "elapsed": 61.9, + "raw_tail": "e total became (x + 5) - 7 + 8 = x + 6.\n4. We know that at the third stop, there were 20 people, so x + 6 = 20.\n5. Solving for x: x = 20 - 6 = 14.\nTherefore, 14 people got on the bus at the terminal.\n", + "question": "Some people got on a bus at the terminal. At the first bus stop, 5 more people g..." + }, + { + "i": 29, + "expected": "23", + "got": "23", + "status": "OK", + "elapsed": 39.9, + "raw_tail": "1. Red cars: 12\n2. Green cars: 12 - 2 = 10\n3. Blue cars: 3 * 10 = 30\n4. Total of red, green, and blue cars: 12 + 10 + 30 = 52\n5. Yellow cars: 75 - 52 = 23\nThe answer is 23.\n", + "question": "A bumper car rink has 12 red cars. They have 2 fewer green cars than they have r..." + }, + { + "i": 30, + "expected": "145", + "got": "145", + "status": "OK", + "elapsed": 70.2, + "raw_tail": " Marta has 3 * 35 = 105 points.\nTotal points collected by the class = 50 + 65 + 35 + 105 = 255 points.\nThe minimum threshold is 400 points.\nPoints missing = 400 - 255 = 145 points.\nThe answer is 145.\n", + "question": "Students in class 3B are collecting school points for behavior. If they get enou..." + }, + { + "i": 31, + "expected": "123", + "got": "123", + "status": "OK", + "elapsed": 86.7, + "raw_tail": "e number of boys, girls, and nongendered children together.\n * Total = Boys + Girls + Nongendered\n * Total = $30 + 90 + 3$\n * Total = $123$.\n\nThe classroom has a total of 123 children.\n", + "question": "If a classroom has 3 times as many girls as they do boys, and 1/10 as many nonge..." + }, + { + "i": 32, + "expected": "98", + "got": "98", + "status": "OK", + "elapsed": 66.9, + "raw_tail": "Area = Length \u00d7 Width\nSo, 360 = 9 \u00d7 Width\nWidth = 360 \u00f7 9 = 40 feet.\nNow I can find the perimeter.\nPerimeter = 2 \u00d7 (Length + Width)\nPerimeter = 2 \u00d7 (9 + 40) = 2 \u00d7 49 = 98 feet.\nThe answer is 98 feet.\n", + "question": "The area of Billie's rectangular bedroom is 360 square feet. If the length of hi..." + }, + { + "i": 33, + "expected": "7", + "got": "7", + "status": "OK", + "elapsed": 89.9, + "raw_tail": "ow I need to calculate the cost of 10 lollipops and 10 candies.\n10 lollipops cost 10 \u00d7 $0.40 = $4.00.\n10 candies cost 10 \u00d7 $0.30 = $3.00.\nThe total cost is $4.00 + $3.00 = $7.00.\nThe answer is $7.00.\n", + "question": "Manolo bought five lollipops and four candies that cost $3.20. If each lollipop ..." + }, + { + "i": 34, + "expected": "34", + "got": "34", + "status": "OK", + "elapsed": 84.4, + "raw_tail": "up all expenses:\nTickets: $20.25\nFood: $15.75\nRides: $66\nTotal = $20.25 + $15.75 + $66 = $102.\nSince there are 3 friends splitting the cost evenly, each person pays $102 / 3 = $34.\nThe answer is $34.\n", + "question": "Three friends spent $20.25 on 3 tickets to the fair. They also spent $4.50 less ..." + }, + { + "i": 35, + "expected": "38", + "got": "38", + "status": "OK", + "elapsed": 46.6, + "raw_tail": "- Peaches: 6 pounds * $2.00/pound = $12.00\n- Plums: 8 pounds * $1.00/pound = $8.00\n- Apricots: 6 pounds * $3.00/pound = $18.00\n- Total: $12.00 + $8.00 + $18.00 = $38.00\nThe answer is $38.00.\n", + "question": "At the local Pick Your Own fruit orchard, you could pick your own peaches for $2..." + }, + { + "i": 36, + "expected": "320", + "got": "320", + "status": "OK", + "elapsed": 83.3, + "raw_tail": "+ 40) = 2 * 100 = 200 cards.\nNow, I need to add up all the cards she has:\nInitial cards: 20\nFirst month: 60\nSecond month: 40\nThird month: 200\nTotal: 20 + 60 + 40 + 200 = 320 cards.\nThe answer is 320.\n", + "question": "Elaine initially had 20 Pokemon cards. After a month, she collected three times ..." + }, + { + "i": 37, + "expected": "50", + "got": "50", + "status": "OK", + "elapsed": 115.5, + "raw_tail": "andidates A and B from the total number of voters:\nTotal votes = 100\nVotes for A = 20\nVotes for B = 30\nVotes for C = 100 - 20 - 30 = 50 votes.\n\nTherefore, candidate C got 50 votes.\n\nThe answer is 50.\n", + "question": "In a student council election, candidate A got 20% of the votes while candidate ..." + }, + { + "i": 38, + "expected": "50", + "got": "50", + "status": "OK", + "elapsed": 67.3, + "raw_tail": "'s share = 3 parts \u00d7 $20 = $60.\nJulia's share = 2 parts \u00d7 $20 = $40.\nNow, Gerald spent $10 on a book.\nSo, Gerald's remaining money = $60 - $10 = $50.\nTherefore, Gerald had $50 left.\nThe answer is 50.\n", + "question": "Gerald and Julia divided $100 in the ratio 3:2. If Gerald spent $10 on a book, h..." + }, + { + "i": 39, + "expected": "84", + "got": "84", + "status": "OK", + "elapsed": 46.8, + "raw_tail": "3. Their mother made 4 pies, and each pie requires 12 grapes: 4 * 12 = 48 grapes.\n4. Total grapes = Madeline's grapes + Brother's grapes + Mother's grapes = 6 + 30 + 48 = 84 grapes.\nThe answer is 84.\n", + "question": "Madeline ate 6 grapes. Her brother wanted to make grape juice and used up 5 time..." + }, + { + "i": 40, + "expected": "50", + "got": "50", + "status": "OK", + "elapsed": 92.9, + "raw_tail": "gs target is $1125, so 45x = 1125\n6. Solving for x: x = 1125 \u00f7 45 = 25\n7. So Elvis saves $25 per day in the first half.\n8. In the second half, he saves $2x = 2 \u00d7 25 = $50 per day.\n\nThe answer is $50.\n", + "question": "Elvis has a monthly saving target of $1125. In April, he wants to save twice as ..." + }, + { + "i": 41, + "expected": "8000", + "got": "8000", + "status": "OK", + "elapsed": 90.6, + "raw_tail": "H = 88,000\nH = 88,000 / 11\nH = 8,000\n\nSo Harald sold 8,000 copies.\nLet me verify:\n- Harald: 8,000 copies\n- Marilyn: 10 \u00d7 8,000 = 80,000 copies\n- Total: 8,000 + 80,000 = 88,000 \u2713\n\nThe answer is 8,000.\n", + "question": "Marilyn's first record sold 10 times as many copies as Harald's. If they sold 88..." + }, + { + "i": 42, + "expected": "280", + "got": "280", + "status": "OK", + "elapsed": 48.3, + "raw_tail": "dren per row \u00d7 7 rows = 56 children per school\n\n3. Calculate total children for all schools:\n - 56 children per school \u00d7 5 schools = 280 children\n\nTherefore, there can be 280 children in the drill.\n", + "question": "For the children's day celebration, children are to perform mass drills in the p..." + }, + { + "i": 43, + "expected": "30", + "got": "30", + "status": "OK", + "elapsed": 34.1, + "raw_tail": "is 24 years old.\n2. Martha is 4 years younger than Mike, so Martha is 24 - 4 = 20 years old.\n3. Jeff is 10 years older than Martha, so Jeff is 20 + 10 = 30 years old.\nTherefore, Jeff is 30 years old.\n", + "question": "Jeff is 10 years older than his younger sister, Martha. Martha, on the other han..." + }, + { + "i": 44, + "expected": "192", + "got": "8", + "status": "WRONG", + "elapsed": 211.7, + "raw_tail": "uous - which 8 stalls? Let me assume it means any 8 stalls, or perhaps 8 of the stalls that originally had cows.\n\nIf we take 8 stalls that originally had 20 cows:\n- Each now has 22 cows\n- Total = 8 \u00d7 ", + "question": "Ten stalls have 20 cows each. Mr. Sylas buys 40 cows and divides them equally, p..." + }, + { + "i": 45, + "expected": "276", + "got": "276", + "status": "OK", + "elapsed": 63.5, + "raw_tail": "els for cars + 146 wheels for motorcycles = 374 wheels total.\nFinally, I need to find out how many wheels will be left.\n650 wheels in the box - 374 wheels needed = 276 wheels left.\nThe answer is 276.\n", + "question": "Henry needs to assemble some toys, specifically 57 cars and 73 motorcycles. Henr..." + }, + { + "i": 46, + "expected": "32", + "got": "32", + "status": "OK", + "elapsed": 73.8, + "raw_tail": "fish than Anakin, so Locsin caught 3 + 2 = 5 clownfish.\nNow, let's add up all the fish they caught:\nAnakin: 10 + 6 + 3 = 19 fish\nLocsin: 5 + 3 + 5 = 13 fish\nTotal: 19 + 13 = 32 fish\nThe answer is 32.\n", + "question": "Anakin and Locsin went to the beach today. Anakin caught 10 starfish, 6 sea hors..." + }, + { + "i": 47, + "expected": "25", + "got": "25", + "status": "OK", + "elapsed": 104.7, + "raw_tail": "Total weight = 200 + 200 = 400 ounces.\n\nFinally, I need to convert the total weight to pounds.\nThere are 16 ounces in 1 pound.\nTotal weight in pounds = 400 \u00f7 16 = 25 pounds.\n\nThe answer is 25 pounds.\n", + "question": "Nik has 200 crayons. He wants to separate them into groups of 8 and put them int..." + }, + { + "i": 48, + "expected": "10", + "got": "10", + "status": "OK", + "elapsed": 55.8, + "raw_tail": " 8 + 4 = 12 pounds.\n4. This weight is too heavy, so he uses a weight that is 2 pounds lighter than 12 pounds.\n5. 12 - 2 = 10 pounds.\n6. Therefore, Jamaal now uses a 10-pound weight.\nThe answer is 10.\n", + "question": "Jamaal is at the gym. He has been using an 8-pound weight. He increases the weig..." + }, + { + "i": 49, + "expected": "84", + "got": "84", + "status": "OK", + "elapsed": 58.6, + "raw_tail": "lculate the total age of all siblings:\n- 3 sisters, each 16 years old: 3 * 16 = 48 years\n- 1 brother is 12 years old\n- 1 brother is 24 years old\nTotal age = 48 + 12 + 24 = 84 years.\nThe answer is 84.\n", + "question": "In a family, there are 2 brothers and 3 sisters. All sisters are the same age, w..." + } + ] +} \ No newline at end of file diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/config.json b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/config.json new file mode 100644 index 000000000000..5b6403b2bf27 --- /dev/null +++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/config.json @@ -0,0 +1,37 @@ +{ + "git_sha": "3300f7776eb2b2c875097a98cec90c913f34aacf", + "wo_split": 8, + "total_wo_ctas": 32, + "slice_ctas": 8, + "gather_ctas": 32, + "total_grid_ctas_per_seq": 32, + "active_wo_ctas": 32, + "hidden_size": 5120, + "num_kv_heads": 4, + "num_q_heads": 24, + "head_dim": 256, + "K": 6144, + "num_k_groups": 384, + "num_k_tiles": 96, + "num_active_tokens": 1, + "seed": 4242, + "launches": 50, + "cooperative": true, + "ncu": false, + "warmup_launches": 0, + "dtypes": { + "attn_output": "bfloat16", + "wo_weight": "uint8", + "wo_scales": "uint8", + "wo_gs": "float32", + "wo_output": "float32", + "final_out": "float32" + }, + "cache_key": "a0950af2b637ba65", + "diagnostic_no_cooperative": false, + "image_id": "sha256:9c0f1d31c92c29488f66a2c136183950cea787035d735ff95dd6af193740f530", + "torch_version": "2.12.0.dev20260402+cu132", + "python_version": "3.12.3 (main, Mar 3 2026, 12:15:18) [GCC 13.3.0]", + "effective_bytes_formula": "PAYLOAD = B*K*2 + H*K//2 + nmt*nkt*32*4*4 + 4 + B*H*4; SCRATCH = B*total_wo_ctas*H*4 + GATHER_CTAS*total_wo_ctas*B*H*4; EFFECTIVE = PAYLOAD + SCRATCH; GATHER_CTAS = slice_ctas*num_kv_heads (this run: 32)", + "cache_status_first_event": "MISS" +} diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/correctness_gate_split_order.json b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/correctness_gate_split_order.json new file mode 100644 index 000000000000..590a15e7a5b3 --- /dev/null +++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/correctness_gate_split_order.json @@ -0,0 +1,10 @@ +{ + "passes": true, + "max_abs": 0.0, + "max_rel": 0.0, + "rtol": 0.001, + "atol": 0.0001, + "ref_function": "reference_split_order(wo_split=8)", + "wo_split": 8, + "kind": "AUTHORITATIVE" +} diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/correctness_vs_chained.json b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/correctness_vs_chained.json new file mode 100644 index 000000000000..ddd2759cab51 --- /dev/null +++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/correctness_vs_chained.json @@ -0,0 +1,6 @@ +{ + "max_abs": 0.05859375, + "max_rel": 0.0013783597387373447, + "ref_function": "reference_chained_fma", + "kind": "DIAGNOSTIC" +} diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/correctness_vs_matmul.json b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/correctness_vs_matmul.json new file mode 100644 index 000000000000..7ee237cca16a --- /dev/null +++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/correctness_vs_matmul.json @@ -0,0 +1,6 @@ +{ + "max_abs": 0.02734375, + "max_rel": 0.00547570176422596, + "ref_function": "reference_matmul", + "kind": "DIAGNOSTIC" +} diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/timing.csv b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/timing.csv new file mode 100644 index 000000000000..92e4a556f7bf --- /dev/null +++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/timing.csv @@ -0,0 +1,51 @@ +launch_idx,elapsed_us,payload_bytes,scratch_bytes,effective_bytes,effective_gbps,is_warmup +0,2687.776089,17727492,21626880,39354372,14.641983,0 +1,1689.120054,17727492,21626880,39354372,23.298742,0 +2,1650.464058,17727492,21626880,39354372,23.844428,0 +3,1642.079949,17727492,21626880,39354372,23.966173,0 +4,1636.736035,17727492,21626880,39354372,24.044422,0 +5,1640.416026,17727492,21626880,39354372,23.990483,0 +6,1634.464025,17727492,21626880,39354372,24.077845,0 +7,1647.168040,17727492,21626880,39354372,23.892142,0 +8,1645.344019,17727492,21626880,39354372,23.918628,0 +9,1642.431974,17727492,21626880,39354372,23.961036,0 +10,1648.895979,17727492,21626880,39354372,23.867104,0 +11,1767.647982,17727492,21626880,39354372,22.263693,0 +12,1647.168040,17727492,21626880,39354372,23.892142,0 +13,1642.560005,17727492,21626880,39354372,23.959169,0 +14,1640.064001,17727492,21626880,39354372,23.995632,0 +15,1638.656020,17727492,21626880,39354372,24.016250,0 +16,1644.032001,17727492,21626880,39354372,23.937717,0 +17,1645.984054,17727492,21626880,39354372,23.909328,0 +18,1646.816015,17727492,21626880,39354372,23.897249,0 +19,1639.968038,17727492,21626880,39354372,23.997036,0 +20,1645.087957,17727492,21626880,39354372,23.922351,0 +21,1645.280004,17727492,21626880,39354372,23.919559,0 +22,1645.087957,17727492,21626880,39354372,23.922351,0 +23,1645.663977,17727492,21626880,39354372,23.913978,0 +24,1643.167973,17727492,21626880,39354372,23.950304,0 +25,1644.991994,17727492,21626880,39354372,23.923747,0 +26,1711.967945,17727492,21626880,39354372,22.987797,0 +27,1640.992045,17727492,21626880,39354372,23.982061,0 +28,1641.855955,17727492,21626880,39354372,23.969443,0 +29,1638.592005,17727492,21626880,39354372,24.017188,0 +30,1639.263988,17727492,21626880,39354372,24.007343,0 +31,1641.792059,17727492,21626880,39354372,23.970375,0 +32,1642.207980,17727492,21626880,39354372,23.964304,0 +33,1639.616013,17727492,21626880,39354372,24.002188,0 +34,1641.407967,17727492,21626880,39354372,23.975985,0 +35,1641.183972,17727492,21626880,39354372,23.979257,0 +36,1643.488050,17727492,21626880,39354372,23.945639,0 +37,1640.416026,17727492,21626880,39354372,23.990483,0 +38,1639.680028,17727492,21626880,39354372,24.001251,0 +39,1641.919971,17727492,21626880,39354372,23.968508,0 +40,1641.280055,17727492,21626880,39354372,23.977853,0 +41,1722.527981,17727492,21626880,39354372,22.846870,0 +42,1637.760043,17727492,21626880,39354372,24.029388,0 +43,1643.776059,17727492,21626880,39354372,23.941444,0 +44,1650.431991,17727492,21626880,39354372,23.844892,0 +45,1639.168024,17727492,21626880,39354372,24.008748,0 +46,1640.735984,17727492,21626880,39354372,23.985804,0 +47,1638.623953,17727492,21626880,39354372,24.016720,0 +48,1643.808007,17727492,21626880,39354372,23.940978,0 +49,1640.159965,17727492,21626880,39354372,23.994228,0 diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_region_breakdown.csv b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_region_breakdown.csv new file mode 100644 index 000000000000..a0ddf16a35cd --- /dev/null +++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_region_breakdown.csv @@ -0,0 +1,14 @@ +region_id,region,n_active_ctas,cta_class,tick_source,mean_ticks,median_ticks,p99_ticks,raw_total_ticks,median_us,frac_of_kernel +0,phase0_pre_attn,1,phase0,globaltimer,12288.0,12288.0,12288.0,12288,12.288,0.0003072 +1,phase1_attn_pre_wo,4,phase1,globaltimer,247904.0,250912.0,250912.0,991616,250.912,0.0062728 +2,phase1_wo_gemv,4,phase1,globaltimer,14376232.0,14341536.0,15108648.0,57504928,14341.536,0.3585384 +3,phase1_wo_post,2,phase1,globaltimer,32.0,32.0,32.0,64,0.032,8e-07 +4,grid_barrier_wait,64,barrier_wait,globaltimer,14538419.5,15453952.0,15454016.0,930458848,15453.952, +5,phase3_load_x,64,phase3,globaltimer,1938.0,1952.0,1984.0,124032,1.952,4.88e-05 +6,phase3_partial_reset,64,phase3,globaltimer,119.5,112.0,256.0,7648,0.112,2.8e-06 +7,phase3_3a_fc1_silu,64,phase3,globaltimer,572357.0,573072.0,618823.04,36630848,573.072,0.0143268 +8,phase3_3b_quant,64,phase3,globaltimer,546.0,544.0,576.0,34944,0.544,1.36e-05 +9,phase3_3c_fc2_atomic,64,phase3,globaltimer,115795.0,115440.0,176603.2,7410880,115.44,0.002886 +10,phase3_3d_arrival,64,phase3,globaltimer,318.5,96.0,2592.0,20384,0.096,2.4e-06 +11,phase1_pre_wo_wait,0,barrier_wait,globaltimer,0.0,0.0,0.0,0,0.0, +12,phase1_gather_reduce,1,dynamic_single,globaltimer,74368.0,74368.0,74368.0,74368,74.368, diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_region_timings.npy b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_region_timings.npy new file mode 100644 index 000000000000..3354af9fb875 Binary files /dev/null and b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_region_timings.npy differ diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_v2_region_breakdown.csv b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_v2_region_breakdown.csv new file mode 100644 index 000000000000..4a17e3cb3177 --- /dev/null +++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_v2_region_breakdown.csv @@ -0,0 +1,14 @@ +region_id,region,n_active_ctas,cta_class,tick_source,mean_ticks,median_ticks,p99_ticks,raw_total_ticks,median_us,frac_of_kernel +0,phase0_pre_attn,1,phase0,globaltimer,12448.0,12448.0,12448.0,12448,12.448,0.0003112 +1,phase1_attn_pre_wo,4,phase1,globaltimer,246624.0,249664.0,249664.0,986496,249.664,0.0062416 +2,phase1_wo_gemv,32,phase1,globaltimer,2266876.0,2359712.0,2409204.16,72540032,2359.712,0.0589928 +3,phase1_wo_post,17,phase1,globaltimer,69.6470588235294,32.0,213.76,1184,0.032,8e-07 +4,grid_barrier_wait,64,barrier_wait,globaltimer,1561391.0,1752528.0,2822528.0,99929024,1752.528, +5,phase3_load_x,64,phase3,globaltimer,1937.0,1952.0,2016.0,123968,1.952,4.88e-05 +6,phase3_partial_reset,64,phase3,globaltimer,93.5,64.0,160.0,5984,0.064,1.6e-06 +7,phase3_3a_fc1_silu,64,phase3,globaltimer,569275.0,567936.0,617433.92,36433600,567.936,0.0141984 +8,phase3_3b_quant,64,phase3,globaltimer,549.5,544.0,576.0,35168,0.544,1.36e-05 +9,phase3_3c_fc2_atomic,64,phase3,globaltimer,116767.0,114608.0,177275.2,7473088,114.608,0.0028652 +10,phase3_3d_arrival,64,phase3,globaltimer,317.5,96.0,2571.84,20320,0.096,2.4e-06 +11,phase1_pre_wo_wait,28,barrier_wait,globaltimer,249904.0,249888.0,250007.36,6997312,249.888, +12,phase1_gather_reduce,3,dynamic_single,globaltimer,166368.0,166720.0,173431.04,499104,166.72, diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_v2_region_timings.npy b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_v2_region_timings.npy new file mode 100644 index 000000000000..b0a01f29d84e Binary files /dev/null and b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_v2_region_timings.npy differ diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/nsys_summary.md b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/nsys_summary.md new file mode 100644 index 000000000000..27ad1a9f8517 --- /dev/null +++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/nsys_summary.md @@ -0,0 +1,200 @@ +# wo_k_parallel_audit / 2026-05-03-wo-split-8-prod / nsys_summary + +NSYS evidence for the W_O K-parallel total-kernel performance claim. + +## Status + +DONE -- Plan B (harness microkernel under nsys) at production-equivalent +launch shape. Plan A (vLLM V1 server under nsys) was not attempted as a +production-trace path because vLLM V1 spawns the EngineCore as a separate +process whose CUPTI activity is not captured by an nsys profile attached +to the API server (per project memory `feedback_vllm_profiling`). The +`--target-processes=all` flag still requires the spawned subprocess to +inherit the CUPTI injection env; vLLM strips most env across the +multiprocessing spawn (per `feedback_vllm_enginecore_env_strip`), and the +sentinel-file workaround does not propagate CUPTI injection. + +The harness traces below capture the W_O microkernel only, at the same +launch shape produced inside the production beta-coop kernel +(slice_ctas=8 -> 32 cooperative-grid CTAs per seq, num_kv_heads=4), +which is the substantive kernel work the W_O K-parallel optimization +modifies. + +## Approach used + +Plan B: harness microkernel. + +- Driver: `docs/research/2026-05-03-w-o-k-parallel-harness/run_harness.py` +- nsys: `/opt/nvidia/nsight-systems/2025.6.3` (host bind-mount, --privileged) +- Trace flags: `-t cuda,nvtx` +- Capture: full process duration (no `--duration`, no `--capture-range`) +- Cache: warm `/tmp/cute_harness_cache_v3` (HIT for changed; MISS->stored + for baseline first run, then HIT on rerun) +- Both runs use 50 timed launches, B=1 active token, seed=4242, + cooperative=True (hardwired in microkernel.py) + +## Provenance + +- Branch: `evidence/wo-k-parallel-harness` +- HEAD at run time: `3300f7776eb2b2c875097a98cec90c913f34aacf` +- Image id: `sha256:9c0f1d31c92c29488f66a2c136183950cea787035d735ff95dd6af193740f530` +- Image tag: `nvllm:gb10` +- Hardware: NVIDIA GB10 (DGX Spark, SM120, 48 SMs) +- nsys version: 2025.6.3.541-256337736014v0 + +## Configs + +| | wo_split | slice_ctas | total_grid_ctas | active_W_O_ctas | gather_ctas | cache_key | +|---|---|---|---|---|---|---| +| baseline | 1 | 8 | 32 | 4 | 32 | `35fee3f003016249` | +| changed | 8 | 8 | 32 | 32 | 32 | `a0950af2b637ba65` | + +Both configs share the same total cooperative-grid size (32 CTAs per +seq, matching production beta-coop). The only axis varied is the W_O +K-parallel split (active W_O CTAs goes 4 -> 32). + +## Correctness gate (AUTHORITATIVE) + +Both runs are bit-exact against `reference_split_order(wo_split=N)` +(the kernel and reference share reduction tree). + +| | passes | max_abs | max_rel | +|---|---|---|---| +| baseline (wo_split=1) | true | 0.0 | 0.0 | +| changed (wo_split=8) | true | 0.0 | 0.0 | + +## W_O kernel timings (nsys cuda_gpu_kern_sum, exact) + +Kernel symbol: `kernel_cutlass__wo_kernel_body_________________0` +Instances per run: 50 timed launches. + +| Stat | Baseline (wo_split=1) | Changed (wo_split=8) | Ratio | +|---|---:|---:|---:| +| Total (ns) | 706,025,504 | 79,899,712 | 8.84x | +| Avg (ns) | 14,120,510.1 | 1,597,994.2 | 8.84x | +| Med (ns) | 13,715,248.0 | 1,598,064.0 | 8.58x | +| Min (ns) | 13,643,232 | 1,585,248 | 8.61x | +| Max (ns) | 28,391,072 | 1,625,888 | 17.46x | +| StdDev (ns) | 2,101,617.2 | 5,677.2 | n/a | + +In microseconds (median, two-significant-figure): + +- Baseline median: 13,715 us +- Changed median: 1,598 us +- Delta: -12,117 us (-88.3% / 8.58x) + +The baseline max (28.39 ms) is the launch-0 outlier (first-launch warmup +artifact; cold-cache JIT/driver). All other 49 launches are within +13.6-13.8 ms. The changed run is steady-state (StdDev ~5 us). + +These per-launch numbers correspond to the harness device-side timings +(timing.csv) of: +- Baseline: 13,776-13,925 us at launches 1-49 (host-side CUDA event) +- Changed: 1,640-1,690 us at launches 1-49 (host-side CUDA event) + +Host CUDA-event timings include cooperative-launch overhead and any +launch-edge sync; nsys CUPTI kernel timings are device-time-only. The +two methods agree on the same ratio. + +## Methodology caveats + +1. **Harness microkernel ONLY -- not the full beta-coop kernel.** + The harness exercises only the W_O+gather portion of the beta-coop + kernel (the section the K-parallel optimization modifies). It does + not include Phase 0 (input LN), Phase 1 (attention RMS+QK+SDPA), + Phase 3 (MLP) or Phase 4 (post-attn LN), all of which run in the + production beta-coop kernel. These traces cannot be used to argue + end-to-end per-call cost for vLLM serving. + +2. **Production grid layout matched, not full kernel composition.** + The harness reproduces the 32-CTA cooperative grid, num_kv_heads=4, + num_q_heads=24, head_dim=256, K=6144, hidden_size=5120, + NUM_THREADS=128, tile_s, tile_k, FP4 NVFP4 weight layout. What it + does NOT match is the constexpr fan-in of inputs from upstream phases + (those are present inside the prod beta-coop kernel but not exercised + here, since the harness feeds synthetic attn_output directly). + +3. **No first-launch outlier rejection.** Baseline max (28.4 ms) is the + launch-0 outlier (cache-MISS first call on this process). The + 50-sample median (13.72 ms) is the canonical number. Reported + total/avg are inflated by ~2% by this single launch. + +4. **Plan A (vLLM V1 nsys) blocked by architecture.** vLLM V1 EngineCore + is a spawned subprocess. Per project memory `feedback_vllm_profiling`, + nsys does not capture EngineCore CUPTI activity. The + `--target-processes=all` flag is necessary but not sufficient because + the EngineCore subprocess does not inherit CUPTI injection env vars + from the API server's nsys-instrumented context. Project policy is to + use vLLM's torch profiler API (`/start_profile`/`/stop_profile` via + `VLLM_TORCH_PROFILER_DIR`) for V1 evidence -- but that produces a + torch-profiler trace, not an nsys trace. AGENTS.md sec.4 specifies + nsys traces, so the correct authoritative measurement at the + production GRID SHAPE is the harness microkernel under nsys (this + run). For the end-to-end serving cost we publish region-timing CSVs + and GSM8K evals (sibling files in this dir). + +5. **First baseline capture had to be re-run.** The first attempt + (`baseline.nsys-rep` size 1.33 MB, 11 kernel instances total, 0 + `wo_kernel` instances) had only the setup-phase kernels. Cause is + under-captured but coincided with a cache-MISS first-launch. The + re-run (cache-HIT) captured all 50 launches as expected. Both + traces below are the re-run captures. + +## Files produced + +- `benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline.nsys-rep` +- `benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed.nsys-rep` +- `benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/baseline_nsys_run/{config,correctness_gate_split_order,correctness_vs_chained,correctness_vs_matmul,timing}.{json,csv}` +- `benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/changed_nsys_run/{config,correctness_gate_split_order,correctness_vs_chained,correctness_vs_matmul,timing}.{json,csv}` +- `benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/nsys_summary.md` (this file) + +## Reproduction + +```bash +REPO=/home/natfii/docker/nvllm +cd "$REPO" +mkdir -p /tmp/cute_harness_cache_v3 +DST="$REPO/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod" + +for WS in 1 8; do + if [ "$WS" = "1" ]; then NAME=baseline; else NAME=changed; fi + docker run --rm --gpus all --privileged \ + -v /opt/nvidia/nsight-systems/2025.6.3:/opt/nsys \ + -v "$REPO:/work" \ + -v "$REPO:/app/nvllm" \ + -v "/tmp/cute_harness_cache_v3:/tmp/cute_harness_cache_v3" \ + --entrypoint /opt/nsys/bin/nsys \ + nvllm:gb10 \ + profile -t cuda,nvtx \ + -o "/work/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/${NAME}.nsys-rep" \ + --force-overwrite=true \ + /opt/venv/bin/python /work/docs/research/2026-05-03-w-o-k-parallel-harness/run_harness.py \ + --wo-split "$WS" \ + --slice-ctas 8 \ + --launches 50 \ + --out "/work/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/${NAME}_nsys_run" +done + +NSYS=/opt/nvidia/nsight-systems/2025.6.3/bin/nsys +$NSYS stats --report cuda_gpu_kern_sum:mangled "$DST/baseline.nsys-rep" | grep "wo_kernel_body" +$NSYS stats --report cuda_gpu_kern_sum:mangled "$DST/changed.nsys-rep" | grep "wo_kernel_body" +``` + +## Cross-references + +- Parity-gap audit (slice_ctas axis study): + `benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-parity-gap/README.md` + config B (wo_split=1, slice_ctas=8) and config C (wo_split=8, + slice_ctas=8) match the baseline/changed shapes here. The parity-gap + traces include --slice-ctas axis and audit kernel-level microbenchmarks; + this dir's traces are the focused two-point evidence at the production + grid. +- Region-timing CSVs in this same dir capture full-beta-coop kernel + region breakdown (not just W_O). +- GSM8K eval JSONs in this same dir confirm correctness end-to-end at + both wo_split values under live vLLM serving. +- Harness README: + `docs/research/2026-05-03-w-o-k-parallel-harness/README.md` +- Phase-E kernel: `vllm/v1/attention/backends/cute_paged/phase_e_kernel.py` + - W_O slot: line 3306 (`self._kernel_phase_0_to_4(...)`) + - wo_split env read: line 262 (`os.environ.get("CUTE_WO_SPLIT", "1")`) diff --git a/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/summary.md b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/summary.md new file mode 100644 index 000000000000..35dfc3b0dc51 --- /dev/null +++ b/benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-wo-split-8-prod/summary.md @@ -0,0 +1,201 @@ +# wo_split=8 K-parallel W_O GEMV — production prototype evidence + +**Commit:** `b3f75721d` on branch `evidence/wo-k-parallel-harness` +**Date captured:** 2026-05-04 +**Model:** `ig1/Qwen3.5-27B-NVFP4` (non-distilled, official llm-compressor VL recipe) + +## Serve config (both runs) + +``` +serve --model ig1/Qwen3.5-27B-NVFP4 --served-model-name default + --kv-cache-dtype fp8_e4m3 + --attention-backend CUTE_PAGED + --max-model-len 65536 --max-num-seqs 4 + --gpu-memory-utilization 0.70 + --kernel-config '{"enable_flashinfer_autotune":false}' + --compilation-config '{"cudagraph_mode":"PIECEWISE"}' + --language-model-only + --trust-remote-code +``` + +Container `nvllm:gb10` with β-coop fusion enabled (`CUTE_PHASE_E_FUSION=1`, layers 0-7). +Bind-mount: `vllm/v1/attention/backends/cute_paged` for live-update. + +The two runs differ ONLY in `CUTE_WO_SPLIT` (1 = baseline, 8 = changed). The env var +propagates to the EngineCore subprocess via the `/tmp/c2_diag/ENV` sentinel-file +workaround (commit `ef9f68028`). + +## Quality (GSM8K-50 full-think, seed=42, max_tokens=512, timeout=600s) + +| Run | Accuracy | Errors / timeouts | 50-question wall | Per-question OK median | +|---|---:|---:|---:|---:| +| **wo_split=1 baseline** | **48/50 (96.0%)** | 0 | 3760 s | 65.7 s | +| **wo_split=8 changed** | **47/50 (94.0%)** | 0 | 3664 s | 62.3 s | +| Δ | −1 question | identical | **−96 s (−2.6%)** | **−3.4 s (−5.2%)** | + +Quality parity — within ±2% noise, 0 errors both sides. + +Artifacts: `baseline_gsm8k_fullthink.json`, `changed_gsm8k_fullthink.json`. + +## Region timing (5-completion synthetic load) + +`CUTE_BETA_REGION_TIMING=1`, dumped via `scripts/trigger_region_timing_dump.sh`, +reduced via `docs/research/2026-05-02-beta-region-breakdown/extract_regions.py +--wo-split N`. + +| Region | wo_split=1 | wo_split=8 | Δ | Notes | +|---|---:|---:|---:|---| +| R2 `phase1_wo_gemv` | 14121 μs (4 active) | **2360 μs (32 active)** | **−11761 μs (5.99×)** | K-parallel split | +| R4 `grid_barrier_wait` | 15211 μs (64) | **1753 μs (64)** | **−13458 μs (8.68×)** | shrinks because R2 finishes faster | +| R11 `phase1_pre_wo_wait` | 0 μs (mask empty) | 250 μs (28 active) | +250 μs | new: bx>0 consumers spin-wait | +| R12 `phase1_gather_reduce` | 73 μs (1 elected) | 167 μs (1 elected) | +94 μs | gather of 32 partials vs 4 | +| **Cluster (R2+R4+R11+R12)** | **29405 μs** | **4530 μs** | **−24875 μs (6.49× / −84.6%)** | | + +Other regions unchanged (R0, R1, R3, R5-R10). + +R11 active CTA count = 28 = 32 W_O total − 4 attn producers (bx==0 producers skip +R11; intra-CTA ordering means their attn_output reads need no acquire fence). + +R12 is a dynamic-single-CTA region (only the elected CTA writes a tick); host +reducer uses nonzero filtering to drop the 63 zero-rows. + +Artifacts: +- `baseline_region_timings.npy`, `baseline_region_breakdown.csv` — wo_split=1 +- `changed_v2_region_timings.npy`, `changed_v2_region_breakdown.csv` — wo_split=8 + +## nsys total-kernel comparison + +Captured via the harness microkernel at production grid shape (slice_ctas=8, +32-CTA cooperative grid, num_kv_heads=4, hidden=5120, K=6144, NUM_THREADS=128, +NVFP4 weights). The harness microkernel reproduces the production W_O+gather +math bit-exactly (verified against `reference_split_order(wo_split=N)` — +`max_abs=0.0` at both wo_splits). vLLM V1 nsys against the EngineCore +subprocess is blocked by CUPTI injection inheritance through the multiprocess +spawn (per `feedback_vllm_profiling`); harness microkernel + production grid +is the authoritative nsys path for the W_O+gather portion of the kernel. + +| Metric | wo_split=1 | wo_split=8 | Delta | +|---|---:|---:|---:| +| Symbol | `kernel_cutlass__wo_kernel_body_________________0` | same | -- | +| 50-launch median | **13715.248 us** | **1598.064 us** | **-12117 us (-88.3% / 8.58x)** | +| 50-launch mean | 14120.510 us | 1597.994 us | -12522 us (8.84x) | +| 50-launch stddev | 2101.617 us | **5.677 us** | collapsed; high stability at wo_split=8 | +| Min / Max | 13643 / 28391 us | 1585 / 1626 us | wo_split=1 had 28391 us first-call cache-MISS outlier | +| GPU time (50 launches) | 706 ms | 80 ms | -- | +| Time fraction in trace | 92.7% | 58.8% | -- | + +`kernel_cutlass__wo_kernel_body` is the CuTe DSL emitted symbol for the W_O +microkernel body. The same kernel-body code path ships in production beta-coop +fusion (`_kernel_phase_0_to_4` in `phase_e_kernel.py`), but production emits a +different mangled symbol for the full fused kernel. The harness isolates the +W_O+gather portion using the same K-range slicing and slot-index formulas, +so the 8.58x harness speedup transfers to production R2 (verified via region +timing R2 = 14121 -> 2360 us = 5.99x; the gap reflects ignore-eos warm-cache +vs serving cold-launch variance). + +Files: +- `baseline.nsys-rep` -- wo_split=1 trace, 1.84 MB +- `changed.nsys-rep` -- wo_split=8 trace, 1.96 MB +- `nsys_summary.md` -- full subagent capture report + +Reproduction: +```bash +/opt/nvidia/nsight-systems/2025.6.3/bin/nsys stats \ + --report cuda_gpu_kern_sum:mangled baseline.nsys-rep \ + | grep wo_kernel_body +``` + +## Bit-exact correctness gate + +The K-parallel W_O kernel reproduces `reference_split_order(wo_split=N)` from +`docs/research/2026-05-03-w-o-k-parallel-harness/torch_reference.py` bit-exactly +at both wo_split=1 and wo_split=8 (`max_abs == 0.0`). + +Methodology: V=constant trick — set FP8 V-cache to `+1.0` (0x38) so Phase 1 +attention output is deterministically `attn_output = ones(NAT, K)`. With known +input, `wo_output[seq, 0, :]` (post-gather, written at `phase_e_kernel.py:4471` +and read by RMSNorm Pass 1 at `:4490`) equals +`reference_split_order(attn=ones, weighted, wo_split=N)`. + +Repro at `/tmp/wo_split_repro.py` (transient, not committed). Re-runs bit-exact +at any time on a warm container. + +## Exact reproduction commands + +### Build +```bash +cd /home/natfii/docker/nvllm +git checkout evidence/wo-k-parallel-harness +git rev-parse HEAD # expect b3f75721d (or descendant) +docker images nvllm:gb10 +``` + +### wo_split=1 baseline GSM8K +```bash +docker stop nvllm; docker rm nvllm +NVLLM_BIND_MOUNT_CUTE_PAGED=1 \ +CUTE_PHASE_E_FUSION=1 \ +CUTE_PHASE_E_LAYERS=0,1,2,3,4,5,6,7 \ +CUTE_PHASE_E_FALLBACK_RAISE=1 \ + bash scripts/serve-cute.sh +until curl -s -f -m 3 http://localhost:8000/v1/models 2>/dev/null | grep -q '"data"'; do sleep 10; done +.venv/bin/python scripts/gsm8k_eval_50.py \ + --api http://localhost:8000/v1 --model default \ + --n 50 --seed 42 --max-tokens 512 --timeout 600 \ + --save baseline_gsm8k_fullthink.json --label task10_wo_split_1_baseline +``` + +### Region timing (wo_split=1) +```bash +docker stop nvllm; docker rm nvllm +NVLLM_BIND_MOUNT_CUTE_PAGED=1 \ +CUTE_PHASE_E_FUSION=1 \ +CUTE_PHASE_E_LAYERS=0,1,2,3,4,5,6,7 \ +CUTE_PHASE_E_FALLBACK_RAISE=1 \ +CUTE_BETA_REGION_TIMING=1 \ +VLLM_TORCH_PROFILER_DIR=/root/.cache/vllm/profiler \ + bash scripts/serve-cute.sh +for i in 1 2 3 4 5; do + curl -s -X POST http://localhost:8000/v1/completions \ + -H 'Content-Type: application/json' \ + -d '{"model":"default","prompt":"capital of france is","max_tokens":50,"temperature":0,"ignore_eos":true}' \ + > /dev/null +done +bash scripts/trigger_region_timing_dump.sh baseline_region_timings.npy +.venv/bin/python docs/research/2026-05-02-beta-region-breakdown/extract_regions.py \ + --buf baseline_region_timings.npy --kernel-mean-us 40000 \ + --slice-ctas 8 --num-k-tiles 8 --num-seqs 1 \ + --tick-source globaltimer --wo-split 1 --num-kv-heads 4 \ + --out baseline_region_breakdown.csv +``` + +### wo_split=8 changed +Prepend `CUTE_WO_SPLIT=8 \\` to the `bash scripts/serve-cute.sh` line above. The +sentinel-file workaround (commit `ef9f68028`) propagates the env var to +EngineCore. For the region-timing capture at wo_split=8, pass `--wo-split 8` to +extract_regions. + +### Bit-exact gate +```bash +.venv/bin/python /tmp/wo_split_repro.py --wo-split 1 --seed 4242 # PASS, max_abs=0 +.venv/bin/python /tmp/wo_split_repro.py --wo-split 8 --seed 4242 # PASS, max_abs=0 +``` + +## Caveats + +- **Region timing captured at synthetic 5-completion ignore_eos load**, not GSM8K + workload. Per-region speedups at GSM8K may differ (workload sensitivity). +- **R11 active CTA count (28) reflects bx>0 consumers only**; the host reducer's + `_phase1_wo_split_cta_ids` mask spans all 32 W_O CTAs but the kernel only + writes ticks for bx>0 — nonzero filter drops the bx==0 zeros. +- **R12 active CTA count varies (1-3 across runs)** depending on how many + concurrent decodes are in flight at the dump moment; each seq elects its own + last CTA. +- **`CUTE_WO_SPLIT` accepted values restricted to `{1, 2, 4, 8}`** (commit + `4331362e2`). The kernel logic works for arbitrary 1..slice_ctas but only the + powers-of-2 subset has reference-validation evidence. +- **wo_split=8 stays opt-in** via env var; default is 1 (no behavioral change + for production callers who don't set the var). +- **R11 / R12 are timing-instrumentation regions**, NOT production gates. The + underlying mechanisms (consumer wait, single-CTA gather) are production + behavior; the timing samples are debug-gated by `CUTE_BETA_REGION_TIMING=1`. diff --git a/docs/research/2026-05-02-beta-region-breakdown/extract_regions.py b/docs/research/2026-05-02-beta-region-breakdown/extract_regions.py index e504c4e1b760..2e82b0fc6acf 100644 --- a/docs/research/2026-05-02-beta-region-breakdown/extract_regions.py +++ b/docs/research/2026-05-02-beta-region-breakdown/extract_regions.py @@ -2,7 +2,7 @@ per-region CSV. Inputs: - - region_timings.npy: (num_ctas, 11, 2) int64 raw buffer. + - region_timings.npy: (num_ctas, 13, 2) int64 raw buffer. - profile_kernels.csv: per-kernel μs from docs/research/gemm_sweep/extract_e2e_kernels.py. Column name is `kernel_symbol` (NOT `Kernel Name`); rows include CuTe kernel name @@ -44,6 +44,14 @@ def main() -> None: p.add_argument("--kernel-symbol-regex", default="PhaseE_Beta_Kernel", help="Substring used to find the β-coop row in " "profile_kernels.csv.kernel_symbol") + p.add_argument("--wo-split", type=int, default=1, + help="K-parallel split factor for W_O (regions " + "{2,3,11,12}). Default 1 keeps legacy mask " + "(_phase1_cta_ids); >1 dispatches " + "_phase1_wo_split_cta_ids.") + p.add_argument("--num-kv-heads", type=int, default=4, + help="KV-head count for the W_O mask when " + "--wo-split>1. Qwen3.5-27B = 4.") args = p.parse_args() buf = np.load(args.buf) @@ -86,6 +94,8 @@ def main() -> None: num_seqs=args.num_seqs, tick_source=args.tick_source, nsys_total_us=nsys_total_us, + wo_split=args.wo_split, + num_kv_heads=args.num_kv_heads, ) df.to_csv(args.out, index=False) diff --git a/scripts/serve-cute.sh b/scripts/serve-cute.sh index 16beaa5b97e2..b3177116df89 100755 --- a/scripts/serve-cute.sh +++ b/scripts/serve-cute.sh @@ -74,6 +74,11 @@ mkdir -p /tmp/c2_diag echo "CUTE_C2_DIAG_DUMP_DIR=${CUTE_C2_DIAG_DUMP_DIR:-}" echo "CUTE_C2_DIAG_TOL_ATOL=${CUTE_C2_DIAG_TOL_ATOL:-}" echo "CUTE_C2_DIAG_TOL_RTOL=${CUTE_C2_DIAG_TOL_RTOL:-}" + # vLLM's EngineCore subprocess strips most -e env vars (per + # feedback_vllm_enginecore_env_strip). Sentinel-file workaround: + # qwen3_5.py reads /tmp/c2_diag/ENV at module import and calls + # os.environ.setdefault(). Same pattern as CUTE_C2_* above. + echo "CUTE_WO_SPLIT=${CUTE_WO_SPLIT:-1}" } > /tmp/c2_diag/ENV # Optional bind-mount of the cute_paged subdir for Python-only iteration @@ -116,6 +121,7 @@ docker run -d \ -e CUTE_PHASE_E_LAYERS="${CUTE_PHASE_E_LAYERS:-}" \ -e CUTE_PHASE_E_FALLBACK_RAISE="${CUTE_PHASE_E_FALLBACK_RAISE:-0}" \ -e CUTE_BETA_REGION_TIMING="${CUTE_BETA_REGION_TIMING:-0}" \ + -e CUTE_WO_SPLIT="${CUTE_WO_SPLIT:-1}" \ -e VLLM_TORCH_PROFILER_DIR="${VLLM_TORCH_PROFILER_DIR:-}" \ "$NVLLM_IMAGE" \ serve \ diff --git a/vllm/nvllm/models/qwen3_5.py b/vllm/nvllm/models/qwen3_5.py index fcc28a13dc83..8c0788d0cfc5 100644 --- a/vllm/nvllm/models/qwen3_5.py +++ b/vllm/nvllm/models/qwen3_5.py @@ -49,7 +49,10 @@ if os.path.isfile(_C2_ENV_FILE): with open(_C2_ENV_FILE) as _c2_f: for _c2_ln in _c2_f: - if "=" in _c2_ln and _c2_ln.startswith("CUTE_C2_"): + if "=" in _c2_ln and ( + _c2_ln.startswith("CUTE_C2_") + or _c2_ln.startswith("CUTE_WO_SPLIT=") + ): _c2_k, _c2_v = _c2_ln.strip().split("=", 1) if _c2_v: # skip empty values so we don't shadow real env os.environ.setdefault(_c2_k, _c2_v) diff --git a/vllm/v1/attention/backends/cute_paged/_backend.py b/vllm/v1/attention/backends/cute_paged/_backend.py index bf07ed00e7a2..b536979634ef 100644 --- a/vllm/v1/attention/backends/cute_paged/_backend.py +++ b/vllm/v1/attention/backends/cute_paged/_backend.py @@ -60,7 +60,7 @@ _REGION_TIMING_ENABLED = ( os.environ.get("CUTE_BETA_REGION_TIMING", "0") == "1" ) -_REGION_TIMING_NUM_REGIONS = 11 +_REGION_TIMING_NUM_REGIONS = 13 # CuTe DSL disk cache — runtime hookup. Without this call, the env vars # B12X_CUTE_COMPILE_DISK_CACHE and B12X_CUTE_COMPILE_CACHE_DIR are inert @@ -911,7 +911,9 @@ def attach_mlp_fusion( # inside this `try:` so an OOM trips the except handler # that nulls _phase_e_coop_kernel. self._phase_e_coop_wo_output = torch.zeros( - max_num_seqs, 4, hidden_dim, + max_num_seqs, + self.num_kv_heads * self._phase_e_coop_kernel.wo_split, + hidden_dim, dtype=torch.float32, device="cuda", ) self._phase_e_coop_mlp_partial_fp32 = torch.zeros( @@ -928,6 +930,19 @@ def attach_mlp_fusion( self._phase_e_coop_phase1_arrival_count = torch.zeros( max_num_seqs, dtype=torch.int32, device="cuda", ) + # Task 6: pre-W_O arrival counter — producers (bx==0 attn + # CTAs) atomic_add 1 after attn_output is written; + # consumers (bx>0 W_O CTAs, only at wo_split>1) spin-wait + # until the counter reaches num_kv_heads. At wo_split=1 the + # consumer mask is empty and no CTA reads this counter, so + # the increment to num_kv_heads is harmless. Reset by host + # zero_() before each launch (Task 6 chose host-zero_ + # approach over kernel atomic-subtract for symmetry with + # mlp_arrival_count.zero_() that already runs at every + # launch). + self._phase_e_coop_pre_wo_arrival_count = torch.zeros( + max_num_seqs, dtype=torch.int32, device="cuda", + ) if _REGION_TIMING_ENABLED: # Per-CTA region timing scratch. Layout: # (num_ctas, num_regions, 2) u64 — entry+exit ticks. @@ -1640,6 +1655,11 @@ def _run_paged() -> torch.Tensor: mlp_arrival_count=self._phase_e_coop_mlp_arrival_count[:nat], grid_barrier_i32=self._phase_e_coop_grid_barrier_i32[:nat], phase1_arrival_count=self._phase_e_coop_phase1_arrival_count[:nat], + # Task 6: pre-W_O arrival counter (dormant at + # wo_split=1 — consumer mask `bx>0 && bxslice_ctas requires grid changes). + self.wo_split = int(os.environ.get("CUTE_WO_SPLIT", "1")) + assert self.wo_split in (1, 2, 4, 8) \ + and self.wo_split <= self.slice_ctas, ( + f"wo_split={self.wo_split} must be in {{1, 2, 4, 8}} " + f"and <= slice_ctas={self.slice_ctas}" + ) # FC2 thread mapping — same two-path choice as Phase_D (see # mlp_kernel.py:374). if tile_k >= self.num_threads: @@ -334,7 +355,7 @@ def __init__( ) # Number of measured regions; matches the host reducer in # vllm/v1/attention/backends/cute_paged/region_timing.py. - self._region_timing_num_regions = 11 + self._region_timing_num_regions = 13 # ----------------------------------------------------------------- # Python-level debug entry point (phase-0-only). @@ -2796,6 +2817,12 @@ def run_beta_coop_full( mlp_arrival_count: torch.Tensor, # [nat, num_k_tiles] u32 grid_barrier_i32: torch.Tensor, # [nat] i32 phase1_arrival_count: torch.Tensor, # [nat] i32 + # Task 6: pre-W_O arrival counter (i32 [nat]). Producers (bx==0 + # attn CTAs, by0 W_O CTAs at wo_split>1) spin until + # counter == num_kv_heads. Dormant at wo_split=1 (consumer mask + # `bx>0 && bx0 W_O CTAs at wo_split>1) spin until counter == num_kv_heads. + # Dormant at wo_split=1 (consumer mask `bx>0 && bx tuple: self.slice_ctas, self.num_slices, self.num_k_tiles, + self.wo_split, # K-parallel W_O split factor self.slices_per_cta, self._rows_per_thread, self._threads_per_row, @@ -3195,12 +3237,18 @@ def _jit_launch_phase_0_to_4( wo_output_ptr: Int64, wo_gs_ptr: Int64, phase1_arrival_ptr: Int64, + # Task 6: pre-W_O arrival counter (i32 [nat]). Producers (bx==0 + # attn CTAs, by0 W_O CTAs at wo_split>1) spin + # until counter == num_kv_heads. Dormant at wo_split=1 + # (consumer mask `bx>0 && bx0 W_O CTAs at + # wo_split>1) spin until counter == num_kv_heads. Dormant + # at wo_split=1 (consumer mask `bx>0 && bx1 the + # bx>0 W_O CTAs spin-acquire on the counter at kernel- + # level R11 (placed between this parent and the new + # W_O block). + # ======================================================== _threadfence() cute.arch.sync_threads() + if tid == Int32(0): + _atomic_add_u32( + pre_wo_arrival_ptr + + Int64(seq_idx * Int32(4)), + Int32(1)) + # End of attn-producer parent block. Task 8 lifted W_O + # body out to kernel-level (after R11 wait). - attn_base = seq_idx * num_q_heads * hd \ - + q_head_start * hd - hd_wo = Int32(self.hidden_size) - n_per_thr_wo = Int32( - self.hidden_size // self.num_threads) - my_row_base = tid * n_per_thr_wo - - wo_gs = _ld_global_f32(wo_gs_ptr) + # =================================================================== + # Task 6: R11 = pre_wo_wait. Consumer-side spin-wait + bracketing + # timing. Runs at kernel-level (OUTSIDE the bx==0 parent above). + # + # At wo_split=1 the consumer mask `bx>0 && bx1 (Task 8), the K-parallel W_O CTAs (bx∈[1, wo_split), + # by 0` (NOT `bx >= 0`) — bx==0 is the producer path + # and must not enter the consumer wait or it deadlocks. + # =================================================================== + + # bx==0 producer CTAs skip the R11 wait (consumer mask FALSE) and + # re-enter the W_O gate below: their attn_output reads are intra-CTA, + # no acquire-fence needed (only bx>0 consumers acquire below). + pre_wo_consumer_active = ( + (bx > Int32(0)) + and (bx < Int32(wo_split_const)) + and (by < Int32(self.num_kv_heads)) + ) - for _out_group in cutlass.range_constexpr( - self.hidden_size // self.num_threads // 8 - ): - out_base_wo = my_row_base \ - + Int32(_out_group * 8) + # Region 11 entry: pre_wo_wait (consumer-only mask). + if region_timing_enabled: + if pre_wo_consumer_active and tid == Int32(0): + cta_id = ( + bz * Int32(self.slice_ctas * self.num_k_tiles) + + by * Int32(self.slice_ctas) + + bx + ) + t_entry = _read_globaltimer_u64() + _st_global_u64( + region_timing_ptr + + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE) + + Int64(11 * 2 * 8) # region 11 + + Int64(0 * 8), # slot 0 = entry + t_entry, + ) - a0 = Float32(0.0) - a1 = Float32(0.0) - a2 = Float32(0.0) - a3 = Float32(0.0) - a4 = Float32(0.0) - a5 = Float32(0.0) - a6 = Float32(0.0) - a7 = Float32(0.0) + # Spin-wait for producers (consumer-only mask). At wo_split=1 the + # mask is empty so no CTA enters the loop. At wo_split>1 each + # consumer CTA (bx>0) loops on a volatile load until the counter + # reaches num_kv_heads (= number of producers). + if pre_wo_consumer_active: + pre_wo_arrived = Int32(0) + while pre_wo_arrived < num_kv_heads: + pre_wo_arrived = _ld_volatile_u32( + pre_wo_arrival_ptr + + Int64(seq_idx * Int32(4)) + ) + _acquire_fence() + cute.arch.sync_threads() - k_dim = group_size_p1 * hd - k_idx = Int32(0) - while k_idx < k_dim: - attn_val = _ld_global_b16_to_f32( - attn_output_ptr - + Int64((attn_base + k_idx) * Int32(2))) - abs_k = (kv_head_idx * group_size_p1 * hd - + k_idx) - k_byte = abs_k >> Int32(1) - k_is_hi = abs_k & Int32(1) - k_grp = abs_k >> Int32(4) + # Region 11 exit: pre_wo_wait (consumer-only mask). + if region_timing_enabled: + if pre_wo_consumer_active and tid == Int32(0): + cta_id = ( + bz * Int32(self.slice_ctas * self.num_k_tiles) + + by * Int32(self.slice_ctas) + + bx + ) + t_exit = _read_globaltimer_u64() + _st_global_u64( + region_timing_ptr + + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE) + + Int64(11 * 2 * 8) # region 11 + + Int64(1 * 8), # slot 1 = exit + t_exit, + ) - for _oi in cutlass.range_constexpr(8): - out_row = out_base_wo + Int32(_oi) - if out_row < hd_wo: - w_addr = wo_weight_ptr + Int64( - out_row * wo_weight_row_stride - + k_byte) - aligned = w_addr & Int64( - 0xFFFFFFFFFFFFFFFC) - raw = _ld_global_b32(aligned) - bpos = Int32(w_addr & Int64(3)) - the_byte = _extract_byte_from_b32( - raw, bpos) - nib_shift = k_is_hi << Int32(2) - nib = (the_byte >> nib_shift) \ - & Int32(0x0F) - w_f32 = _fp4_nibble_to_f32(nib) - sf = _ld_swizzled_scale( - wo_scale_ptr, out_row, k_grp, - wo_num_k_tiles) - w_dequant = w_f32 * sf * wo_gs + # =================================================================== + # Task 8: K-parallel W_O GEMV body (lifted out of attn-producer + # parent). Mask: `bx < wo_split && by < num_kv_heads`. + # + # At wo_split_const=1, mask `bx<1 && by<4` ≡ `bx==0 && by<4` — + # the legacy single-CTA-per-KV-head behavior is bit-exact preserved. + # + # At wo_split_const>1, mask spans `bx ∈ [0, wo_split)` × `by ∈ [0, + # num_kv_heads)`, giving total_wo_slots = num_kv_heads * wo_split + # K-parallel partial-GEMV CTAs. Each CTA owns a 1/wo_split slice + # of one KV-head's K range, accumulates a chained-FMA partial, + # and writes to wo_output[seq, slot, :] where slot = by*wo_split+bx + # (matches torch_reference.py:438-439, microkernel.py:225-227). + # + # K-range slicing follows torch_reference.py:443-446 EXACTLY: + # K_per_head = K // num_kv_heads + # k_start_in_head = (K_per_head * bx) // wo_split + # k_end_in_head = (K_per_head * (bx + 1)) // wo_split + # k_start = by * K_per_head + k_start_in_head + # k_end = by * K_per_head + k_end_in_head + # =================================================================== + if (bx < Int32(wo_split_const)) and (by < Int32(self.num_kv_heads)): + # Region 2 entry: Phase 1 W_O GEMV body (now at kernel-level). + if region_timing_enabled: + if tid == Int32(0): + cta_id = ( + bz * Int32(self.slice_ctas * self.num_k_tiles) + + by * Int32(self.slice_ctas) + + bx + ) + t_entry = _read_globaltimer_u64() + _st_global_u64( + region_timing_ptr + + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE) + + Int64(2 * 2 * 8) # region 2 + + Int64(0 * 8), # slot 0 = entry + t_entry, + ) - if _oi == 0: - a0 = a0 + w_dequant * attn_val - if _oi == 1: - a1 = a1 + w_dequant * attn_val - if _oi == 2: - a2 = a2 + w_dequant * attn_val - if _oi == 3: - a3 = a3 + w_dequant * attn_val - if _oi == 4: - a4 = a4 + w_dequant * attn_val - if _oi == 5: - a5 = a5 + w_dequant * attn_val - if _oi == 6: - a6 = a6 + w_dequant * attn_val - if _oi == 7: - a7 = a7 + w_dequant * attn_val + _threadfence() + cute.arch.sync_threads() - k_idx = k_idx + Int32(1) + # --- W_O body locals (recomputed; no longer inherited from + # attn-producer parent). + kv_head_idx_wo = by + group_size_p1_wo = num_q_heads // num_kv_heads + q_head_start_wo = kv_head_idx_wo * group_size_p1_wo + hd_wo_dim = Int32(self.head_dim) + attn_base_wo = (seq_idx * num_q_heads * hd_wo_dim + + q_head_start_wo * hd_wo_dim) + + # K_per_head as constexpr (= num_q_heads*head_dim/num_kv_heads). + # The runtime form `group_size_p1_wo * hd_wo_dim` agrees on + # non-negative operands. Use runtime form to honor the + # kernel's existing dynamic shape contract. + k_per_head = group_size_p1_wo * hd_wo_dim + # K-range slicing (Python integer-divide; PTX i32 sdiv agrees + # for non-negative operands). + k_start_in_head = (k_per_head * bx) // Int32(wo_split_const) + k_end_in_head = (k_per_head * (bx + Int32(1))) // Int32(wo_split_const) + + hd_wo = Int32(self.hidden_size) + n_per_thr_wo = Int32( + self.hidden_size // self.num_threads) + my_row_base = tid * n_per_thr_wo + + wo_gs = _ld_global_f32(wo_gs_ptr) + + for _out_group in cutlass.range_constexpr( + self.hidden_size // self.num_threads // 8 + ): + out_base_wo = my_row_base \ + + Int32(_out_group * 8) + + a0 = Float32(0.0) + a1 = Float32(0.0) + a2 = Float32(0.0) + a3 = Float32(0.0) + a4 = Float32(0.0) + a5 = Float32(0.0) + a6 = Float32(0.0) + a7 = Float32(0.0) + + # K-loop iterates over [k_start_in_head, k_end_in_head) + # local-in-head. abs_k = kv_head_idx*K_per_head + k_idx + # (matches the attn_output layout: K dim is + # num_kv_heads-major → K_per_head-minor). + k_idx = k_start_in_head + while k_idx < k_end_in_head: + attn_val = _ld_global_b16_to_f32( + attn_output_ptr + + Int64((attn_base_wo + k_idx) * Int32(2))) + abs_k = (kv_head_idx_wo * k_per_head + + k_idx) + k_byte = abs_k >> Int32(1) + k_is_hi = abs_k & Int32(1) + k_grp = abs_k >> Int32(4) - cta_idx = bx * num_kv_heads + by - wo_slot_base = wo_output_ptr + Int64( - (seq_idx * total_ctas_per_seq_attn + cta_idx) - * hd_wo * Int32(4)) for _oi in cutlass.range_constexpr(8): out_row = out_base_wo + Int32(_oi) if out_row < hd_wo: + w_addr = wo_weight_ptr + Int64( + out_row * wo_weight_row_stride + + k_byte) + aligned = w_addr & Int64( + 0xFFFFFFFFFFFFFFFC) + raw = _ld_global_b32(aligned) + bpos = Int32(w_addr & Int64(3)) + the_byte = _extract_byte_from_b32( + raw, bpos) + nib_shift = k_is_hi << Int32(2) + nib = (the_byte >> nib_shift) \ + & Int32(0x0F) + w_f32 = _fp4_nibble_to_f32(nib) + sf = _ld_swizzled_scale( + wo_scale_ptr, out_row, k_grp, + wo_num_k_tiles) + w_dequant = w_f32 * sf * wo_gs + if _oi == 0: - _st_global_f32( - wo_slot_base + Int64( - out_row * Int32(4)), a0) + a0 = a0 + w_dequant * attn_val if _oi == 1: - _st_global_f32( - wo_slot_base + Int64( - out_row * Int32(4)), a1) + a1 = a1 + w_dequant * attn_val if _oi == 2: - _st_global_f32( - wo_slot_base + Int64( - out_row * Int32(4)), a2) + a2 = a2 + w_dequant * attn_val if _oi == 3: - _st_global_f32( - wo_slot_base + Int64( - out_row * Int32(4)), a3) + a3 = a3 + w_dequant * attn_val if _oi == 4: - _st_global_f32( - wo_slot_base + Int64( - out_row * Int32(4)), a4) + a4 = a4 + w_dequant * attn_val if _oi == 5: - _st_global_f32( - wo_slot_base + Int64( - out_row * Int32(4)), a5) + a5 = a5 + w_dequant * attn_val if _oi == 6: - _st_global_f32( - wo_slot_base + Int64( - out_row * Int32(4)), a6) + a6 = a6 + w_dequant * attn_val if _oi == 7: - _st_global_f32( - wo_slot_base + Int64( - out_row * Int32(4)), a7) + a7 = a7 + w_dequant * attn_val + + k_idx = k_idx + Int32(1) + + # Slot index: slot = by * wo_split + bx (matches + # torch_reference.py:438-439 — slot_id // wo_split == by, + # slot_id % wo_split == bx). + slot_idx = by * Int32(wo_split_const) + bx + wo_slot_base = wo_output_ptr + Int64( + (seq_idx * total_wo_slots + slot_idx) + * hd_wo * Int32(4)) + for _oi in cutlass.range_constexpr(8): + out_row = out_base_wo + Int32(_oi) + if out_row < hd_wo: + if _oi == 0: + _st_global_f32( + wo_slot_base + Int64( + out_row * Int32(4)), a0) + if _oi == 1: + _st_global_f32( + wo_slot_base + Int64( + out_row * Int32(4)), a1) + if _oi == 2: + _st_global_f32( + wo_slot_base + Int64( + out_row * Int32(4)), a2) + if _oi == 3: + _st_global_f32( + wo_slot_base + Int64( + out_row * Int32(4)), a3) + if _oi == 4: + _st_global_f32( + wo_slot_base + Int64( + out_row * Int32(4)), a4) + if _oi == 5: + _st_global_f32( + wo_slot_base + Int64( + out_row * Int32(4)), a5) + if _oi == 6: + _st_global_f32( + wo_slot_base + Int64( + out_row * Int32(4)), a6) + if _oi == 7: + _st_global_f32( + wo_slot_base + Int64( + out_row * Int32(4)), a7) - # === Phase B.5 + C: last-CTA gather + RMSNorm === - _threadfence() + # === Phase B.5 + C: last-CTA gather + RMSNorm === + _threadfence() - # Region 2 exit: Phase 1 W_O GEMV body. Recorded just - # after the W_O writes are published by _threadfence. - # We are inside the bx==0 && by<4 block, gate on tid==0. - if region_timing_enabled: - if tid == Int32(0): - cta_id = ( - bz * Int32(self.slice_ctas * self.num_k_tiles) - + by * Int32(self.slice_ctas) - + bx - ) - t_exit = _read_globaltimer_u64() - _st_global_u64( - region_timing_ptr - + Int64(cta_id) * Int64(11 * 2 * 8) - + Int64(2 * 2 * 8) # region 2 - + Int64(1 * 8), # slot 1 = exit - t_exit, - ) + # Region 2 exit: Phase 1 W_O GEMV body. Recorded just after + # the W_O writes are published by _threadfence. + if region_timing_enabled: + if tid == Int32(0): + cta_id = ( + bz * Int32(self.slice_ctas * self.num_k_tiles) + + by * Int32(self.slice_ctas) + + bx + ) + t_exit = _read_globaltimer_u64() + _st_global_u64( + region_timing_ptr + + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE) + + Int64(2 * 2 * 8) # region 2 + + Int64(1 * 8), # slot 1 = exit + t_exit, + ) + + # Region 3 entry: W_O end → barrier-arrive (cleanup). + if region_timing_enabled: + if tid == Int32(0): + cta_id = ( + bz * Int32(self.slice_ctas * self.num_k_tiles) + + by * Int32(self.slice_ctas) + + bx + ) + t_entry = _read_globaltimer_u64() + _st_global_u64( + region_timing_ptr + + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE) + + Int64(3 * 2 * 8) # region 3 + + Int64(0 * 8), # slot 0 = entry + t_entry, + ) + + # Region 3 exit: just before _atomic_add_u32 to + # phase1_arrival_count (the per-CTA barrier-arrive signal). + if region_timing_enabled: + if tid == Int32(0): + cta_id = ( + bz * Int32(self.slice_ctas * self.num_k_tiles) + + by * Int32(self.slice_ctas) + + bx + ) + t_exit = _read_globaltimer_u64() + _st_global_u64( + region_timing_ptr + + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE) + + Int64(3 * 2 * 8) # region 3 + + Int64(1 * 8), # slot 1 = exit + t_exit, + ) + + # Election: each W_O CTA's tid==0 atomic-adds 1; the writer + # of (total_wo_slots-1) wins and broadcasts is_last_cta=1 + # via SMEM. Other CTAs broadcast 0. + if tid == Int32(0): + old_count = _atomic_add_u32( + phase1_arrival_ptr + + Int64(seq_idx * Int32(4)), + Int32(1)) + if old_count == total_wo_slots - Int32(1): + _st_shared_f32(sync_md, Float32(1.0)) + else: + _st_shared_f32(sync_md, Float32(0.0)) + cute.arch.sync_threads() - # Region 3 entry: W_O end → barrier-arrive (cleanup). - # Same site as region 2 exit (W_O sync), but recorded - # as a separate block to keep the buffer boundary clean. + is_last_cta = _ld_shared_f32(sync_md) + + if is_last_cta > Float32(0.5): + # Region 12 entry: gather_reduce (elected single-CTA only). if region_timing_enabled: if tid == Int32(0): cta_id = ( @@ -4175,15 +4433,130 @@ def _kernel_phase_0_to_4( t_entry = _read_globaltimer_u64() _st_global_u64( region_timing_ptr - + Int64(cta_id) * Int64(11 * 2 * 8) - + Int64(3 * 2 * 8) # region 3 + + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE) + + Int64(12 * 2 * 8) # region 12 + Int64(0 * 8), # slot 0 = entry t_entry, ) - # Region 3 exit: just before the _atomic_add_u32 to - # phase1_arrival_count (the per-CTA barrier-arrive - # signal). We are inside bx==0 && by<4, gate on tid==0. + hd_c = hidden_dim + n_per_thr_c = hd_c // Int32(128) + + res_base_c = residual_in_ptr + Int64( + seq_idx * hd_c * Int32(2)) + wo_base_c = wo_output_ptr + Int64( + seq_idx * total_wo_slots + * hd_c * Int32(4)) + gamma_base_c = post_attn_gamma_ptr + out_base_c = attn_output_ptr + Int64( + seq_idx * hd_c * Int32(2)) + resout_base_c = residual_output_ptr + Int64( + seq_idx * hd_c * Int32(2)) + + my_start_c = tid * n_per_thr_c + + # Phase B.5: gather per-CTA slots into slot 0. + for _grp in cutlass.range_constexpr( + self.hidden_size // self.num_threads // 8 + ): + for _ei in cutlass.range_constexpr(8): + idx_c = my_start_c + Int32(_grp * 8 + _ei) + gather_acc = Float32(0.0) + cta_i = Int32(0) + while cta_i < total_wo_slots: + slot_addr = wo_output_ptr + Int64( + (seq_idx * total_wo_slots + + cta_i) + * hd_c * Int32(4) + + idx_c * Int32(4)) + gather_acc = gather_acc \ + + _ld_global_f32(slot_addr) + cta_i = cta_i + Int32(1) + _st_global_f32( + wo_base_c + + Int64(idx_c * Int32(4)), + gather_acc, + ) + _threadfence() + cute.arch.sync_threads() + + # Pass 1: residual add + sum-of-squares + ss = Float32(0.0) + for _grp in cutlass.range_constexpr( + self.hidden_size // self.num_threads // 8 + ): + base_idx = my_start_c + Int32(_grp * 8) + for _ei in cutlass.range_constexpr(8): + idx_c = base_idx + Int32(_ei) + res_f32 = _ld_global_b16_to_f32( + res_base_c + + Int64(idx_c * Int32(2))) + wo_f32 = _ld_global_f32( + wo_base_c + + Int64(idx_c * Int32(4))) + nr = res_f32 + wo_f32 + ss = ss + nr * nr + + ss = ss + shfl_xor_sync(ss, Int32(1)) + ss = ss + shfl_xor_sync(ss, Int32(2)) + ss = ss + shfl_xor_sync(ss, Int32(4)) + ss = ss + shfl_xor_sync(ss, Int32(8)) + ss = ss + shfl_xor_sync(ss, Int32(16)) + + if lane == Int32(0): + _st_shared_f32( + sync_md + Int64(warp * Int32(4)), ss) + cute.arch.sync_threads() + + if warp == Int32(0): + if lane == Int32(0): + total_ss = _ld_shared_f32(sync_md) + total_ss = total_ss + _ld_shared_f32( + sync_md + Int64(4)) + total_ss = total_ss + _ld_shared_f32( + sync_md + Int64(8)) + total_ss = total_ss + _ld_shared_f32( + sync_md + Int64(12)) + variance = total_ss / Float32(hd_c) + inv_rms = _rsqrt_approx_f32( + variance + rms_eps) + _st_shared_f32(sync_md, inv_rms) + cute.arch.sync_threads() + + inv_rms_val = _ld_shared_f32(sync_md) + + # Pass 3: re-read, scale, write BF16 output + for _grp in cutlass.range_constexpr( + self.hidden_size // self.num_threads // 8 + ): + base_idx = my_start_c + Int32(_grp * 8) + for _oi in cutlass.range_constexpr(8): + idx_c = base_idx + Int32(_oi) + res_f32 = _ld_global_b16_to_f32( + res_base_c + + Int64(idx_c * Int32(2))) + wo_f32 = _ld_global_f32( + wo_base_c + + Int64(idx_c * Int32(4))) + new_res = res_f32 + wo_f32 + + gamma_f32 = _ld_global_b16_to_f32( + gamma_base_c + + Int64(idx_c * Int32(2))) + # Qwen3_5RMSNorm uses x * (1 + γ) — see vllm/nvllm/layers/layernorm.py:78 + hidden_val = new_res * inv_rms_val \ + * (Float32(1.0) + gamma_f32) + + _st_global_bf16_from_f32( + out_base_c + + Int64(idx_c * Int32(2)), + hidden_val) + _st_global_bf16_from_f32( + resout_base_c + + Int64(idx_c * Int32(2)), + new_res) + + # Region 12 exit: gather_reduce. if region_timing_enabled: if tid == Int32(0): cta_id = ( @@ -4194,149 +4567,18 @@ def _kernel_phase_0_to_4( t_exit = _read_globaltimer_u64() _st_global_u64( region_timing_ptr - + Int64(cta_id) * Int64(11 * 2 * 8) - + Int64(3 * 2 * 8) # region 3 + + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE) + + Int64(12 * 2 * 8) + Int64(1 * 8), # slot 1 = exit t_exit, ) + # Reset arrival counter for next call. if tid == Int32(0): - old_count = _atomic_add_u32( + _atomic_add_u32( phase1_arrival_ptr + Int64(seq_idx * Int32(4)), - Int32(1)) - if old_count == total_ctas_per_seq_attn - Int32(1): - _st_shared_f32(sync_md, Float32(1.0)) - else: - _st_shared_f32(sync_md, Float32(0.0)) - cute.arch.sync_threads() - - is_last_cta = _ld_shared_f32(sync_md) - - if is_last_cta > Float32(0.5): - hd_c = hidden_dim - n_per_thr_c = hd_c // Int32(128) - - res_base_c = residual_in_ptr + Int64( - seq_idx * hd_c * Int32(2)) - wo_base_c = wo_output_ptr + Int64( - seq_idx * total_ctas_per_seq_attn - * hd_c * Int32(4)) - gamma_base_c = post_attn_gamma_ptr - out_base_c = attn_output_ptr + Int64( - seq_idx * hd_c * Int32(2)) - resout_base_c = residual_output_ptr + Int64( - seq_idx * hd_c * Int32(2)) - - my_start_c = tid * n_per_thr_c - - # Phase B.5: gather per-CTA slots into slot 0. - for _grp in cutlass.range_constexpr( - self.hidden_size // self.num_threads // 8 - ): - for _ei in cutlass.range_constexpr(8): - idx_c = my_start_c + Int32(_grp * 8 + _ei) - gather_acc = Float32(0.0) - cta_i = Int32(0) - while cta_i < total_ctas_per_seq_attn: - slot_addr = wo_output_ptr + Int64( - (seq_idx * total_ctas_per_seq_attn - + cta_i) - * hd_c * Int32(4) - + idx_c * Int32(4)) - gather_acc = gather_acc \ - + _ld_global_f32(slot_addr) - cta_i = cta_i + Int32(1) - _st_global_f32( - wo_base_c - + Int64(idx_c * Int32(4)), - gather_acc, - ) - _threadfence() - cute.arch.sync_threads() - - # Pass 1: residual add + sum-of-squares - ss = Float32(0.0) - for _grp in cutlass.range_constexpr( - self.hidden_size // self.num_threads // 8 - ): - base_idx = my_start_c + Int32(_grp * 8) - for _ei in cutlass.range_constexpr(8): - idx_c = base_idx + Int32(_ei) - res_f32 = _ld_global_b16_to_f32( - res_base_c - + Int64(idx_c * Int32(2))) - wo_f32 = _ld_global_f32( - wo_base_c - + Int64(idx_c * Int32(4))) - nr = res_f32 + wo_f32 - ss = ss + nr * nr - - ss = ss + shfl_xor_sync(ss, Int32(1)) - ss = ss + shfl_xor_sync(ss, Int32(2)) - ss = ss + shfl_xor_sync(ss, Int32(4)) - ss = ss + shfl_xor_sync(ss, Int32(8)) - ss = ss + shfl_xor_sync(ss, Int32(16)) - - if lane == Int32(0): - _st_shared_f32( - sync_md + Int64(warp * Int32(4)), ss) - cute.arch.sync_threads() - - if warp == Int32(0): - if lane == Int32(0): - total_ss = _ld_shared_f32(sync_md) - total_ss = total_ss + _ld_shared_f32( - sync_md + Int64(4)) - total_ss = total_ss + _ld_shared_f32( - sync_md + Int64(8)) - total_ss = total_ss + _ld_shared_f32( - sync_md + Int64(12)) - variance = total_ss / Float32(hd_c) - inv_rms = _rsqrt_approx_f32( - variance + rms_eps) - _st_shared_f32(sync_md, inv_rms) - cute.arch.sync_threads() - - inv_rms_val = _ld_shared_f32(sync_md) - - # Pass 3: re-read, scale, write BF16 output - for _grp in cutlass.range_constexpr( - self.hidden_size // self.num_threads // 8 - ): - base_idx = my_start_c + Int32(_grp * 8) - for _oi in cutlass.range_constexpr(8): - idx_c = base_idx + Int32(_oi) - res_f32 = _ld_global_b16_to_f32( - res_base_c - + Int64(idx_c * Int32(2))) - wo_f32 = _ld_global_f32( - wo_base_c - + Int64(idx_c * Int32(4))) - new_res = res_f32 + wo_f32 - - gamma_f32 = _ld_global_b16_to_f32( - gamma_base_c - + Int64(idx_c * Int32(2))) - # Qwen3_5RMSNorm uses x * (1 + γ) — see vllm/nvllm/layers/layernorm.py:78 - hidden_val = new_res * inv_rms_val \ - * (Float32(1.0) + gamma_f32) - - _st_global_bf16_from_f32( - out_base_c - + Int64(idx_c * Int32(2)), - hidden_val) - _st_global_bf16_from_f32( - resout_base_c - + Int64(idx_c * Int32(2)), - new_res) - - # Reset arrival counter for next call. - if tid == Int32(0): - _atomic_add_u32( - phase1_arrival_ptr - + Int64(seq_idx * Int32(4)), - Int32(0) - total_ctas_per_seq_attn) + Int32(0) - total_wo_slots) # Region 4 entry: grid barrier wait (all 64 CTAs participate). # Entry tick recorded at the moment a CTA arrives at the @@ -4353,7 +4595,7 @@ def _kernel_phase_0_to_4( t_entry = _read_globaltimer_u64() _st_global_u64( region_timing_ptr - + Int64(cta_id) * Int64(11 * 2 * 8) + + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE) + Int64(4 * 2 * 8) # region 4 + Int64(0 * 8), # slot 0 = entry t_entry, @@ -4395,7 +4637,7 @@ def _kernel_phase_0_to_4( t_exit = _read_globaltimer_u64() _st_global_u64( region_timing_ptr - + Int64(cta_id) * Int64(11 * 2 * 8) + + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE) + Int64(4 * 2 * 8) # region 4 + Int64(1 * 8), # slot 1 = exit t_exit, @@ -4417,7 +4659,7 @@ def _kernel_phase_0_to_4( t_entry = _read_globaltimer_u64() _st_global_u64( region_timing_ptr - + Int64(cta_id) * Int64(11 * 2 * 8) + + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE) + Int64(5 * 2 * 8) # region 5 + Int64(0 * 8), # slot 0 = entry t_entry, @@ -4492,7 +4734,7 @@ def _kernel_phase_0_to_4( t_exit = _read_globaltimer_u64() _st_global_u64( region_timing_ptr - + Int64(cta_id) * Int64(11 * 2 * 8) + + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE) + Int64(5 * 2 * 8) # region 5 + Int64(1 * 8), # slot 1 = exit t_exit, @@ -4511,7 +4753,7 @@ def _kernel_phase_0_to_4( t_entry = _read_globaltimer_u64() _st_global_u64( region_timing_ptr - + Int64(cta_id) * Int64(11 * 2 * 8) + + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE) + Int64(6 * 2 * 8) # region 6 + Int64(0 * 8), # slot 0 = entry t_entry, @@ -4580,7 +4822,7 @@ def _kernel_phase_0_to_4( t_exit = _read_globaltimer_u64() _st_global_u64( region_timing_ptr - + Int64(cta_id) * Int64(11 * 2 * 8) + + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE) + Int64(6 * 2 * 8) # region 6 + Int64(1 * 8), # slot 1 = exit t_exit, @@ -4602,7 +4844,7 @@ def _kernel_phase_0_to_4( t_entry = _read_globaltimer_u64() _st_global_u64( region_timing_ptr - + Int64(cta_id) * Int64(11 * 2 * 8) + + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE) + Int64(7 * 2 * 8) # region 7 + Int64(0 * 8), # slot 0 = entry t_entry, @@ -4741,7 +4983,7 @@ def _kernel_phase_0_to_4( t_exit = _read_globaltimer_u64() _st_global_u64( region_timing_ptr - + Int64(cta_id) * Int64(11 * 2 * 8) + + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE) + Int64(7 * 2 * 8) # region 7 + Int64(1 * 8), # slot 1 = exit t_exit, @@ -4760,7 +5002,7 @@ def _kernel_phase_0_to_4( t_entry = _read_globaltimer_u64() _st_global_u64( region_timing_ptr - + Int64(cta_id) * Int64(11 * 2 * 8) + + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE) + Int64(8 * 2 * 8) # region 8 + Int64(0 * 8), # slot 0 = entry t_entry, @@ -4873,7 +5115,7 @@ def _kernel_phase_0_to_4( t_exit = _read_globaltimer_u64() _st_global_u64( region_timing_ptr - + Int64(cta_id) * Int64(11 * 2 * 8) + + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE) + Int64(8 * 2 * 8) # region 8 + Int64(1 * 8), # slot 1 = exit t_exit, @@ -4892,7 +5134,7 @@ def _kernel_phase_0_to_4( t_entry = _read_globaltimer_u64() _st_global_u64( region_timing_ptr - + Int64(cta_id) * Int64(11 * 2 * 8) + + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE) + Int64(9 * 2 * 8) # region 9 + Int64(0 * 8), # slot 0 = entry t_entry, @@ -5092,7 +5334,7 @@ def _kernel_phase_0_to_4( t_exit = _read_globaltimer_u64() _st_global_u64( region_timing_ptr - + Int64(cta_id) * Int64(11 * 2 * 8) + + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE) + Int64(9 * 2 * 8) # region 9 + Int64(1 * 8), # slot 1 = exit t_exit, @@ -5109,7 +5351,7 @@ def _kernel_phase_0_to_4( t_entry = _read_globaltimer_u64() _st_global_u64( region_timing_ptr - + Int64(cta_id) * Int64(11 * 2 * 8) + + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE) + Int64(10 * 2 * 8) # region 10 + Int64(0 * 8), # slot 0 = entry t_entry, @@ -5208,7 +5450,7 @@ def _kernel_phase_0_to_4( t_exit = _read_globaltimer_u64() _st_global_u64( region_timing_ptr - + Int64(cta_id) * Int64(11 * 2 * 8) + + Int64(cta_id) * Int64(_REGION_TIMING_PER_CTA_STRIDE) + Int64(10 * 2 * 8) # region 10 + Int64(1 * 8), # slot 1 = exit t_exit, diff --git a/vllm/v1/attention/backends/cute_paged/region_timing.py b/vllm/v1/attention/backends/cute_paged/region_timing.py index b556732b07f4..a956431f50a6 100644 --- a/vllm/v1/attention/backends/cute_paged/region_timing.py +++ b/vllm/v1/attention/backends/cute_paged/region_timing.py @@ -44,10 +44,13 @@ "phase3_3b_quant", "phase3_3c_fc2_atomic", "phase3_3d_arrival", + "phase1_pre_wo_wait", # NEW R11: bx>0 W_O CTAs wait for attn producers + "phase1_gather_reduce", # NEW R12: last-CTA gather of total_wo_slots partials ] PHASE0_REGIONS = {0} # single CTA per seq PHASE1_REGIONS = {1, 2, 3} # 4 CTAs per seq (bx==0, by<4) -WAIT_NOT_WORK_REGIONS = {4} # all CTAs but reported separately +WAIT_NOT_WORK_REGIONS = {4, 11} # R4 grid barrier + R11 pre-W_O wait +DYNAMIC_SINGLE_CTA_REGIONS = {12} # R12 elected gather/reduce PHASE3_REGIONS = {5, 6, 7, 8, 9, 10} # all CTAs @@ -69,12 +72,31 @@ def _phase1_cta_ids(slice_ctas: int, num_k_tiles: int, num_seqs: int) -> np.ndar return np.array(out, dtype=np.int64) +def _phase1_wo_split_cta_ids( + slice_ctas: int, + num_k_tiles: int, + num_seqs: int, + wo_split: int, + num_kv_heads: int, +) -> np.ndarray: + """W_O active CTAs with K-parallel split: bx pd.DataFrame: - """Reduce a (num_ctas, 11, 2) tick buffer to per-region rows. + """Reduce a (num_ctas, 13, 2) tick buffer to per-region rows. Active-CTA masks are derived from (slice_ctas, num_k_tiles, num_seqs) so callers do NOT pass a "num_attn_active_ctas" count — that count is wrong for Phase 0 (1 CTA/seq) vs Phase 1 (4 CTAs/seq) which the earlier draft conflated as "32". + When wo_split > 1, regions {2, 3, 11, 12} are masked using the + K-parallel W_O active-CTA layout (bx 1 AND caller + # supplied num_kv_heads. Used for R2/R3/R11/R12. + if wo_split > 1: + assert num_kv_heads > 0, ( + "wo_split>1 requires num_kv_heads>0 for the K-parallel mask" + ) + wo_split_ids = _phase1_wo_split_cta_ids( + slice_ctas, num_k_tiles, num_seqs, wo_split, num_kv_heads, + ) + else: + wo_split_ids = None rows: list[RegionRow] = [] for r in range(num_regions): @@ -134,11 +174,29 @@ def reduce_region_timings( active_ids = p0_ids cta_class = "phase0" elif r in PHASE1_REGIONS: - active_ids = p1_ids + # When wo_split>1, R2/R3 are the W_O GEMV/post regions and + # use the K-parallel mask. R1 (phase1_attn_pre_wo) is still + # the bx==0 && by<4 set so it stays on p1_ids. + if wo_split_ids is not None and r in (2, 3): + active_ids = wo_split_ids + else: + active_ids = p1_ids cta_class = "phase1" elif r in WAIT_NOT_WORK_REGIONS: - active_ids = all_ids + # R11 (phase1_pre_wo_wait) uses the K-parallel mask when + # wo_split>1 — it's the consumer wait for bx>0 W_O CTAs. + # R4 (grid_barrier_wait) stays on all_ids. + if wo_split_ids is not None and r == 11: + active_ids = wo_split_ids + else: + active_ids = all_ids cta_class = "barrier_wait" + elif r in DYNAMIC_SINGLE_CTA_REGIONS: + # R12 (phase1_gather_reduce) is the elected single-CTA + # gather. Even with wo_split>1 only one CTA writes a tick; + # nonzero filter handles it. Mask is all_ids. + active_ids = all_ids + cta_class = "dynamic_single" else: active_ids = all_ids cta_class = "phase3" @@ -170,7 +228,7 @@ def reduce_region_timings( # source. For globaltimer that's *1000 (ns/μs); for clock64 we # cannot convert, so frac is reported as NaN unless caller # passes a clock64-calibrated total (not in the v1 reducer API). - if r in WAIT_NOT_WORK_REGIONS: + if r in WAIT_NOT_WORK_REGIONS or r in DYNAMIC_SINGLE_CTA_REGIONS: frac = float("nan") elif nsys_total_us is None: frac = float("nan")