Systematic AI-driven GPU kernel optimization — Using Claude Code + structured playbooks to perform phased, traceable performance tuning on the aiter operator library. Target hardware: AMD Instinct MI300X (gfx942).
GPU kernel optimization is a critical bottleneck in LLM inference, yet traditional approaches rely on expert experience, take long cycles, and are hard to reproduce. We observed that:
- Optimization follows a repeatable pattern: bottleneck classification → strategy matching → layered implementation → validation & recording
- AI excels at pattern matching and code analysis, but requires structured constraints to produce reliable results
- Lessons from each optimization can be accumulated, making subsequent optimizations faster
TopK Hunters encodes kernel optimization expertise into Claude-executable playbooks, enabling AI to complete the full pipeline from bottleneck diagnosis to code modification under human supervision.
flowchart LR
subgraph Input
A[User specifies operator]
end
subgraph "Phase 1 — Triage"
B{Implementation\nclassification}
B -->|ASM| B1[asm.md]
B -->|CK| B2[ck.md]
B -->|Triton| B3[triton.md]
B -->|HIP| B4[hip.md]
end
subgraph "Phase 2 — Measure"
C[Baseline measurement\nrocprof + roofline]
end
subgraph "Phase 3 — Optimize"
D[Bottleneck classification]
D -->|Memory| D1[Vectorization / occupancy]
D -->|Compute| D2[Tile / ILP]
D -->|Sync| D3[Fence removal / persistence]
E[Layered optimization\nL1→L2→L3→L4\nOne axis at a time]
end
subgraph "Phase 4 — Validate"
F[Correctness check ✓\nPerformance comparison ✓]
end
subgraph "Phase 5 — Record"
G[notes.md\nv1_tag.cu]
end
A --> B
B1 & B2 & B3 & B4 --> C
C --> D
D1 & D2 & D3 --> E
E --> F
F --> G
Claude's role at each phase:
| Phase | Claude does | Human does |
|---|---|---|
| Triage | Automatically classify implementation, check optimization history | Confirm classification, select target |
| Measure | Generate benchmark scripts, analyze rocprof data | Execute on GPU environment |
| Optimize | Read playbook, propose code changes | Review changes, decide whether to adopt |
| Validate | Generate tests, analyze before/after | Execute on GPU environment |
| Record | Auto-populate notes.md | Review record accuracy |
| Operator | Bottleneck | Best Speedup | Typical | Status |
|---|---|---|---|---|
| topk_softmax | Sync | 15.04× | 3–9× | V1 done |
| moe_fused_gate | Memory+Sync | 5.3× | 1.2–4.5× | V6 done |
| topk_per_row | Memory | 2.07× | 1.1–1.4× | V4 done |
Detailed data → Leaderboard
# 1. Clone the project
git clone https://github.com/sijyang/kernel-forge
cd kernel-forge
# 2. Install skill (creates symlink to ~/.claude/skills/)
./install.sh
# 3. Start (or restart) Claude Code, then type:
/kernel-forge <op-name> impl:hip goal:latencyExamples:
/kernel-forge moe_fused_gate impl:hip goal:latency
/kernel-forge rmsnorm impl:hip goal:latency
/kernel-forge gated_rmsnorm_quant impl:hip goal:latency
- Four implementation classes, each with its own playbook — ASM / CK / Triton / HIP
- Three bottleneck types, beyond roofline — compute / memory / sync (sync-bound is invisible on roofline)
- Layered optimization with single-axis attribution — structural → sync → instruction → resource, one axis at a time
- Claim vs Measured — all claims must be empirically verified; discrepancies are documented
- Failed experiments are first-class citizens — preventing repeated pitfalls
- Knowledge flywheel — insights from each optimization are written back into playbooks
kernel-forge/
├── SKILL.md # Claude Code skill entry point (Triage + phase routing)
├── README.md # This file
├── reference.md # MI300X peak performance / roofline / rocprof reference
├── playbooks/
│ ├── asm.md # ASM kernel optimization playbook
│ ├── ck.md # CK/CK-Tile configuration tuning playbook
│ ├── triton.md # Triton autotune / static table playbook
│ └── hip.md # HIP C++ full-source playbook + field experience library
├── shared/
│ ├── profiling.md # Measurement discipline and rocprof usage
│ └── validation.md # Correctness verification and report template
├── kernels/ # Per-operator optimization workspace
│ ├── README.md # Leaderboard overview
│ ├── topk_softmax/ # V1 done, up to 15×
│ ├── moe_fused_gate/ # V6 done, up to 5.3×
│ ├── topk_per_row/ # V4 done, up to 2.07×
│ └── gated_rmsnorm_quant/ # Baseline already optimal
└── install.sh # One-click install