TopK Hunters

Systematic AI-driven GPU kernel optimization — Using Claude Code + structured playbooks to perform phased, traceable performance tuning on the aiter operator library. Target hardware: AMD Instinct MI300X (gfx942).

Why This Project

GPU kernel optimization is a critical bottleneck in LLM inference, yet traditional approaches rely on expert experience, take long cycles, and are hard to reproduce. We observed that:

Optimization follows a repeatable pattern: bottleneck classification → strategy matching → layered implementation → validation & recording
AI excels at pattern matching and code analysis, but requires structured constraints to produce reliable results
Lessons from each optimization can be accumulated, making subsequent optimizations faster

TopK Hunters encodes kernel optimization expertise into Claude-executable playbooks, enabling AI to complete the full pipeline from bottleneck diagnosis to code modification under human supervision.

Workflow Architecture

flowchart LR
    subgraph Input
        A[User specifies operator]
    end

    subgraph "Phase 1 — Triage"
        B{Implementation\nclassification}
        B -->|ASM| B1[asm.md]
        B -->|CK| B2[ck.md]
        B -->|Triton| B3[triton.md]
        B -->|HIP| B4[hip.md]
    end

    subgraph "Phase 2 — Measure"
        C[Baseline measurement\nrocprof + roofline]
    end

    subgraph "Phase 3 — Optimize"
        D[Bottleneck classification]
        D -->|Memory| D1[Vectorization / occupancy]
        D -->|Compute| D2[Tile / ILP]
        D -->|Sync| D3[Fence removal / persistence]
        E[Layered optimization\nL1→L2→L3→L4\nOne axis at a time]
    end

    subgraph "Phase 4 — Validate"
        F[Correctness check ✓\nPerformance comparison ✓]
    end

    subgraph "Phase 5 — Record"
        G[notes.md\nv1_tag.cu]
    end

    A --> B
    B1 & B2 & B3 & B4 --> C
    C --> D
    D1 & D2 & D3 --> E
    E --> F
    F --> G

Claude's role at each phase:

Phase	Claude does	Human does
Triage	Automatically classify implementation, check optimization history	Confirm classification, select target
Measure	Generate benchmark scripts, analyze rocprof data	Execute on GPU environment
Optimize	Read playbook, propose code changes	Review changes, decide whether to adopt
Validate	Generate tests, analyze before/after	Execute on GPU environment
Record	Auto-populate notes.md	Review record accuracy

Results

Operator	Bottleneck	Best Speedup	Typical	Status
topk_softmax	Sync	15.04×	3–9×	V1 done
moe_fused_gate	Memory+Sync	5.3×	1.2–4.5×	V6 done
topk_per_row	Memory	2.07×	1.1–1.4×	V4 done

Detailed data → Leaderboard

Quick Start

# 1. Clone the project
git clone https://github.com/sijyang/kernel-forge
cd kernel-forge

# 2. Install skill (creates symlink to ~/.claude/skills/)
./install.sh

# 3. Start (or restart) Claude Code, then type:
/kernel-forge <op-name> impl:hip goal:latency

Examples:

/kernel-forge moe_fused_gate impl:hip goal:latency
/kernel-forge rmsnorm impl:hip goal:latency
/kernel-forge gated_rmsnorm_quant impl:hip goal:latency

Core Design

Four implementation classes, each with its own playbook — ASM / CK / Triton / HIP
Three bottleneck types, beyond roofline — compute / memory / sync (sync-bound is invisible on roofline)
Layered optimization with single-axis attribution — structural → sync → instruction → resource, one axis at a time
Claim vs Measured — all claims must be empirically verified; discrepancies are documented
Failed experiments are first-class citizens — preventing repeated pitfalls
Knowledge flywheel — insights from each optimization are written back into playbooks

Project Structure

kernel-forge/
├── SKILL.md              # Claude Code skill entry point (Triage + phase routing)
├── README.md             # This file
├── reference.md          # MI300X peak performance / roofline / rocprof reference
├── playbooks/
│   ├── asm.md            # ASM kernel optimization playbook
│   ├── ck.md             # CK/CK-Tile configuration tuning playbook
│   ├── triton.md         # Triton autotune / static table playbook
│   └── hip.md            # HIP C++ full-source playbook + field experience library
├── shared/
│   ├── profiling.md      # Measurement discipline and rocprof usage
│   └── validation.md     # Correctness verification and report template
├── kernels/              # Per-operator optimization workspace
│   ├── README.md         # Leaderboard overview
│   ├── topk_softmax/     # V1 done, up to 15×
│   ├── moe_fused_gate/   # V6 done, up to 5.3×
│   ├── topk_per_row/     # V4 done, up to 2.07×
│   └── gated_rmsnorm_quant/  # Baseline already optimal
└── install.sh            # One-click install

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TopK Hunters

Why This Project

Workflow Architecture

Results

Quick Start

Core Design

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
kernels		kernels
playbooks		playbooks
shared		shared
.gitignore		.gitignore
README.md		README.md
SKILL.md		SKILL.md
install.sh		install.sh
reference.md		reference.md

Folders and files

Latest commit

History

Repository files navigation

TopK Hunters

Why This Project

Workflow Architecture

Results

Quick Start

Core Design

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages