Skip to content

sijyang/kernel-forge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TopK Hunters

Systematic AI-driven GPU kernel optimization — Using Claude Code + structured playbooks to perform phased, traceable performance tuning on the aiter operator library. Target hardware: AMD Instinct MI300X (gfx942).

Why This Project

GPU kernel optimization is a critical bottleneck in LLM inference, yet traditional approaches rely on expert experience, take long cycles, and are hard to reproduce. We observed that:

  • Optimization follows a repeatable pattern: bottleneck classification → strategy matching → layered implementation → validation & recording
  • AI excels at pattern matching and code analysis, but requires structured constraints to produce reliable results
  • Lessons from each optimization can be accumulated, making subsequent optimizations faster

TopK Hunters encodes kernel optimization expertise into Claude-executable playbooks, enabling AI to complete the full pipeline from bottleneck diagnosis to code modification under human supervision.

Workflow Architecture

flowchart LR
    subgraph Input
        A[User specifies operator]
    end

    subgraph "Phase 1 — Triage"
        B{Implementation\nclassification}
        B -->|ASM| B1[asm.md]
        B -->|CK| B2[ck.md]
        B -->|Triton| B3[triton.md]
        B -->|HIP| B4[hip.md]
    end

    subgraph "Phase 2 — Measure"
        C[Baseline measurement\nrocprof + roofline]
    end

    subgraph "Phase 3 — Optimize"
        D[Bottleneck classification]
        D -->|Memory| D1[Vectorization / occupancy]
        D -->|Compute| D2[Tile / ILP]
        D -->|Sync| D3[Fence removal / persistence]
        E[Layered optimization\nL1→L2→L3→L4\nOne axis at a time]
    end

    subgraph "Phase 4 — Validate"
        F[Correctness check ✓\nPerformance comparison ✓]
    end

    subgraph "Phase 5 — Record"
        G[notes.md\nv1_tag.cu]
    end

    A --> B
    B1 & B2 & B3 & B4 --> C
    C --> D
    D1 & D2 & D3 --> E
    E --> F
    F --> G
Loading

Claude's role at each phase:

Phase Claude does Human does
Triage Automatically classify implementation, check optimization history Confirm classification, select target
Measure Generate benchmark scripts, analyze rocprof data Execute on GPU environment
Optimize Read playbook, propose code changes Review changes, decide whether to adopt
Validate Generate tests, analyze before/after Execute on GPU environment
Record Auto-populate notes.md Review record accuracy

Results

Operator Bottleneck Best Speedup Typical Status
topk_softmax Sync 15.04× 3–9× V1 done
moe_fused_gate Memory+Sync 5.3× 1.2–4.5× V6 done
topk_per_row Memory 2.07× 1.1–1.4× V4 done

Detailed data → Leaderboard

Quick Start

# 1. Clone the project
git clone https://github.com/sijyang/kernel-forge
cd kernel-forge

# 2. Install skill (creates symlink to ~/.claude/skills/)
./install.sh

# 3. Start (or restart) Claude Code, then type:
/kernel-forge <op-name> impl:hip goal:latency

Examples:

/kernel-forge moe_fused_gate impl:hip goal:latency
/kernel-forge rmsnorm impl:hip goal:latency
/kernel-forge gated_rmsnorm_quant impl:hip goal:latency

Core Design

  1. Four implementation classes, each with its own playbook — ASM / CK / Triton / HIP
  2. Three bottleneck types, beyond roofline — compute / memory / sync (sync-bound is invisible on roofline)
  3. Layered optimization with single-axis attribution — structural → sync → instruction → resource, one axis at a time
  4. Claim vs Measured — all claims must be empirically verified; discrepancies are documented
  5. Failed experiments are first-class citizens — preventing repeated pitfalls
  6. Knowledge flywheel — insights from each optimization are written back into playbooks

Project Structure

kernel-forge/
├── SKILL.md              # Claude Code skill entry point (Triage + phase routing)
├── README.md             # This file
├── reference.md          # MI300X peak performance / roofline / rocprof reference
├── playbooks/
│   ├── asm.md            # ASM kernel optimization playbook
│   ├── ck.md             # CK/CK-Tile configuration tuning playbook
│   ├── triton.md         # Triton autotune / static table playbook
│   └── hip.md            # HIP C++ full-source playbook + field experience library
├── shared/
│   ├── profiling.md      # Measurement discipline and rocprof usage
│   └── validation.md     # Correctness verification and report template
├── kernels/              # Per-operator optimization workspace
│   ├── README.md         # Leaderboard overview
│   ├── topk_softmax/     # V1 done, up to 15×
│   ├── moe_fused_gate/   # V6 done, up to 5.3×
│   ├── topk_per_row/     # V4 done, up to 2.07×
│   └── gated_rmsnorm_quant/  # Baseline already optimal
└── install.sh            # One-click install

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors