AgentKernelArena is a standardized evaluation arena built by AMD's AI Group (AIG) to measure how well AI coding agents perform on real GPU kernel optimization tasks. It provides an end-to-end, siloed benchmarking environment where LLM-powered agents (Cursor Agent, Claude Code, Codex, SWE-agent, GEAK, and custom agents) are evaluated side-by-side on the same kernel tasks using objective and reproducible metrics.
AgentKernelArena enables systematic evaluation of AI agents on GPU kernel optimization tasks by combining:
- Multi-Agent Arena: Cursor, Claude Code, SWE-agent, OpenEvolve (GEAK), single LLM calls (Codex/others), and custom agents
- Multi-Model Support: OpenAI (GPT-5), Anthropic Claude (Opus 4.5, Sonnet 4.5), and other models via OpenRouter or vLLM
- Task Categories: HIP (ROCm examples, rocPRIM, customer HIP), Triton (TritonBench, ROCmBench), and Torch2HIP conversions
- Real Metrics: Automated evaluation of compilation success, correctness, and real GPU performance speedups
- Designed for Fair Comparison: Standardized tasks, environments, prompts, and scoring for leaderboard-style evaluation
- Workspace Isolation: Each task runs in a timestamped duplicate workspace for reproducibility
- Comprehensive Logging: Detailed logs with timestamps, prompts, outputs, and results for every task execution
- Flexible Configuration: YAML-based configuration for tasks, agents, and LLM parameters
AgentKernelArena is actively under development. Upcoming releases will publish detailed evaluation results comparing agent performance across multiple task categories, using standardized correctness and performance scores. As AI coding agents rapidly improve, we need more than cherry-picked demos -- especially in specialized domains like GPU programming. AgentKernelArena is built to answer a simple, critical question: which agents actually deliver real performance gains on real kernels?
AgentKernelArena/
├── main.py # Main orchestration entry point
├── config.yaml # Global configuration
├── src/
│ ├── module_registration.py # Dynamic agent/prompt/post-processing loading
│ ├── preprocessing.py # Workspace setup and environment checks
│ ├── prompt_builder.py # Task prompt construction
│ ├── postprocessing.py # Result analysis and report generation
│ ├── score.py # Scoring logic for evaluation metrics
│ ├── tasks.py # Task discovery and registration
│ └── utils/
│ └── report_generation.py # Aggregate report analysis utilities
├── agents/
│ ├── cursor/ # Cursor agent integration
│ ├── claude_code/ # Claude Code agent integration
│ ├── SWE_agent/ # SWE-agent integration
│ ├── openevolve/ # OpenEvolve (GEAK) integration
│ ├── geak_optimagentv2/ # GEAK OptimAgent v2 integration
│ ├── geak_hip/ # GEAK HIP integration
│ ├── geak_ourllm_kernel2kernel/ # GEAK OurLLM kernel-to-kernel integration
│ ├── single_llm_call/ # Single LLM call implementation
│ └── __init__.py # Agent registry
└── tasks/ # Task definitions
├── rocm-examples/ # ROCm example kernels
├── rocprim/ # rocPRIM kernels
├── customer_hip/ # Custom HIP kernels
├── triton/ # Triton benchmark kernels
└── torch2hip/ # Torch2HIP conversion tasks
- Configuration Loading: Load
config.yamlwith agent, task, and LLM settings - Agent Registration: Dynamically load agent launcher, prompt builder, and post-processing handler based on AgentType enum
- Task Discovery: Scan
tasks/directory for task configurations matching specified categories - Workspace Setup: Create isolated workspace with timestamp for each task
- Prompt Building: Construct task-specific prompts from config, source code, and instructions/cheatsheets
- Agent Execution: Launch agent in workspace with constructed prompt
- Result Collection: Save agent output, logs, and modified code
- Post-Processing: Run compilation, correctness tests, performance profiling, and scoring
- Report Generation: Generate comprehensive evaluation report with metrics
- Python 3.12+
- ROCm toolkit (for HIP kernels):
hipcc,rocprof-compute - Triton (for Triton kernels)
- Git
# Clone the repository
git clone <repository-url>
cd AgentKernelArena
# Install dependencies
pip install -r requirements.txt
# Set up API keys (choose one or more)
export OPENAI_API_KEY="your_openai_key"
export ANTHROPIC_API_KEY="your_anthropic_key"
export OPENROUTER_API_KEY="your_openrouter_key"
# Install agent CLIs (using claude_code as an example)
# For Claude Code:
npm install -g @anthropic-ai/claude-code
- Configure
config.yaml:
# Select agent type
agent:
template: claude_code # Options: cursor, claude_code, swe_agent, single_llm_call, openevolve, geak_optimagentv2, geak_hip, geak_ourllm_kernel2kernel
max_iterations: 5
# Specify tasks to run
tasks:
- rocm-examples/bitonic_sort
- customer_hip/silu
# - all # Run ALL tasks
target_gpu_model: MI300
log_directory: logs
workspace_directory_prefix: workspace
- Run evaluation:
python main.pytasks:
- rocm-examples/* # All ROCm examples
- rocprim/* # All rocPRIM tasks
- customer_hip/mmcv/* # All MMCV HIP kernels
- triton/tritonbench/* # All Triton benchmarks
- torch2hip/* # All Torch2HIP conversion tasksEach task is defined by a config.yaml in its directory:
# tasks/rocm-examples/bitonic_sort/config.yaml
source_file_path:
- main.hip
target_kernel_functions:
- bitonic_sort_kernel
compile_command:
- make
correctness_command:
- ./applications_bitonic_sort -l 15
performance_command:
- rocprof-compute profile -n kernelgen --path rocprof_compute_profile --no-roof --join-type kernel -b SQ -b TCP -b TCC -- ./applications_bitonic_sort -l 15
- rocprof-compute analyze --path rocprof_compute_profile -b 2
task_type: hip2hip
prompt:
source_code: null # Optional: override default source code inclusion
instructions: null # Optional: custom instructions
cheatsheet: null # Optional: provide cheatsheet/referenceAgentKernelArena uses a cumulative scoring system:
| Metric | Points | Description |
|---|---|---|
| Compilation | 20 | Code compiles successfully without errors |
| Correctness | 100 | Code produces correct output (passes tests) |
| Speedup | ratio × 100 | Performance improvement over baseline |
Example: A submission that compiles (20), passes correctness (100), and achieves 1.5× speedup (150) would score 270 points.
Note: This is not the only way to score. Users could always define their own ways.
-
Create agent directory:
agents/your_agent/ -
Implement launch function:
# agents/your_agent/launch_agent.py
from agents import register_agent
@register_agent("your_agent")
def launch_agent(prompt: str, log_directory: str, workspace: str) -> str:
"""
Launch your agent.
Returns:
str: Agent output
"""
# Your agent implementation
return result- Register in module_registration.py:
# Add to AgentType enum
class AgentType(Enum):
YOUR_AGENT = "your_agent"
# Add import in load_agent_launcher
if agent_type == AgentType.YOUR_AGENT:
from agents.your_agent import launch_agent- Add prompt builder support (if needed):
# In load_prompt_builder
if agent_type in [..., AgentType.YOUR_AGENT]:
return prompt_builder- Add post-processing support (if needed):
# In load_post_processing_handler
if agent_type in [..., AgentType.YOUR_AGENT]:
return general_post_processing-
Create task directory:
tasks/category/task_name/ -
Add source files:
main.hip,Makefile, etc. -
Create config.yaml:
source_file_path:
- main.hip
target_kernel_functions:
- your_kernel_function
compile_command:
- make
correctness_command:
- ./your_executable --test
performance_command:
- rocprof-compute profile ... -- ./your_executable
- rocprof-compute analyze ...
task_type: triton2triton
prompt:
source_code: null
instructions: null
cheatsheet: null- Add baseline performance (optional): Create
baseline.txtwith expected performance metrics
- Enhance A/B Testing with Better Interactivity and User Experience
- Benchmarking State-of-the-Art Agents for Technical Reporting
- Standardize Holdout Tests with Comprehensive Shape Coverage
- Add Holdout Test Evaluation via Independent Agent
- New Feature: Support Multi Agents in Multi GPUs Server
- New Feature: Resume the Evaluation From Previous Experiment
- Agents Can Hang During Task Execution, Blocking Test Completion
- Expand Pytorch2HIP Task Set to 100+ Tasks
- Expand CUDA2HIP Task Set to 100+ Tasks
- Expand Triton2Triton Task Set to 100+ Tasks
- Expand HIP2HIP Task Set to 100+ Tasks
- Restructure Task Directory by Take Type and Difficulty Level