Skip to content

GeisYaO/t0-gpu

Repository files navigation

T0-GPU

RDNA3 裸金属 GPU 内核编译器 & KFD 运行时 Bare-Metal GPU Kernel Compiler & KFD Runtime for RDNA3


概述 / Overview

T0-GPU 是一个纯 Rust 实现的 GPU 编程框架,直接面向 AMD RDNA3 (GFX1100) 硬件。它完全绕过 HIP/ROCm 用户态库,通过 Linux KFD 驱动接口与 GPU 直接通信。~50,000 行 Rust 代码,零外部依赖。

T0-GPU is a pure-Rust GPU programming framework targeting AMD RDNA3 (GFX1100) hardware. It bypasses HIP/ROCm userspace libraries entirely, communicating directly with the GPU through the Linux KFD driver interface. ~50,000 lines of Rust, zero external dependencies.

核心组件 / Core Components

组件 / Component 说明 / Description
T0 编译器 / Compiler DSL → SSA IR → 6-pass 优化 → 寄存器分配 → GFX1100 ISA → AMD HSA ELF
GEMM 生成器 / GEMM Generator 参数化 bf16 WMMA GEMM:cooperative load + LDS double-buffer + K-loop 流水线
ISA 编码器 / ISA Encoder GFX1100 全指令集机器码编码(VOP1/VOP2/VOP3/SMEM/FLAT/WMMA/DS/MUBUF)
Code Object 生成器 手工构建 AMD HSA ELF 二进制(不依赖 LLVM linker)
KFD 运行时 / Runtime 裸金属 GPU 调度:AQL 队列、VRAM 管理、doorbell dispatch (~2μs)

🏆 性能亮点 / Performance Highlights

🏆 超越 rocBLAS / Surpasses rocBLAS — T0 在大矩阵上全面超越 AMD 官方 rocBLAS 库:

  • 4096³: 96.4 TF vs rocBLAS 91.1 TF (+6%)
  • 2048³: 83.2 TF vs rocBLAS 71.2 TF (+17%)
  • 8192³: 114.1 TF (rocBLAS 未测)

纯 Rust JIT 编译器,运行时自动生成最优 GEMM 内核。不依赖预编译穷举,不依赖任何外部库。 Pure Rust JIT compiler generating optimal GEMM kernels at runtime. No pre-compiled kernel libraries, no external dependencies.

Zero-Overhead Dispatch — 异步调度延迟 2.26 μs(HIP: 2.6 μs),同步调度 14.96 μs(HIP: 20.5 μs)。 Async dispatch 2.26 μs (HIP: 2.6 μs), sync dispatch 14.96 μs (HIP: 20.5 μs) — 13-27% faster than HIP.

Zero-Dependency — 纯 Rust,零外部依赖,仅需 /dev/kfd + /dev/dri。 Pure Rust, zero external dependencies — only requires /dev/kfd + /dev/dri.

为什么不用 HIP?/ Why Not HIP?

HIP Runtime KFD 裸金属 / Bare-Metal
同步调度延迟 / Sync dispatch 20.5 μs 14.96 μs (−27%)
异步调度延迟 / Async dispatch 2.6 μs 2.26 μs (−13%)
内存管理 / Memory mgmt hipMalloc/hipFree 直接 mmap VRAM / Direct VRAM mmap
依赖 / Dependencies libhip, libhsakmt, ROCr /dev/kfd + /dev/dri
编译器栈 / Compiler stack Python + LLVM + ROCm (数 GB) 单一 Rust binary

快速开始 / Quick Start

环境要求 / Requirements

  • GPU: AMD RDNA3 (RX 7900 XTX / 7900 XT / 7800 XT 等)
  • OS: Linux, 内核 5.15+(Ubuntu 22.04+ 推荐)/ Linux kernel 5.15+
  • 驱动 / Driver: amdgpu KFD(内核模块自带,无需额外安装)/ Built-in kernel module
  • 工具链 / Toolchain: Rust 1.70+, LLVM 17+ (llvm-mc, ld.lld)

验证环境 / Verify Setup

# 检查 KFD 设备
ls -la /dev/kfd /dev/dri/renderD128

# 检查用户权限
groups | grep -E "video|render"
# 如需添加: sudo usermod -aG video,render $USER && newgrp video

# 验证 LLVM
llvm-mc --version    # 需要 17+

编译 / Build

# 仅编译 T0 编译器(无需 GPU)
cargo build --release --lib

# 编译含 KFD 运行时
cargo build --release --lib --features rocm

# 运行 GEMM 基准测试(见下方「性能复现」章节)
cargo test --release --features rocm -- test_wgp_k64_benchmark \
  --nocapture --ignored --test-threads=1

# 运行正确性测试
cargo test --release --features rocm -- test_tile_ir_correctness \
  --nocapture --test-threads=1

# 导出 ISA 汇编(调试)
T0_DUMP_ASM=1 cargo test --release --features rocm -- <test_name>

示例:BlockDSL API / Example: BlockDSL API

use t0_gpu::t0::block_dsl::*;

// 矢量加法内核 / Vector add kernel
let mut kb = BlockKernel::new("vadd", 256);
let x = kb.arg_ptr("x");
let y = kb.arg_ptr("y");
let n = kb.arg_u32("n_elems");
let gid = kb.global_id();

kb.if_lt(gid, n, |kb| {
    let a = kb.load_f32(x, gid);
    let b = kb.load_f32(y, gid);
    let c = kb.add(a, b);
    kb.store_f32(x, gid, c);
});

let compiled = kb.compile(Target::GFX1100)?;
// → AMD HSA ELF binary, ready for KFD dispatch

T0 编译器架构 / T0 Compiler Architecture

T0 是一个多层 GPU 内核编译器,具有两条独立的编译路径:

T0 is a multi-layer GPU kernel compiler with two independent compilation paths:

flowchart TD
    subgraph "Path A: General Kernels"
        A1["BlockDSL<br>(block_dsl.rs)"] --> A2["SSA IR<br>(ssa_ir.rs)"]
        A2 --> A3["Optimization<br>DCE · CSE · LICM<br>CopyProp · AlgSimp"]
        A3 --> A4["SSA RegAlloc<br>(ssa_regalloc.rs)"]
    end

    subgraph "Path B: GEMM Kernels"
        B1["GemmConfig<br>(gemm_gen.rs)"] --> B2["TileIR<br>(tile_ir.rs)"]
        B2 --> B3["Tile SSA<br>(tile_ssa.rs)"]
        B3 --> B4["Tile Lower<br>(tile_ssa_lower.rs)"]
    end

    A4 --> C["T0Kernel<br>(compile.rs)"]
    B4 --> C

    subgraph "Backend"
        C --> D["AsmEmitter<br>→ rdna3_asm.rs"]
        D --> E["AMD HSA ELF<br>→ rdna3_code_object.rs"]
    end

    subgraph "KFD Runtime"
        E --> F["GpuKernel::load()"]
        F --> G["AQL Dispatch<br>~2μs async"]
        G --> H["🚀 RX 7900 XTX"]
    end

    style A1 fill:#4CAF50,color:#fff
    style B1 fill:#4CAF50,color:#fff
    style H fill:#FF5722,color:#fff
    style E fill:#2196F3,color:#fff
Loading

Path A: BlockDSL → SSA (通用内核 / General Kernels)

适用于逐元素运算、Softmax、RoPE、Cross-Entropy Loss 等。

For elementwise ops, Softmax, RoPE, Cross-Entropy Loss, etc.

  • BlockDSL: Triton 风格的声明式内核 DSL(循环、条件、LDS、WMMA、Wave reduce)
  • SSA IR: Static Single Assignment 中间表示 + Phi 节点 + 控制流图
  • 6-Pass 优化: DCE、CSE (barrier-aware)、LICM、Copy Propagation、Algebraic Simplification、Waitcnt Refinement
  • SSA RegAlloc: 线性扫描 + Gap Reclaim + WMMA 8-aligned 群组分配

Path B: TileIR → GEMM (矩阵乘法专用 / GEMM-Specific)

适用于 bf16 WMMA 矩阵乘法。

For bf16 WMMA matrix multiplication.

  • GemmConfig: 参数化配置(tile_m/n/k, split_k, wg_size, transpose)
  • TileIR: K-loop 双缓冲流水线 + Cooperative Load + Graduated LDS Waits
  • Tile SSA: VGPR 压力估算 + acc_swap 检测
  • Auto-Select: auto_select(M, K, N) 自动选择最优配置

内置内核 / Built-in Kernels

内核 / Kernel 说明 / Description
GEMM bf16 WMMA, cooperative load, LDS double-buffer, auto-select
RMSNorm 前向 + 后向 / Forward + backward
Softmax Online Safe Softmax (数值稳定)
Cross-Entropy log_softmax + NLL loss + backward
RoPE 旋转位置编码 前向 + 后向
Causal Mask 上三角 mask → -inf
Elementwise scale, relu, sigmoid, SiLU, gelu, exp, fma 及融合组合
Transpose f32/bf16 矩阵转置
Format f32 ⇆ bf16 转换

🏆 GEMM 性能实测 / GEMM Performance

T0 vs rocBLAS vs Triton 对比 / Head-to-Head Comparison (2026-03-31)

BF16 GEMM,RX 7900 XTX,同机同条件测量。rocBLAS baseline: PyTorch 2.9.1+rocm6.4 torch.mm(); Triton 3.6.0 with @triton.autotune

BF16 GEMM on RX 7900 XTX, same machine, same session:

矩阵 M×N×K rocBLAS (TF) Triton-AT (TF) T0 (TF) T0 Config T0 vs rocBLAS
256³ 3.4 2.3 2.4 64×64 k64 71%
512³ 14.3 17.3 14.2 64×64 k64 99%
1024³ 51.5 55.4 47.1 64×64 k64 91%
2048³ 71.2 78.7 83.2 128×64 k32 117% 🔥
4096³ 91.1 88.2 96.4 128×128 k32 106% 🔥
8192³ 114.1 128×128 k32

🏆 大矩阵全面超越 — T0 在 2048³ 超越 rocBLAS 17%,4096³ 超越 6%。 🏆 Surpasses rocBLAS on large matrices — T0 beats rocBLAS by 17% at 2048³, 6% at 4096³.

JIT vs 预编译 — rocBLAS 使用 Tensile 离线穷举 1000+ 预编译变体(耗时数天),T0 在运行时 ~1 秒内自动生成最优内核。 ⚡ JIT vs pre-compiled — rocBLAS uses Tensile to pre-generate 1000+ kernel variants offline; T0 JIT-compiles optimal kernels in ~1 second at runtime.

⚠️ 小矩阵 (≤256³) — 调度开销主导,rocBLAS 的 HIP dispatch 路径对小矩阵更优。 ⚠️ Small matrices (≤256³) — dispatch overhead dominates; HIP's dispatch path is faster for tiny matrices.

性能演化 / Performance Evolution

日期 / Date 版本 / Version 4096³ TFLOPS 关键优化 / Key Optimization
2026-03-21 gemm_gen Split-K 67.3 Split-K + WGP mode
2026-03-29 TileIR v1 79.2 Graduated lgkmcnt + Gap Reclaim
2026-03-30 TileIR v2 (Phase 1) 84.1 soffset addressing
2026-03-30 TileIR v2 (Phase 2) 89.0 LDS offset folding
2026-03-30 TileIR v2 (Phase 3) 87.1 Concurrent VMEM overlap
2026-03-31 TileIR v3 (Autotuner) 103.7 Data-driven auto-select + spill filter + WGP fix
2026-03-31 TileIR v3 (8192³) 116.8 128×128 k64, 70.8% of peak
2026-03-31 TileIR v4 (LDS Fix) 96.4 LDS XOR swizzle correctness fix (k>16)

优化技术 / Optimization Techniques

技术 / Technique 说明 / Description
Cooperative Loading 工作组内线程协作加载 tile,每线程 buffer_load_b128 (16B)
LDS Double Buffering 双缓冲 K-loop 流水线,隐藏 GMEM 延迟
Graduated LDS Waits ds_load 后逐条 lgkmcnt(N) 精化,最大化 WMMA/LDS 重叠
soffset Addressing SGPR 行偏移预计算 → 消除 inner loop 串行 v_add 链
LDS Offset Folding ds_store 立即数 offset 字段折叠行地址,零 VGPR 开销
Concurrent VMEM X 和 WT 矩阵同时发射 buffer_load,92+ 指令 VMEM 重叠窗口
Gap Reclaim 对齐间隙 VGPR 回收,节省 ~15 VGPRs
Auto-Select + K-Clamp 数据驱动自动选最优 tile 配置(基于全谱 benchmark)
Autotuner 运行时预编译 + 批量压测,~1 秒选出最优内核
VGPR Spill Filter 编译时检测 LDS spill,自动跳过低效内核
Split-K 编译时 K 维并行化 (sk=1~16)
Dual Grid Layout M-on-X / N-on-X 自适应 L2 局部性

项目结构 / Project Structure

t0-gpu/  (~50,000 LOC)
├── Cargo.toml
├── README.md
├── docs/
│   ├── T0_技术手册.md           # 📖 完整技术手册 (1000+ 行)
│   ├── architecture.md          # 系统架构图
│   ├── T0_SSA_Safety_Guide.md   # SSA 管线安全指南
│   └── *.md                     # 18 份实验记录
├── examples/
│   ├── bench_gemm_sweep.rs      # GEMM 多配置扫描
│   ├── bench_gemm_variants.rs   # GEMM 变体对比
│   └── hello_gemm_gen.rs        # 自动选择 GEMM
└── src/
    ├── lib.rs
    ├── prelude.rs
    ├── rdna3_asm.rs              # ISA 编码器 (3,100 LOC)
    ├── rdna3_code_object.rs      # ELF 生成器 (1,400 LOC)
    ├── kfd/
    │   └── mod.rs                # KFD 裸金属运行时 (3,000 LOC)
    └── t0/                       # T0 编译器 (34 文件, ~38K LOC)
        ├── block_dsl.rs          #   BlockDSL 前端 (2,000 LOC)
        ├── block_dsl_to_ssa.rs   #   DSL → SSA 翻译 (1,800 LOC)
        ├── ssa_ir.rs             #   SSA 中间表示 (3,400 LOC)
        ├── opt_passes.rs         #   6-pass SSA 优化 (1,400 LOC)
        ├── ssa_regalloc.rs       #   SSA 寄存器分配 (1,000 LOC)
        ├── domtree.rs            #   支配树 (600 LOC)
        ├── ir.rs                 #   T0 IR (~80 Op 类型) (1,100 LOC)
        ├── compile.rs            #   编译主逻辑 (1,400 LOC)
        ├── asm_emitter.rs        #   ISA 发射器 (1,000 LOC)
        ├── tile_ir.rs            #   GEMM TileIR (4,000 LOC)
        ├── tile_ssa.rs           #   Tile SSA (2,100 LOC)
        ├── tile_ssa_lower.rs     #   Tile → T0Kernel (2,300 LOC)
        ├── gemm_gen.rs           #   参数化 GEMM (1,400 LOC)
        ├── math.rs               #   数学内核库 (7,900 LOC)
        ├── cost_model.rs         #   GFX1100 成本模型 (900 LOC)
        ├── kloop_simulator.rs    #   4 管线 K-loop 模拟器 (1,500 LOC)
        ├── hw_probe.rs           #   GPU 指令延迟探测 (1,700 LOC)
        ├── isa_probe.rs          #   ISA 编码自动验证 (1,000 LOC)
        ├── isa_verifier.rs       #   ISA 静态验证器 (500 LOC)
        ├── gpu_printf.rs         #   GPU 端 printf (400 LOC)
        ├── softmax_kernels.rs    #   Softmax 前向/后向
        ├── ce_loss_kernels.rs    #   Cross-Entropy Loss
        ├── rope_kernels.rs       #   RoPE 前向/后向
        ├── causal_mask_kernels.rs #  Causal Mask
        └── ...

诊断与工具 / Diagnostics & Tools

工具 / Tool 说明 / Description
ISA Verifier 编译前静态检查 hang 模式(VCC 残留、EXEC 不平衡、缺失 waitcnt)
HW Probe GPU 上运行微基准,测量每条指令延迟/吞吐 (s_getreg SHADER_CYCLES)
ISA Probe 调用 llvm-mc 自动发现 GFX1100 可用指令 + 差异分析
K-loop Simulator 4 管线流水线模拟器(VALU/WMMA, LDS, VMEM, SALU)+ RAW 依赖跟踪
GPU Printf KFD 裸金属 ring buffer printf(GPU atomic_add 写入,CPU 读取)
ASM Dump T0_DUMP_ASM=1 导出人类可读 ISA 汇编

🔬 性能复现 / Reproducing Benchmarks

环境准备 / Setup

# 确保 GPU 和驱动就绪
ls /dev/kfd /dev/dri/renderD128
groups | grep -E "video|render"

# 编译(需要 --features rocm 启用 KFD 运行时)
cargo build --release --lib --features rocm

Benchmark 命令 / Benchmark Commands

# ┌─────────────────────────────────────────────────────────────┐
# │  ★ 主力 Benchmark:4096³ Autotuner(自动选最优 tile)        │
# │  ★ Primary: 4096³ Autotuner (auto-select best tile)       │
# └─────────────────────────────────────────────────────────────┘
cargo test --release --features rocm -- test_tune_tile_ir_4096 \
  --nocapture --ignored --test-threads=1

# 输出示例 / Expected output:
#   [tile_tune]   tile_gemm_128x128_k32_db → 96.4 TF
#   [tile_tune]   tile_gemm_64x128_k32_db  → 94.9 TF
#   [tile_tune]   tile_gemm_64x64_k64_db   → 91.5 TF
#   [tile_tune] ✓ Best: tile_gemm_128x128_k32_db (96.4 TF)
# ┌─────────────────────────────────────────────────────────────┐
# │  全谱 Benchmark:256³ ~ 8192³ 全尺寸扫描                    │
# │  Full spectrum: 256³ to 8192³ sweep                        │
# └─────────────────────────────────────────────────────────────┘
cargo test --release --features rocm -- test_tune_tile_ir_all_sizes \
  --nocapture --ignored --test-threads=1

# 输出包含 256³ ~ 8192³ 每个尺寸的最优配置和性能
# Output includes best config and TFLOPS for each size
# ┌─────────────────────────────────────────────────────────────┐
# │  正确性测试:GPU vs CPU 参考实现对比                    │
# │  Correctness: GPU vs CPU reference comparison             │
# └─────────────────────────────────────────────────────────────┘
cargo run --release --features rocm --example test_gemm_correctness

# 输出示例 / Expected output:
#   39 configs tested, 39 PASS, 0 FAIL
#   ✅ PASS max_err in BF16 precision range (~1e-5)
# ┌─────────────────────────────────────────────────────────────┐
# │  rocBLAS / Triton 对比(需要 PyTorch + Triton)              │
# │  rocBLAS / Triton comparison (requires PyTorch + Triton)  │
# └─────────────────────────────────────────────────────────────┘
python3 benchmarks/bench_triton_rocblas.py

# 输出示例 / Expected output:
#   4096×4096×4096 | rocBLAS 91.10 TF | Triton 88.16 TF | ★ rocBLAS
# 用与交叉验证 T0 结果 / Cross-validate T0 results
# ┌─────────────────────────────────────────────────────────────┐
# │  ISA 汇编导出(分析 inner loop 质量)                   │
# │  ISA dump (analyze inner loop quality)                    │
# └─────────────────────────────────────────────────────────────┘
T0_DUMP_ASM=1 cargo test --release --features rocm \
  -- test_lower_gemm_128x128_k32_compiles --nocapture

# 将输出 GFX1100 ISA 汇编,包含寄存器分配信息
# Outputs GFX1100 ISA assembly with register allocation info

性能注意事项 / Performance Notes

  • 首次运行:Autotuner 首次运行会编译 13 个候选内核(~5秒),后续运行使用缓存(~0.1秒) First run compiles 13 candidate kernels (~5s); subsequent runs use cache (~0.1s).
  • GPU 频率:建议运行 2-3 次取最佳值(首次可能因频率爬升偏低) Run 2-3 times and take the best (first run may be slower due to GPU clock ramp-up).
  • GPU 温度:长时间运行后热节流可能导致 1-3% 性能波动 Thermal throttling may cause 1-3% variance after sustained runs.
  • --test-threads=1:GPU 测试必须单线程,否则竞争 GPU 资源导致结果不准 GPU tests must run single-threaded to avoid resource contention.

路线图 / Roadmap

状态 功能 / Feature 说明 / Description
超越 rocBLAS 大矩阵 GEMM 超越 AMD 官方库 14-32%
数据驱动 Autotuner 预编译+批量压测,~1s 选出最优内核
VGPR Spill 过滤 编译时检测 LDS spill,跳过低效配置
RSRC1 WGP 修复 修正 RSRC1 bit29 映射 + KFD loader 清理
soffset Addressing SGPR 行偏移预计算消除 inner loop 串行依赖
LDS Offset Folding ds_store 立即数 offset 折叠,零 VGPR 开销
Concurrent VMEM Load X/WT 同时发射 buffer_load,92+ 指令 overlap
Graduated lgkmcnt(N) 精确 waitcnt 最大化 WMMA/LDS 流水线重叠
buffer_load 优化 SRD 寻址替代 global_load,减少 VGPR 压力
LDS Bank Conflict 优化 stride padding + swizzle 消除 bank conflict
🟡 Graph 级算子融合 / Op Fusion GEMM+Bias+RMSNorm 融合内核
🟡 Async GPU Dispatch GpuFuture + submit_async()
🟢 多 GPU / Multi-GPU 多队列调度、PCIe P2P 传输
🟢 RDNA4 支持 / GFX12 GFX12 ISA 适配

文档 / Documentation


许可证 / License

Licensed under either of:

at your option.

硬件目标 / Hardware Target

项目 / Item 详情 / Detail
GPU AMD Radeon RX 7900 XTX (Navi 31)
架构 / Architecture RDNA3, Wave32, 96 CU
ISA 目标 / ISA Target amdgcn-amd-amdhsa--gfx1100
VRAM 24 GB GDDR6
峰值算力 / Peak Compute 165 TFLOPS (bf16 WMMA, 2.5 GHz boost)

支持与赞助 / Support & Sponsor

T0-GPU 是我在失业期间独立开发的个人开源项目。如果你觉得这个项目不仅硬核,而且对你的研究或工作有启发,欢迎支持。

T0-GPU is an independent open-source project developed entirely during my unemployment. If you find this bare-metal approach inspiring, consider supporting its ongoing development.

🪙 Crypto:

  • ETH / ERC20: 0x5C28A5e66302800ba4Cc8950055715f7119562C4
  • BTC: bc1q0844xxw9s3r4usu96l8rs6j82er0sce7p7yg8t

☕️ 微信 / 支付宝 (For supporters in mainland China): 如果你在国内,请查看 DONATE.md 获取赞助二维码。

About

T0-GPU is a pure-Rust GPU programming framework targeting AMD RDNA3 (GFX1100) hardware. It bypasses HIP/ROCm userspace libraries entirely, communicating directly with the GPU through the Linux KFD driver interface.

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Packages

 
 
 

Contributors