Skip to content

Latest commit

 

History

History
140 lines (108 loc) · 5.31 KB

File metadata and controls

140 lines (108 loc) · 5.31 KB

GPU Architecture Reference

AMD CDNA Architecture Peak Performance

MI250X (gfx90a)

Dtype Peak TFLOPS
FP64 47.9
FP32 47.9
FP16 / BF16 383.0
INT8 383.0
  • HBM2e bandwidth: 3.2 TB/s (aggregate, 2 GCDs)
  • Per-GCD: 1.6 TB/s
  • LDS bandwidth: ~51.2 TB/s per GCD
  • L2 cache: 8 MB per GCD
  • LDS: 64 KB per CU
  • Max VGPR per CU: 512 (each 64-bit for FP64)
  • Wavefront size: 64

MI300X (gfx942)

Dtype Peak TFLOPS
FP64 163.4
FP32 163.4
FP16 / BF16 1307.4
FP8 2614.9
INT8 2614.9
  • HBM3 bandwidth: 5.3 TB/s
  • L2 cache: 256 MB (shared across XCDs)
  • LDS: 64 KB per CU
  • Max VGPR per CU: 512
  • Wavefront size: 64
  • 8 XCDs, 304 CUs total

Roofline Analysis

Arithmetic Intensity (AI) = FLOPs / Bytes accessed

If AI < Peak FLOPS / Peak BW  =>  Memory-bound
If AI > Peak FLOPS / Peak BW  =>  Compute-bound

Ridge point (AI where compute meets memory)

GPU FP16/BF16 ridge FP8 ridge
MI250X (per GCD) 239 FLOP/Byte -
MI300X 247 FLOP/Byte 493 FLOP/Byte

Common operator AI ranges

Operator Typical AI Bound
PA Decode (single query) 1-4 Memory
PA Prefill (long seq) 64-256 Compute
GEMM (large M,N,K) 128-512 Compute
GEMM (small M, decode) 2-16 Memory
RMSNorm 2-4 Memory
RoPE 4-8 Memory
Activation (SiLU, GELU) ~1 Memory
MoE routing 1-2 Memory
TopK / Radix select (multi-block) n/a Sync
Custom all-reduce (intra-node) n/a Sync

Sync-bound operators

A third category beyond compute / memory bound. Roofline analysis doesn't apply directly because the dominant cost is __threadfence / atomicInc-spin / kernel launch / __syncthreads, not FLOPs or bytes. Symptoms in rocprof: both VALUUtilization and MemUnitBusy simultaneously low, long gaps between memory ops in att traces. Multi-block-cooperative kernels (radix top-k, all-reduce, multi-stage fused ops) commonly land here.


gfx942 / MI300X memory-model notes

These are non-obvious behaviors that change what code is needed:

  • Device-scope atomicAdd to global memory completes through L2 with global visibility — a __threadfence() immediately after such an atomicAdd is redundant on gfx942. Each unnecessary __threadfence() costs ~4 μs on MI300X (measured). Removing 3 redundant fences from a multi-block top-k kernel saved ~12 μs end-to-end.
  • Counters read only by the host need no in-kernel fence — the implicit kernel-end fence at completion is sufficient.
  • Slots allocated by atomicAdd are unique by construction — if each block's output position is allocated this way, there is no write conflict and therefore no fence needed to order writes.
  • Persistent kernels are very cheap on gfx942 — converting a 3-launch host loop into a single in-kernel pass loop saves both launch overhead and unlocks further fence removal (no host-visible state between passes).
  • Always document the safety argument inline when removing a fence — even when correct, the next reader cannot audit a silent removal.

HIP Launch Config Guidelines

Block size selection (CDNA)

  • Wavefront = 64 threads (always multiple of 64)
  • 64 threads: low occupancy, max registers per thread
  • 128 threads: balanced
  • 256 threads: high occupancy, fewer registers

Occupancy targets

  • Memory-bound: maximize occupancy (hide latency with more waves)
  • Compute-bound: moderate occupancy, use saved registers for tiling/ILP

LDS usage

  • 64 KB per CU, shared across all active wavefronts
  • Bank conflicts: 32 banks, 4-byte stride
  • Padding trick: allocate [N][M+1] instead of [N][M] to avoid conflicts

ROCm profiling toolchain

Tool Use
rocprof (legacy) Quick --stats summary, --pmc counter sweeps. Good for first-pass triage.
rocprofv3 Newer; richer event model. Required for ATT (Advanced Thread Trace).
rocprofv3 --att + ThreadTraceView Per-instruction trace with cycle-level timing. Essential for sync-bound ops — shows the actual gaps between barriers/fences that simple counter views miss. Output: .att (visualization) + .csv (instruction & cycle data).
omniperf / omnitrace Higher-level dashboards over rocprof data. Good for "where is time going" surveys.

For sync-bound debugging, prefer rocprofv3 --att over counter sweeps — counters average across the kernel and hide the per-barrier gaps that are exactly the thing you need to see.


rocprof Counters Quick Reference

Counter What it tells you
FETCH_SIZE Bytes read from HBM
WRITE_SIZE Bytes written to HBM
VALUUtilization % of cycles VALU is active
SALUUtilization % of cycles SALU is active
LDSBankConflict LDS bank conflict count
MemUnitBusy % memory unit is busy
Wavefronts Total wavefronts launched
VALUInsts VALU instructions executed
SALUInsts SALU instructions executed
LDSInsts LDS instructions executed
FlatVMemInsts Flat/global memory instructions
WriteUnitStalled % write unit is stalled

Useful derived metrics

  • Achieved BW = (FETCH_SIZE + WRITE_SIZE) / kernel_duration
  • VALU efficiency = VALUInsts / (total_cycles * num_CUs)
  • LDS conflict ratio = LDSBankConflict / LDSInsts