GPU Architecture Reference

AMD CDNA Architecture Peak Performance

MI250X (gfx90a)

Dtype	Peak TFLOPS
FP64	47.9
FP32	47.9
FP16 / BF16	383.0
INT8	383.0

HBM2e bandwidth: 3.2 TB/s (aggregate, 2 GCDs)
Per-GCD: 1.6 TB/s
LDS bandwidth: ~51.2 TB/s per GCD
L2 cache: 8 MB per GCD
LDS: 64 KB per CU
Max VGPR per CU: 512 (each 64-bit for FP64)
Wavefront size: 64

MI300X (gfx942)

Dtype	Peak TFLOPS
FP64	163.4
FP32	163.4
FP16 / BF16	1307.4
FP8	2614.9
INT8	2614.9

HBM3 bandwidth: 5.3 TB/s
L2 cache: 256 MB (shared across XCDs)
LDS: 64 KB per CU
Max VGPR per CU: 512
Wavefront size: 64
8 XCDs, 304 CUs total

Roofline Analysis

Arithmetic Intensity (AI) = FLOPs / Bytes accessed

If AI < Peak FLOPS / Peak BW  =>  Memory-bound
If AI > Peak FLOPS / Peak BW  =>  Compute-bound

Ridge point (AI where compute meets memory)

GPU	FP16/BF16 ridge	FP8 ridge
MI250X (per GCD)	239 FLOP/Byte	-
MI300X	247 FLOP/Byte	493 FLOP/Byte

Common operator AI ranges

Operator	Typical AI	Bound
PA Decode (single query)	1-4	Memory
PA Prefill (long seq)	64-256	Compute
GEMM (large M,N,K)	128-512	Compute
GEMM (small M, decode)	2-16	Memory
RMSNorm	2-4	Memory
RoPE	4-8	Memory
Activation (SiLU, GELU)	~1	Memory
MoE routing	1-2	Memory
TopK / Radix select (multi-block)	n/a	Sync
Custom all-reduce (intra-node)	n/a	Sync

Sync-bound operators

A third category beyond compute / memory bound. Roofline analysis doesn't apply directly because the dominant cost is __threadfence / atomicInc-spin / kernel launch / __syncthreads, not FLOPs or bytes. Symptoms in rocprof: both VALUUtilization and MemUnitBusy simultaneously low, long gaps between memory ops in att traces. Multi-block-cooperative kernels (radix top-k, all-reduce, multi-stage fused ops) commonly land here.

gfx942 / MI300X memory-model notes

These are non-obvious behaviors that change what code is needed:

Device-scope atomicAdd to global memory completes through L2 with global visibility — a __threadfence() immediately after such an atomicAdd is redundant on gfx942. Each unnecessary __threadfence() costs ~4 μs on MI300X (measured). Removing 3 redundant fences from a multi-block top-k kernel saved ~12 μs end-to-end.
Counters read only by the host need no in-kernel fence — the implicit kernel-end fence at completion is sufficient.
Slots allocated by atomicAdd are unique by construction — if each block's output position is allocated this way, there is no write conflict and therefore no fence needed to order writes.
Persistent kernels are very cheap on gfx942 — converting a 3-launch host loop into a single in-kernel pass loop saves both launch overhead and unlocks further fence removal (no host-visible state between passes).
Always document the safety argument inline when removing a fence — even when correct, the next reader cannot audit a silent removal.

HIP Launch Config Guidelines

Block size selection (CDNA)

Wavefront = 64 threads (always multiple of 64)
64 threads: low occupancy, max registers per thread
128 threads: balanced
256 threads: high occupancy, fewer registers

Occupancy targets

Memory-bound: maximize occupancy (hide latency with more waves)
Compute-bound: moderate occupancy, use saved registers for tiling/ILP

LDS usage

64 KB per CU, shared across all active wavefronts
Bank conflicts: 32 banks, 4-byte stride
Padding trick: allocate [N][M+1] instead of [N][M] to avoid conflicts

ROCm profiling toolchain

Tool	Use
`rocprof` (legacy)	Quick `--stats` summary, `--pmc` counter sweeps. Good for first-pass triage.
`rocprofv3`	Newer; richer event model. Required for ATT (Advanced Thread Trace).
`rocprofv3 --att` + `ThreadTraceView`	Per-instruction trace with cycle-level timing. Essential for sync-bound ops — shows the actual gaps between barriers/fences that simple counter views miss. Output: `.att` (visualization) + `.csv` (instruction & cycle data).
`omniperf` / `omnitrace`	Higher-level dashboards over rocprof data. Good for "where is time going" surveys.

For sync-bound debugging, prefer rocprofv3 --att over counter sweeps — counters average across the kernel and hide the per-barrier gaps that are exactly the thing you need to see.

rocprof Counters Quick Reference

Counter	What it tells you
`FETCH_SIZE`	Bytes read from HBM
`WRITE_SIZE`	Bytes written to HBM
`VALUUtilization`	% of cycles VALU is active
`SALUUtilization`	% of cycles SALU is active
`LDSBankConflict`	LDS bank conflict count
`MemUnitBusy`	% memory unit is busy
`Wavefronts`	Total wavefronts launched
`VALUInsts`	VALU instructions executed
`SALUInsts`	SALU instructions executed
`LDSInsts`	LDS instructions executed
`FlatVMemInsts`	Flat/global memory instructions
`WriteUnitStalled`	% write unit is stalled

Useful derived metrics

Achieved BW = (FETCH_SIZE + WRITE_SIZE) / kernel_duration
VALU efficiency = VALUInsts / (total_cycles * num_CUs)
LDS conflict ratio = LDSBankConflict / LDSInsts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Architecture Reference

AMD CDNA Architecture Peak Performance

MI250X (gfx90a)

MI300X (gfx942)

Roofline Analysis

Ridge point (AI where compute meets memory)

Common operator AI ranges

Sync-bound operators

gfx942 / MI300X memory-model notes

HIP Launch Config Guidelines

Block size selection (CDNA)

Occupancy targets

LDS usage

ROCm profiling toolchain

rocprof Counters Quick Reference

Useful derived metrics

FilesExpand file tree

reference.md

Latest commit

History

reference.md

File metadata and controls

GPU Architecture Reference

AMD CDNA Architecture Peak Performance

MI250X (gfx90a)

MI300X (gfx942)

Roofline Analysis

Ridge point (AI where compute meets memory)

Common operator AI ranges

Sync-bound operators

gfx942 / MI300X memory-model notes

HIP Launch Config Guidelines

Block size selection (CDNA)

Occupancy targets

LDS usage

ROCm profiling toolchain

rocprof Counters Quick Reference

Useful derived metrics