You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Arithmetic Intensity (AI) = FLOPs / Bytes accessed
If AI < Peak FLOPS / Peak BW => Memory-bound
If AI > Peak FLOPS / Peak BW => Compute-bound
Ridge point (AI where compute meets memory)
GPU
FP16/BF16 ridge
FP8 ridge
MI250X (per GCD)
239 FLOP/Byte
-
MI300X
247 FLOP/Byte
493 FLOP/Byte
Common operator AI ranges
Operator
Typical AI
Bound
PA Decode (single query)
1-4
Memory
PA Prefill (long seq)
64-256
Compute
GEMM (large M,N,K)
128-512
Compute
GEMM (small M, decode)
2-16
Memory
RMSNorm
2-4
Memory
RoPE
4-8
Memory
Activation (SiLU, GELU)
~1
Memory
MoE routing
1-2
Memory
TopK / Radix select (multi-block)
n/a
Sync
Custom all-reduce (intra-node)
n/a
Sync
Sync-bound operators
A third category beyond compute / memory bound. Roofline analysis doesn't apply directly because the dominant cost is __threadfence / atomicInc-spin / kernel launch / __syncthreads, not FLOPs or bytes. Symptoms in rocprof: bothVALUUtilizationandMemUnitBusy simultaneously low, long gaps between memory ops in att traces. Multi-block-cooperative kernels (radix top-k, all-reduce, multi-stage fused ops) commonly land here.
gfx942 / MI300X memory-model notes
These are non-obvious behaviors that change what code is needed:
Device-scope atomicAdd to global memory completes through L2 with global visibility — a __threadfence() immediately after such an atomicAdd is redundant on gfx942. Each unnecessary __threadfence() costs ~4 μs on MI300X (measured). Removing 3 redundant fences from a multi-block top-k kernel saved ~12 μs end-to-end.
Counters read only by the host need no in-kernel fence — the implicit kernel-end fence at completion is sufficient.
Slots allocated by atomicAdd are unique by construction — if each block's output position is allocated this way, there is no write conflict and therefore no fence needed to order writes.
Persistent kernels are very cheap on gfx942 — converting a 3-launch host loop into a single in-kernel pass loop saves both launch overhead and unlocks further fence removal (no host-visible state between passes).
Always document the safety argument inline when removing a fence — even when correct, the next reader cannot audit a silent removal.
HIP Launch Config Guidelines
Block size selection (CDNA)
Wavefront = 64 threads (always multiple of 64)
64 threads: low occupancy, max registers per thread
128 threads: balanced
256 threads: high occupancy, fewer registers
Occupancy targets
Memory-bound: maximize occupancy (hide latency with more waves)
Compute-bound: moderate occupancy, use saved registers for tiling/ILP
LDS usage
64 KB per CU, shared across all active wavefronts
Bank conflicts: 32 banks, 4-byte stride
Padding trick: allocate [N][M+1] instead of [N][M] to avoid conflicts
ROCm profiling toolchain
Tool
Use
rocprof (legacy)
Quick --stats summary, --pmc counter sweeps. Good for first-pass triage.
rocprofv3
Newer; richer event model. Required for ATT (Advanced Thread Trace).
rocprofv3 --att + ThreadTraceView
Per-instruction trace with cycle-level timing. Essential for sync-bound ops — shows the actual gaps between barriers/fences that simple counter views miss. Output: .att (visualization) + .csv (instruction & cycle data).
omniperf / omnitrace
Higher-level dashboards over rocprof data. Good for "where is time going" surveys.
For sync-bound debugging, prefer rocprofv3 --att over counter sweeps — counters average across the kernel and hide the per-barrier gaps that are exactly the thing you need to see.