Skip to content

Conversation

@willzhou-amd
Copy link
Contributor

@willzhou-amd willzhou-amd commented Aug 5, 2025

Continuation of #736.

Note that the E2E fused FF suffers from heavy atomics contention when the GEMM is large, and thus typically only outperforms the two-kernel implementation (fused GEMM+act+gating, and then another GEMM) when M<128.

Adds:

  • E2E fused gated GEMM (e.g FF with swiglu)
  • E2E fused ungated GEMM (e.g FF with relu)
  • Benchmarks.
  • Tests.

Metrics are collected on a single MI350X.

Metrics

.do_bench:
image

image

RocprofV3:
image

image

@willzhou-amd willzhou-amd self-assigned this Aug 5, 2025
@willzhou-amd willzhou-amd requested review from azaidy and vgokhale August 5, 2025 20:39
@rahulbatra85 rahulbatra85 merged commit 1a725b1 into main Aug 19, 2025
14 checks passed
@rahulbatra85 rahulbatra85 deleted the willz/e2e-fused-ff branch August 19, 2025 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants