Skip to content

Conversation

@willzhou-amd
Copy link
Contributor

@willzhou-amd willzhou-amd commented Jul 28, 2025

Adding fused GEMM + activation + gating options to reduce memory movement overhead in a FF block.

Changes:

  • Add activation string parameter support for gemm_a16w16 to fuse an activation function with the GEMM.
  • Add gemm_a16w16_gating.py which uses the first half of the (M, N) output along the N dimension as the gating activations and the second half as the layer activations. Produces a tensor of shape (M, N//2). This also includes an activation parameter for the gating activation.
  • Write tests.
  • Write a fully-fused E2E FF kernel. Will go into next PR.

Benchmarks on 350x:
image

Post-tuning:
image

Benchmarks on 300x:
image

Multikernel-ff is where we're at right now (e.g launching two Aiter Triton GEMMs).

@willzhou-amd willzhou-amd self-assigned this Jul 28, 2025
Copy link
Contributor

@rahulbatra85 rahulbatra85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add logging for the new op

@rahulbatra85 rahulbatra85 merged commit 4822e67 into main Aug 2, 2025
14 checks passed
@rahulbatra85 rahulbatra85 deleted the willz/fused-ff-gemms branch August 2, 2025 01:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants