Skip to content

Conversation

@tjtanaa
Copy link
Contributor

@tjtanaa tjtanaa commented Dec 18, 2025

Motivation

There has been huge interest in the vLLM community in using AITER triton kernels for Radeon. The testing has shown significant performance benefits on Radeon GPU as well. vllm-project/vllm#28649 (comment)

We are working with @hongxiayang on enabling and upstreaming all the tuning configs and tuning scripts to AITER Radeon as a proper support.

The work will then be broken down into multiple PRs to upstream to AITER.

Technical Details

Phase 1 (Done)

Tasks

Understand which triton op is failing for gfx1201
Run unit tests using the community patch vllm-project/vllm#28649

Results is based on this commit f4e4188

All of the important triton kernels can run on RDNA 4:

  1. test_gemm_a8w8.log all passed.
  2. test_gemm_a8w8_per_token_scale.log all passed.
  3. test_gemm_a8w8_block_scale.log all passed.
  4. test_batched_gemm_a8w8.log All passed, just some OOM
  5. test_batched_gemm_bf16.log All passed, just OOM
  6. test_moe.log All passed with just 4/850 fails. Great
  7. test_unified_attention.log All passing with 41/823 failures (hardware config failure, has been solved in EmbeddedLLM@1574097 of our branch)
  8. test_rmsnorm.log All pass, just very small mismatch and OOM .
  9. test_mha.log all workings for forwards one, just with some OOM

Phase 2 (Enable GPU Arch on gfx1201)

Tasks:

  1. Add the gfx1201 to GPU arch.
  2. Run all unit tests and make sure the kernels that are important for actual deployments are passing. Fix any issues related to the failure.
  3. Add Tuning scripts for Radeon GPU with proper search space.
  4. Evaluate the performance gain on vLLM.

The checklist of this phase is the list of ops to enable

Current progress:

  • gemm_a16w16 ✅
  • gemm_a8w8_block_scale ✅
  • gmm
  • gemm_a8w8✅
  • gemm_a8w8_per_token_scale ✅
  • batched_gemm_a8w8
  • batched_gemm_bf16
  • unified_attention✅
  • moe
  • rmsnorm
  • mha (forward)
  • gemm_a16w16_gated

Test Plan

Ensure we can run the unit tests.
Kernels are also tuned.

Test Result

Phase 1 (DONE): Test Results of unit tests using the community patch

  1. Run all unit tests and make sure the kernels that are important for actual deployments are passing. Fix any issues related to the failure.

All of the important triton kernels can run on RDNA 4:

  1. test_gemm_a8w8.log all passed.
  2. test_gemm_a8w8_per_token_scale.log all passed.
  3. test_gemm_a8w8_block_scale.log all passed.
  4. test_batched_gemm_a8w8.log All passed, just some OOM
  5. test_batched_gemm_bf16.log All passed, just OOM
  6. test_moe.log All passed with just 4/850 fails. Great
  7. test_unified_attention.log All passing with 41/823 failures (hardware config failure, has been solved in EmbeddedLLM@1574097 of our branch)
  8. test_rmsnorm.log All pass, just very small mismatch and OOM .
  9. test_mha.log all workings for forwards one, just with some OOM

Submission Checklist

@tjtanaa
Copy link
Contributor Author

tjtanaa commented Dec 18, 2025

CC @hongxiayang @mgehre-amd

@tjtanaa
Copy link
Contributor Author

tjtanaa commented Dec 18, 2025

@valarLip could we get some preliminary thoughts about this? Will there be any concerns in us upstreaming the configs and tuning scripts, and fixes like this unified attention fix is acceptable or not (EmbeddedLLM@1574097)?
We will try our best to make sure the changes are as little as possible as the main repo AITER is designed for Instinct GPUs.

Signed-off-by: Amir Balwel <amoooori04@gmail.com>
Co-authored-by: Jeff Aw <jeffaw99@hotmail.com>

Signed-off-by: Amir Balwel <amoooori04@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants