[TRITON]: Moe tuning mi350 #610

Chi-Chu319 · 2025-07-03T08:38:43Z

~~WARNING: Don't merge this yet. It fails testing small shapes with our num_stage=3 change. Investigation on going. Reproducer~~

pytest   op_tests/triton_tests/test_moe.py::test_fused_moe[True-True-dtype0-False-False-True-64-14336-4096-2-8]

We change the num_stage to 2. We will investigate this outside scope of this pr. If its compiler related we will raise attention from the compiler ppl.

Fix config loading
configs for mi350
Added various support for benchmark for future tuning
EVEN_K loading of a and b
405B expert matrix loading int32 overflow

Fixed xcd remapping: Previously we used GRID_MN = EM // BLOCK_SIZE_M * num_pid_n for the xcd_remap(pid, GRID_MN,NUM_XCD=8). This results in the last XCD receiving all the dummy tokens (as EM=num_tokens_post_padded + dummy tokens) which have no work and have early return -> the last XCD is idling while other XCDs still have work. Fixed the GRID_MN to be calculated as GRID_MN = num_tokens_post_padded // BLOCK_SIZE_M * num_pid_n.

Perf comparison to main

main:

Profiling Summary (kernel times in ns, TFLOPs, and Bandwidth in GB/s):
+------------------+-------------------+------------------------+--------+------------------+-----------------+
|    M | model     | _fused_moe_kernel | Other Kernels Sum (ns) | TFLOPs | Bandwidth (GB/s) | AI (Flops/Byte) |
+------------------+-------------------+------------------------+--------+------------------+-----------------+
| 128 llama3-405B  |    13150217.33    |       1164348.65       | 33.97  |     1063.87      |      31.93      |
| 256 llama3-405B  |    14980420.00    |       1185801.70       | 59.63  |      935.99      |      63.71      |
| 512 llama3-405B  |    17507314.60    |       1252937.57       | 102.05 |      804.49      |      126.86     |
| 1024 llama3-405B |    23240926.78    |       1388565.29       | 153.76 |      611.43      |      251.47     |
| 2048 llama3-405B |    33828796.57    |       1304083.53       | 211.26 |      427.50      |      494.18     |
+------------------+-------------------+------------------------+--------+------------------+-----------------+

moe-tuning-mi350:

Profiling Summary (kernel times in ns, TFLOPs, and Bandwidth in GB/s):
+------------------+-------------------+------------------------+--------+------------------+-----------------+
|    M | model     | _fused_moe_kernel | Other Kernels Sum (ns) | TFLOPs | Bandwidth (GB/s) | AI (Flops/Byte) |
+------------------+-------------------+------------------------+--------+------------------+-----------------+
| 128 llama3-405B  |     2838261.55    |       1094917.86       | 157.38 |     4929.11      |      31.93      |
| 256 llama3-405B  |     2931990.21    |       1110832.19       | 304.69 |     4782.27      |      63.71      |
| 512 llama3-405B  |     4357953.36    |       1192987.41       | 409.99 |     3231.90      |      126.86     |
| 1024 llama3-405B |     6993911.33    |       1260398.59       | 510.93 |     2031.81      |      251.47     |
| 2048 llama3-405B |    12335273.92    |       1241415.39       | 579.38 |     1172.41      |      494.18     |
+------------------+-------------------+------------------------+--------+------------------+-----------------+

…ped pids, rather than the EM which contains the empty tiles

…ersions

…4 moe versions

… moe-tuning-mi350

rahulbatra85 · 2025-07-07T15:07:29Z

@Chi-Chu319 Please address black CI issues

aiter/ops/triton/moe_op_gelu.py

aiter/ops/triton/moe_op.py

gelu and silu fused

… moe-tuning-mi350

* Updated benchmark script * Fixed config parsing and mi350 config * format * Correct the xcd_remapping to use num_tokens_post_padded for the remapped pids, rather than the EM which contains the empty tiles * fix * New remapping for all * fixed tests and moe remapping * Linter * chunked remap jit * Fixed comments * fix remap on fp4 and gelu moe, and remove TODOs from the persistent versions * print time * num_pid_m calculated based on tokens post padded in gelu, silu and fp4 moe versions * fix pid grid * MI350 medium M config tuned * even_k * silu fused moe even_k * mi350 small tuning * black linting * remove outdated TODO in persistent * reverse black linting for json files * fix segmem fault from persistent by turning pointers to tl.int64 * fix segmem fault from persistent by turning pointers to tl.int64 gelu and silu fused * int64 * Black linter * Evem k refactor * num_stage=2 --------- Co-authored-by: Juuso Korhonen <40278371+juuso-oskari@users.noreply.github.com>

Chi-Chu319 added 2 commits July 2, 2025 14:37

Updated benchmark script

1ea53b3

Fixed config parsing and mi350 config

6bbc9ea

Chi-Chu319 requested review from juuso-oskari and rahulbatra85 July 3, 2025 08:38

Chi-Chu319 self-assigned this Jul 3, 2025

Chi-Chu319 and others added 4 commits July 3, 2025 10:47

format

32b6a82

Correct the xcd_remapping to use num_tokens_post_padded for the remap…

d815d4b

…ped pids, rather than the EM which contains the empty tiles

fix

46ee8bb

New remapping for all

853e03a

juuso-oskari self-assigned this Jul 3, 2025

Chi-Chu319 and others added 15 commits July 3, 2025 12:39

fixed tests and moe remapping

32a7292

Merge branch 'main' into moe-tuning-mi350

1fce458

Linter

a67008a

chunked remap jit

751bcd8

Fixed comments

26240d1

Merge branch 'main' into moe-tuning-mi350

3ef9c2a

fix remap on fp4 and gelu moe, and remove TODOs from the persistent v…

1445b05

…ersions

Merge branch 'main' into moe-tuning-mi350

3fb3df4

print time

4bb7f3d

num_pid_m calculated based on tokens post padded in gelu, silu and fp…

d0d21b9

…4 moe versions

fix pid grid

e1d77e4

MI350 medium M config tuned

3c8ffe4

even_k

b00ff64

Merge branch 'moe-tuning-mi350' of https://github.com/ROCm/aiter into…

3a418e5

… moe-tuning-mi350

silu fused moe even_k

2e781d7

rahulbatra85 changed the title ~~Moe tuning mi350~~ [TRITON]: Moe tuning mi350 Jul 7, 2025

juuso-oskari added 3 commits July 8, 2025 07:38

mi350 small tuning

f2471ed

Merge branch 'main' into moe-tuning-mi350

acaa0f4

black linting

6be796a

juuso-oskari added 2 commits July 8, 2025 08:03

remove outdated TODO in persistent

03894d7

reverse black linting for json files

f6e6aba

rahulbatra85 reviewed Jul 8, 2025

View reviewed changes

aiter/ops/triton/moe_op_gelu.py Show resolved Hide resolved

rahulbatra85 reviewed Jul 8, 2025

View reviewed changes

aiter/ops/triton/moe_op.py Outdated Show resolved Hide resolved

rahulbatra85 reviewed Jul 8, 2025

View reviewed changes

aiter/ops/triton/moe_op.py Outdated Show resolved Hide resolved

juuso-oskari and others added 8 commits July 9, 2025 09:43

fix segmem fault from persistent by turning pointers to tl.int64

9cf6569

fix segmem fault from persistent by turning pointers to tl.int64

866d005

gelu and silu fused

int64

f029462

Black linter

816a0b6

Evem k refactor

9fdd868

Merge branch 'main' into moe-tuning-mi350

76af744

num_stage=2

19f4e59

Merge branch 'moe-tuning-mi350' of https://github.com/ROCm/aiter into…

c7c9a26

… moe-tuning-mi350

Chi-Chu319 requested a review from rahulbatra85 July 9, 2025 13:12

rahulbatra85 approved these changes Jul 10, 2025

View reviewed changes

juuso-oskari and others added 2 commits July 10, 2025 08:00

Merge branch 'main' into moe-tuning-mi350

3364ff0

Merge branch 'main' into moe-tuning-mi350

266c693

rahulbatra85 merged commit c7bca6a into main Jul 10, 2025
13 checks passed

rahulbatra85 deleted the moe-tuning-mi350 branch July 10, 2025 15:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TRITON]: Moe tuning mi350 #610

[TRITON]: Moe tuning mi350 #610

Uh oh!

Chi-Chu319 commented Jul 3, 2025 •

edited

Loading

Uh oh!

rahulbatra85 commented Jul 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[TRITON]: Moe tuning mi350 #610

[TRITON]: Moe tuning mi350 #610

Uh oh!

Conversation

Chi-Chu319 commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Perf comparison to main

Uh oh!

rahulbatra85 commented Jul 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Chi-Chu319 commented Jul 3, 2025 •

edited

Loading