[TRITON]: Fp4gemm m=256 tuning #533

Chi-Chu319 · 2025-06-10T12:43:06Z

This PR is made primarily for M=256, N=16384 tuning and xcd remapping with some tunings

The full performance is available at
https://amdcloud-my.sharepoint.com/:x:/r/personal/alizaidy_amd_com/_layouts/15/doc2.aspx?sourcedoc=%7B2117442c-b906-49c5-8ace-0c07b925dc14%7D&action=edit&activeCell=%27fp4gemm_m256-tuning

We move away from split k because we can satisfy the parallelism with GRID_MN while still respecting the constraint of num_warps * 32 <= BLOCK_N (for the preshuffling scales to work).

We also made a chunked version of the xcd_remap, which now brings perf boost (as opposed to previously degrading perf). The benefit from the chunked version is:

With previous remapping, we divided all the pids to 8 chunks and send those to the 8 XCDs. This effectively made a single XCD process its own continuous chunk of B matrix of size (K x N // 8). This is good for L2 usage, but most likely the L2 is already saturated by caching the A matrix of size (M x K). Its also at the same time disasterous for L3 caching, because now the concurrent memory reads coming from different XCDs are separated maximally with N//8 elements.

We solved this by having the xcd_remap instead of mapping into one continous chunk of size num_pid_n//8, mapping into multiple continuous chunks of size CHUNK_SIZE (a tunable variable).

Performance

The performance for M=256, N=16384:

_gemm_afp4_wfp4_kernel

main (commit e31c2e0)

Profiling Summary (kernel times in µs, TFLOPs, and Bandwidth in GB/s):
+-----------------+---------------------------+----------------------------------+------------------------+--------------------+--------------------+
|    M | N | K    | _gemm_afp4_wfp4_kernel.kd | _gemm_afp4_wfp4_reduce_kernel.kd | Other Kernels Sum (µs) |       TFLOPs       |  Bandwidth (GB/s)  |
+-----------------+---------------------------+----------------------------------+------------------------+--------------------+--------------------+
|  256 16384 2048 |           19.073          |               None               |         69.081         | 900.7428922560688  | 1389.0278404026633 |
|  256 16384 2048 |           19.073          |               None               |         69.081         | 900.7428922560688  | 1389.0278404026633 |
|  256 16384 6656 |           44.991          |               None               |        119.574         | 1241.016533262208  | 1494.2465826498635 |
|  256 16384 8192 |          181.061          |               6.41               |        143.834         | 366.5605706269236  | 431.03140219020537 |
| 256 16384 13312 |          278.465          |              6.864               |        208.056         | 391.3697860925459  | 441.82921469601763 |
| 256 16384 16384 |          339.051          |               7.47               |         252.06         | 396.6251784798035  | 442.17570652283695 |
| 256 16384 26624 |          539.954          |              7.317               |         386.3          | 408.09452609767374 | 445.38221100697825 |
+-----------------+---------------------------+----------------------------------+------------------------+--------------------+--------------------+

fp4gemm_m256-tuning

Profiling Summary (kernel times in µs, TFLOPs, and Bandwidth in GB/s):
+-----------------+---------------------------+----------------------------------+------------------------+--------------------+--------------------+
|    M | N | K    | _gemm_afp4_wfp4_kernel.kd | _gemm_afp4_wfp4_reduce_kernel.kd | Other Kernels Sum (µs) |       TFLOPs       |  Bandwidth (GB/s)  |
+-----------------+---------------------------+----------------------------------+------------------------+--------------------+--------------------+
|  256 16384 2048 |           15.761          |               None               |         67.329         | 1090.0240583719308 | 1680.9166931032296 |
|  256 16384 2048 |           15.761          |               None               |         67.329         | 1090.0240583719308 | 1680.9166931032296 |
|  256 16384 6656 |           48.896          |               None               |         119.89         | 1141.9047539267015 | 1374.9109947643979 |
|  256 16384 8192 |           44.463          |               None               |        137.355         | 1545.5429623732093 | 1817.3737264691993 |
| 256 16384 13312 |           68.03           |               None               |        199.743         | 1641.469200293988  | 1853.1043363222109 |
| 256 16384 16384 |           85.242          |               None               |        240.081         | 1612.3384419886906 | 1797.5078951690477 |
| 256 16384 26624 |          137.011          |               None               |        386.911         | 1630.0756829159702 |  1779.01604980622  |
+-----------------+---------------------------+----------------------------------+------------------------+--------------------+--------------------+

gemm_afp4wfp4_preshuffled_scales

main (commit e31c2e0)

Profiling Summary (kernel times in µs, TFLOPs, and Bandwidth in GB/s):
+-----------------+----------------------------------------------+---------------------------+----------------------------------+------------------------+--------------------+--------------------+
|    M | N | K    | _gemm_afp4_wfp4_kernel_preshuffled_scales.kd | _gemm_afp4_wfp4_kernel.kd | _gemm_afp4_wfp4_reduce_kernel.kd | Other Kernels Sum (µs) |       TFLOPs       |  Bandwidth (GB/s)  |
+-----------------+----------------------------------------------+---------------------------+----------------------------------+------------------------+--------------------+--------------------+
|  256 16384 2048 |                    19.843                    |            None           |               None               |         78.761         | 865.7899099934486  | 1335.1271481126846 |
|  256 16384 2048 |                    19.843                    |            None           |               None               |         78.761         | 865.7899099934486  | 1335.1271481126846 |
|  256 16384 6656 |                    64.045                    |            None           |               None               |         130.85         | 871.8022460457491  | 1049.6939339526893 |
|  256 16384 8192 |                    53.171                    |            None           |              7.002               |        155.862         | 1142.0317540425108 |  1342.89279244844  |
| 256 16384 13312 |                    82.582                    |            None           |              6.975               |         220.07         | 1246.9058777761652 | 1407.6698415534242 |
| 256 16384 16384 |                    97.697                    |            None           |              7.091               |        264.826         | 1311.5905778524257 | 1462.2205596060617 |
| 256 16384 26624 |                   154.077                    |            None           |              7.441               |        405.901         | 1382.7455725801458 | 1509.0873339194393 |
+-----------------+----------------------------------------------+---------------------------+----------------------------------+------------------------+--------------------+--------------------+

fp4gemm_m256-tuning

Profiling Summary (kernel times in µs, TFLOPs, and Bandwidth in GB/s):
+-----------------+----------------------------------------------+---------------------------+----------------------------------+------------------------+--------------------+--------------------+
|    M | N | K    | _gemm_afp4_wfp4_kernel_preshuffled_scales.kd | _gemm_afp4_wfp4_kernel.kd | _gemm_afp4_wfp4_reduce_kernel.kd | Other Kernels Sum (µs) |       TFLOPs       |  Bandwidth (GB/s)  |
+-----------------+----------------------------------------------+---------------------------+----------------------------------+------------------------+--------------------+--------------------+
|  256 16384 2048 |                    15.829                    |            None           |               None               |         75.528         | 1085.3414103228251 | 1673.695621959694  |
|  256 16384 2048 |                    15.829                    |            None           |               None               |         75.528         | 1085.3414103228251 | 1673.695621959694  |
|  256 16384 6656 |                    63.651                    |            None           |               None               |        130.299         |  877.198706194718  | 1056.1915445161899 |
|  256 16384 8192 |                    46.289                    |            None           |               None               |        149.175         | 1484.5746664650349 | 1745.682300330532  |
| 256 16384 13312 |                    73.493                    |            None           |               None               |        214.465         | 1519.4528689262925 | 1715.356401289919  |
| 256 16384 16384 |                    86.779                    |            None           |               None               |        259.582         | 1583.7812543587736 | 1765.6710494474473 |
| 256 16384 26624 |                   139.834                    |            None           |               None               |        401.384         | 1597.167351230745  | 1743.1008767538651 |
+-----------------+----------------------------------------------+---------------------------+----------------------------------+------------------------+--------------------+--------------------+

the performance of varying m (we show that the pid remapping doesn't hurt the perf but also improve perf in some cases)

rahulbatra85 · 2025-06-12T14:33:14Z

@Chi-Chu319 LGTM!
Can you please take of black CI issues?
Thanks!

cagrikymk · 2025-06-16T16:04:15Z

Also, I was able to reproduce the performance numbers and everything looks good!

rahulbatra85

looks good except the conflict. Also, can you please run the config changes by Cagri Mehmut and Ali?

Thanks!

aiter/ops/triton/utils/pid_preprocessing.py

cagrikymk · 2025-06-26T14:56:13Z

I compared the performance against main and besides the following shapes there is no regression, only performance improvement or no change. "x" means all M dimensions I tried. This is likely from the missing freshly changed configs.

M	N	K
X	2112	7168
X	3072	1536
X	7168	2048
X	7168	256
X	512	7168
32	53248	16384
64	26624	16384
64	53248	16384

Chi-Chu319 · 2025-06-27T11:55:23Z

7168 2048
X 7168 256
X 512 7168

I am tuning them. Most of them are group size related. As with remapping you want smaller group size

cagrikymk · 2025-06-27T16:21:02Z

Looks good to me. No regression for any shapes I tested (LLAMA and DS related ones), only improvements or no change.

juuso-oskari

LGTM

Chi-Chu319 and others added 8 commits June 9, 2025 22:02

tuning

753a7f8

Updated tuning

ff011b8

tuned

68c3c48

Merge branch 'main' into tianxing/fp4gemm_m256-tuning

d24d25b

K 26624 tuned

37f4148

CHUNK SIZED xcd remapping

ec67b99

separate chunked version of the xcd_remap

27bdbdb

group size tuning

1423201

Chi-Chu319 requested review from azaidy and rahulbatra85 June 10, 2025 12:43

Chi-Chu319 assigned Chi-Chu319 and juuso-oskari Jun 10, 2025

black lint

14948e8

juuso-oskari requested a review from vgokhale June 10, 2025 13:54

juuso-oskari added 5 commits June 11, 2025 07:49

add print vgpr to gemm bench for fast tuning of block sizes

344347e

rm

c0167a4

fix

996a27f

add chunked remapping for shuffled scales

27f326f

refactor remap

cdf4a4a

rahulbatra85 changed the title ~~Fp4gemm m=256 tuning~~ [TRITON]: Fp4gemm m=256 tuning Jun 11, 2025

Chi-Chu319 and others added 7 commits June 13, 2025 10:47

Tuning

afb3e5d

merge main

98170a2

black linting

b21a950

Revert 3rdparty submodule to match main

a387e01

added num_warps * 32 <= BLOCK_N condition for preshuffle

19ba8f0

preshuffle case optimizing

d51a865

small shape optimization

c5b1f9f

Merge branch 'main' into fp4gemm_m256-tuning

63f043d

Chi-Chu319 added 4 commits June 18, 2025 18:32

tuning

ebcf12f

black liter

383d9fa

Merge branch 'main' into fp4gemm_m256-tuning

640ca47

cache optimization

d53dd2f

Chi-Chu319 requested a review from rahulbatra85 June 19, 2025 13:27

Merge branch 'main' into fp4gemm_m256-tuning

610d137

rahulbatra85 requested a review from cagrikymk June 24, 2025 15:47

rahulbatra85 requested changes Jun 25, 2025

View reviewed changes

cagrikymk reviewed Jun 25, 2025

View reviewed changes

aiter/ops/triton/utils/pid_preprocessing.py Show resolved Hide resolved

Merge branch 'main' into fp4gemm_m256-tuning

73c0e89

Chi-Chu319 added 3 commits June 27, 2025 12:47

deepseek tuning

1fade13

deepseek tuning

1832aaf

Deekseek tuning

c9e9260

Chi-Chu319 requested a review from rahulbatra85 June 27, 2025 21:10

rahulbatra85 previously approved these changes Jun 30, 2025

View reviewed changes

Chi-Chu319 added 3 commits July 1, 2025 10:54

Merge branch 'main' into fp4gemm_m256-tuning

2f5d102

Merge branch 'main' into fp4gemm_m256-tuning

6b3918c

Merge branch 'main' into fp4gemm_m256-tuning

1908549

Chi-Chu319 dismissed rahulbatra85’s stale review via 1908549 July 2, 2025 11:52

Merge branch 'main' into fp4gemm_m256-tuning

ca540f8

Chi-Chu319 requested review from cagrikymk and rahulbatra85 July 2, 2025 21:31

Merge branch 'main' into fp4gemm_m256-tuning

17f8958

Chi-Chu319 requested a review from juuso-oskari July 3, 2025 11:17

juuso-oskari approved these changes Jul 3, 2025

View reviewed changes

Chi-Chu319 merged commit f9c7bb0 into main Jul 3, 2025
13 checks passed

Chi-Chu319 deleted the fp4gemm_m256-tuning branch July 3, 2025 11:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TRITON]: Fp4gemm m=256 tuning #533

[TRITON]: Fp4gemm m=256 tuning #533

Uh oh!

Chi-Chu319 commented Jun 10, 2025 •

edited

Loading

Uh oh!

rahulbatra85 commented Jun 12, 2025

Uh oh!

cagrikymk commented Jun 16, 2025

Uh oh!

rahulbatra85 left a comment

Uh oh!

Uh oh!

cagrikymk commented Jun 26, 2025

Uh oh!

Chi-Chu319 commented Jun 27, 2025

Uh oh!

cagrikymk commented Jun 27, 2025

Uh oh!

juuso-oskari left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[TRITON]: Fp4gemm m=256 tuning #533

[TRITON]: Fp4gemm m=256 tuning #533

Uh oh!

Conversation

Chi-Chu319 commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance

_gemm_afp4_wfp4_kernel

gemm_afp4wfp4_preshuffled_scales

Uh oh!

rahulbatra85 commented Jun 12, 2025

Uh oh!

cagrikymk commented Jun 16, 2025

Uh oh!

rahulbatra85 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cagrikymk commented Jun 26, 2025

Uh oh!

Chi-Chu319 commented Jun 27, 2025

Uh oh!

cagrikymk commented Jun 27, 2025

Uh oh!

juuso-oskari left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Chi-Chu319 commented Jun 10, 2025 •

edited

Loading