Skip to content

Conversation

@Chi-Chu319
Copy link
Contributor

@Chi-Chu319 Chi-Chu319 commented Jun 10, 2025

Authors: @Chi-Chu319 @juuso-oskari

This PR is made primarily for M=256, N=16384 tuning and xcd remapping with some tunings

The full performance is available at
https://amdcloud-my.sharepoint.com/:x:/r/personal/alizaidy_amd_com/_layouts/15/doc2.aspx?sourcedoc=%7B2117442c-b906-49c5-8ace-0c07b925dc14%7D&action=edit&activeCell=%27fp4gemm_m256-tuning

We move away from split k because we can satisfy the parallelism with GRID_MN while still respecting the constraint of num_warps * 32 <= BLOCK_N (for the preshuffling scales to work).

We also made a chunked version of the xcd_remap, which now brings perf boost (as opposed to previously degrading perf). The benefit from the chunked version is:

With previous remapping, we divided all the pids to 8 chunks and send those to the 8 XCDs. This effectively made a single XCD process its own continuous chunk of B matrix of size (K x N // 8). This is good for L2 usage, but most likely the L2 is already saturated by caching the A matrix of size (M x K). Its also at the same time disasterous for L3 caching, because now the concurrent memory reads coming from different XCDs are separated maximally with N//8 elements.

We solved this by having the xcd_remap instead of mapping into one continous chunk of size num_pid_n//8, mapping into multiple continuous chunks of size CHUNK_SIZE (a tunable variable).

Performance

The performance for M=256, N=16384:

_gemm_afp4_wfp4_kernel

main (commit e31c2e0)

Profiling Summary (kernel times in µs, TFLOPs, and Bandwidth in GB/s):
+-----------------+---------------------------+----------------------------------+------------------------+--------------------+--------------------+
|    M | N | K    | _gemm_afp4_wfp4_kernel.kd | _gemm_afp4_wfp4_reduce_kernel.kd | Other Kernels Sum (µs) |       TFLOPs       |  Bandwidth (GB/s)  |
+-----------------+---------------------------+----------------------------------+------------------------+--------------------+--------------------+
|  256 16384 2048 |           19.073          |               None               |         69.081         | 900.7428922560688  | 1389.0278404026633 |
|  256 16384 2048 |           19.073          |               None               |         69.081         | 900.7428922560688  | 1389.0278404026633 |
|  256 16384 6656 |           44.991          |               None               |        119.574         | 1241.016533262208  | 1494.2465826498635 |
|  256 16384 8192 |          181.061          |               6.41               |        143.834         | 366.5605706269236  | 431.03140219020537 |
| 256 16384 13312 |          278.465          |              6.864               |        208.056         | 391.3697860925459  | 441.82921469601763 |
| 256 16384 16384 |          339.051          |               7.47               |         252.06         | 396.6251784798035  | 442.17570652283695 |
| 256 16384 26624 |          539.954          |              7.317               |         386.3          | 408.09452609767374 | 445.38221100697825 |
+-----------------+---------------------------+----------------------------------+------------------------+--------------------+--------------------+

fp4gemm_m256-tuning

Profiling Summary (kernel times in µs, TFLOPs, and Bandwidth in GB/s):
+-----------------+---------------------------+----------------------------------+------------------------+--------------------+--------------------+
|    M | N | K    | _gemm_afp4_wfp4_kernel.kd | _gemm_afp4_wfp4_reduce_kernel.kd | Other Kernels Sum (µs) |       TFLOPs       |  Bandwidth (GB/s)  |
+-----------------+---------------------------+----------------------------------+------------------------+--------------------+--------------------+
|  256 16384 2048 |           15.761          |               None               |         67.329         | 1090.0240583719308 | 1680.9166931032296 |
|  256 16384 2048 |           15.761          |               None               |         67.329         | 1090.0240583719308 | 1680.9166931032296 |
|  256 16384 6656 |           48.896          |               None               |         119.89         | 1141.9047539267015 | 1374.9109947643979 |
|  256 16384 8192 |           44.463          |               None               |        137.355         | 1545.5429623732093 | 1817.3737264691993 |
| 256 16384 13312 |           68.03           |               None               |        199.743         | 1641.469200293988  | 1853.1043363222109 |
| 256 16384 16384 |           85.242          |               None               |        240.081         | 1612.3384419886906 | 1797.5078951690477 |
| 256 16384 26624 |          137.011          |               None               |        386.911         | 1630.0756829159702 |  1779.01604980622  |
+-----------------+---------------------------+----------------------------------+------------------------+--------------------+--------------------+

gemm_afp4wfp4_preshuffled_scales

main (commit e31c2e0)

Profiling Summary (kernel times in µs, TFLOPs, and Bandwidth in GB/s):
+-----------------+----------------------------------------------+---------------------------+----------------------------------+------------------------+--------------------+--------------------+
|    M | N | K    | _gemm_afp4_wfp4_kernel_preshuffled_scales.kd | _gemm_afp4_wfp4_kernel.kd | _gemm_afp4_wfp4_reduce_kernel.kd | Other Kernels Sum (µs) |       TFLOPs       |  Bandwidth (GB/s)  |
+-----------------+----------------------------------------------+---------------------------+----------------------------------+------------------------+--------------------+--------------------+
|  256 16384 2048 |                    19.843                    |            None           |               None               |         78.761         | 865.7899099934486  | 1335.1271481126846 |
|  256 16384 2048 |                    19.843                    |            None           |               None               |         78.761         | 865.7899099934486  | 1335.1271481126846 |
|  256 16384 6656 |                    64.045                    |            None           |               None               |         130.85         | 871.8022460457491  | 1049.6939339526893 |
|  256 16384 8192 |                    53.171                    |            None           |              7.002               |        155.862         | 1142.0317540425108 |  1342.89279244844  |
| 256 16384 13312 |                    82.582                    |            None           |              6.975               |         220.07         | 1246.9058777761652 | 1407.6698415534242 |
| 256 16384 16384 |                    97.697                    |            None           |              7.091               |        264.826         | 1311.5905778524257 | 1462.2205596060617 |
| 256 16384 26624 |                   154.077                    |            None           |              7.441               |        405.901         | 1382.7455725801458 | 1509.0873339194393 |
+-----------------+----------------------------------------------+---------------------------+----------------------------------+------------------------+--------------------+--------------------+

fp4gemm_m256-tuning

Profiling Summary (kernel times in µs, TFLOPs, and Bandwidth in GB/s):
+-----------------+----------------------------------------------+---------------------------+----------------------------------+------------------------+--------------------+--------------------+
|    M | N | K    | _gemm_afp4_wfp4_kernel_preshuffled_scales.kd | _gemm_afp4_wfp4_kernel.kd | _gemm_afp4_wfp4_reduce_kernel.kd | Other Kernels Sum (µs) |       TFLOPs       |  Bandwidth (GB/s)  |
+-----------------+----------------------------------------------+---------------------------+----------------------------------+------------------------+--------------------+--------------------+
|  256 16384 2048 |                    15.829                    |            None           |               None               |         75.528         | 1085.3414103228251 | 1673.695621959694  |
|  256 16384 2048 |                    15.829                    |            None           |               None               |         75.528         | 1085.3414103228251 | 1673.695621959694  |
|  256 16384 6656 |                    63.651                    |            None           |               None               |        130.299         |  877.198706194718  | 1056.1915445161899 |
|  256 16384 8192 |                    46.289                    |            None           |               None               |        149.175         | 1484.5746664650349 | 1745.682300330532  |
| 256 16384 13312 |                    73.493                    |            None           |               None               |        214.465         | 1519.4528689262925 | 1715.356401289919  |
| 256 16384 16384 |                    86.779                    |            None           |               None               |        259.582         | 1583.7812543587736 | 1765.6710494474473 |
| 256 16384 26624 |                   139.834                    |            None           |               None               |        401.384         | 1597.167351230745  | 1743.1008767538651 |
+-----------------+----------------------------------------------+---------------------------+----------------------------------+------------------------+--------------------+--------------------+

image
the performance of varying m (we show that the pid remapping doesn't hurt the perf but also improve perf in some cases)
image

@juuso-oskari juuso-oskari requested a review from vgokhale June 10, 2025 13:54
@rahulbatra85 rahulbatra85 changed the title Fp4gemm m=256 tuning [TRITON]: Fp4gemm m=256 tuning Jun 11, 2025
@rahulbatra85
Copy link
Contributor

@Chi-Chu319 LGTM!
Can you please take of black CI issues?
Thanks!

@cagrikymk
Copy link
Contributor

Also, I was able to reproduce the performance numbers and everything looks good!

@Chi-Chu319 Chi-Chu319 requested a review from rahulbatra85 June 19, 2025 13:27
@rahulbatra85 rahulbatra85 requested a review from cagrikymk June 24, 2025 15:47
Copy link
Contributor

@rahulbatra85 rahulbatra85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good except the conflict. Also, can you please run the config changes by Cagri Mehmut and Ali?

Thanks!

@cagrikymk
Copy link
Contributor

I compared the performance against main and besides the following shapes there is no regression, only performance improvement or no change. "x" means all M dimensions I tried. This is likely from the missing freshly changed configs.

M N K
X 2112 7168
X 3072 1536
X 7168 2048
X 7168 256
X 512 7168
32 53248 16384
64 26624 16384
64 53248 16384

@Chi-Chu319
Copy link
Contributor Author

7168 2048
X 7168 256
X 512 7168

I am tuning them. Most of them are group size related. As with remapping you want smaller group size

@cagrikymk
Copy link
Contributor

Looks good to me. No regression for any shapes I tested (LLAMA and DS related ones), only improvements or no change.

@Chi-Chu319 Chi-Chu319 requested a review from rahulbatra85 June 27, 2025 21:10
rahulbatra85
rahulbatra85 previously approved these changes Jun 30, 2025
@Chi-Chu319 Chi-Chu319 requested a review from juuso-oskari July 3, 2025 11:17
Copy link
Contributor

@juuso-oskari juuso-oskari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Chi-Chu319 Chi-Chu319 merged commit f9c7bb0 into main Jul 3, 2025
13 checks passed
@Chi-Chu319 Chi-Chu319 deleted the fp4gemm_m256-tuning branch July 3, 2025 11:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants