Skip to content

CUDA: add attention sinks for tile and wmma#15178

Merged
am17an merged 2 commits intoggml-org:masterfrom
am17an:cuda_fattn_tile_wmma
Aug 9, 2025
Merged

CUDA: add attention sinks for tile and wmma#15178
am17an merged 2 commits intoggml-org:masterfrom
am17an:cuda_fattn_tile_wmma

Conversation

@am17an
Copy link
Copy Markdown
Contributor

@am17an am17an commented Aug 8, 2025

Adding attention sink support for older GPUs (Volta and below), this would complete support for attention sinks in the flash attention code

on P100
master

ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla P100-SXM2-16GB, compute capability 6.0, VMM: yes
| model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |      512 |  1 |          pp8192 |        443.35 ± 1.08 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |      512 |  1 |           tg128 |         52.81 ± 0.05 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |     1024 |  1 |          pp8192 |        501.63 ± 0.68 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |     1024 |  1 |           tg128 |         52.77 ± 0.04 |

PR

  Device 0: Tesla P100-SXM2-16GB, compute capability 6.0, VMM: yes
| model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |      512 |  1 |          pp8192 |        687.26 ± 2.64 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |      512 |  1 |           tg128 |         52.83 ± 0.03 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |     1024 |  1 |          pp8192 |        823.87 ± 1.32 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |     1024 |  1 |           tg128 |         52.76 ± 0.05 |

on V100

master (with fix) - at the moment it looks this model is broken on solely Volta because it goes through the wmma path even though attention sinks are not supported

  Device 0: Tesla V100-PCIE-16GB, compute capability 7.0, VMM: yes
| model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |      512 |  1 |          pp8192 |       1081.62 ± 2.53 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |      512 |  1 |           tg128 |        117.00 ± 0.20 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |     1024 |  1 |          pp8192 |       1189.98 ± 3.06 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |     1024 |  1 |           tg128 |        117.38 ± 0.29 |

PR

 Device 0: Tesla V100-PCIE-16GB, compute capability 7.0, VMM: yes
| model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s | 
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |      512 |  1 |          pp8192 |      2231.48 ± 15.04 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |      512 |  1 |           tg128 |        117.85 ± 0.10 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |     1024 |  1 |          pp8192 |      2801.53 ± 29.66 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |     1024 |  1 |           tg128 |        117.79 ± 0.13 |

@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Aug 8, 2025
Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR should produce correct results, but I think some of the synchronizations can be optimized out. In addition to the usual tests for correctness, please also check compute-sanitizer --tool=racecheck ./tests/test-backend-ops -o FLASH_ATTN_EXT, the compute sanitizer should come with the CUDA installation but it may not be on the PATH (on my system it's under /opt/cuda/bin/compute-sanitizer).

Comment thread ggml/src/ggml-cuda/fattn-tile-f16.cu Outdated
Comment thread ggml/src/ggml-cuda/fattn-tile-f16.cu Outdated
Comment thread ggml/src/ggml-cuda/fattn-tile-f16.cu Outdated
Comment thread ggml/src/ggml-cuda/fattn-tile-f16.cu Outdated
Comment thread ggml/src/ggml-cuda/fattn-tile-f32.cu Outdated
Comment thread ggml/src/ggml-cuda/fattn-wmma-f16.cu Outdated
Comment thread ggml/src/ggml-cuda/fattn-wmma-f16.cu Outdated
Comment thread ggml/src/ggml-cuda/fattn-wmma-f16.cu Outdated
Comment thread ggml/src/ggml-cuda/fattn-wmma-f16.cu Outdated
Comment thread ggml/src/ggml-cuda/fattn-wmma-f16.cu Outdated
@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented Aug 9, 2025

@JohannesGaessler the compute-sanitizer tests are all green. Tested on P100 and V100

@am17an am17an merged commit 34c9d76 into ggml-org:master Aug 9, 2025
47 checks passed
@am17an am17an deleted the cuda_fattn_tile_wmma branch August 9, 2025 12:00
@IMbackK
Copy link
Copy Markdown
Collaborator

IMbackK commented Aug 9, 2025

If possible i would like to be tagged for prs that touch the wmma code.

Thireus added a commit to Thireus/ik_llama.cpp that referenced this pull request Aug 11, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 6, 2025
blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026
* CUDA: add attention sinks for tile and wmma

* Review: formatting changes + remove syncthreads from tile + remove warp_reduce_max from wmma
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
* CUDA: add attention sinks for tile and wmma

* Review: formatting changes + remove syncthreads from tile + remove warp_reduce_max from wmma
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants