CUDA: add attention sinks for tile and wmma by am17an · Pull Request #15178 · ggml-org/llama.cpp

am17an · 2025-08-08T17:16:07Z

Adding attention sink support for older GPUs (Volta and below), this would complete support for attention sinks in the flash attention code

on P100
master

ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla P100-SXM2-16GB, compute capability 6.0, VMM: yes
| model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |      512 |  1 |          pp8192 |        443.35 ± 1.08 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |      512 |  1 |           tg128 |         52.81 ± 0.05 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |     1024 |  1 |          pp8192 |        501.63 ± 0.68 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |     1024 |  1 |           tg128 |         52.77 ± 0.04 |

PR

  Device 0: Tesla P100-SXM2-16GB, compute capability 6.0, VMM: yes
| model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |      512 |  1 |          pp8192 |        687.26 ± 2.64 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |      512 |  1 |           tg128 |         52.83 ± 0.03 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |     1024 |  1 |          pp8192 |        823.87 ± 1.32 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |     1024 |  1 |           tg128 |         52.76 ± 0.05 |

on V100

master (with fix) - at the moment it looks this model is broken on solely Volta because it goes through the wmma path even though attention sinks are not supported

  Device 0: Tesla V100-PCIE-16GB, compute capability 7.0, VMM: yes
| model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |      512 |  1 |          pp8192 |       1081.62 ± 2.53 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |      512 |  1 |           tg128 |        117.00 ± 0.20 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |     1024 |  1 |          pp8192 |       1189.98 ± 3.06 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |     1024 |  1 |           tg128 |        117.38 ± 0.29 |

PR

 Device 0: Tesla V100-PCIE-16GB, compute capability 7.0, VMM: yes
| model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s | 
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |      512 |  1 |          pp8192 |      2231.48 ± 15.04 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |      512 |  1 |           tg128 |        117.85 ± 0.10 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |     1024 |  1 |          pp8192 |      2801.53 ± 29.66 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |     1024 |  1 |           tg128 |        117.79 ± 0.13 |

JohannesGaessler

This PR should produce correct results, but I think some of the synchronizations can be optimized out. In addition to the usual tests for correctness, please also check compute-sanitizer --tool=racecheck ./tests/test-backend-ops -o FLASH_ATTN_EXT, the compute sanitizer should come with the CUDA installation but it may not be on the PATH (on my system it's under /opt/cuda/bin/compute-sanitizer).

…rp_reduce_max from wmma

am17an · 2025-08-09T11:30:55Z

@JohannesGaessler the compute-sanitizer tests are all green. Tested on P100 and V100

IMbackK · 2025-08-09T19:47:18Z

If possible i would like to be tagged for prs that touch the wmma code.

Port of ggml-org/llama.cpp#15178

This reverts commit 34c9d76.

* CUDA: add attention sinks for tile and wmma * Review: formatting changes + remove syncthreads from tile + remove warp_reduce_max from wmma

CUDA: add attention sinks for tile and wmma

4946c19

am17an requested a review from JohannesGaessler as a code owner August 8, 2025 17:16

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Aug 8, 2025

JohannesGaessler reviewed Aug 9, 2025

View reviewed changes

Review: formatting changes + remove syncthreads from tile + remove wa…

1ef7fd0

…rp_reduce_max from wmma

JohannesGaessler approved these changes Aug 9, 2025

View reviewed changes

am17an merged commit 34c9d76 into ggml-org:master Aug 9, 2025
47 checks passed

am17an deleted the cuda_fattn_tile_wmma branch August 9, 2025 12:00

Thireus added a commit to Thireus/ik_llama.cpp that referenced this pull request Aug 11, 2025

CUDA: add attention sinks for tile and wmma

f71ef6b

Port of ggml-org/llama.cpp#15178

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 6, 2025

Revert "CUDA: add attention sinks for tile and wmma (ggml-org#15178)"

b46e828

This reverts commit 34c9d76.

blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026

CUDA: add attention sinks for tile and wmma (#15178)

76e2486

* CUDA: add attention sinks for tile and wmma * Review: formatting changes + remove syncthreads from tile + remove warp_reduce_max from wmma

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: add attention sinks for tile and wmma#15178

CUDA: add attention sinks for tile and wmma#15178
am17an merged 2 commits intoggml-org:masterfrom
am17an:cuda_fattn_tile_wmma

am17an commented Aug 8, 2025 •

edited

Loading

Uh oh!

JohannesGaessler left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

am17an commented Aug 9, 2025

Uh oh!

Uh oh!

IMbackK commented Aug 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

am17an commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

am17an commented Aug 9, 2025

Uh oh!

Uh oh!

IMbackK commented Aug 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

am17an commented Aug 8, 2025 •

edited

Loading