CUDA: fix tile FA kernel on Pascal by JohannesGaessler · Pull Request #22541 · ggml-org/llama.cpp

JohannesGaessler · 2026-04-30T07:14:54Z

The problem is that the new kernel for Mistral Small 4 is being compiled unconditionally with 32 columns / CUDA block. On Pascal that puts it above the 38 kiB / CUDA block shared memory limit. This PR makes it so that 32 columns/block continue to be used for AMD where this fits and on Pascal 2 CUDA blocks with 16 columns each are used instead.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: No

IMbackK · 2026-04-30T11:12:01Z

I gues this needlessly nerfs musa which afaik also has 64KiB sram.

* 'master' of github.com:tekintian/llama.cpp: (659 commits) ggml-webgpu: Improve performance of mat-vec and mat-mat for MUL_MAT_ID (ggml-org#22464) Update llama-mmap to use ftello/fseeko (ggml-org#22497) common : check for null getpwuid in hf-cache (ggml-org#22550) vulkan: add get/set tensor 2d functions (ggml-org#22514) spec: fix argument typo (ggml-org#22552) ci : bump ty to 0.0.33 (ggml-org#22535) vendor : update cpp-httplib to 0.43.2 (ggml-org#22548) CUDA: fix tile FA kernel on Pascal (ggml-org#22541) scripts : add wc2wt.sh - create worktree from current HEAD (ggml-org#22513) add fast matmul iquants (ggml-org#22504) spec : fix draft model checkpoints (ggml-org#22521) spec : fix vocab compat checks in spec example (ggml-org#22426) common : do not pass prompt tokens to reasoning budget sampler (ggml-org#22488) hexagon: make vmem and buffer-size configurable (ggml-org#22487) CUDA: fuse SSM_CONV + ADD(bias) + SILU (ggml-org#22478) spec : disacard last drafted token with low prob (ggml-org#22506) sync : ggml ggml : bump version to 0.10.1 (ggml/1469) webui: fix slow mic stop and WAV encode (ggml-org#22480) ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault (ggml-org#22293) ... # Conflicts: # .gitignore

CUDA: fix tile FA kernel on Pascal

6cfbba8

JohannesGaessler requested a review from a team as a code owner April 30, 2026 07:14

JohannesGaessler mentioned this pull request Apr 30, 2026

Compile bug: Entry function flash_attn_tile (mangled) uses too much shared data (0xd100 bytes, 0xc000 max) #22491

Closed

am17an approved these changes Apr 30, 2026

View reviewed changes

ggerganov approved these changes Apr 30, 2026

View reviewed changes

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 30, 2026

JohannesGaessler merged commit e82aaf2 into ggml-org:master Apr 30, 2026
44 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: fix tile FA kernel on Pascal#22541

CUDA: fix tile FA kernel on Pascal#22541
JohannesGaessler merged 1 commit intoggml-org:masterfrom
JohannesGaessler:cuda-fa-fix-pascal-compile

JohannesGaessler commented Apr 30, 2026

Uh oh!

Uh oh!

IMbackK commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

JohannesGaessler commented Apr 30, 2026

Requirements

Uh oh!

Uh oh!

IMbackK commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants