CUDA: fix FTZ in FA for Gemma 3 by JohannesGaessler · Pull Request #13991 · ggml-org/llama.cpp

JohannesGaessler · 2025-06-03T12:31:54Z

What I think is happening is that there is an underflow in the FlashAttention code when rescaling the FP16 VKQ accumulators. This PR flushes the scale to 0 if it's < 2.06e-9. I don't have multimodal Gemma 3 set up, I did not reproduce the issue on my machine.

ggerganov

This seems like a good solution, though I have some small remaining concerns that there might be something else going on. I tried the same approach with the Metal implementation (i.e. keep accumulating the output in F16 and FTZ the scores like in the CUDA code) and Gemma 3 27B keeps outputting garbage for large prompts. Hard to say what is the root cause as the Metal implementation does not provide many tools for debugging.

Anyway, this should be OK to merge since @mostlygeek confirmed to be running, but we should keep an eye out for any remaining issues.

I don't have multimodal Gemma 3 set up

Btw, you don't need multi-modal Gemma to reproduce the issue. Just load the text-only model and ask it to summarize something about ~100k tokens (for example, server.cpp + llama-context.cpp).

JohannesGaessler · 2025-06-04T06:59:49Z

This seems like a good solution, though I have some small remaining concerns that there might be something else going on.

Well, I hope not. If the CUDA code had to use FP32 for the accumulation of VKQ that would be a pretty big headache for me due to register pressure. BF16 could partially solve the issue but then the new issue is that not all instructions are available on all GPUs.

CUDA: fix FTZ in FA for Gemma 3

2fa5116

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 3, 2025

JohannesGaessler mentioned this pull request Jun 3, 2025

Eval bug: Gemma3 <unused32> spam #12433

Closed

ggerganov approved these changes Jun 4, 2025

View reviewed changes

JohannesGaessler merged commit 0b4be4c into ggml-org:master Jun 4, 2025
42 checks passed

furyhawk pushed a commit to furyhawk/llama.cpp that referenced this pull request Jun 6, 2025

CUDA: fix FTZ in FA for Gemma 3 (ggml-org#13991)

c209d57

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026

CUDA: fix FTZ in FA for Gemma 3 (ggml-org#13991)

0646981

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: fix FTZ in FA for Gemma 3#13991

CUDA: fix FTZ in FA for Gemma 3#13991
JohannesGaessler merged 1 commit intoggml-org:masterfrom
JohannesGaessler:cuda-fa-fix-ftz

JohannesGaessler commented Jun 3, 2025

Uh oh!

ggerganov left a comment

Uh oh!

Uh oh!

JohannesGaessler commented Jun 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JohannesGaessler commented Jun 3, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JohannesGaessler commented Jun 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants