Skip to content

Misc. bug: Vulkan's performance degradation(TG) on A770 from b7194 and FA problem #17628

@savvadesogle

Description

@savvadesogle

Name and Version

llama-cli -v
load_backend: loaded RPC backend from C:\llm\llama-cpp\VULKAN\b7209\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from C:\llm\llama-cpp\VULKAN\b7209\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\llm\llama-cpp\VULKAN\b7209\ggml-cpu-haswell.dll
build: 7209 (7f8ef50) with clang version 19.1.5 for x86_64-pc-windows-msvc

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-bench
llama-server

Command line

llama-bench -m T:\models\lmstudio-community\gpt-oss-20b-GGUF\gpt-oss-20b-MXFP4.gguf  -ngl 100 -fa 0,1


llama-bench -m T:\models\lmstudio-community\Meta-Llama-3.1-8B-Instruct-GGUF\Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -ngl 100 -fa 0,1

Problem description & steps to reproduce

Hello.

Drop in token generation (TG) performance compared to the B7189 version on the Intel Arc A770.
Between b7189 and b7209.

Driver: 8250
cpu: xeon 2699v3 x2
GPU: 1x A770

Models:
53 -> 42 t/s lmstudio-community\Meta-Llama-3.1-8B-Instruct-GGUF\Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
60 -> 54 t/s lmstudio-community\gpt-oss-20b-GGUF\gpt-oss-20b-MXFP4.gguf

B7209

| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     | 100 |  0 |           pp512 |        921.00 ± 3.22 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     | 100 |  0 |           tg128 |         42.39 ± 0.07 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     | 100 |  1 |           pp512 |        280.12 ± 0.44 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     | 100 |  1 |           tg128 |         29.39 ± 0.02 |

build: 7f8ef50cc (7209)

sometimes flash attention crashes (with no error, The bench just stops)
Image

| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     | 100 |  0 |           pp512 |        885.48 ± 5.59 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     | 100 |  0 |           tg128 |         54.20 ± 0.07 |

B7189

| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     | 100 |  0 |           pp512 |        918.06 ± 4.28 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     | 100 |  0 |           tg128 |         53.63 ± 0.10 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     | 100 |  1 |           pp512 |        280.26 ± 0.70 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     | 100 |  1 |           tg128 |         34.50 ± 0.04 |
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     | 100 |  0 |           pp512 |        884.45 ± 5.87 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     | 100 |  0 |           tg128 |         60.95 ± 0.06 |

First Bad Commit

Between b7189 and b7209

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneed feedbackTesting and feedback with results are neededperformanceSpeed related topicsregressionA regression introduced in a new build (something that was previously working correctly)

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions