Name and Version
llama-cli -v
load_backend: loaded RPC backend from C:\llm\llama-cpp\VULKAN\b7209\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from C:\llm\llama-cpp\VULKAN\b7209\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\llm\llama-cpp\VULKAN\b7209\ggml-cpu-haswell.dll
build: 7209 (7f8ef50) with clang version 19.1.5 for x86_64-pc-windows-msvc
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-bench
llama-server
Command line
llama-bench -m T:\models\lmstudio-community\gpt-oss-20b-GGUF\gpt-oss-20b-MXFP4.gguf -ngl 100 -fa 0,1
llama-bench -m T:\models\lmstudio-community\Meta-Llama-3.1-8B-Instruct-GGUF\Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -ngl 100 -fa 0,1
Problem description & steps to reproduce
Hello.
Drop in token generation (TG) performance compared to the B7189 version on the Intel Arc A770.
Between b7189 and b7209.
Driver: 8250
cpu: xeon 2699v3 x2
GPU: 1x A770
Models:
53 -> 42 t/s lmstudio-community\Meta-Llama-3.1-8B-Instruct-GGUF\Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
60 -> 54 t/s lmstudio-community\gpt-oss-20b-GGUF\gpt-oss-20b-MXFP4.gguf
B7209
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Vulkan | 100 | 0 | pp512 | 921.00 ± 3.22 |
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Vulkan | 100 | 0 | tg128 | 42.39 ± 0.07 |
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Vulkan | 100 | 1 | pp512 | 280.12 ± 0.44 |
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Vulkan | 100 | 1 | tg128 | 29.39 ± 0.02 |
build: 7f8ef50cc (7209)
sometimes flash attention crashes (with no error, The bench just stops)

| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan | 100 | 0 | pp512 | 885.48 ± 5.59 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan | 100 | 0 | tg128 | 54.20 ± 0.07 |
B7189
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Vulkan | 100 | 0 | pp512 | 918.06 ± 4.28 |
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Vulkan | 100 | 0 | tg128 | 53.63 ± 0.10 |
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Vulkan | 100 | 1 | pp512 | 280.26 ± 0.70 |
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Vulkan | 100 | 1 | tg128 | 34.50 ± 0.04 |
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan | 100 | 0 | pp512 | 884.45 ± 5.87 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan | 100 | 0 | tg128 | 60.95 ± 0.06 |
First Bad Commit
Between b7189 and b7209
Name and Version
llama-cli -v
load_backend: loaded RPC backend from C:\llm\llama-cpp\VULKAN\b7209\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from C:\llm\llama-cpp\VULKAN\b7209\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\llm\llama-cpp\VULKAN\b7209\ggml-cpu-haswell.dll
build: 7209 (7f8ef50) with clang version 19.1.5 for x86_64-pc-windows-msvc
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-bench
llama-server
Command line
Problem description & steps to reproduce
Hello.
Drop in token generation (TG) performance compared to the B7189 version on the Intel Arc A770.
Between b7189 and b7209.
Driver: 8250
cpu: xeon 2699v3 x2
GPU: 1x A770
Models:
53 -> 42 t/s lmstudio-community\Meta-Llama-3.1-8B-Instruct-GGUF\Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
60 -> 54 t/s lmstudio-community\gpt-oss-20b-GGUF\gpt-oss-20b-MXFP4.gguf
B7209
sometimes flash attention crashes (with no error, The bench just stops)

B7189
First Bad Commit
Between b7189 and b7209