Name and Version
version: 7901 (8a98ba4)
built with GNU 15.2.1 for Linux x86_64
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-bench
Command line
./build/bin/llama-bench --model ./models/GLM-4.7-Flash-IQ4_XS.gguf -ngl 99 -fa 0,1 -d 0,4096,8192
Problem description & steps to reproduce
With FA enabled, I get generally worse performance with Vulkan than with ROCm, but I'm not here about PP this time. What is interesting is that smaller pre-filled context window results in worse TG performance (see the logs below for full table including fa=0 and ROCm).
This is on Fedora 43, 7900 XTX, ROCm 6.4.2, Vulkan 1.4.342, Mesa RADV 26.1.0-0.5.gita8fac76:
| model |
size |
params |
backend |
ngl |
fa |
test |
t/s |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |
15.15 GiB |
29.94 B |
Vulkan |
99 |
1 |
tg128 |
102.97 ± 0.08 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |
15.15 GiB |
29.94 B |
Vulkan |
99 |
1 |
tg128 @ d4096 |
41.81 ± 0.01 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |
15.15 GiB |
29.94 B |
Vulkan |
99 |
1 |
tg128 @ d8192 |
117.59 ± 0.07 |
First Bad Commit
No response
Relevant log output
Logs
# device info on ROCm
Device 0: Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
# device info on Vulkan
ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
# reproducible results
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw | 15.15 GiB | 29.94 B | ROCm | 99 | 0 | pp512 | 2590.64 ± 22.66 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw | 15.15 GiB | 29.94 B | Vulkan | 99 | 0 | pp512 | 916.86 ± 2.69 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw | 15.15 GiB | 29.94 B | ROCm | 99 | 0 | tg128 | 83.17 ± 0.03 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw | 15.15 GiB | 29.94 B | Vulkan | 99 | 0 | tg128 | 74.51 ± 0.06 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw | 15.15 GiB | 29.94 B | ROCm | 99 | 0 | pp512 @ d4096 | 1146.33 ± 20.77 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw | 15.15 GiB | 29.94 B | Vulkan | 99 | 0 | pp512 @ d4096 | 272.53 ± 2.52 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw | 15.15 GiB | 29.94 B | ROCm | 99 | 0 | tg128 @ d4096 | 44.94 ± 0.02 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw | 15.15 GiB | 29.94 B | Vulkan | 99 | 0 | tg128 @ d4096 | 23.02 ± 0.12 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw | 15.15 GiB | 29.94 B | ROCm | 99 | 0 | pp512 @ d8192 | 734.88 ± 5.89 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw | 15.15 GiB | 29.94 B | Vulkan | 99 | 0 | pp512 @ d8192 | 341.09 ± 1.12 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw | 15.15 GiB | 29.94 B | ROCm | 99 | 0 | tg128 @ d8192 | 27.81 ± 0.02 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw | 15.15 GiB | 29.94 B | Vulkan | 99 | 0 | tg128 @ d8192 | 73.74 ± 0.04 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw | 15.15 GiB | 29.94 B | ROCm | 99 | 1 | pp512 | 2684.54 ± 19.55 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw | 15.15 GiB | 29.94 B | Vulkan | 99 | 1 | pp512 | 786.92 ± 1.40 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw | 15.15 GiB | 29.94 B | ROCm | 99 | 1 | tg128 | 89.74 ± 0.08 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw | 15.15 GiB | 29.94 B | Vulkan | 99 | 1 | tg128 | 102.97 ± 0.08 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw | 15.15 GiB | 29.94 B | ROCm | 99 | 1 | pp512 @ d4096 | 1202.67 ± 3.16 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw | 15.15 GiB | 29.94 B | Vulkan | 99 | 1 | pp512 @ d4096 | 748.20 ± 2.73 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw | 15.15 GiB | 29.94 B | ROCm | 99 | 1 | tg128 @ d4096 | 82.65 ± 0.07 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw | 15.15 GiB | 29.94 B | Vulkan | 99 | 1 | tg128 @ d4096 | 41.81 ± 0.01 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw | 15.15 GiB | 29.94 B | ROCm | 99 | 1 | pp512 @ d8192 | 775.44 ± 2.65 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw | 15.15 GiB | 29.94 B | Vulkan | 99 | 1 | pp512 @ d8192 | 637.03 ± 3.20 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw | 15.15 GiB | 29.94 B | ROCm | 99 | 1 | tg128 @ d8192 | 76.71 ± 0.03 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw | 15.15 GiB | 29.94 B | Vulkan | 99 | 1 | tg128 @ d8192 | 117.59 ± 0.07 |
Name and Version
version: 7901 (8a98ba4)
built with GNU 15.2.1 for Linux x86_64
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-bench
Command line
Problem description & steps to reproduce
With FA enabled, I get generally worse performance with Vulkan than with ROCm, but I'm not here about PP this time. What is interesting is that smaller pre-filled context window results in worse TG performance (see the logs below for full table including fa=0 and ROCm).
This is on Fedora 43, 7900 XTX, ROCm 6.4.2, Vulkan 1.4.342, Mesa RADV 26.1.0-0.5.gita8fac76:
First Bad Commit
No response
Relevant log output
Logs