Misc. bug: GLM-4.7-Flash (Vulkan on 7900 XTX) inference faster @d8192 than @d4096 or @d0

### Name and Version

version: 7901 (8a98ba458)
built with GNU 15.2.1 for Linux x86_64

### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

llama-bench

### Command line

```shell
./build/bin/llama-bench --model ./models/GLM-4.7-Flash-IQ4_XS.gguf -ngl 99 -fa 0,1 -d 0,4096,8192
```

### Problem description & steps to reproduce

With FA enabled, I get generally worse performance with Vulkan than with ROCm, but I'm not here about PP this time. What is interesting is that smaller pre-filled context window results in worse TG performance (see the logs below for full table including fa=0 and ROCm).

This is on Fedora 43, 7900 XTX, ROCm 6.4.2, Vulkan 1.4.342, Mesa RADV 26.1.0-0.5.gita8fac76:

| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |  15.15 GiB |    29.94 B | Vulkan     |  99 |  1 |           tg128 |        102.97 ± 0.08 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |  15.15 GiB |    29.94 B | Vulkan     |  99 |  1 |   tg128 @ d4096 |         41.81 ± 0.01 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |  15.15 GiB |    29.94 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        117.59 ± 0.07 |

### First Bad Commit

_No response_

### Relevant log output

<details>
<summary>Logs</summary>


```console
# device info on ROCm
  Device 0: Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
# device info on Vulkan
ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
# reproducible results
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |  15.15 GiB |    29.94 B | ROCm       |  99 |  0 |           pp512 |      2590.64 ± 22.66 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |  15.15 GiB |    29.94 B | Vulkan     |  99 |  0 |           pp512 |        916.86 ± 2.69 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |  15.15 GiB |    29.94 B | ROCm       |  99 |  0 |           tg128 |         83.17 ± 0.03 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |  15.15 GiB |    29.94 B | Vulkan     |  99 |  0 |           tg128 |         74.51 ± 0.06 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |  15.15 GiB |    29.94 B | ROCm       |  99 |  0 |   pp512 @ d4096 |      1146.33 ± 20.77 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |  15.15 GiB |    29.94 B | Vulkan     |  99 |  0 |   pp512 @ d4096 |        272.53 ± 2.52 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |  15.15 GiB |    29.94 B | ROCm       |  99 |  0 |   tg128 @ d4096 |         44.94 ± 0.02 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |  15.15 GiB |    29.94 B | Vulkan     |  99 |  0 |   tg128 @ d4096 |         23.02 ± 0.12 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |  15.15 GiB |    29.94 B | ROCm       |  99 |  0 |   pp512 @ d8192 |        734.88 ± 5.89 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |  15.15 GiB |    29.94 B | Vulkan     |  99 |  0 |   pp512 @ d8192 |        341.09 ± 1.12 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |  15.15 GiB |    29.94 B | ROCm       |  99 |  0 |   tg128 @ d8192 |         27.81 ± 0.02 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |  15.15 GiB |    29.94 B | Vulkan     |  99 |  0 |   tg128 @ d8192 |         73.74 ± 0.04 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |  15.15 GiB |    29.94 B | ROCm       |  99 |  1 |           pp512 |      2684.54 ± 19.55 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |  15.15 GiB |    29.94 B | Vulkan     |  99 |  1 |           pp512 |        786.92 ± 1.40 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |  15.15 GiB |    29.94 B | ROCm       |  99 |  1 |           tg128 |         89.74 ± 0.08 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |  15.15 GiB |    29.94 B | Vulkan     |  99 |  1 |           tg128 |        102.97 ± 0.08 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |  15.15 GiB |    29.94 B | ROCm       |  99 |  1 |   pp512 @ d4096 |       1202.67 ± 3.16 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |  15.15 GiB |    29.94 B | Vulkan     |  99 |  1 |   pp512 @ d4096 |        748.20 ± 2.73 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |  15.15 GiB |    29.94 B | ROCm       |  99 |  1 |   tg128 @ d4096 |         82.65 ± 0.07 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |  15.15 GiB |    29.94 B | Vulkan     |  99 |  1 |   tg128 @ d4096 |         41.81 ± 0.01 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |  15.15 GiB |    29.94 B | ROCm       |  99 |  1 |   pp512 @ d8192 |        775.44 ± 2.65 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |  15.15 GiB |    29.94 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |        637.03 ± 3.20 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |  15.15 GiB |    29.94 B | ROCm       |  99 |  1 |   tg128 @ d8192 |         76.71 ± 0.03 |
| deepseek2 30B.A3B IQ4_XS - 4.25 bpw |  15.15 GiB |    29.94 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        117.59 ± 0.07 |
```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: GLM-4.7-Flash (Vulkan on 7900 XTX) inference faster @d8192 than @d4096 or @d0 #19255

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

model	size	params	backend	ngl	fa	test	t/s
deepseek2 30B.A3B IQ4_XS - 4.25 bpw	15.15 GiB	29.94 B	Vulkan	99	1	tg128	102.97 ± 0.08
deepseek2 30B.A3B IQ4_XS - 4.25 bpw	15.15 GiB	29.94 B	Vulkan	99	1	tg128 @ d4096	41.81 ± 0.01
deepseek2 30B.A3B IQ4_XS - 4.25 bpw	15.15 GiB	29.94 B	Vulkan	99	1	tg128 @ d8192	117.59 ± 0.07

Misc. bug: GLM-4.7-Flash (Vulkan on 7900 XTX) inference faster @d8192 than @d4096 or @d0 #19255

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions