Misc. bug: Linux Vulkan build is slower than CPU with Ternary 8B, and 10 times slower than a larger model

### Name and Version

```
./llama-cli --version
load_backend: loaded RPC backend from ~/Applications/llama-prism/libggml-rpc.so
load_backend: loaded Vulkan backend from ~/Applications/llama-prism/libggml-vulkan.so
load_backend: loaded CPU backend from ~/Applications/llama-prism/libggml-cpu-zen4.so
version: 8846 (d104cf1b6)
built with GNU 11.4.0 for Linux x86_64
```
```
llama-cli --list-devices
load_backend: loaded RPC backend from ~/Applications/llama-prism/libggml-rpc.so
load_backend: loaded Vulkan backend from ~/Applications/llama-prism/libggml-vulkan.so
load_backend: loaded CPU backend from ~/Applications/llama-prism/libggml-cpu-zen4.so
Available devices:
  Vulkan0: AMD Radeon 780M Graphics (RADV PHOENIX) (40361 MiB, 19410 MiB free)
```

### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

llama-server

### Command line

```shell
./llama-server -m ~/.lmstudio/models/prism-ml/Ternary-Bonsai-8B-gguf/Ternary-Bonsai-8B-Q2_0.gguf \
  --alias bonsai-8b \
  --ctx-size 8192 \
  --jinja --gpu-layers all \
  --temp 0.3 --top-p 1.0 --min-p 0.01 \
  --sleep-idle-seconds 600 \
  --host 0.0.0.0 --port 1234
```

### Problem description & steps to reproduce

I've been trying the most recent Ternary 8B model with the Linux Vulkan build, and the performance is terrible (2.4 t/s), even worse than the CPU build (2.73 t/s). For comparison, the same hardware runs a much larger model (Qwen 3.6 MoE) at 24 t/s. The GPU has 16 GB dedicated, and it only gets filled in at 25% (~4 GB at 8K context). Interestingly enough, when chatting, the GPU is barely involved, while the CPU spikes, even with `--gpu-layers 'all'`. From my POV, it looks like the Vulkan build "forgets" that it has a GPU available for some tasks, but of course I may be wrong

**OS Specs**
```
OS: Linux Mint 22.3 x86_64
Host: Venus series
Kernel: 6.8.0-110-generic
Shell: bash 5.2.21
Terminal: /dev/pts/0
CPU: AMD Ryzen 9 7940HS w/ Radeon 780M Graphics (16) @ 5.263GHz
GPU: AMD ATI c4:00.0 Phoenix1
Memory: 3545MiB / 47954MiB
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: Linux Vulkan build is slower than CPU with Ternary 8B, and 10 times slower than a larger model #28

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: Linux Vulkan build is slower than CPU with Ternary 8B, and 10 times slower than a larger model #28

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions