Skip to content

Misc. bug: Linux Vulkan build is slower than CPU with Ternary 8B, and 10 times slower than a larger model #28

@pmanna

Description

@pmanna

Name and Version

./llama-cli --version
load_backend: loaded RPC backend from ~/Applications/llama-prism/libggml-rpc.so
load_backend: loaded Vulkan backend from ~/Applications/llama-prism/libggml-vulkan.so
load_backend: loaded CPU backend from ~/Applications/llama-prism/libggml-cpu-zen4.so
version: 8846 (d104cf1b6)
built with GNU 11.4.0 for Linux x86_64
llama-cli --list-devices
load_backend: loaded RPC backend from ~/Applications/llama-prism/libggml-rpc.so
load_backend: loaded Vulkan backend from ~/Applications/llama-prism/libggml-vulkan.so
load_backend: loaded CPU backend from ~/Applications/llama-prism/libggml-cpu-zen4.so
Available devices:
  Vulkan0: AMD Radeon 780M Graphics (RADV PHOENIX) (40361 MiB, 19410 MiB free)

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

./llama-server -m ~/.lmstudio/models/prism-ml/Ternary-Bonsai-8B-gguf/Ternary-Bonsai-8B-Q2_0.gguf \
  --alias bonsai-8b \
  --ctx-size 8192 \
  --jinja --gpu-layers all \
  --temp 0.3 --top-p 1.0 --min-p 0.01 \
  --sleep-idle-seconds 600 \
  --host 0.0.0.0 --port 1234

Problem description & steps to reproduce

I've been trying the most recent Ternary 8B model with the Linux Vulkan build, and the performance is terrible (2.4 t/s), even worse than the CPU build (2.73 t/s). For comparison, the same hardware runs a much larger model (Qwen 3.6 MoE) at 24 t/s. The GPU has 16 GB dedicated, and it only gets filled in at 25% (~4 GB at 8K context). Interestingly enough, when chatting, the GPU is barely involved, while the CPU spikes, even with --gpu-layers 'all'. From my POV, it looks like the Vulkan build "forgets" that it has a GPU available for some tasks, but of course I may be wrong

OS Specs

OS: Linux Mint 22.3 x86_64
Host: Venus series
Kernel: 6.8.0-110-generic
Shell: bash 5.2.21
Terminal: /dev/pts/0
CPU: AMD Ryzen 9 7940HS w/ Radeon 780M Graphics (16) @ 5.263GHz
GPU: AMD ATI c4:00.0 Phoenix1
Memory: 3545MiB / 47954MiB

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions