Name and Version
./llama-cli --version
load_backend: loaded RPC backend from ~/Applications/llama-prism/libggml-rpc.so
load_backend: loaded Vulkan backend from ~/Applications/llama-prism/libggml-vulkan.so
load_backend: loaded CPU backend from ~/Applications/llama-prism/libggml-cpu-zen4.so
version: 8846 (d104cf1b6)
built with GNU 11.4.0 for Linux x86_64
llama-cli --list-devices
load_backend: loaded RPC backend from ~/Applications/llama-prism/libggml-rpc.so
load_backend: loaded Vulkan backend from ~/Applications/llama-prism/libggml-vulkan.so
load_backend: loaded CPU backend from ~/Applications/llama-prism/libggml-cpu-zen4.so
Available devices:
Vulkan0: AMD Radeon 780M Graphics (RADV PHOENIX) (40361 MiB, 19410 MiB free)
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
./llama-server -m ~/.lmstudio/models/prism-ml/Ternary-Bonsai-8B-gguf/Ternary-Bonsai-8B-Q2_0.gguf \
--alias bonsai-8b \
--ctx-size 8192 \
--jinja --gpu-layers all \
--temp 0.3 --top-p 1.0 --min-p 0.01 \
--sleep-idle-seconds 600 \
--host 0.0.0.0 --port 1234
Problem description & steps to reproduce
I've been trying the most recent Ternary 8B model with the Linux Vulkan build, and the performance is terrible (2.4 t/s), even worse than the CPU build (2.73 t/s). For comparison, the same hardware runs a much larger model (Qwen 3.6 MoE) at 24 t/s. The GPU has 16 GB dedicated, and it only gets filled in at 25% (~4 GB at 8K context). Interestingly enough, when chatting, the GPU is barely involved, while the CPU spikes, even with --gpu-layers 'all'. From my POV, it looks like the Vulkan build "forgets" that it has a GPU available for some tasks, but of course I may be wrong
OS Specs
OS: Linux Mint 22.3 x86_64
Host: Venus series
Kernel: 6.8.0-110-generic
Shell: bash 5.2.21
Terminal: /dev/pts/0
CPU: AMD Ryzen 9 7940HS w/ Radeon 780M Graphics (16) @ 5.263GHz
GPU: AMD ATI c4:00.0 Phoenix1
Memory: 3545MiB / 47954MiB
Name and Version
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
./llama-server -m ~/.lmstudio/models/prism-ml/Ternary-Bonsai-8B-gguf/Ternary-Bonsai-8B-Q2_0.gguf \ --alias bonsai-8b \ --ctx-size 8192 \ --jinja --gpu-layers all \ --temp 0.3 --top-p 1.0 --min-p 0.01 \ --sleep-idle-seconds 600 \ --host 0.0.0.0 --port 1234Problem description & steps to reproduce
I've been trying the most recent Ternary 8B model with the Linux Vulkan build, and the performance is terrible (2.4 t/s), even worse than the CPU build (2.73 t/s). For comparison, the same hardware runs a much larger model (Qwen 3.6 MoE) at 24 t/s. The GPU has 16 GB dedicated, and it only gets filled in at 25% (~4 GB at 8K context). Interestingly enough, when chatting, the GPU is barely involved, while the CPU spikes, even with
--gpu-layers 'all'. From my POV, it looks like the Vulkan build "forgets" that it has a GPU available for some tasks, but of course I may be wrongOS Specs