Name and Version
zekromllm@zekromllm:/HomeLab/GPU/TurboQuantAmesianX$ docker compose run --rm --remove-orphans --entrypoint llama-cli turboa-llama-cpp --model /models/gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf --mmproj /models/mmproj-google_gemma-4-26B-A4B-it-f16.gguf --flash-attn on --mlock --no-warmup --parallel 4 --n-gpu-layers -1 --n-cpu-moe 18 --cache-type-k tbq4_0 --cache-type-v tbq4_0 --threads 12 --ctx-size 128000 --batch-size 4096 --ubatch-size 1024 --n-predict 4096 -p "Why is the sky blue?" --version
Container turboquantamesianx-turboa-llama-cpp-run-7a90e7ab92d8 Creating
Container turboquantamesianx-turboa-llama-cpp-run-7a90e7ab92d8 Created
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15842 MiB):
Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, VRAM: 15842 MiB
version: 8702 (cf2170e5f)
built with GNU 13.3.0 for Linux x86_64
The 1.5.2 version
Operating systems
Docker compose
Builder: nvidia/cuda:13.1.1-devel-ubuntu24.04
Runtime: nvidia/cuda:13.1.1-runtime-ubuntu24.04
docker-compose.yml
Dockerfile.txt
GGML backends
CUDA
Hardware
Ryzen 9900x + 5070Ti @PCIe5 + 64GB DDR5
Models
gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF
gemma-3-27b-it-IQ4_XS.gguf https://huggingface.co/unsloth/gemma-3-27b-it-GGUF
Problem description & steps to reproduce
When I set --cache-type-k tbq4_0 --cache-type-v tbq4_0 for Gemma4 26B A4B Q6_K_XL it produces gibberish just like it did with TheTom's version with turbo4. Even with out mmproj.
docker compose run --rm --entrypoint llama-cli turboa-llama-cpp \
--model /models/gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf \
--mmproj /models/mmproj-google_gemma-4-26B-A4B-it-f16.gguf \
--flash-attn on \
--mlock \
--no-warmup \
--parallel 4 \
--n-gpu-layers -1 \
--n-cpu-moe 18 \
--cache-type-k tbq4_0 \
--cache-type-v tbq4_0 \
--threads 12 \
--ctx-size 128000 \
--batch-size 4096 \
--ubatch-size 1024 \
--n-predict 4096 \
-p "Why is the sky blue?"
Note, I tried to run Gemma3 and it failed to init at all, no errors just stopped. It works fine with q4_0:
zekromllm@zekromllm:/HomeLab/GPU/TurboQuantAmesianX$ docker compose run --rm --remove-orphans --entrypoint llama-cli turboa-llama-cpp \
--model /models/gemma-3-27b-it-IQ4_XS.gguf \
--fit off \
--flash-attn on \
--mlock \
--no-warmup \
--parallel 1 \
--n-gpu-layers 50 \
--cache-type-k tbq3_0 \
--cache-type-v tbq3_0 \
--threads 12 \
--ctx-size 16000 \
--batch-size 2048 \
--ubatch-size 512 \
-p "Why is the sky blue?"
Container turboquantamesianx-turboa-llama-cpp-run-f4ce0c6cce1b Creating
Container turboquantamesianx-turboa-llama-cpp-run-f4ce0c6cce1b Created
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15842 MiB):
Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, VRAM: 15842 MiB
Loading model... /
First Bad Commit
I am not sure, though it did appear not to be stable from the 1.5.0 version.
Relevant log output
Logs
<!-- Copy-pasted short logs go into the "console" area here -->
zekromllm@zekromllm:/HomeLab/GPU/TurboQuantAmesianX$ docker compose run --rm --remove-orphans --entrypoint llama-cli turboa-llama-cpp \
--model /models/gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf \
--mmproj /models/mmproj-google_gemma-4-26B-A4B-it-f16.gguf \
--flash-attn on \
--mlock \
--no-warmup \
--parallel 4 \
--n-gpu-layers -1 \
--n-cpu-moe 18 \
--cache-type-k tbq4_0 \
--cache-type-v tbq4_0 \
--threads 12 \
--ctx-size 128000 \
--batch-size 4096 \
--ubatch-size 1024 \
--n-predict 4096 \
-p "Why is the sky blue?"
Container turboquantamesianx-turboa-llama-cpp-run-5f31db41abcb Creating
Container turboquantamesianx-turboa-llama-cpp-run-5f31db41abcb Created
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15842 MiB):
Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, VRAM: 15842 MiB
Loading model...
▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀
build : b8702-cf2170e5f
model : gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf
modalities : text, vision
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read <file> add a text file
/glob <pattern> add text files using globbing pattern
/image <file> add an image file
> Why is the sky blue?
瑞 de- actually-----—--—–--------—-
10.-10-1-1-1--1-1--1-0-0--1-1--1--1--1-1--1--1--1--1--1--1--1-0-1--1--1-0--1-1--1--1--1--1--1-1--1-1--1--1--1--1--1-1--1--1--1-1--1--1-1--1--1--1--—-— [1-10-10-10--10-10-0-1--1--1--1--1--1--1--1--1--1--1--1--1--1--1--1--1--1--1-1--1--1-1-1--1--1--1--1--1---1--1--1--1--1-
[ Prompt: 105.0 t/s | Generation: 46.7 t/s ]
zekromllm@zekromllm:/HomeLab/GPU/TurboQuantAmesianX$ docker compose run --rm --remove-orphans --entrypoint llama-cli turboa-llama-cpp \
--model /models/gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf \
--fit off \
--flash-attn on \
--mlock \
--no-warmup \
--parallel 4 \
--n-gpu-layers -1 \
--n-cpu-moe 18 \
--cache-type-k tbq4_0 \
--cache-type-v tbq4_0 \
--threads 12 \
--ctx-size 128000 \
--batch-size 4096 \
--ubatch-size 1024 \
--n-predict 4096 \
-p "Why is the sky blue?"
Container turboquantamesianx-turboa-llama-cpp-run-b5c9ba13613d Creating
Container turboquantamesianx-turboa-llama-cpp-run-b5c9ba13613d Created
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15842 MiB):
Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, VRAM: 15842 MiB
Loading model...
▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀
build : b8702-cf2170e5f
model : gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf
modalities : text
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read <file> add a text file
/glob <pattern> add text files using globbing pattern
> Why is the sky blue?
horizon-100-v-1.0-11.-10-10-0-0-0-0-0-0-0-0-0-0-0-0-0--0-0-0-0-0-0-0-0-0-0-0--10-10-0-0-0-0-0-0-0--0-0-0-0-0-0-0--
Expected result
zekromllm@zekromllm:/HomeLab/GPU/TurboQuantAmesianX$ docker compose run --rm --remove-orphans --entrypoint llama-cli turboa-llama-cpp \
--model /models/gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf \
--fit off \
--flash-attn on \
--mlock \
--no-warmup \
--parallel 4 \
--n-gpu-layers -1 \
--n-cpu-moe 18 \
--cache-type-k tbq3_0 \
--cache-type-v tbq3_0 \
--threads 12 \
--ctx-size 128000 \
--batch-size 4096 \
--ubatch-size 1024 \
--n-predict 4096 \
-p "Why is the sky blue?"
Container turboquantamesianx-turboa-llama-cpp-run-45926ad6657a Creating
Container turboquantamesianx-turboa-llama-cpp-run-45926ad6657a Created
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15842 MiB):
Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, VRAM: 15842 MiB
Loading model...
▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀
build : b8702-cf2170e5f
model : gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf
modalities : text
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read <file> add a text file
/glob <pattern> add text files using globbing pattern
> Why is the sky blue?
[Start thinking]
"Why is the sky blue?"
Science/Physics/Atmospheric science.
General audience (needs to be clear, accurate, but not overly dense with jargon unless explained).
* Sunlight (white light) is made of all colors of the rainbow (ROYGBIV).
* Light travels as waves.
* Different colors have different wavelengths (Red = long, Blue/Violet = short).
* Earth's atmosphere is filled with gases (Nitrogen, Oxygen).
* *Rayleigh Scattering:* When light hits gas molecules, shorter wavelengths (blue) scatter more easily in all directions than longer wavelengths (red).
* The human eye is more sensitive to blue than violet
Gemma4tbq4.txt
gemma3load.txt
Name and Version
The 1.5.2 version
Operating systems
Docker compose
Builder: nvidia/cuda:13.1.1-devel-ubuntu24.04
Runtime: nvidia/cuda:13.1.1-runtime-ubuntu24.04
docker-compose.yml
Dockerfile.txt
GGML backends
CUDA
Hardware
Ryzen 9900x + 5070Ti @PCIe5 + 64GB DDR5
Models
gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF
gemma-3-27b-it-IQ4_XS.gguf https://huggingface.co/unsloth/gemma-3-27b-it-GGUF
Problem description & steps to reproduce
When I set
--cache-type-k tbq4_0 --cache-type-v tbq4_0for Gemma4 26B A4B Q6_K_XL it produces gibberish just like it did with TheTom's version with turbo4. Even with out mmproj.Note, I tried to run Gemma3 and it failed to init at all, no errors just stopped. It works fine with
q4_0:First Bad Commit
I am not sure, though it did appear not to be stable from the 1.5.0 version.
Relevant log output
Logs
Expected result
Gemma4tbq4.txt
gemma3load.txt