Skip to content

Eval bug: gemma-4-26b-a4b-it-turbo using tbq4_0 is insane (and gemma3 fails to load correctly) #17

@zekrom-vale

Description

@zekrom-vale

Name and Version

zekromllm@zekromllm:/HomeLab/GPU/TurboQuantAmesianX$ docker compose run --rm --remove-orphans --entrypoint llama-cli turboa-llama-cpp   --model /models/gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf   --mmproj /models/mmproj-google_gemma-4-26B-A4B-it-f16.gguf   --flash-attn on   --mlock   --no-warmup   --parallel 4   --n-gpu-layers -1   --n-cpu-moe 18   --cache-type-k tbq4_0   --cache-type-v tbq4_0   --threads 12   --ctx-size 128000   --batch-size 4096   --ubatch-size 1024   --n-predict 4096   -p "Why is the sky blue?" --version
Container turboquantamesianx-turboa-llama-cpp-run-7a90e7ab92d8 Creating 
Container turboquantamesianx-turboa-llama-cpp-run-7a90e7ab92d8 Created 
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15842 MiB):
  Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, VRAM: 15842 MiB
version: 8702 (cf2170e5f)
built with GNU 13.3.0 for Linux x86_64

The 1.5.2 version

Operating systems

Docker compose

Builder: nvidia/cuda:13.1.1-devel-ubuntu24.04
Runtime: nvidia/cuda:13.1.1-runtime-ubuntu24.04

docker-compose.yml

Dockerfile.txt

GGML backends

CUDA

Hardware

Ryzen 9900x + 5070Ti @PCIe5 + 64GB DDR5

Models

gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF
gemma-3-27b-it-IQ4_XS.gguf https://huggingface.co/unsloth/gemma-3-27b-it-GGUF

Problem description & steps to reproduce

When I set --cache-type-k tbq4_0 --cache-type-v tbq4_0 for Gemma4 26B A4B Q6_K_XL it produces gibberish just like it did with TheTom's version with turbo4. Even with out mmproj.

docker compose run --rm --entrypoint llama-cli turboa-llama-cpp \
  --model /models/gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf \
  --mmproj /models/mmproj-google_gemma-4-26B-A4B-it-f16.gguf \
  --flash-attn on \
  --mlock \
  --no-warmup \
  --parallel 4 \
  --n-gpu-layers -1 \
  --n-cpu-moe 18 \
  --cache-type-k tbq4_0 \
  --cache-type-v tbq4_0 \
  --threads 12 \
  --ctx-size 128000 \
  --batch-size 4096 \
  --ubatch-size 1024 \
  --n-predict 4096 \
  -p "Why is the sky blue?"

Note, I tried to run Gemma3 and it failed to init at all, no errors just stopped. It works fine with q4_0:

zekromllm@zekromllm:/HomeLab/GPU/TurboQuantAmesianX$ docker compose run --rm --remove-orphans --entrypoint llama-cli turboa-llama-cpp \
  --model /models/gemma-3-27b-it-IQ4_XS.gguf \
  --fit off \
  --flash-attn on \
  --mlock \
  --no-warmup \
  --parallel 1 \
  --n-gpu-layers 50 \
  --cache-type-k tbq3_0 \
  --cache-type-v tbq3_0 \
  --threads 12 \
  --ctx-size 16000 \
  --batch-size 2048 \
  --ubatch-size 512 \
  -p "Why is the sky blue?"
Container turboquantamesianx-turboa-llama-cpp-run-f4ce0c6cce1b Creating 
Container turboquantamesianx-turboa-llama-cpp-run-f4ce0c6cce1b Created 
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15842 MiB):
  Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, VRAM: 15842 MiB

Loading model... /

First Bad Commit

I am not sure, though it did appear not to be stable from the 1.5.0 version.

Relevant log output

Logs
<!-- Copy-pasted short logs go into the "console" area here -->
zekromllm@zekromllm:/HomeLab/GPU/TurboQuantAmesianX$ docker compose run --rm --remove-orphans --entrypoint llama-cli turboa-llama-cpp \
  --model /models/gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf \
  --mmproj /models/mmproj-google_gemma-4-26B-A4B-it-f16.gguf \
  --flash-attn on \
  --mlock \
  --no-warmup \
  --parallel 4 \
  --n-gpu-layers -1 \
  --n-cpu-moe 18 \
  --cache-type-k tbq4_0 \
  --cache-type-v tbq4_0 \
  --threads 12 \
  --ctx-size 128000 \
  --batch-size 4096 \
  --ubatch-size 1024 \
  --n-predict 4096 \
  -p "Why is the sky blue?"
Container turboquantamesianx-turboa-llama-cpp-run-5f31db41abcb Creating 
Container turboquantamesianx-turboa-llama-cpp-run-5f31db41abcb Created 
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15842 MiB):
  Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, VRAM: 15842 MiB

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8702-cf2170e5f
model      : gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf
modalities : text, vision

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern
  /image <file>       add an image file


> Why is the sky blue?

瑞 de- actually-----—--—–--------—-
10.-10-1-1-1--1-1--1-0-0--1-1--1--1--1-1--1--1--1--1--1--1--1-0-1--1--1-0--1-1--1--1--1--1--1-1--1-1--1--1--1--1--1-1--1--1--1-1--1--1-1--1--1--1--—-— [1-10-10-10--10-10-0-1--1--1--1--1--1--1--1--1--1--1--1--1--1--1--1--1--1--1-1--1--1-1-1--1--1--1--1--1---1--1--1--1--1-

[ Prompt: 105.0 t/s | Generation: 46.7 t/s ]


zekromllm@zekromllm:/HomeLab/GPU/TurboQuantAmesianX$ docker compose run --rm --remove-orphans --entrypoint llama-cli turboa-llama-cpp \
  --model /models/gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf \
  --fit off \
  --flash-attn on \
  --mlock \
  --no-warmup \
  --parallel 4 \
  --n-gpu-layers -1 \
  --n-cpu-moe 18 \
  --cache-type-k tbq4_0 \
  --cache-type-v tbq4_0 \
  --threads 12 \
  --ctx-size 128000 \
  --batch-size 4096 \
  --ubatch-size 1024 \
  --n-predict 4096 \
  -p "Why is the sky blue?"
Container turboquantamesianx-turboa-llama-cpp-run-b5c9ba13613d Creating 
Container turboquantamesianx-turboa-llama-cpp-run-b5c9ba13613d Created 
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15842 MiB):
  Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, VRAM: 15842 MiB

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8702-cf2170e5f
model      : gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern


> Why is the sky blue?

horizon-100-v-1.0-11.-10-10-0-0-0-0-0-0-0-0-0-0-0-0-0--0-0-0-0-0-0-0-0-0-0-0--10-10-0-0-0-0-0-0-0--0-0-0-0-0-0-0--

Expected result

zekromllm@zekromllm:/HomeLab/GPU/TurboQuantAmesianX$ docker compose run --rm --remove-orphans --entrypoint llama-cli turboa-llama-cpp \
  --model /models/gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf \
  --fit off \
  --flash-attn on \
  --mlock \
  --no-warmup \
  --parallel 4 \
  --n-gpu-layers -1 \
  --n-cpu-moe 18 \
  --cache-type-k tbq3_0 \
  --cache-type-v tbq3_0 \
  --threads 12 \
  --ctx-size 128000 \
  --batch-size 4096 \
  --ubatch-size 1024 \
  --n-predict 4096 \
  -p "Why is the sky blue?"
Container turboquantamesianx-turboa-llama-cpp-run-45926ad6657a Creating 
Container turboquantamesianx-turboa-llama-cpp-run-45926ad6657a Created 
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15842 MiB):
  Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, VRAM: 15842 MiB

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8702-cf2170e5f
model      : gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern


> Why is the sky blue?

[Start thinking]
"Why is the sky blue?"
Science/Physics/Atmospheric science.
General audience (needs to be clear, accurate, but not overly dense with jargon unless explained).

    *   Sunlight (white light) is made of all colors of the rainbow (ROYGBIV).
    *   Light travels as waves.
    *   Different colors have different wavelengths (Red = long, Blue/Violet = short).
    *   Earth's atmosphere is filled with gases (Nitrogen, Oxygen).
    *   *Rayleigh Scattering:* When light hits gas molecules, shorter wavelengths (blue) scatter more easily in all directions than longer wavelengths (red).
    *   The human eye is more sensitive to blue than violet

Gemma4tbq4.txt

gemma3load.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions