Skip to content

Eval bug: Gemma 3 on ARM64 CPU (Cortex-A76) produces wrong logits; Qwen unaffected — b8816 #22011

@mhamann

Description

@mhamann

Name and Version

version: b8816 (also reproduced on b8708)
built: ubuntu-24.04-arm runner, cmake -DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv8-a
host: Raspberry Pi 5 (Cortex-A76, ARMv8.2-A, features: fphp asimdhp asimddp asimdrdm)

Operating systems

Linux

GGML backends

CPU

Hardware

Raspberry Pi 5 (Broadcom BCM2712), 4× Cortex-A76 @ 2.4 GHz, 8 GB RAM. ARMv8.2-A with half-precision + dot product. No SVE.

Models

Primary repro: Gemma 3 270M fine-tune (base: unsloth/functiongemma-270m-it), quantized to Q8_0.

Also reproduced to a lesser degree on: unsloth/gemma-3-1b-it-GGUF (Q8_0), which segfaults on b8708 and produces numerically wrong (but non-crashing) output on b8816.

Problem description & steps to reproduce

Same GGUF on Mac CPU produces correct output; on the Pi ARM64 CPU build the forward pass silently produces a wrong probability distribution.

Reproduction via llama-cpp-python API (or llama-cli directly — the result is the same):

from llama_cpp import Llama
llm = Llama(model_path="/path/model.Q8_0.gguf", n_ctx=512, n_gpu_layers=0, logits_all=True)
out = llm.create_completion("The capital of France is", max_tokens=1, temperature=0, logprobs=10)

Mac CPU (llama-cpp-python 0.3.20, bundled ggml 0.9.11):

top tokens:
  ' Paris': -0.003
  ' **':    -6.139
  ' Bast':  -8.627
  ...

Pi ARM64 CPU (b8816 built from GitHub workflow with -DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv8-a, ggml 0.9.11):

top tokens:
  ' bel':   -0.602
  ' el':    -1.468
  ' where': -2.533
  ' bal':   -2.695
  ' the':   -3.475
  ...  (" Paris" not in top 10)

Identical garbage with quantization = Q8_0 and bf16 (different garbage between quants, but still no "Paris" on Pi). Identical across n_threads=1/2/4. Same behavior with flash_attn on/off.

Isolated the bug to Gemma 3 architecture specifically:

  • Qwen 2.5 0.5B Q8_0 on same Pi/binary: " Paris" at -0.918 ✓ (matches Mac to within expected precision)
  • Gemma 3 1B base Q8_0: segfaults on b8708; produces wrong-but-not-crashing output on b8816 (" France" top instead of " Paris")
  • Our 270M Gemma 3 fine-tune: byte-identical garbage logits between b8708 and b8816

The binary strings show standard Gemma 3 ISWA implementation symbols (llm_build_gemma3, llama_kv_cache_iswa). Qwen uses standard GQA and is unaffected. This strongly points at the ARM64 CPU kernel path for Gemma 3's interleaved sliding window attention or an upstream fp16 accumulation issue specific to this architecture/shape.

First Bad Commit

Bisection not performed yet — b8708 and b8816 both show the symptom. The 1B segfault does get fixed somewhere in that range, but the 270M numerics bug is present across the full range tested.

Relevant log output

llama_context: n_ctx         = 4096
llama_context: n_ctx_seq     = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 2048
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_kv_cache_iswa: using full-size SWA cache
llama_kv_cache_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_iswa: creating     SWA KV cache, size = 4096 cells
llama_kv_cache: K (f16): 6.00 MiB, V (f16): 6.00 MiB
llama_kv_cache: K (f16): 30.00 MiB, V (f16): 30.00 MiB
sched_reserve:        CPU compute buffer size =  2058.01 MiB
Warmup completed in 49.2ms

No errors at load time. The model loads cleanly, reports the expected graph layout, and warmup completes. The divergence appears in the forward pass logits directly.

I'm happy to bisect the range and try additional reproductions if useful. Also happy to share the fine-tuned GGUFs privately if helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions