Name and Version
version: b8816 (also reproduced on b8708)
built: ubuntu-24.04-arm runner, cmake -DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv8-a
host: Raspberry Pi 5 (Cortex-A76, ARMv8.2-A, features: fphp asimdhp asimddp asimdrdm)
Operating systems
Linux
GGML backends
CPU
Hardware
Raspberry Pi 5 (Broadcom BCM2712), 4× Cortex-A76 @ 2.4 GHz, 8 GB RAM. ARMv8.2-A with half-precision + dot product. No SVE.
Models
Primary repro: Gemma 3 270M fine-tune (base: unsloth/functiongemma-270m-it), quantized to Q8_0.
Also reproduced to a lesser degree on: unsloth/gemma-3-1b-it-GGUF (Q8_0), which segfaults on b8708 and produces numerically wrong (but non-crashing) output on b8816.
Problem description & steps to reproduce
Same GGUF on Mac CPU produces correct output; on the Pi ARM64 CPU build the forward pass silently produces a wrong probability distribution.
Reproduction via llama-cpp-python API (or llama-cli directly — the result is the same):
from llama_cpp import Llama
llm = Llama(model_path="/path/model.Q8_0.gguf", n_ctx=512, n_gpu_layers=0, logits_all=True)
out = llm.create_completion("The capital of France is", max_tokens=1, temperature=0, logprobs=10)
Mac CPU (llama-cpp-python 0.3.20, bundled ggml 0.9.11):
top tokens:
' Paris': -0.003
' **': -6.139
' Bast': -8.627
...
Pi ARM64 CPU (b8816 built from GitHub workflow with -DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv8-a, ggml 0.9.11):
top tokens:
' bel': -0.602
' el': -1.468
' where': -2.533
' bal': -2.695
' the': -3.475
... (" Paris" not in top 10)
Identical garbage with quantization = Q8_0 and bf16 (different garbage between quants, but still no "Paris" on Pi). Identical across n_threads=1/2/4. Same behavior with flash_attn on/off.
Isolated the bug to Gemma 3 architecture specifically:
- Qwen 2.5 0.5B Q8_0 on same Pi/binary:
" Paris" at -0.918 ✓ (matches Mac to within expected precision)
- Gemma 3 1B base Q8_0: segfaults on b8708; produces wrong-but-not-crashing output on b8816 (
" France" top instead of " Paris")
- Our 270M Gemma 3 fine-tune: byte-identical garbage logits between b8708 and b8816
The binary strings show standard Gemma 3 ISWA implementation symbols (llm_build_gemma3, llama_kv_cache_iswa). Qwen uses standard GQA and is unaffected. This strongly points at the ARM64 CPU kernel path for Gemma 3's interleaved sliding window attention or an upstream fp16 accumulation issue specific to this architecture/shape.
First Bad Commit
Bisection not performed yet — b8708 and b8816 both show the symptom. The 1B segfault does get fixed somewhere in that range, but the 270M numerics bug is present across the full range tested.
Relevant log output
llama_context: n_ctx = 4096
llama_context: n_ctx_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 2048
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = false
llama_kv_cache_iswa: using full-size SWA cache
llama_kv_cache_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_iswa: creating SWA KV cache, size = 4096 cells
llama_kv_cache: K (f16): 6.00 MiB, V (f16): 6.00 MiB
llama_kv_cache: K (f16): 30.00 MiB, V (f16): 30.00 MiB
sched_reserve: CPU compute buffer size = 2058.01 MiB
Warmup completed in 49.2ms
No errors at load time. The model loads cleanly, reports the expected graph layout, and warmup completes. The divergence appears in the forward pass logits directly.
I'm happy to bisect the range and try additional reproductions if useful. Also happy to share the fine-tuned GGUFs privately if helpful.
Name and Version
Operating systems
Linux
GGML backends
CPU
Hardware
Raspberry Pi 5 (Broadcom BCM2712), 4× Cortex-A76 @ 2.4 GHz, 8 GB RAM. ARMv8.2-A with half-precision + dot product. No SVE.
Models
Primary repro: Gemma 3 270M fine-tune (base:
unsloth/functiongemma-270m-it), quantized to Q8_0.Also reproduced to a lesser degree on:
unsloth/gemma-3-1b-it-GGUF(Q8_0), which segfaults on b8708 and produces numerically wrong (but non-crashing) output on b8816.Problem description & steps to reproduce
Same GGUF on Mac CPU produces correct output; on the Pi ARM64 CPU build the forward pass silently produces a wrong probability distribution.
Reproduction via llama-cpp-python API (or llama-cli directly — the result is the same):
Mac CPU (llama-cpp-python 0.3.20, bundled ggml 0.9.11):
Pi ARM64 CPU (b8816 built from GitHub workflow with
-DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv8-a, ggml 0.9.11):Identical garbage with quantization = Q8_0 and bf16 (different garbage between quants, but still no "Paris" on Pi). Identical across
n_threads=1/2/4. Same behavior withflash_attnon/off.Isolated the bug to Gemma 3 architecture specifically:
" Paris"at -0.918 ✓ (matches Mac to within expected precision)" France"top instead of" Paris")The binary strings show standard Gemma 3 ISWA implementation symbols (
llm_build_gemma3,llama_kv_cache_iswa). Qwen uses standard GQA and is unaffected. This strongly points at the ARM64 CPU kernel path for Gemma 3's interleaved sliding window attention or an upstream fp16 accumulation issue specific to this architecture/shape.First Bad Commit
Bisection not performed yet — b8708 and b8816 both show the symptom. The 1B segfault does get fixed somewhere in that range, but the 270M numerics bug is present across the full range tested.
Relevant log output
No errors at load time. The model loads cleanly, reports the expected graph layout, and warmup completes. The divergence appears in the forward pass logits directly.
I'm happy to bisect the range and try additional reproductions if useful. Also happy to share the fine-tuned GGUFs privately if helpful.