Eval bug: Gemma 3 on ARM64 CPU (Cortex-A76) produces wrong logits; Qwen unaffected — b8816

### Name and Version

```
version: b8816 (also reproduced on b8708)
built: ubuntu-24.04-arm runner, cmake -DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv8-a
host: Raspberry Pi 5 (Cortex-A76, ARMv8.2-A, features: fphp asimdhp asimddp asimdrdm)
```

### Operating systems

Linux

### GGML backends

CPU

### Hardware

Raspberry Pi 5 (Broadcom BCM2712), 4× Cortex-A76 @ 2.4 GHz, 8 GB RAM. ARMv8.2-A with half-precision + dot product. No SVE.

### Models

Primary repro: Gemma 3 270M fine-tune (base: `unsloth/functiongemma-270m-it`), quantized to Q8_0.

Also reproduced to a lesser degree on: `unsloth/gemma-3-1b-it-GGUF` (Q8_0), which segfaults on b8708 and produces numerically wrong (but non-crashing) output on b8816.

### Problem description & steps to reproduce

Same GGUF on Mac CPU produces correct output; on the Pi ARM64 CPU build the forward pass silently produces a wrong probability distribution.

Reproduction via llama-cpp-python API (or llama-cli directly — the result is the same):

```python
from llama_cpp import Llama
llm = Llama(model_path="/path/model.Q8_0.gguf", n_ctx=512, n_gpu_layers=0, logits_all=True)
out = llm.create_completion("The capital of France is", max_tokens=1, temperature=0, logprobs=10)
```

**Mac CPU (llama-cpp-python 0.3.20, bundled ggml 0.9.11):**
```
top tokens:
  ' Paris': -0.003
  ' **':    -6.139
  ' Bast':  -8.627
  ...
```

**Pi ARM64 CPU (b8816 built from GitHub workflow with `-DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv8-a`, ggml 0.9.11):**
```
top tokens:
  ' bel':   -0.602
  ' el':    -1.468
  ' where': -2.533
  ' bal':   -2.695
  ' the':   -3.475
  ...  (" Paris" not in top 10)
```

Identical garbage with quantization = Q8_0 and bf16 (different garbage between quants, but still no "Paris" on Pi). Identical across `n_threads=1/2/4`. Same behavior with `flash_attn` on/off.

Isolated the bug to Gemma 3 architecture specifically:
- Qwen 2.5 0.5B Q8_0 on same Pi/binary: `" Paris"` at -0.918 ✓ (matches Mac to within expected precision)
- Gemma 3 1B base Q8_0: segfaults on b8708; produces wrong-but-not-crashing output on b8816 (`" France"` top instead of `" Paris"`)
- Our 270M Gemma 3 fine-tune: byte-identical garbage logits between b8708 and b8816

The binary strings show standard Gemma 3 ISWA implementation symbols (`llm_build_gemma3`, `llama_kv_cache_iswa`). Qwen uses standard GQA and is unaffected. This strongly points at the ARM64 CPU kernel path for Gemma 3's interleaved sliding window attention or an upstream fp16 accumulation issue specific to this architecture/shape.

### First Bad Commit

Bisection not performed yet — b8708 and b8816 both show the symptom. The 1B segfault does get fixed somewhere in that range, but the 270M numerics bug is present across the full range tested.

### Relevant log output

```
llama_context: n_ctx         = 4096
llama_context: n_ctx_seq     = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 2048
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_kv_cache_iswa: using full-size SWA cache
llama_kv_cache_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_iswa: creating     SWA KV cache, size = 4096 cells
llama_kv_cache: K (f16): 6.00 MiB, V (f16): 6.00 MiB
llama_kv_cache: K (f16): 30.00 MiB, V (f16): 30.00 MiB
sched_reserve:        CPU compute buffer size =  2058.01 MiB
Warmup completed in 49.2ms
```

No errors at load time. The model loads cleanly, reports the expected graph layout, and warmup completes. The divergence appears in the forward pass logits directly.

I'm happy to bisect the range and try additional reproductions if useful. Also happy to share the fine-tuned GGUFs privately if helpful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: Gemma 3 on ARM64 CPU (Cortex-A76) produces wrong logits; Qwen unaffected — b8816 #22011

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: Gemma 3 on ARM64 CPU (Cortex-A76) produces wrong logits; Qwen unaffected — b8816 #22011

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions