[BUG] Gemma 4 token repetition collapse during long generation — affects both 31b Dense and 26b MoE

## Summary

Both `gemma-4-31B-it` (Dense) and `gemma-4-26B-A4B-it` (MoE) exhibit token-level repetition collapse during long text generation. A word or token fragment doubles, then collapses into a single repeated token that fills the remaining generation budget. This occurs most reliably when combined with grammar-constrained (structured JSON) output, but the underlying tendency toward word doubling is visible even in unconstrained generation.

This may be related to #610, which reports repetition loops in the 26B MoE model during list generation. Our findings extend the scope to the 31b Dense variant and provide systematic isolation of the trigger conditions.

## Evidence

### Token doubling in unconstrained output

Even without any grammar constraint, successful outputs from both models show word doubling:
- `"the act of observing a a peaceful environment"`
- `"the the waves"`, `"sapphire sapphire"`

This suggests the model's logit distribution has an inherent tendency to degenerate toward repeating the previous token, which does not self-correct reliably during sustained generation.

### Repetition collapse with grammar-constrained output

When using structured JSON output (Ollama's `format=` parameter, which uses llama.cpp's grammar sampling), the word doubling escalates into a full collapse:

1. Normal generation starts fine inside a JSON string value
2. A word doubles: `"contemplative contemplative"`
3. Collapses into a single repeated token: `"own own own own own..."`
4. Fills remaining generation budget (thousands of tokens)

The grammar constraint cannot prevent this because repeated words are valid JSON string content.

### Test results (Ollama, 10 seeds per test)

| Test | gemma4:26b (MoE) | gemma4:31b (Dense) |
|------|------------------|-------------------|
| | Repetition / Valid JSON | Repetition / Valid JSON |
| Short output + JSON schema | 4/10 / 1/10 | 7/10 / 1/10 |
| 1000+ words + JSON schema | 5/10 / 1/10 | 10/10 / 0/10 |
| Complex nested schema | 4/10 / 0/10 | 7/8 / 1/8 |
| 6 free-text fields | 4/10 / 0/10 | 3/10 / 7/10 |
| No JSON schema (free generation) | 0/10 / n/a | 0/10 / n/a |

Key observations:
- **Both architectures are affected** — Dense (31b) has higher repetition rates, but MoE (26b) is equally broken for JSON validity
- **Without grammar constraints, no repetition loops occur** — but the word-doubling precursor is still present
- **gemma3:27b is clean** (0/10 repetition, 10/10 valid JSON on the same tests) — this is a gemma4-generation regression
- **`repeat_penalty` has no effect** — tested at 1.0, 1.15, and 1.5, identical seeds fail identically at all values
- **Repeated tokens differ between architectures**: 31b produces clean English words (`"own"`, `"beach"`, `"same"`), 26b produces more exotic fragments (`"$\text{}$"`, `"visually-cent,"`, `"sing_er,"`)

### Minimal reproduction

Requires Ollama with `gemma4:31b` or `gemma4:26b`. Text-only, no images needed.

```python
import ollama

SCHEMA = {
    "type": "object",
    "required": ["description", "analysis", "tags"],
    "properties": {
        "description": {
            "type": "string",
            "description": "At least 3 detailed sentences.",
        },
        "analysis": {
            "type": "string",
            "description": "Several paragraphs of analysis.",
        },
        "tags": {"type": "array", "items": {"type": "string"}},
    },
}

response = ollama.chat(
    model="gemma4:31b",  # also reproduces with gemma4:26b
    messages=[{"role": "user", "content": (
        "Describe a beach scene at sunset in detail. "
        "Write at least 3 full sentences for description "
        "and several paragraphs for analysis."
    )}],
    format=SCHEMA,
    options={
        "num_ctx": 32768,
        "num_predict": 8192,
        "repeat_penalty": 1.15,
        "repeat_last_n": 256,
        "seed": 0,
    },
)
print(response.message.content[-200:])
```

## Environment

- Ollama 0.20.5 (llama.cpp backend)
- NVIDIA RTX 5090 (32 GB VRAM)
- Ubuntu 24.04.3 LTS
- Also reported on Cloudflare Workers AI (#610 comment) and LMStudio (#610), suggesting this is not backend-specific

## Related issues

- google-deepmind/gemma#610 — Gemma 4 26B repetition loop during list generation
- ollama/ollama#15502 — Detailed isolation and stress testing of this bug
- ggml-org/llama.cpp#21321 — Different bug: `<unused24>` tokens from GEMV buffer overlap (distinct root cause, fixed in b8702)

## Why this is likely a model-level issue

1. Both architectures (Dense 31b and MoE 26b) are affected
2. gemma3:27b on the same inference engine with the same parameters is clean
3. The word-doubling precursor appears even without grammar constraints
4. The bug reproduces across multiple backends (Ollama, LMStudio, Cloudflare — per #610)
5. Grammar-constrained sampling merely exposes and amplifies an underlying model tendency

---

*Tested on 2026-04-11. Full test data and scripts: ollama/ollama#15502.*

*The test methodology was designed with assistance from Claude Code (Anthropic). All tests were run locally. Results are deterministic and independently reproducible.*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Gemma 4 token repetition collapse during long generation — affects both 31b Dense and 26b MoE #622

Summary

Evidence

Token doubling in unconstrained output

Repetition collapse with grammar-constrained output

Test results (Ollama, 10 seeds per test)

Minimal reproduction

Environment

Related issues

Why this is likely a model-level issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Test	gemma4:26b (MoE)	gemma4:31b (Dense)
	Repetition / Valid JSON	Repetition / Valid JSON
Short output + JSON schema	4/10 / 1/10	7/10 / 1/10
1000+ words + JSON schema	5/10 / 1/10	10/10 / 0/10
Complex nested schema	4/10 / 0/10	7/8 / 1/8
6 free-text fields	4/10 / 0/10	3/10 / 7/10
No JSON schema (free generation)	0/10 / n/a	0/10 / n/a

[BUG] Gemma 4 token repetition collapse during long generation — affects both 31b Dense and 26b MoE #622

Description

Summary

Evidence

Token doubling in unconstrained output

Repetition collapse with grammar-constrained output

Test results (Ollama, 10 seeds per test)

Minimal reproduction

Environment

Related issues

Why this is likely a model-level issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions