Skip to content

[BUG] Gemma 4 token repetition collapse during long generation — affects both 31b Dense and 26b MoE #622

@rnh0

Description

@rnh0

Summary

Both gemma-4-31B-it (Dense) and gemma-4-26B-A4B-it (MoE) exhibit token-level repetition collapse during long text generation. A word or token fragment doubles, then collapses into a single repeated token that fills the remaining generation budget. This occurs most reliably when combined with grammar-constrained (structured JSON) output, but the underlying tendency toward word doubling is visible even in unconstrained generation.

This may be related to #610, which reports repetition loops in the 26B MoE model during list generation. Our findings extend the scope to the 31b Dense variant and provide systematic isolation of the trigger conditions.

Evidence

Token doubling in unconstrained output

Even without any grammar constraint, successful outputs from both models show word doubling:

  • "the act of observing a a peaceful environment"
  • "the the waves", "sapphire sapphire"

This suggests the model's logit distribution has an inherent tendency to degenerate toward repeating the previous token, which does not self-correct reliably during sustained generation.

Repetition collapse with grammar-constrained output

When using structured JSON output (Ollama's format= parameter, which uses llama.cpp's grammar sampling), the word doubling escalates into a full collapse:

  1. Normal generation starts fine inside a JSON string value
  2. A word doubles: "contemplative contemplative"
  3. Collapses into a single repeated token: "own own own own own..."
  4. Fills remaining generation budget (thousands of tokens)

The grammar constraint cannot prevent this because repeated words are valid JSON string content.

Test results (Ollama, 10 seeds per test)

Test gemma4:26b (MoE) gemma4:31b (Dense)
Repetition / Valid JSON Repetition / Valid JSON
Short output + JSON schema 4/10 / 1/10 7/10 / 1/10
1000+ words + JSON schema 5/10 / 1/10 10/10 / 0/10
Complex nested schema 4/10 / 0/10 7/8 / 1/8
6 free-text fields 4/10 / 0/10 3/10 / 7/10
No JSON schema (free generation) 0/10 / n/a 0/10 / n/a

Key observations:

  • Both architectures are affected — Dense (31b) has higher repetition rates, but MoE (26b) is equally broken for JSON validity
  • Without grammar constraints, no repetition loops occur — but the word-doubling precursor is still present
  • gemma3:27b is clean (0/10 repetition, 10/10 valid JSON on the same tests) — this is a gemma4-generation regression
  • repeat_penalty has no effect — tested at 1.0, 1.15, and 1.5, identical seeds fail identically at all values
  • Repeated tokens differ between architectures: 31b produces clean English words ("own", "beach", "same"), 26b produces more exotic fragments ("$\text{}$", "visually-cent,", "sing_er,")

Minimal reproduction

Requires Ollama with gemma4:31b or gemma4:26b. Text-only, no images needed.

import ollama

SCHEMA = {
    "type": "object",
    "required": ["description", "analysis", "tags"],
    "properties": {
        "description": {
            "type": "string",
            "description": "At least 3 detailed sentences.",
        },
        "analysis": {
            "type": "string",
            "description": "Several paragraphs of analysis.",
        },
        "tags": {"type": "array", "items": {"type": "string"}},
    },
}

response = ollama.chat(
    model="gemma4:31b",  # also reproduces with gemma4:26b
    messages=[{"role": "user", "content": (
        "Describe a beach scene at sunset in detail. "
        "Write at least 3 full sentences for description "
        "and several paragraphs for analysis."
    )}],
    format=SCHEMA,
    options={
        "num_ctx": 32768,
        "num_predict": 8192,
        "repeat_penalty": 1.15,
        "repeat_last_n": 256,
        "seed": 0,
    },
)
print(response.message.content[-200:])

Environment

Related issues

Why this is likely a model-level issue

  1. Both architectures (Dense 31b and MoE 26b) are affected
  2. gemma3:27b on the same inference engine with the same parameters is clean
  3. The word-doubling precursor appears even without grammar constraints
  4. The bug reproduces across multiple backends (Ollama, LMStudio, Cloudflare — per [BUG] Gemma-4-26B-A4B: Deterministic Repetition Loop at 14th item in list ("Wait, I found it. The 14.") #610)
  5. Grammar-constrained sampling merely exposes and amplifies an underlying model tendency

Tested on 2026-04-11. Full test data and scripts: ollama/ollama#15502.

The test methodology was designed with assistance from Claude Code (Anthropic). All tests were run locally. Results are deterministic and independently reproducible.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions