You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Both gemma-4-31B-it (Dense) and gemma-4-26B-A4B-it (MoE) exhibit token-level repetition collapse during long text generation. A word or token fragment doubles, then collapses into a single repeated token that fills the remaining generation budget. This occurs most reliably when combined with grammar-constrained (structured JSON) output, but the underlying tendency toward word doubling is visible even in unconstrained generation.
This may be related to #610, which reports repetition loops in the 26B MoE model during list generation. Our findings extend the scope to the 31b Dense variant and provide systematic isolation of the trigger conditions.
Evidence
Token doubling in unconstrained output
Even without any grammar constraint, successful outputs from both models show word doubling:
"the act of observing a a peaceful environment"
"the the waves", "sapphire sapphire"
This suggests the model's logit distribution has an inherent tendency to degenerate toward repeating the previous token, which does not self-correct reliably during sustained generation.
Repetition collapse with grammar-constrained output
When using structured JSON output (Ollama's format= parameter, which uses llama.cpp's grammar sampling), the word doubling escalates into a full collapse:
Normal generation starts fine inside a JSON string value
A word doubles: "contemplative contemplative"
Collapses into a single repeated token: "own own own own own..."
Fills remaining generation budget (thousands of tokens)
The grammar constraint cannot prevent this because repeated words are valid JSON string content.
Test results (Ollama, 10 seeds per test)
Test
gemma4:26b (MoE)
gemma4:31b (Dense)
Repetition / Valid JSON
Repetition / Valid JSON
Short output + JSON schema
4/10 / 1/10
7/10 / 1/10
1000+ words + JSON schema
5/10 / 1/10
10/10 / 0/10
Complex nested schema
4/10 / 0/10
7/8 / 1/8
6 free-text fields
4/10 / 0/10
3/10 / 7/10
No JSON schema (free generation)
0/10 / n/a
0/10 / n/a
Key observations:
Both architectures are affected — Dense (31b) has higher repetition rates, but MoE (26b) is equally broken for JSON validity
Without grammar constraints, no repetition loops occur — but the word-doubling precursor is still present
gemma3:27b is clean (0/10 repetition, 10/10 valid JSON on the same tests) — this is a gemma4-generation regression
repeat_penalty has no effect — tested at 1.0, 1.15, and 1.5, identical seeds fail identically at all values
Repeated tokens differ between architectures: 31b produces clean English words ("own", "beach", "same"), 26b produces more exotic fragments ("$\text{}$", "visually-cent,", "sing_er,")
Minimal reproduction
Requires Ollama with gemma4:31b or gemma4:26b. Text-only, no images needed.
importollamaSCHEMA= {
"type": "object",
"required": ["description", "analysis", "tags"],
"properties": {
"description": {
"type": "string",
"description": "At least 3 detailed sentences.",
},
"analysis": {
"type": "string",
"description": "Several paragraphs of analysis.",
},
"tags": {"type": "array", "items": {"type": "string"}},
},
}
response=ollama.chat(
model="gemma4:31b", # also reproduces with gemma4:26bmessages=[{"role": "user", "content": (
"Describe a beach scene at sunset in detail. ""Write at least 3 full sentences for description ""and several paragraphs for analysis."
)}],
format=SCHEMA,
options={
"num_ctx": 32768,
"num_predict": 8192,
"repeat_penalty": 1.15,
"repeat_last_n": 256,
"seed": 0,
},
)
print(response.message.content[-200:])
The test methodology was designed with assistance from Claude Code (Anthropic). All tests were run locally. Results are deterministic and independently reproducible.
Summary
Both
gemma-4-31B-it(Dense) andgemma-4-26B-A4B-it(MoE) exhibit token-level repetition collapse during long text generation. A word or token fragment doubles, then collapses into a single repeated token that fills the remaining generation budget. This occurs most reliably when combined with grammar-constrained (structured JSON) output, but the underlying tendency toward word doubling is visible even in unconstrained generation.This may be related to #610, which reports repetition loops in the 26B MoE model during list generation. Our findings extend the scope to the 31b Dense variant and provide systematic isolation of the trigger conditions.
Evidence
Token doubling in unconstrained output
Even without any grammar constraint, successful outputs from both models show word doubling:
"the act of observing a a peaceful environment""the the waves","sapphire sapphire"This suggests the model's logit distribution has an inherent tendency to degenerate toward repeating the previous token, which does not self-correct reliably during sustained generation.
Repetition collapse with grammar-constrained output
When using structured JSON output (Ollama's
format=parameter, which uses llama.cpp's grammar sampling), the word doubling escalates into a full collapse:"contemplative contemplative""own own own own own..."The grammar constraint cannot prevent this because repeated words are valid JSON string content.
Test results (Ollama, 10 seeds per test)
Key observations:
repeat_penaltyhas no effect — tested at 1.0, 1.15, and 1.5, identical seeds fail identically at all values"own","beach","same"), 26b produces more exotic fragments ("$\text{}$","visually-cent,","sing_er,")Minimal reproduction
Requires Ollama with
gemma4:31borgemma4:26b. Text-only, no images needed.Environment
Related issues
<unused24>tokens from GEMV buffer overlap (distinct root cause, fixed in b8702)Why this is likely a model-level issue
Tested on 2026-04-11. Full test data and scripts: ollama/ollama#15502.
The test methodology was designed with assistance from Claude Code (Anthropic). All tests were run locally. Results are deterministic and independently reproducible.