sampling: add segment-level repetition loop detection by Frank-Schruefer · Pull Request #22007 · ggml-org/llama.cpp

Frank-Schruefer · 2026-04-16T18:19:08Z

Problem

Thinking models (and other models at low temperature) can enter infinite repetition loops where the same sentence or paragraph repeats endlessly. The existing repeat_penalty operates on token-level n-grams and cannot reliably break multi-token phrase repetition — the penalty window would need to cover the entire repeated phrase to be effective, which is impractical.

Solution

Adds a new sampler llama_sampler_init_repeat_line() that tracks completed segments (delimited by configurable characters) and temporarily boosts the temperature when a repeated segment is detected, nudging the model out of the loop.

This is a port of ollama/ollama#15212.

New parameters

Parameter	Type	Default	Description
`repeat_line_window`	int	0 (disabled)	Number of past segments to track
`repeat_line_min_length`	int	20	Minimum segment length (avoids false positives from short phrases)
`repeat_line_delimiters`	string	`"\n.!?:"`	Characters that end a segment
`repeat_line_temp_boost`	float	0.5	Temperature boost when a loop is detected

All parameters are exposed via the server API (/v1/chat/completions and /completion).

Caveats

The feature works well in our Ollama deployment. The llama.cpp port is fresh — production testing in this context is still pending.

Ports the repeat-line loop detection from ollama/ollama#15212 to llama.cpp. Thinking models (and other models at low temperature) can enter infinite repetition loops where the same sentence or paragraph repeats endlessly. The existing repeat_penalty operates on token-level n-grams and cannot reliably break multi-token phrase repetition. This adds a new sampler llama_sampler_init_repeat_line() that detects segment-level loops and temporarily boosts the temperature when a repeated segment is detected, nudging the model out of the loop. Four new parameters (all exposed via the server API): repeat_line_window - number of past segments to track (0 = disabled) repeat_line_min_length - minimum segment length to consider (default 20) repeat_line_delimiters - characters that end a segment (default "\n.!?:") repeat_line_temp_boost - temperature boost on loop detection (default 0.5)

ggml-gh-bot · 2026-04-16T18:24:19Z

Hi @Frank-Schruefer, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

Frank-Schruefer · 2026-04-16T18:34:42Z

AI was used as an assistant throughout (code lookup, boilerplate, review). The design, parameter choices, and the underlying algorithm are based on our existing Ollama implementation which has been in production use. I reviewed all generated code; end-to-end testing of the loop detection itself is still pending.

ddh0 · 2026-04-16T21:50:27Z

The existing repeat_penalty operates on token-level n-grams and cannot reliably break multi-token phrase repetition — the penalty window would need to cover the entire repeated phrase to be effective, which is impractical.

Why is it impractical? The existing penalty samplers (repeat, presence, and frequency) can already support arbitrary window sizes. The existing adaptive-p sampler maintains a rolling window of arbitrary size.

Adds a new sampler llama_sampler_init_repeat_line() that tracks completed segments (delimited by configurable characters) and temporarily boosts the temperature when a repeated segment is detected, nudging the model out of the loop.

This sounds very similar to the existing DRY sampler (#9702), which supports sequence breakers; that is, arbitrary strings that break up sequences of repetition.

IMO this new sampler seems redundant given that DRY exists and is already supported.

ddh0 · 2026-04-16T21:52:59Z

Please also be aware of the AI Usage Policy, part of CONTRIBUTING.md, which reads:

It is strictly prohibited to use AI to write your posts for you (bug reports, feature requests, pull request descriptions, Github discussions, responding to humans, ...).

ngxson · 2026-04-16T22:15:02Z

Thinking models (and other models at low temperature) can enter infinite repetition loops where the same sentence or paragraph repeats endlessly.

It sounds to me like the better fix is to increase the temperature. No models is trained for temperature 0.0 if I understand correctly.

And if the model still repeat itself with higher temp, that's the problem with the model itself. I don't think any LLM runtime out there will accept a hacky fix like this one.

Frank-Schruefer · 2026-04-17T02:19:59Z

repeat, presence, frequency is token-level and always-on
reactive here vs. constant output manipulation there
you want low temperature for thinking and coding

strawberrymelonpanda · 2026-04-17T04:45:39Z

if the model still repeat itself with higher temp, that's the problem with the model itself.

For what it's worth, I've got endless repetition loops in thinking with Gemma 4 26B (Q4) at T=1.0 with all the latest fixes. It's far less often now, but it still happens. Example from a couple days ago:

temperature = 1.0; top-p = 0.95; top-k = 64

Wait! I see it!\n
If line 168 prints the `AVG` header, then the output should be:\n
[...]
But the output I got was:\n
[...]
And the only way that can happen is if there is[...]\n
Wait! I see it!\n
[loop]

endlessly until I noticed and stopped it at over 28,000 repeating tokens.

Why is it impractical? The existing penalty samplers (repeat, presence, and frequency) can already support arbitrary window sizes. The existing adaptive-p sampler maintains a rolling window of arbitrary size.

This sounds very similar to the existing DRY sampler (#9702), which supports sequence breakers;

If there are existing settings that might work well with Gemma 4 without seriously hampering its reasoning ability, I'd sincerely like to know how to use them. Could you give an example?

My own attempts at changing the DRY and repetition penalty either didn't stop the repetition, or seriously destroyed the output (e.g. it could no longer tool call because it would be unable to correctly call the current path)

I really like the idea of detecting a very obvious (to the user) repeating pattern and nudging the model out of that loop, but otherwise not affecting the model's output.

ddh0 · 2026-04-17T05:27:51Z

If there are existing settings that might work well with Gemma 4 without seriously hampering its reasoning ability, I'd sincerely like to know how to use them. Could you give an example?

These options work well for me:

e.g. --reasoning-budget 4096 to limit CoT to 4096 tokens, and --reasoning-budget-message "...enough. Need to give the final output now." to end CoT gracefully.

strawberrymelonpanda · 2026-04-17T07:29:58Z

e.g. --reasoning-budget 4096 to limit CoT to 4096 tokens, and --reasoning-budget-message "...enough. Need to give the final output now." to end CoT gracefully.

Ah, thanks. I'm currently using the reasoning budget as well. I use

reasoning-budget = 20480
reasoning-budget-message = "... reasoning budget exceeded. We may be in a loop or overcomplicating things. Let's try to answer now."

It does a similar job, but as I mentioned in the issue thread, my real problem is the loop itself rather than the verbosity. It also kicks the model out of reasoning entirely. It would be nice if there's a way to target the repetition, and nudge the model out of that loop, rather than hard-limit the reasoning amount.

pwilkin · 2026-04-17T08:50:26Z

@strawberrymelonpanda as @ddh0 already mentioned here, there's the oft-forgotten DRY sampler :)

pwilkin · 2026-04-17T08:51:03Z

I mean, unless I'm misreading something, "DRY" literally stands for "Do not Repeat Yourself".

strawberrymelonpanda · 2026-04-17T09:08:03Z

Sure, and I've tried it, but the result I got wasn't good. In my experience, It absolutely destroys Gemma 4's ability to successfully do agentic tasks with tool calls just setting --dry-multiplier to 0.05.

✗ edit failed
Error: Could not find oldString in the file. It must match exactly, including whitespace, indentation, and line endings.
✗ edit failed
Error: Could not find oldString in the file. It must match exactly, including whitespace, indentation, and line endings.
✗ edit failed
Error: Could not find oldString in the file. It must match exactly, including whitespace, indentation, and line endings.
✗ edit failed
Error: Could not find oldString in the file. It must match exactly, including whitespace, indentation, and line endings.

I don't claim to understand the internals of it, but what counts as "repeating yourself", to an LLM? Code and paths that should match exactly seem to purposefully not.

But like I said, if someone knows better settings, I'm very interested. There's at least 5 knobs to tweak here.

--dry-multiplier N                      set DRY sampling multiplier (default: 0.00, 0.0 = disabled)
--dry-base N                            set DRY sampling base value (default: 1.75)
--dry-allowed-length N                  set allowed length for DRY sampling (default: 2)
--dry-penalty-last-n N                  set DRY penalty for the last n tokens (default: -1, 0 = disable, -1 =
--dry-sequence-breaker STRING           add sequence breaker for DRY sampling, clearing out default breakers

Frank-Schruefer requested review from a team and ggerganov as code owners April 16, 2026 18:19

Frank-Schruefer mentioned this pull request Apr 16, 2026

Feature Request: Line-level repetition detection to break thinking loops #21264

Open

github-actions Bot added examples server labels Apr 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sampling: add segment-level repetition loop detection#22007

sampling: add segment-level repetition loop detection#22007
Frank-Schruefer wants to merge 1 commit intoggml-org:masterfrom
Frank-Schruefer:feature/repeat-line-loop-detection

Frank-Schruefer commented Apr 16, 2026

Uh oh!

ggml-gh-bot Bot commented Apr 16, 2026

Uh oh!

Frank-Schruefer commented Apr 16, 2026

Uh oh!

ddh0 commented Apr 16, 2026

Uh oh!

ddh0 commented Apr 16, 2026

Uh oh!

ngxson commented Apr 16, 2026

Uh oh!

Frank-Schruefer commented Apr 17, 2026

Uh oh!

strawberrymelonpanda commented Apr 17, 2026 •

edited

Loading

Uh oh!

ddh0 commented Apr 17, 2026

Uh oh!

strawberrymelonpanda commented Apr 17, 2026 •

edited

Loading

Uh oh!

pwilkin commented Apr 17, 2026

Uh oh!

pwilkin commented Apr 17, 2026

Uh oh!

strawberrymelonpanda commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Frank-Schruefer commented Apr 16, 2026

Problem

Solution

New parameters

Caveats

Uh oh!

ggml-gh-bot Bot commented Apr 16, 2026

Uh oh!

Frank-Schruefer commented Apr 16, 2026

Uh oh!

ddh0 commented Apr 16, 2026

Uh oh!

ddh0 commented Apr 16, 2026

Uh oh!

ngxson commented Apr 16, 2026

Uh oh!

Frank-Schruefer commented Apr 17, 2026

Uh oh!

strawberrymelonpanda commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ddh0 commented Apr 17, 2026

Uh oh!

strawberrymelonpanda commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Apr 17, 2026

Uh oh!

pwilkin commented Apr 17, 2026

Uh oh!

strawberrymelonpanda commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

strawberrymelonpanda commented Apr 17, 2026 •

edited

Loading

strawberrymelonpanda commented Apr 17, 2026 •

edited

Loading