Skip to content

sampling: add segment-level repetition loop detection#22007

Open
Frank-Schruefer wants to merge 1 commit intoggml-org:masterfrom
Frank-Schruefer:feature/repeat-line-loop-detection
Open

sampling: add segment-level repetition loop detection#22007
Frank-Schruefer wants to merge 1 commit intoggml-org:masterfrom
Frank-Schruefer:feature/repeat-line-loop-detection

Conversation

@Frank-Schruefer
Copy link
Copy Markdown

Problem

Thinking models (and other models at low temperature) can enter infinite repetition loops where the same sentence or paragraph repeats endlessly. The existing repeat_penalty operates on token-level n-grams and cannot reliably break multi-token phrase repetition — the penalty window would need to cover the entire repeated phrase to be effective, which is impractical.

Solution

Adds a new sampler llama_sampler_init_repeat_line() that tracks completed segments (delimited by configurable characters) and temporarily boosts the temperature when a repeated segment is detected, nudging the model out of the loop.

This is a port of ollama/ollama#15212.

New parameters

Parameter Type Default Description
repeat_line_window int 0 (disabled) Number of past segments to track
repeat_line_min_length int 20 Minimum segment length (avoids false positives from short phrases)
repeat_line_delimiters string "\n.!?:" Characters that end a segment
repeat_line_temp_boost float 0.5 Temperature boost when a loop is detected

All parameters are exposed via the server API (/v1/chat/completions and /completion).

Caveats

The feature works well in our Ollama deployment. The llama.cpp port is fresh — production testing in this context is still pending.

Ports the repeat-line loop detection from ollama/ollama#15212 to llama.cpp.

Thinking models (and other models at low temperature) can enter infinite
repetition loops where the same sentence or paragraph repeats endlessly.
The existing repeat_penalty operates on token-level n-grams and cannot
reliably break multi-token phrase repetition.

This adds a new sampler llama_sampler_init_repeat_line() that detects
segment-level loops and temporarily boosts the temperature when a repeated
segment is detected, nudging the model out of the loop.

Four new parameters (all exposed via the server API):
  repeat_line_window     - number of past segments to track (0 = disabled)
  repeat_line_min_length - minimum segment length to consider (default 20)
  repeat_line_delimiters - characters that end a segment (default "\n.!?:")
  repeat_line_temp_boost - temperature boost on loop detection (default 0.5)
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented Apr 16, 2026

Hi @Frank-Schruefer, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@Frank-Schruefer
Copy link
Copy Markdown
Author

AI was used as an assistant throughout (code lookup, boilerplate, review). The design, parameter choices, and the underlying algorithm are based on our existing Ollama implementation which has been in production use. I reviewed all generated code; end-to-end testing of the loop detection itself is still pending.

@ddh0
Copy link
Copy Markdown
Contributor

ddh0 commented Apr 16, 2026

The existing repeat_penalty operates on token-level n-grams and cannot reliably break multi-token phrase repetition — the penalty window would need to cover the entire repeated phrase to be effective, which is impractical.

Why is it impractical? The existing penalty samplers (repeat, presence, and frequency) can already support arbitrary window sizes. The existing adaptive-p sampler maintains a rolling window of arbitrary size.

Adds a new sampler llama_sampler_init_repeat_line() that tracks completed segments (delimited by configurable characters) and temporarily boosts the temperature when a repeated segment is detected, nudging the model out of the loop.

This sounds very similar to the existing DRY sampler (#9702), which supports sequence breakers; that is, arbitrary strings that break up sequences of repetition.

IMO this new sampler seems redundant given that DRY exists and is already supported.

@ddh0
Copy link
Copy Markdown
Contributor

ddh0 commented Apr 16, 2026

Please also be aware of the AI Usage Policy, part of CONTRIBUTING.md, which reads:

It is strictly prohibited to use AI to write your posts for you (bug reports, feature requests, pull request descriptions, Github discussions, responding to humans, ...).

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Apr 16, 2026

Thinking models (and other models at low temperature) can enter infinite repetition loops where the same sentence or paragraph repeats endlessly.

It sounds to me like the better fix is to increase the temperature. No models is trained for temperature 0.0 if I understand correctly.

And if the model still repeat itself with higher temp, that's the problem with the model itself. I don't think any LLM runtime out there will accept a hacky fix like this one.

@Frank-Schruefer
Copy link
Copy Markdown
Author

  • repeat, presence, frequency is token-level and always-on
  • reactive here vs. constant output manipulation there
  • you want low temperature for thinking and coding

@strawberrymelonpanda
Copy link
Copy Markdown
Contributor

strawberrymelonpanda commented Apr 17, 2026

if the model still repeat itself with higher temp, that's the problem with the model itself.

For what it's worth, I've got endless repetition loops in thinking with Gemma 4 26B (Q4) at T=1.0 with all the latest fixes. It's far less often now, but it still happens. Example from a couple days ago:

temperature = 1.0; top-p = 0.95; top-k = 64
Wait! I see it!\n
If line 168 prints the `AVG` header, then the output should be:\n
[...]
But the output I got was:\n
[...]
And the only way that can happen is if there is[...]\n
Wait! I see it!\n
[loop]

endlessly until I noticed and stopped it at over 28,000 repeating tokens.

Why is it impractical? The existing penalty samplers (repeat, presence, and frequency) can already support arbitrary window sizes. The existing adaptive-p sampler maintains a rolling window of arbitrary size.

This sounds very similar to the existing DRY sampler (#9702), which supports sequence breakers;

If there are existing settings that might work well with Gemma 4 without seriously hampering its reasoning ability, I'd sincerely like to know how to use them. Could you give an example?

My own attempts at changing the DRY and repetition penalty either didn't stop the repetition, or seriously destroyed the output (e.g. it could no longer tool call because it would be unable to correctly call the current path)

I really like the idea of detecting a very obvious (to the user) repeating pattern and nudging the model out of that loop, but otherwise not affecting the model's output.

@ddh0
Copy link
Copy Markdown
Contributor

ddh0 commented Apr 17, 2026

If there are existing settings that might work well with Gemma 4 without seriously hampering its reasoning ability, I'd sincerely like to know how to use them. Could you give an example?

These options work well for me:

Screenshot 2026-04-17 at 12 25 50 AM

e.g. --reasoning-budget 4096 to limit CoT to 4096 tokens, and --reasoning-budget-message "...enough. Need to give the final output now." to end CoT gracefully.

@strawberrymelonpanda
Copy link
Copy Markdown
Contributor

strawberrymelonpanda commented Apr 17, 2026

e.g. --reasoning-budget 4096 to limit CoT to 4096 tokens, and --reasoning-budget-message "...enough. Need to give the final output now." to end CoT gracefully.

Ah, thanks. I'm currently using the reasoning budget as well. I use

reasoning-budget = 20480
reasoning-budget-message = "... reasoning budget exceeded. We may be in a loop or overcomplicating things. Let's try to answer now."

It does a similar job, but as I mentioned in the issue thread, my real problem is the loop itself rather than the verbosity. It also kicks the model out of reasoning entirely. It would be nice if there's a way to target the repetition, and nudge the model out of that loop, rather than hard-limit the reasoning amount.

@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented Apr 17, 2026

@strawberrymelonpanda as @ddh0 already mentioned here, there's the oft-forgotten DRY sampler :)

@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented Apr 17, 2026

I mean, unless I'm misreading something, "DRY" literally stands for "Do not Repeat Yourself".

@strawberrymelonpanda
Copy link
Copy Markdown
Contributor

Sure, and I've tried it, but the result I got wasn't good. In my experience, It absolutely destroys Gemma 4's ability to successfully do agentic tasks with tool calls just setting --dry-multiplier to 0.05.

✗ edit failed
Error: Could not find oldString in the file. It must match exactly, including whitespace, indentation, and line endings.
✗ edit failed
Error: Could not find oldString in the file. It must match exactly, including whitespace, indentation, and line endings.
✗ edit failed
Error: Could not find oldString in the file. It must match exactly, including whitespace, indentation, and line endings.
✗ edit failed
Error: Could not find oldString in the file. It must match exactly, including whitespace, indentation, and line endings.

I don't claim to understand the internals of it, but what counts as "repeating yourself", to an LLM? Code and paths that should match exactly seem to purposefully not.

But like I said, if someone knows better settings, I'm very interested. There's at least 5 knobs to tweak here.

--dry-multiplier N                      set DRY sampling multiplier (default: 0.00, 0.0 = disabled)
--dry-base N                            set DRY sampling base value (default: 1.75)
--dry-allowed-length N                  set allowed length for DRY sampling (default: 2)
--dry-penalty-last-n N                  set DRY penalty for the last n tokens (default: -1, 0 = disable, -1 =
--dry-sequence-breaker STRING           add sequence breaker for DRY sampling, clearing out default breakers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants