sampling: add segment-level repetition loop detection#22007
sampling: add segment-level repetition loop detection#22007Frank-Schruefer wants to merge 1 commit intoggml-org:masterfrom
Conversation
Ports the repeat-line loop detection from ollama/ollama#15212 to llama.cpp. Thinking models (and other models at low temperature) can enter infinite repetition loops where the same sentence or paragraph repeats endlessly. The existing repeat_penalty operates on token-level n-grams and cannot reliably break multi-token phrase repetition. This adds a new sampler llama_sampler_init_repeat_line() that detects segment-level loops and temporarily boosts the temperature when a repeated segment is detected, nudging the model out of the loop. Four new parameters (all exposed via the server API): repeat_line_window - number of past segments to track (0 = disabled) repeat_line_min_length - minimum segment length to consider (default 20) repeat_line_delimiters - characters that end a segment (default "\n.!?:") repeat_line_temp_boost - temperature boost on loop detection (default 0.5)
|
Hi @Frank-Schruefer, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
|
AI was used as an assistant throughout (code lookup, boilerplate, review). The design, parameter choices, and the underlying algorithm are based on our existing Ollama implementation which has been in production use. I reviewed all generated code; end-to-end testing of the loop detection itself is still pending. |
Why is it impractical? The existing penalty samplers (
This sounds very similar to the existing DRY sampler (#9702), which supports sequence breakers; that is, arbitrary strings that break up sequences of repetition. IMO this new sampler seems redundant given that DRY exists and is already supported. |
|
Please also be aware of the AI Usage Policy, part of CONTRIBUTING.md, which reads:
|
It sounds to me like the better fix is to increase the temperature. No models is trained for temperature 0.0 if I understand correctly. And if the model still repeat itself with higher temp, that's the problem with the model itself. I don't think any LLM runtime out there will accept a hacky fix like this one. |
|
For what it's worth, I've got endless repetition loops in thinking with Gemma 4 26B (Q4) at T=1.0 with all the latest fixes. It's far less often now, but it still happens. Example from a couple days ago: endlessly until I noticed and stopped it at over 28,000 repeating tokens.
If there are existing settings that might work well with Gemma 4 without seriously hampering its reasoning ability, I'd sincerely like to know how to use them. Could you give an example? My own attempts at changing the DRY and repetition penalty either didn't stop the repetition, or seriously destroyed the output (e.g. it could no longer tool call because it would be unable to correctly call the current path) I really like the idea of detecting a very obvious (to the user) repeating pattern and nudging the model out of that loop, but otherwise not affecting the model's output. |
Ah, thanks. I'm currently using the reasoning budget as well. I use It does a similar job, but as I mentioned in the issue thread, my real problem is the loop itself rather than the verbosity. It also kicks the model out of reasoning entirely. It would be nice if there's a way to target the repetition, and nudge the model out of that loop, rather than hard-limit the reasoning amount. |
|
@strawberrymelonpanda as @ddh0 already mentioned here, there's the oft-forgotten DRY sampler :) |
|
I mean, unless I'm misreading something, "DRY" literally stands for "Do not Repeat Yourself". |
|
Sure, and I've tried it, but the result I got wasn't good. In my experience, It absolutely destroys Gemma 4's ability to successfully do agentic tasks with tool calls just setting --dry-multiplier to 0.05. I don't claim to understand the internals of it, but what counts as "repeating yourself", to an LLM? Code and paths that should match exactly seem to purposefully not. But like I said, if someone knows better settings, I'm very interested. There's at least 5 knobs to tweak here. |

Problem
Thinking models (and other models at low temperature) can enter infinite repetition loops where the same sentence or paragraph repeats endlessly. The existing
repeat_penaltyoperates on token-level n-grams and cannot reliably break multi-token phrase repetition — the penalty window would need to cover the entire repeated phrase to be effective, which is impractical.Solution
Adds a new sampler
llama_sampler_init_repeat_line()that tracks completed segments (delimited by configurable characters) and temporarily boosts the temperature when a repeated segment is detected, nudging the model out of the loop.This is a port of ollama/ollama#15212.
New parameters
repeat_line_windowrepeat_line_min_lengthrepeat_line_delimiters"\n.!?:"repeat_line_temp_boostAll parameters are exposed via the server API (
/v1/chat/completionsand/completion).Caveats
The feature works well in our Ollama deployment. The llama.cpp port is fresh — production testing in this context is still pending.