Skip to content

fix: Add duplicate response loop breaker to prevent infinite loops#1265

Closed
gdeyoung wants to merge 1 commit into
agent0ai:mainfrom
gdeyoung:fix/duplicate-loop-breaker
Closed

fix: Add duplicate response loop breaker to prevent infinite loops#1265
gdeyoung wants to merge 1 commit into
agent0ai:mainfrom
gdeyoung:fix/duplicate-loop-breaker

Conversation

@gdeyoung
Copy link
Copy Markdown
Contributor

Summary

This PR adds a duplicate response loop breaker to prevent the agent from getting stuck in infinite loops when receiving "You have sent the same message again" errors from external APIs.

Problem

The agent can get stuck in an infinite loop when:

  1. External LLM API (e.g., ZAI/GLM-5, Ollama) rejects a message as a duplicate
  2. Agent receives "You have sent the same message again" error
  3. Agent retries with the exact same message
  4. Loop continues indefinitely

Solution

  • Add duplicate_retries counter to track consecutive duplicate responses
  • Break loop after 3 consecutive identical responses with HandledException
  • Log clear error message when loop is broken
  • Reset duplicate_retries on successful iteration

Changes

  • agent.py: Added duplicate_retries counter and loop breaker logic (13 lines added)

Testing

Tested by triggering duplicate response scenarios - agent now breaks out after 3 attempts with clear error message.

Related Issues

Fixes #1056
Fixes #1000
Related to #1187, #1011

Notes

This is a core code fix that cannot be solved through agent behavior changes alone, as the LLM is not in a coherent state during a loop to recognize and break the pattern.

- Add duplicate_retries counter to track consecutive duplicate responses
- Break loop after 3 consecutive identical responses with HandledException
- Log error message when loop is broken
- Reset duplicate_retries on successful iteration

Fixes agent0ai#1056, agent0ai#1000 - Prevents agent from getting stuck in infinite loop
when receiving 'You have sent the same message again' from external APIs
gdeyoung pushed a commit to gdeyoung/agent-zero that referenced this pull request Mar 15, 2026
- Add duplicate_retries counter to track consecutive duplicate responses
- Pass retry_count to fw.msg_repeat.md for context
- Enhanced warning message with specific guidance on breaking loops
- Provides 4 concrete alternatives when stuck in a loop
- Reset duplicate_retries on successful iteration

This addresses the ROOT CAUSE by giving the LLM:
1. Context (retry count) so it knows it's in a loop
2. Specific alternatives instead of generic 'do something else'
3. Self-correction capability before circuit breaker kicks in

Works in conjunction with PR agent0ai#1265 (circuit breaker) for defense-in-depth.

Related to agent0ai#1056, agent0ai#1000, agent0ai#1187, agent0ai#1011
@gdeyoung
Copy link
Copy Markdown
Contributor Author

🚨 Report: Increased Looping After Today's Update

After today's platform update, I'm experiencing significantly MORE looping issues where patterns get stuck, and it's happening EARLIER in chats.

Evidence: Chat Session Restarts Today (March 25)

Time Session Notes
05:03 Session 1 -
08:01 Session 2 -
14:28 Session 3 -
14:32 Session 4 ⚠️ 4 min gap - restart
14:49 Session 5 ⚠️ 17 min gap - restart
15:49 Session 6 -
16:08 Session 7 -
16:39 Session 8 -
16:42 Session 9 ⚠️ 3 min gap - restart
18:05 Current -

Key Finding: 3 restarts within 21 minutes (14:28 → 14:49) indicates severe looping/hang issues!

Additional Warning on Startup

Seeing this in Docker logs on startup:
/opt/venv-a0/lib/python3.12/site-packages/requests/init.py:113: RequestsDependencyWarning:
urllib3 (2.6.3) or chardet (7.3.0)/charset_normalizer (3.4.6) doesn't match a supported version!
\

Manifestation

The loops manifest as:

  • Agent getting stuck repeating similar actions
  • Patterns that don't break naturally
  • Requiring manual restart to recover

Request

This reinforces the NEED for the circuit breaker fix in this PR. Please prioritize review - this is actively impacting production use.


Related: #1266 (Enhanced duplicate response guidance)

@anglerfish27
Copy link
Copy Markdown

I feel the need to jump in here and post. Today is March 27th. I was using an older version of A0 (.9x) that I had downloaded about a month give or take ago to test A0 out as a POC for my job. Sadly I'm no AI expert, you can barely call me a beginner! I'm just a systems admin! so if you ask me deep technical AI questions or if how I say something sounds weird and causes and eye roll. Sorry. I don't know what I dont know! (yet!)

Working with our enterprise co-pilot AI at work, I was able to wire up that version of A0 I had on my Mac laptop (personal) with docker desktop on it and the official latest (at the time) image. I finally got my backend for A0 in the mail, a DGX Spark 10 "super computer" (their words not mine). I got LM studio the CLI version only installed on it (using the Spark headless), and ollama (again CLI only) on it. I had a few models downloaded to LM studio, I was using qwen/qwen3-30b-A3b-2507 for the "chat" portion, and nvidia/nemotron-3-nano for the utility portion, for web I was using qwen3-8B and for embedding I was using ollama as the backend, with a small text embedding model, the name escapes me but irrelevant.

since things were running in docker on the spark and on my laptop networking was a mess (for me anyways i've never used docker so there's that). Co-pilot forged ahead and got it all connected and working by using an open running terminal window that was making the system behave in a bridged mode (oh yeah I forgot I suck at networking too), after fussing around we got it all work, it was rather amazing to see the power of this come alive, I mean I was throwing all sorts of hard things at it and giving documents to reference, it did an outstanding job at all of it, no running out of context, no hallucinations, it was spot on. I was blown away.

enter today. I wanted to move off this bridged mode connection because that's not how we would run it in the enterprise. That's when things fell apart, I spent hours fighting with it and co-pilot to try and figure it out, we would get some parts working others wouldn't. After I realized we were so wildly out of date, I decided to ditch the old version and go for the latest as of today 1.3.

Working with co-pilot we began wiring it up again, issue after issue, mostly around the embedding model with ollama, co-pilot determined it was in our interest to ditch ollama and go with openai. Being ignorant I said sure, i mean at this point its broke so...

Well that turned into a mess, and after deleting the container and adding some environment variables to the new container deployed (is there really no way to add env vars after the container is created?) we were able to fix the wiring and embedding was happy. It was an API Key issue, A0 wanted a key even though we didn't need one or use one previously on the spark side but that was ollama not openai for the embedded model.

It was no longer showing up as an issue (api keys). We made sure my Mac could talk to all the endpoints, that models were loaded by lms and openai (that bit us for a bit grr). We passed the point where all curl tests worked for both embedding and chat models both on my Mac laptop (where the docker A0 is) and also locally on the spark.

Thinking we were good, I restarted the container for the last time, and that's when we hit the new and current blocker... the infinite loop of:
You have sent the same message again. You have to do something else!

CP had me try different browsers/incognito. No joy. just kept on looping this error. CP had me delete the container and add even more env vars to try and disable the loop (and also had me try disabling streaming) thinking that might be causing issues with chat.

Yeah nope. Still stuck in this loop.

Everything is running local off the spark (no cloud services).

I went rouge (on my own) and tried messing with the API endpoints http:// with it without it adding /v1 or removing it adding v1/models or not ect...basically all the different variations I could think of. Well no that broke things worse I guess that's showing we are correctly configured for the endpoints. I hope?

CP finally gave up after posting logs of the failure when I would type a message and then being stuck in this loop was the final straw I think it ran out of things to try and decided its a bug in the code. And that it can't be overridden with env vars. Again I have no idea if this is true. I'm relying on CP to guide me, yes I know this sucks because it can and is wrong sometimes, many times. But it did get it working perfectly with this bridged network so it knows enough to figure it out and fix the wiring issues we had. Now its recommending I downgrade to version 1.2 as that allegedly doesn't have the looping issue or mechanism...

If someone wants me to test something out, and has the patience to guide me some. I will be more than glad to test away. I know that having everything local may be a curse or a blessing. But the point is there's no penalty for me to test a million times over, no credits to use, no cost. I have a $4K machine that handles all the models locally. We can try other inference engines if we want. I've got nothing to lose since 1.3 is roasted right now unless someone knows how to fix it! So while adding my name to the hat of people saying 'yep its not fixed'. I'm also saying if you want to use me a test bed I'm willing to be one. Just gotta understand this stuff is crazy new to me. I fully understand all the different configurations I've been through probably makes me a high target for "ah its just misconfigured on his end" which is valid and very well may be true! Since I was using AI to help me I can't say with any certainty that I'm not the problem. But I was glad to see that others were posting this issue and recently as of 2 days ago too. Giving me hope there is a bug in there. Hoping the A0 team can figure this out for good. A0 looks so awesome, and when it was working on the old version man I was so impressed. I could feed it information or tell it specific websites to ingest and make it an expert in an instant. Which is our goal, to have a reliable system admin in a box so to speak. With lots of guardrails of course.

Nafania added a commit to Nafania/agent-zero that referenced this pull request Mar 31, 2026
- #3: duplicate response loop breaker (breaks after 3 identical responses)
- #4: dynamic output truncation threshold based on context window size
- #2: resolve §§secret() / $$secret() placeholders in MCP server env/args/url/headers
- #19: scheduler update_task tool method + prompt documentation

Already applied (verified, skipping): #22 parallel MCP init, agent0ai#62 context window optimization

Upstream: PR agent0ai#1265, PR agent0ai#857, PR agent0ai#1150, PR agent0ai#1105
Made-with: Cursor
@gdeyoung gdeyoung closed this Apr 6, 2026
@gdeyoung gdeyoung deleted the fix/duplicate-loop-breaker branch April 6, 2026 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

You have sent the same message again. You have to do something else! You have sent the same message again. You have to do something else!

2 participants