fix: Add duplicate response loop breaker to prevent infinite loops#1265
fix: Add duplicate response loop breaker to prevent infinite loops#1265gdeyoung wants to merge 1 commit into
Conversation
- Add duplicate_retries counter to track consecutive duplicate responses - Break loop after 3 consecutive identical responses with HandledException - Log error message when loop is broken - Reset duplicate_retries on successful iteration Fixes agent0ai#1056, agent0ai#1000 - Prevents agent from getting stuck in infinite loop when receiving 'You have sent the same message again' from external APIs
- Add duplicate_retries counter to track consecutive duplicate responses - Pass retry_count to fw.msg_repeat.md for context - Enhanced warning message with specific guidance on breaking loops - Provides 4 concrete alternatives when stuck in a loop - Reset duplicate_retries on successful iteration This addresses the ROOT CAUSE by giving the LLM: 1. Context (retry count) so it knows it's in a loop 2. Specific alternatives instead of generic 'do something else' 3. Self-correction capability before circuit breaker kicks in Works in conjunction with PR agent0ai#1265 (circuit breaker) for defense-in-depth. Related to agent0ai#1056, agent0ai#1000, agent0ai#1187, agent0ai#1011
🚨 Report: Increased Looping After Today's UpdateAfter today's platform update, I'm experiencing significantly MORE looping issues where patterns get stuck, and it's happening EARLIER in chats. Evidence: Chat Session Restarts Today (March 25)
Key Finding: 3 restarts within 21 minutes (14:28 → 14:49) indicates severe looping/hang issues! Additional Warning on StartupSeeing this in Docker logs on startup: ManifestationThe loops manifest as:
RequestThis reinforces the NEED for the circuit breaker fix in this PR. Please prioritize review - this is actively impacting production use. Related: #1266 (Enhanced duplicate response guidance) |
|
I feel the need to jump in here and post. Today is March 27th. I was using an older version of A0 (.9x) that I had downloaded about a month give or take ago to test A0 out as a POC for my job. Sadly I'm no AI expert, you can barely call me a beginner! I'm just a systems admin! so if you ask me deep technical AI questions or if how I say something sounds weird and causes and eye roll. Sorry. I don't know what I dont know! (yet!) Working with our enterprise co-pilot AI at work, I was able to wire up that version of A0 I had on my Mac laptop (personal) with docker desktop on it and the official latest (at the time) image. I finally got my backend for A0 in the mail, a DGX Spark 10 "super computer" (their words not mine). I got LM studio the CLI version only installed on it (using the Spark headless), and ollama (again CLI only) on it. I had a few models downloaded to LM studio, I was using qwen/qwen3-30b-A3b-2507 for the "chat" portion, and nvidia/nemotron-3-nano for the utility portion, for web I was using qwen3-8B and for embedding I was using ollama as the backend, with a small text embedding model, the name escapes me but irrelevant. since things were running in docker on the spark and on my laptop networking was a mess (for me anyways i've never used docker so there's that). Co-pilot forged ahead and got it all connected and working by using an open running terminal window that was making the system behave in a bridged mode (oh yeah I forgot I suck at networking too), after fussing around we got it all work, it was rather amazing to see the power of this come alive, I mean I was throwing all sorts of hard things at it and giving documents to reference, it did an outstanding job at all of it, no running out of context, no hallucinations, it was spot on. I was blown away. enter today. I wanted to move off this bridged mode connection because that's not how we would run it in the enterprise. That's when things fell apart, I spent hours fighting with it and co-pilot to try and figure it out, we would get some parts working others wouldn't. After I realized we were so wildly out of date, I decided to ditch the old version and go for the latest as of today 1.3. Working with co-pilot we began wiring it up again, issue after issue, mostly around the embedding model with ollama, co-pilot determined it was in our interest to ditch ollama and go with openai. Being ignorant I said sure, i mean at this point its broke so... Well that turned into a mess, and after deleting the container and adding some environment variables to the new container deployed (is there really no way to add env vars after the container is created?) we were able to fix the wiring and embedding was happy. It was an API Key issue, A0 wanted a key even though we didn't need one or use one previously on the spark side but that was ollama not openai for the embedded model. It was no longer showing up as an issue (api keys). We made sure my Mac could talk to all the endpoints, that models were loaded by lms and openai (that bit us for a bit grr). We passed the point where all curl tests worked for both embedding and chat models both on my Mac laptop (where the docker A0 is) and also locally on the spark. Thinking we were good, I restarted the container for the last time, and that's when we hit the new and current blocker... the infinite loop of: CP had me try different browsers/incognito. No joy. just kept on looping this error. CP had me delete the container and add even more env vars to try and disable the loop (and also had me try disabling streaming) thinking that might be causing issues with chat. Yeah nope. Still stuck in this loop. Everything is running local off the spark (no cloud services). I went rouge (on my own) and tried messing with the API endpoints http:// with it without it adding /v1 or removing it adding v1/models or not ect...basically all the different variations I could think of. Well no that broke things worse I guess that's showing we are correctly configured for the endpoints. I hope? CP finally gave up after posting logs of the failure when I would type a message and then being stuck in this loop was the final straw I think it ran out of things to try and decided its a bug in the code. And that it can't be overridden with env vars. Again I have no idea if this is true. I'm relying on CP to guide me, yes I know this sucks because it can and is wrong sometimes, many times. But it did get it working perfectly with this bridged network so it knows enough to figure it out and fix the wiring issues we had. Now its recommending I downgrade to version 1.2 as that allegedly doesn't have the looping issue or mechanism... If someone wants me to test something out, and has the patience to guide me some. I will be more than glad to test away. I know that having everything local may be a curse or a blessing. But the point is there's no penalty for me to test a million times over, no credits to use, no cost. I have a $4K machine that handles all the models locally. We can try other inference engines if we want. I've got nothing to lose since 1.3 is roasted right now unless someone knows how to fix it! So while adding my name to the hat of people saying 'yep its not fixed'. I'm also saying if you want to use me a test bed I'm willing to be one. Just gotta understand this stuff is crazy new to me. I fully understand all the different configurations I've been through probably makes me a high target for "ah its just misconfigured on his end" which is valid and very well may be true! Since I was using AI to help me I can't say with any certainty that I'm not the problem. But I was glad to see that others were posting this issue and recently as of 2 days ago too. Giving me hope there is a bug in there. Hoping the A0 team can figure this out for good. A0 looks so awesome, and when it was working on the old version man I was so impressed. I could feed it information or tell it specific websites to ingest and make it an expert in an instant. Which is our goal, to have a reliable system admin in a box so to speak. With lots of guardrails of course. |
- #3: duplicate response loop breaker (breaks after 3 identical responses) - #4: dynamic output truncation threshold based on context window size - #2: resolve §§secret() / $$secret() placeholders in MCP server env/args/url/headers - #19: scheduler update_task tool method + prompt documentation Already applied (verified, skipping): #22 parallel MCP init, agent0ai#62 context window optimization Upstream: PR agent0ai#1265, PR agent0ai#857, PR agent0ai#1150, PR agent0ai#1105 Made-with: Cursor
Summary
This PR adds a duplicate response loop breaker to prevent the agent from getting stuck in infinite loops when receiving "You have sent the same message again" errors from external APIs.
Problem
The agent can get stuck in an infinite loop when:
Solution
duplicate_retriescounter to track consecutive duplicate responsesHandledExceptionduplicate_retrieson successful iterationChanges
agent.py: Added duplicate_retries counter and loop breaker logic (13 lines added)Testing
Tested by triggering duplicate response scenarios - agent now breaks out after 3 attempts with clear error message.
Related Issues
Fixes #1056
Fixes #1000
Related to #1187, #1011
Notes
This is a core code fix that cannot be solved through agent behavior changes alone, as the LLM is not in a coherent state during a loop to recognize and break the pattern.