fix(web): stop busy-looping on stale gate_response messages#96
Merged
jrob5756 merged 3 commits intomicrosoft:mainfrom Apr 20, 2026
Merged
Conversation
`wait_for_gate_response` previously re-queued any message whose `agent_name` did not match the expected agent, then slept 10ms and retried. Because `asyncio.Queue` has no deduplication, the same stale message would immediately be dequeued again on the next iteration — producing a tight loop that re-enqueues and re-inspects the same dict forever (cpu-bound, 100 iter/sec per stale message). In practice this never triggered during normal flow because conductor presents one gate at a time and the `has_connections()` short-circuit used to hide the issue. With that short-circuit now gone (PR to always race both tasks), and with any client ever producing a duplicate click, the spin becomes reachable. Fix: since conductor only ever awaits one gate at a time, any non-matching `gate_response` is definitionally stale (a duplicate click from a dashboard that missed the first resolution, or a message for a gate that already completed). Re-queueing can never deliver it — the matching `wait_for_gate_response` call is already gone. Discard stale messages with a warning log so the queue drains cleanly and the next `await .get()` blocks properly on an empty queue. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #96 +/- ##
=======================================
Coverage ? 85.58%
=======================================
Files ? 46
Lines ? 6603
Branches ? 0
=======================================
Hits ? 5651
Misses ? 952
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Adds TestWaitForGateResponse with two cases: - Returns the matching message. - Discards stale (non-matching) messages without re-queueing, regression-covering the busy-loop bug fixed in this PR. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
WebDashboardServer.wait_for_gate_responsehad a latent busy-loop bug that becomes reachable as soon as anygate_responseWebSocket message targets a different agent than the one currently being awaited.python while True: msg = await self._gate_response_queue.get() if msg.get(""agent_name"") == agent_name: return msg # Not for this agent — put it back self._gate_response_queue.put_nowait(msg) await asyncio.sleep(0.01)asyncio.Queuehas no deduplication. The same stale message is dequeued, re-enqueued, and re-inspected forever at ~100 iterations/sec, pinning a CPU core and preventing the waiter from ever blocking on a truly-empty queue.Why does this happen at all?
Conductor only presents one gate at a time, so in principle the queue should only ever contain messages targeting the active gate. But stale
gate_responsemessages can reach the queue:Previously this was masked by the
has_connections()short-circuit in_handle_gate_with_webthat skipped the web path entirely when no client was connected at gate-presentation time. With #95 removing that short-circuit, the spin becomes trivially reachable.Fix
Since only one gate is ever awaited concurrently, a non-matching
gate_responseis definitionally stale — the coroutine that would have accepted it is already gone, and re-queueing can never deliver it. Discard stale messages with a warning log, letting the nextawait .get()block properly on an empty queue until a genuine matching message arrives.Risk
Low. The
# so they are not lostcomment in the original code suggests re-queueing was defensive for a hypothetical future where multiple gates could be active simultaneously. Conductor's current architecture doesn't support that, and if it's ever added, a per-agent dispatch (e.g.dict[str, Future]) would be the right primitive rather than a shared queue with round-robin rescheduling.Tests
No existing tests cover
wait_for_gate_response. A unit test that pushes a non-matching then matching message would document the contract — happy to add if desired.Related