Skip to content

fix(web): stop busy-looping on stale gate_response messages#96

Merged
jrob5756 merged 3 commits intomicrosoft:mainfrom
PolyphonyRequiem:fix/gate-response-queue-spin
Apr 20, 2026
Merged

fix(web): stop busy-looping on stale gate_response messages#96
jrob5756 merged 3 commits intomicrosoft:mainfrom
PolyphonyRequiem:fix/gate-response-queue-spin

Conversation

@PolyphonyRequiem
Copy link
Copy Markdown
Member

Summary

WebDashboardServer.wait_for_gate_response had a latent busy-loop bug that becomes reachable as soon as any gate_response WebSocket message targets a different agent than the one currently being awaited.

python while True: msg = await self._gate_response_queue.get() if msg.get(""agent_name"") == agent_name: return msg # Not for this agent — put it back self._gate_response_queue.put_nowait(msg) await asyncio.sleep(0.01)

asyncio.Queue has no deduplication. The same stale message is dequeued, re-enqueued, and re-inspected forever at ~100 iterations/sec, pinning a CPU core and preventing the waiter from ever blocking on a truly-empty queue.

Why does this happen at all?

Conductor only presents one gate at a time, so in principle the queue should only ever contain messages targeting the active gate. But stale gate_response messages can reach the queue:

  • A user double-clicks an option in the per-run dashboard.
  • Two clients are connected and each sends a response.
  • A slow client's frame arrives after the gate already resolved via another path.
  • A malformed / buggy client sends a response with the wrong agent_name.

Previously this was masked by the has_connections() short-circuit in _handle_gate_with_web that skipped the web path entirely when no client was connected at gate-presentation time. With #95 removing that short-circuit, the spin becomes trivially reachable.

Fix

Since only one gate is ever awaited concurrently, a non-matching gate_response is definitionally stale — the coroutine that would have accepted it is already gone, and re-queueing can never deliver it. Discard stale messages with a warning log, letting the next await .get() block properly on an empty queue until a genuine matching message arrives.

Risk

Low. The # so they are not lost comment in the original code suggests re-queueing was defensive for a hypothetical future where multiple gates could be active simultaneously. Conductor's current architecture doesn't support that, and if it's ever added, a per-agent dispatch (e.g. dict[str, Future]) would be the right primitive rather than a shared queue with round-robin rescheduling.

Tests

No existing tests cover wait_for_gate_response. A unit test that pushes a non-matching then matching message would document the contract — happy to add if desired.

Related

`wait_for_gate_response` previously re-queued any message whose
`agent_name` did not match the expected agent, then slept 10ms and
retried. Because `asyncio.Queue` has no deduplication, the same stale
message would immediately be dequeued again on the next iteration —
producing a tight loop that re-enqueues and re-inspects the same dict
forever (cpu-bound, 100 iter/sec per stale message).

In practice this never triggered during normal flow because conductor
presents one gate at a time and the `has_connections()` short-circuit
used to hide the issue. With that short-circuit now gone (PR to always
race both tasks), and with any client ever producing a duplicate click,
the spin becomes reachable.

Fix: since conductor only ever awaits one gate at a time, any
non-matching `gate_response` is definitionally stale (a duplicate
click from a dashboard that missed the first resolution, or a message
for a gate that already completed). Re-queueing can never deliver it
— the matching `wait_for_gate_response` call is already gone.
Discard stale messages with a warning log so the queue drains cleanly
and the next `await .get()` blocks properly on an empty queue.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@82ec042). Learn more about missing BASE report.

Additional details and impacted files
@@           Coverage Diff           @@
##             main      #96   +/-   ##
=======================================
  Coverage        ?   85.58%           
=======================================
  Files           ?       46           
  Lines           ?     6603           
  Branches        ?        0           
=======================================
  Hits            ?     5651           
  Misses          ?      952           
  Partials        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Daniel Green and others added 2 commits April 17, 2026 13:23
Adds TestWaitForGateResponse with two cases:
- Returns the matching message.
- Discards stale (non-matching) messages without re-queueing,
  regression-covering the busy-loop bug fixed in this PR.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@jrob5756 jrob5756 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Thanks!

@jrob5756 jrob5756 merged commit fd04acc into microsoft:main Apr 20, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants