Parent: #204 | Phase 1: Single Rig, Single Polecat
Goal
Fix three reliability issues discovered during local dev testing that cause the agent dispatch loop to retry indefinitely, waste resources, and prevent agents from completing real work.
Context
During end-to-end testing of the Mayor chat flow (PR 7 → PR 6 → PR 5 → PR 4), we found that while the full pipeline works (alarm fires → container dispatch → kilo serve → session creation → prompt delivery), agents exit immediately and the system enters an infinite retry loop. Three issues need to be addressed before PR 8 (Manual Merge Flow) can work correctly.
Issues
1. No API credentials passed to kilo serve
The container starts kilo serve without KILO_API_URL or any API key, so the kilo session has no way to call an LLM. The agent starts a session, sends the prompt, but the session completes instantly with no useful work because there are no model credentials.
Fix: The startAgentInContainer flow needs to pass KILO_API_URL (and any required auth) through to the container's buildAgentEnv(). The Rig DO config or worker environment should supply these values so kilo serve can route LLM calls through the Kilo gateway.
2. No retry limit / circuit breaker for agent dispatch
When an agent exits immediately (e.g., due to missing credentials or a crash), the alarm loop runs indefinitely:
witnessPatrol sees agent container status = exited → resets agent to idle
schedulePendingWork finds idle agent with hooked bead → re-dispatches → 201 success
- 30s later, agent has exited again → repeat forever
There is no max retry count, backoff, or circuit breaker. The system creates dozens of kilo sessions per minute in the container, all exiting immediately.
Fix: Track dispatch attempts per agent (or per bead). After N consecutive failed dispatches (e.g., agent exits within a short window), mark the bead as failed and stop retrying. Optionally create an escalation. Consider exponential backoff before the hard limit.
3. Agent completion does not close the bead
When an agent's session completes (detected via SSE isCompletionEvent), the process-manager.ts sets agent.status = 'exited' and agent.exitReason = 'completed', but nothing calls back to the Rig DO to transition the bead from in_progress to closed. The agent exits, witnessPatrol finds it, resets the agent to idle — but the bead stays in_progress with the agent still hooked, so schedulePendingWork re-dispatches.
In the normal flow (PR 8), gt_done handles this via agentDone(). But for the Mayor agent (which may complete without calling gt_done), there needs to be a mechanism to detect completion and close the bead. Options:
- The container's process manager could call a Rig DO endpoint on agent completion (e.g.,
POST /api/rigs/:rigId/agents/:agentId/done)
witnessPatrol could detect exitReason = 'completed' and auto-close the bead
- The heartbeat mechanism could report completion state back to the DO
Dependencies
Acceptance Criteria
Parent: #204 | Phase 1: Single Rig, Single Polecat
Goal
Fix three reliability issues discovered during local dev testing that cause the agent dispatch loop to retry indefinitely, waste resources, and prevent agents from completing real work.
Context
During end-to-end testing of the Mayor chat flow (PR 7 → PR 6 → PR 5 → PR 4), we found that while the full pipeline works (alarm fires → container dispatch → kilo serve → session creation → prompt delivery), agents exit immediately and the system enters an infinite retry loop. Three issues need to be addressed before PR 8 (Manual Merge Flow) can work correctly.
Issues
1. No API credentials passed to kilo serve
The container starts
kilo servewithoutKILO_API_URLor any API key, so the kilo session has no way to call an LLM. The agent starts a session, sends the prompt, but the session completes instantly with no useful work because there are no model credentials.Fix: The
startAgentInContainerflow needs to passKILO_API_URL(and any required auth) through to the container'sbuildAgentEnv(). The Rig DO config or worker environment should supply these values so kilo serve can route LLM calls through the Kilo gateway.2. No retry limit / circuit breaker for agent dispatch
When an agent exits immediately (e.g., due to missing credentials or a crash), the alarm loop runs indefinitely:
witnessPatrolsees agent container status =exited→ resets agent toidleschedulePendingWorkfinds idle agent with hooked bead → re-dispatches → 201 successThere is no max retry count, backoff, or circuit breaker. The system creates dozens of kilo sessions per minute in the container, all exiting immediately.
Fix: Track dispatch attempts per agent (or per bead). After N consecutive failed dispatches (e.g., agent exits within a short window), mark the bead as
failedand stop retrying. Optionally create an escalation. Consider exponential backoff before the hard limit.3. Agent completion does not close the bead
When an agent's session completes (detected via SSE
isCompletionEvent), theprocess-manager.tssetsagent.status = 'exited'andagent.exitReason = 'completed', but nothing calls back to the Rig DO to transition the bead fromin_progresstoclosed. The agent exits,witnessPatrolfinds it, resets the agent to idle — but the bead staysin_progresswith the agent still hooked, soschedulePendingWorkre-dispatches.In the normal flow (PR 8),
gt_donehandles this viaagentDone(). But for the Mayor agent (which may complete without callinggt_done), there needs to be a mechanism to detect completion and close the bead. Options:POST /api/rigs/:rigId/agents/:agentId/done)witnessPatrolcould detectexitReason = 'completed'and auto-close the beadDependencies
schedulePendingWork,witnessPatrol) — [Gastown] PR 5: Rig DO Alarm — Work Scheduler #212kilo servefor Agent Management #305sendMessage) — [Gastown] PR 6: tRPC Routes — Town & Rig Management #268Acceptance Criteria
KILO_API_URLand auth credentials are passed through to kilo serve processes in the containerfailedclosed(orfailedif the session errored)witnessPatrolor the container reports agent completion back to the Rig DO