Bug
When agents exhaust their dispatch_attempts limit (currently 2), they go permanently idle with no mechanism to recover. The reconciler detects the stuck state (emitting ~1,400 invariant violations per hour) and emits dispatch actions, but agents won't retry past their limit. Towns go permanently dead until manual intervention.
Evidence
Customer town a7b9c59e had a convoy running overnight. At ~03:00 UTC on Apr 2, all agents hit dispatch_attempts: 2 simultaneously (likely a container infrastructure issue). The town has been stuck for 12+ hours:
- 9 agents idle, 6 with
dispatch_attempts: 2
- 2 MR reviews
in_progress assigned to idle refinery
- 2 issues
in_review waiting on reviews that will never come
- 1 issue not started
- Reconciler: ~700 ticks/hour, ~1,400 violations/hour, ~1,400 actions/hour — all wasted
- Last productive work: 04:13 UTC. Zero work since.
Root Cause
dispatch_attempts is incremented on each failed dispatch but never reset. Once an agent reaches the max (2), it's permanently excluded from dispatch. There's no mechanism to:
- Reset attempts after a cooldown period (e.g., reset to 0 after 30 minutes)
- Reset attempts when the container comes back healthy (heartbeat received)
- Reset attempts when the user manually intervenes (settings change, model change)
- Distinguish between "transient failure" (container briefly unavailable) and "permanent failure" (bad config)
Fix
Fix 1 (Critical): Auto-reset dispatch_attempts after a cooldown
When an agent has dispatch_attempts >= max and last_activity_at is older than a cooldown period (e.g., 30 minutes), reset dispatch_attempts to 0. This allows agents to retry after a reasonable backoff.
Add to reconcileAgents or reconcileBeads:
// Rule: Reset exhausted agents after cooldown
for (const agent of exhaustedAgents) {
if (staleMs(agent.last_activity_at, DISPATCH_RESET_COOLDOWN_MS)) {
actions.push({
type: 'transition_agent',
agent_id: agent.bead_id,
reason: 'dispatch attempts reset after cooldown',
// Also reset dispatch_attempts to 0
});
}
}
Fix 2: Reset attempts on successful heartbeat
When the container sends a heartbeat confirming an agent is running, reset its dispatch_attempts to 0. A heartbeat proves the container is functional — there's no reason to keep a stale failure count.
Fix 3: Increase the max from 2
dispatch_attempts: 2 is very aggressive — two failures and the agent is permanently dead. Increase to at least 5-10, with exponential backoff between attempts (30s, 1m, 2m, 5m, 10m).
Fix 4: Reset on container restart
When a container is confirmed restarted (first heartbeat after being not_found/exited), reset dispatch_attempts for ALL agents in that town. The container restart likely fixed whatever caused the dispatch failures.
Fix 5: Add manual user control to Agent Drawer UI
In the agent drawer UI, next to or near the dispatch attempts count, add a button which allows users to reset the dispatch attempts counter.
Related
Files
src/dos/town/reconciler.ts — dispatch attempt checks in reconcileBeads and reconcileReviewQueue
src/dos/town/agents.ts — dispatch_attempts field, increment logic
src/dos/town/actions.ts — dispatch_agent action handler (increments attempts)
Acceptance Criteria
Bug
When agents exhaust their
dispatch_attemptslimit (currently 2), they go permanently idle with no mechanism to recover. The reconciler detects the stuck state (emitting ~1,400 invariant violations per hour) and emits dispatch actions, but agents won't retry past their limit. Towns go permanently dead until manual intervention.Evidence
Customer town
a7b9c59ehad a convoy running overnight. At ~03:00 UTC on Apr 2, all agents hitdispatch_attempts: 2simultaneously (likely a container infrastructure issue). The town has been stuck for 12+ hours:dispatch_attempts: 2in_progressassigned to idle refineryin_reviewwaiting on reviews that will never comeRoot Cause
dispatch_attemptsis incremented on each failed dispatch but never reset. Once an agent reaches the max (2), it's permanently excluded from dispatch. There's no mechanism to:Fix
Fix 1 (Critical): Auto-reset dispatch_attempts after a cooldown
When an agent has
dispatch_attempts >= maxandlast_activity_atis older than a cooldown period (e.g., 30 minutes), resetdispatch_attemptsto 0. This allows agents to retry after a reasonable backoff.Add to
reconcileAgentsorreconcileBeads:Fix 2: Reset attempts on successful heartbeat
When the container sends a heartbeat confirming an agent is running, reset its
dispatch_attemptsto 0. A heartbeat proves the container is functional — there's no reason to keep a stale failure count.Fix 3: Increase the max from 2
dispatch_attempts: 2is very aggressive — two failures and the agent is permanently dead. Increase to at least 5-10, with exponential backoff between attempts (30s, 1m, 2m, 5m, 10m).Fix 4: Reset on container restart
When a container is confirmed restarted (first heartbeat after being
not_found/exited), resetdispatch_attemptsfor ALL agents in that town. The container restart likely fixed whatever caused the dispatch failures.Fix 5: Add manual user control to Agent Drawer UI
In the agent drawer UI, next to or near the dispatch attempts count, add a button which allows users to reset the dispatch attempts counter.
Related
Files
src/dos/town/reconciler.ts— dispatch attempt checks inreconcileBeadsandreconcileReviewQueuesrc/dos/town/agents.ts—dispatch_attemptsfield, increment logicsrc/dos/town/actions.ts—dispatch_agentaction handler (increments attempts)Acceptance Criteria
dispatch_attemptsto 0 after a configurable cooldown (default: 30 min)dispatch_attemptson successful heartbeatdispatch_attemptson container restart detection