Skip to content

fix(gastown): Agents permanently stuck after exhausting dispatch_attempts — no reset mechanism #1932

@jrf0110

Description

@jrf0110

Bug

When agents exhaust their dispatch_attempts limit (currently 2), they go permanently idle with no mechanism to recover. The reconciler detects the stuck state (emitting ~1,400 invariant violations per hour) and emits dispatch actions, but agents won't retry past their limit. Towns go permanently dead until manual intervention.

Evidence

Customer town a7b9c59e had a convoy running overnight. At ~03:00 UTC on Apr 2, all agents hit dispatch_attempts: 2 simultaneously (likely a container infrastructure issue). The town has been stuck for 12+ hours:

  • 9 agents idle, 6 with dispatch_attempts: 2
  • 2 MR reviews in_progress assigned to idle refinery
  • 2 issues in_review waiting on reviews that will never come
  • 1 issue not started
  • Reconciler: ~700 ticks/hour, ~1,400 violations/hour, ~1,400 actions/hour — all wasted
  • Last productive work: 04:13 UTC. Zero work since.

Root Cause

dispatch_attempts is incremented on each failed dispatch but never reset. Once an agent reaches the max (2), it's permanently excluded from dispatch. There's no mechanism to:

  1. Reset attempts after a cooldown period (e.g., reset to 0 after 30 minutes)
  2. Reset attempts when the container comes back healthy (heartbeat received)
  3. Reset attempts when the user manually intervenes (settings change, model change)
  4. Distinguish between "transient failure" (container briefly unavailable) and "permanent failure" (bad config)

Fix

Fix 1 (Critical): Auto-reset dispatch_attempts after a cooldown

When an agent has dispatch_attempts >= max and last_activity_at is older than a cooldown period (e.g., 30 minutes), reset dispatch_attempts to 0. This allows agents to retry after a reasonable backoff.

Add to reconcileAgents or reconcileBeads:

// Rule: Reset exhausted agents after cooldown
for (const agent of exhaustedAgents) {
  if (staleMs(agent.last_activity_at, DISPATCH_RESET_COOLDOWN_MS)) {
    actions.push({
      type: 'transition_agent',
      agent_id: agent.bead_id,
      reason: 'dispatch attempts reset after cooldown',
      // Also reset dispatch_attempts to 0
    });
  }
}

Fix 2: Reset attempts on successful heartbeat

When the container sends a heartbeat confirming an agent is running, reset its dispatch_attempts to 0. A heartbeat proves the container is functional — there's no reason to keep a stale failure count.

Fix 3: Increase the max from 2

dispatch_attempts: 2 is very aggressive — two failures and the agent is permanently dead. Increase to at least 5-10, with exponential backoff between attempts (30s, 1m, 2m, 5m, 10m).

Fix 4: Reset on container restart

When a container is confirmed restarted (first heartbeat after being not_found/exited), reset dispatch_attempts for ALL agents in that town. The container restart likely fixed whatever caused the dispatch failures.

Fix 5: Add manual user control to Agent Drawer UI

In the agent drawer UI, next to or near the dispatch attempts count, add a button which allows users to reset the dispatch attempts counter.

Related

Files

  • src/dos/town/reconciler.ts — dispatch attempt checks in reconcileBeads and reconcileReviewQueue
  • src/dos/town/agents.tsdispatch_attempts field, increment logic
  • src/dos/town/actions.tsdispatch_agent action handler (increments attempts)

Acceptance Criteria

  • Agents auto-reset dispatch_attempts to 0 after a configurable cooldown (default: 30 min)
  • Agents reset dispatch_attempts on successful heartbeat
  • Agents reset dispatch_attempts on container restart detection
  • Max dispatch attempts increased to 5-10 with exponential backoff
  • Towns self-recover from mass dispatch failures within 30-60 minutes without manual intervention

Metadata

Metadata

Assignees

No one assigned

    Labels

    P0Blocks soft launchbugSomething isn't workinggt:coreReconciler, state machine, bead lifecycle, convoy flow

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions