Parent
Sub-issue of #204 (Phase 4: Hardening)
Summary
Polecats that survive a container restart get stuck in a 5-minute reset cycle. The bead cycles in_progress → open → in_progress every 5 minutes, restarting the agent's work from checkpoint each time.
Root Cause
A timing race between startAgentInContainer's 60-second AbortSignal.timeout and the container's actual agent startup time:
dispatchAgent sets the agent to working and calls startAgentInContainer
- The container takes >60s to respond (cold start: git clone + worktree). The timeout fires → returns
false
dispatchAgent's !started path sets the agent back to idle (scheduling.ts:166)
- The container DID start the agent — the timeout was for the HTTP response, not the process
- Agent starts working, sends heartbeats via
touchAgentHeartbeat → updates last_activity_at. But touchAgent only updates last_activity_at — it does NOT restore status to working. The agent remains idle in agent_metadata.
- 5 minutes later:
reconcileBeads Rule 3 (reconciler.ts:580-633) fires: bead is in_progress and stale (STALE_IN_PROGRESS_TIMEOUT_MS = 5 min), checks for a working/stalled agent hooked to it (line 610-613). The agent IS hooked but status is idle → match → bead reset to open, assignee cleared
- Normal scheduling re-hooks and re-dispatches → cycle repeats
The "already running" detection in startAgentInContainer doesn't help because the retry paths (schedulePendingWork / reconcileBeads Rule 2) only dispatch agents whose hooked bead is open. The bead is in_progress when the race happens.
Proposed Fix
Option A (simplest): In touchAgentHeartbeat / touchAgent, if the agent's status is idle but it's receiving heartbeats, restore status to working. A heartbeat is proof the agent is alive in the container:
// In touchAgent:
UPDATE agent_metadata
SET last_activity_at = ?,
status = CASE WHEN status = 'idle' THEN 'working' ELSE status END
WHERE bead_id = ?
Option B: In reconcileBeads Rule 3, also check last_activity_at freshness. If the agent has a recent heartbeat (within 90s), skip the rule — the agent is alive regardless of its status field.
Option C: Increase startAgentInContainer timeout beyond typical cold-start time (120s+), or make it not set the agent to idle on timeout (leave it working — if the agent truly didn't start, reconcileAgents heartbeat check catches it after 90s).
Reproduction
- Deploy to production (
pnpm deploy:prod)
- Container eviction + restart occurs
- Create a convoy with 2+ beads
- Observe beads cycling
in_progress → open → in_progress every 5 minutes
- Agent status messages show active work (tool calls, file reads) but
agent_metadata.status stays idle
Affected Code
cloudflare-gastown/src/dos/town/scheduling.ts:155-178 — dispatchAgent !started path
cloudflare-gastown/src/dos/town/agents.ts:540-567 — touchAgent (updates last_activity_at only)
cloudflare-gastown/src/dos/town/reconciler.ts:580-633 — reconcileBeads Rule 3
cloudflare-gastown/src/dos/town/reconciler.ts:55-56 — STALE_IN_PROGRESS_TIMEOUT_MS = 5 min
Parent
Sub-issue of #204 (Phase 4: Hardening)
Summary
Polecats that survive a container restart get stuck in a 5-minute reset cycle. The bead cycles
in_progress → open → in_progressevery 5 minutes, restarting the agent's work from checkpoint each time.Root Cause
A timing race between
startAgentInContainer's 60-secondAbortSignal.timeoutand the container's actual agent startup time:dispatchAgentsets the agent toworkingand callsstartAgentInContainerfalsedispatchAgent's!startedpath sets the agent back toidle(scheduling.ts:166)touchAgentHeartbeat→ updateslast_activity_at. ButtouchAgentonly updateslast_activity_at— it does NOT restorestatustoworking. The agent remainsidleinagent_metadata.reconcileBeadsRule 3 (reconciler.ts:580-633) fires: bead isin_progressand stale (STALE_IN_PROGRESS_TIMEOUT_MS = 5 min), checks for aworking/stalledagent hooked to it (line 610-613). The agent IS hooked but status isidle→ match → bead reset toopen, assignee clearedThe "already running" detection in
startAgentInContainerdoesn't help because the retry paths (schedulePendingWork/reconcileBeadsRule 2) only dispatch agents whose hooked bead isopen. The bead isin_progresswhen the race happens.Proposed Fix
Option A (simplest): In
touchAgentHeartbeat/touchAgent, if the agent's status isidlebut it's receiving heartbeats, restore status toworking. A heartbeat is proof the agent is alive in the container:Option B: In
reconcileBeadsRule 3, also checklast_activity_atfreshness. If the agent has a recent heartbeat (within 90s), skip the rule — the agent is alive regardless of itsstatusfield.Option C: Increase
startAgentInContainertimeout beyond typical cold-start time (120s+), or make it not set the agent toidleon timeout (leave itworking— if the agent truly didn't start,reconcileAgentsheartbeat check catches it after 90s).Reproduction
pnpm deploy:prod)in_progress → open → in_progressevery 5 minutesagent_metadata.statusstaysidleAffected Code
cloudflare-gastown/src/dos/town/scheduling.ts:155-178—dispatchAgent!startedpathcloudflare-gastown/src/dos/town/agents.ts:540-567—touchAgent(updateslast_activity_atonly)cloudflare-gastown/src/dos/town/reconciler.ts:580-633—reconcileBeadsRule 3cloudflare-gastown/src/dos/town/reconciler.ts:55-56—STALE_IN_PROGRESS_TIMEOUT_MS = 5 min