Parent
Part of #204 (Phase 3: Multi-Rig + Scaling)
Problem
Triage agents enter a feedback loop in production. The cycle repeats every ~5 seconds with the failure count climbing (59 → 61 → 62 → ...).
Important context: The container is NOT dead. The Mayor works fine, polecats can be dispatched and complete work. The triage agent specifically is failing.
Production Evidence
Worker logs show three failure modes all occurring in the same loop:
"The container is not running, consider calling start()"
"The container is not listening in the TCP address 10.0.0.1:8080"
"Timeout waiting for server to start after 30000ms"
"Agent ab8032fb... is already running"
Activity logs show a ~15-second cycle of: triage request created → batch bead created → agent hooked → both beads closed → agent unhooked → repeat.
Root Causes
1. Triage agent dispatched as role: 'polecat' with unnecessary git clone — THE PRIMARY CAUSE
File: src/dos/Town.do.ts:2576 and container/src/agent-runner.ts:467-498
maybeDispatchTriageAgent() calls getOrCreateAgent(sql, 'polecat', ...) and dispatches with role: 'polecat'. This forces the full git clone + worktree flow in agent-runner.ts (only role === 'mayor' has a skip). The triage agent does no code work — it just resolves triage request beads via gt_triage_resolve. The git clone is unnecessary and is likely what's causing the startup failure or timeout.
The worker logs show "Timeout waiting for server to start after 30000ms" and "Agent ab8032fb... is already running". The 30-second timeout suggests the git clone is hanging or slow. The "already running" error occurs when a previous triage agent is still mid-startup (in starting or running status in the container's agents Map at process-manager.ts:296-298) when the next 5-second alarm tick tries to dispatch the same agent ID again.
2. detectCrashLoops() doesn't exclude triage agent failures
File: src/dos/town/patrol.ts, detectCrashLoops() function
The query counts ALL status_changed → failed events per agent_id in bead_events. It does not filter by bead type or label. When a triage batch bead is failed (because dispatch didn't work), the event is counted alongside regular polecat failures. After 3 triage dispatch failures in 30 minutes, detectCrashLoops creates a triage request about the triage agent's own crash loop — feeding the loop.
3. No dispatch cooldown on triage batch beads
File: src/dos/Town.do.ts, maybeDispatchTriageAgent()
The batch bead dedup guard (checking for existing open/in_progress batch beads with gt:triage label) doesn't catch terminal-state beads. Once a batch bead is failed or closed, the guard passes on the next alarm tick (5 seconds later), allowing immediate re-dispatch. The 5-second alarm interval means a failed dispatch is retried almost immediately.
4. Agent ID collision in container process manager
File: container/src/process-manager.ts:296-298
When startAgent is called for an agent ID that's still in the agents Map with running or starting status, it throws "Agent X is already running". The previous triage agent may still be mid-git-clone (takes >5 seconds) when the next alarm tick dispatches with the same polecat agent ID (since getOrCreateAgent reuses idle polecats, and the TownDO has already unhooked+reset the agent to idle after the previous failure at Town.do.ts:2604-2606). The TownDO thinks the agent is idle; the container thinks it's still starting.
Fixes Needed
Fix 1: Exclude triage-related beads from crash loop detection
detectCrashLoops() should filter out bead_events where the bead has label gt:triage or type triage_request. Triage agent failures are an operational concern, not a crash loop to be triaged by another triage agent.
Fix 2: Add a cooldown/backoff on triage dispatch
After a triage dispatch failure, maybeDispatchTriageAgent() should back off. Options:
- Track a
last_triage_attempt_at timestamp and skip dispatch if within a cooldown (e.g., 5 minutes)
- Use exponential backoff based on consecutive failures
- Check container health before attempting triage dispatch — if
ensureContainerReady() is failing, skip triage dispatch entirely
Fix 3: Skip git clone for triage agents
Either dispatch with role: 'triage' and add a skip in agent-runner.ts (similar to the mayor's createMayorWorkspace), or add the triage role to the mayor-like branch that creates a minimal workspace. The triage agent only needs a working directory for the Kilo SDK process, not a git repo.
Fix 4: Cap triage request creation rate
Add a maximum number of open triage requests (e.g., 10). If the cap is reached, stop creating new ones until existing ones are resolved. This prevents unbounded growth during extended outages.
Acceptance Criteria
Notes
- No data migration needed — cloud Gastown hasn't deployed to production (but this is actively happening on a production town right now)
- The original polecat crash loop detection is correct and valuable — the fix should preserve that while preventing the triage system from triaging itself
- The container being dead is the root trigger, but even when the container recovers, the backlog of triage requests and the accumulated failure count in
bead_events would keep the loop running for the full 30-minute crash loop detection window
Parent
Part of #204 (Phase 3: Multi-Rig + Scaling)
Problem
Triage agents enter a feedback loop in production. The cycle repeats every ~5 seconds with the failure count climbing (59 → 61 → 62 → ...).
Important context: The container is NOT dead. The Mayor works fine, polecats can be dispatched and complete work. The triage agent specifically is failing.
Production Evidence
Worker logs show three failure modes all occurring in the same loop:
Activity logs show a ~15-second cycle of: triage request created → batch bead created → agent hooked → both beads closed → agent unhooked → repeat.
Root Causes
1. Triage agent dispatched as
role: 'polecat'with unnecessary git clone — THE PRIMARY CAUSEFile:
src/dos/Town.do.ts:2576andcontainer/src/agent-runner.ts:467-498maybeDispatchTriageAgent()callsgetOrCreateAgent(sql, 'polecat', ...)and dispatches withrole: 'polecat'. This forces the full git clone + worktree flow inagent-runner.ts(onlyrole === 'mayor'has a skip). The triage agent does no code work — it just resolves triage request beads viagt_triage_resolve. The git clone is unnecessary and is likely what's causing the startup failure or timeout.The worker logs show
"Timeout waiting for server to start after 30000ms"and"Agent ab8032fb... is already running". The 30-second timeout suggests the git clone is hanging or slow. The "already running" error occurs when a previous triage agent is still mid-startup (instartingorrunningstatus in the container'sagentsMap atprocess-manager.ts:296-298) when the next 5-second alarm tick tries to dispatch the same agent ID again.2.
detectCrashLoops()doesn't exclude triage agent failuresFile:
src/dos/town/patrol.ts,detectCrashLoops()functionThe query counts ALL
status_changed → failedevents per agent_id inbead_events. It does not filter by bead type or label. When a triage batch bead is failed (because dispatch didn't work), the event is counted alongside regular polecat failures. After 3 triage dispatch failures in 30 minutes,detectCrashLoopscreates a triage request about the triage agent's own crash loop — feeding the loop.3. No dispatch cooldown on triage batch beads
File:
src/dos/Town.do.ts,maybeDispatchTriageAgent()The batch bead dedup guard (checking for existing
open/in_progressbatch beads withgt:triagelabel) doesn't catch terminal-state beads. Once a batch bead isfailedorclosed, the guard passes on the next alarm tick (5 seconds later), allowing immediate re-dispatch. The 5-second alarm interval means a failed dispatch is retried almost immediately.4. Agent ID collision in container process manager
File:
container/src/process-manager.ts:296-298When
startAgentis called for an agent ID that's still in theagentsMap withrunningorstartingstatus, it throws"Agent X is already running". The previous triage agent may still be mid-git-clone (takes >5 seconds) when the next alarm tick dispatches with the same polecat agent ID (sincegetOrCreateAgentreuses idle polecats, and the TownDO has already unhooked+reset the agent toidleafter the previous failure atTown.do.ts:2604-2606). The TownDO thinks the agent is idle; the container thinks it's still starting.Fixes Needed
Fix 1: Exclude triage-related beads from crash loop detection
detectCrashLoops()should filter outbead_eventswhere the bead has labelgt:triageor typetriage_request. Triage agent failures are an operational concern, not a crash loop to be triaged by another triage agent.Fix 2: Add a cooldown/backoff on triage dispatch
After a triage dispatch failure,
maybeDispatchTriageAgent()should back off. Options:last_triage_attempt_attimestamp and skip dispatch if within a cooldown (e.g., 5 minutes)ensureContainerReady()is failing, skip triage dispatch entirelyFix 3: Skip git clone for triage agents
Either dispatch with
role: 'triage'and add a skip inagent-runner.ts(similar to the mayor'screateMayorWorkspace), or add thetriagerole to the mayor-like branch that creates a minimal workspace. The triage agent only needs a working directory for the Kilo SDK process, not a git repo.Fix 4: Cap triage request creation rate
Add a maximum number of open triage requests (e.g., 10). If the cap is reached, stop creating new ones until existing ones are resolved. This prevents unbounded growth during extended outages.
Acceptance Criteria
detectCrashLoops()excludes triage batch bead failures from crash loop detectionrole: 'triage'handling)Notes
bead_eventswould keep the loop running for the full 30-minute crash loop detection window