Bug: Triage agent feedback loop — crash loop detection triggers on its own failures

## Parent

Part of #204 (Phase 3: Multi-Rig + Scaling)

## Problem

Triage agents enter a feedback loop in production. The cycle repeats every ~5 seconds with the failure count climbing (59 → 61 → 62 → ...).

**Important context:** The container is NOT dead. The Mayor works fine, polecats can be dispatched and complete work. The triage agent specifically is failing.

## Production Evidence

Worker logs show three failure modes all occurring in the same loop:

```
"The container is not running, consider calling start()"
"The container is not listening in the TCP address 10.0.0.1:8080"
"Timeout waiting for server to start after 30000ms"
"Agent ab8032fb... is already running"
```

Activity logs show a ~15-second cycle of: triage request created → batch bead created → agent hooked → both beads closed → agent unhooked → repeat.

## Root Causes

### 1. Triage agent dispatched as `role: 'polecat'` with unnecessary git clone — THE PRIMARY CAUSE

**File:** `src/dos/Town.do.ts:2576` and `container/src/agent-runner.ts:467-498`

`maybeDispatchTriageAgent()` calls `getOrCreateAgent(sql, 'polecat', ...)` and dispatches with `role: 'polecat'`. This forces the full git clone + worktree flow in `agent-runner.ts` (only `role === 'mayor'` has a skip). The triage agent does no code work — it just resolves triage request beads via `gt_triage_resolve`. The git clone is unnecessary and is likely what's causing the startup failure or timeout.

The worker logs show `"Timeout waiting for server to start after 30000ms"` and `"Agent ab8032fb... is already running"`. The 30-second timeout suggests the git clone is hanging or slow. The "already running" error occurs when a previous triage agent is still mid-startup (in `starting` or `running` status in the container's `agents` Map at `process-manager.ts:296-298`) when the next 5-second alarm tick tries to dispatch the same agent ID again.

### 2. `detectCrashLoops()` doesn't exclude triage agent failures

**File:** `src/dos/town/patrol.ts`, `detectCrashLoops()` function

The query counts ALL `status_changed → failed` events per agent_id in `bead_events`. It does not filter by bead type or label. When a triage batch bead is failed (because dispatch didn't work), the event is counted alongside regular polecat failures. After 3 triage dispatch failures in 30 minutes, `detectCrashLoops` creates a triage request about the triage agent's own crash loop — feeding the loop.

### 3. No dispatch cooldown on triage batch beads

**File:** `src/dos/Town.do.ts`, `maybeDispatchTriageAgent()`

The batch bead dedup guard (checking for existing `open`/`in_progress` batch beads with `gt:triage` label) doesn't catch terminal-state beads. Once a batch bead is `failed` or `closed`, the guard passes on the next alarm tick (5 seconds later), allowing immediate re-dispatch. The 5-second alarm interval means a failed dispatch is retried almost immediately.

### 4. Agent ID collision in container process manager

**File:** `container/src/process-manager.ts:296-298`

When `startAgent` is called for an agent ID that's still in the `agents` Map with `running` or `starting` status, it throws `"Agent X is already running"`. The previous triage agent may still be mid-git-clone (takes >5 seconds) when the next alarm tick dispatches with the same polecat agent ID (since `getOrCreateAgent` reuses idle polecats, and the TownDO has already unhooked+reset the agent to `idle` after the previous failure at `Town.do.ts:2604-2606`). The TownDO thinks the agent is idle; the container thinks it's still starting.

## Fixes Needed

### Fix 1: Exclude triage-related beads from crash loop detection

`detectCrashLoops()` should filter out `bead_events` where the bead has label `gt:triage` or type `triage_request`. Triage agent failures are an operational concern, not a crash loop to be triaged by another triage agent.

### Fix 2: Add a cooldown/backoff on triage dispatch

After a triage dispatch failure, `maybeDispatchTriageAgent()` should back off. Options:
- Track a `last_triage_attempt_at` timestamp and skip dispatch if within a cooldown (e.g., 5 minutes)
- Use exponential backoff based on consecutive failures
- Check container health before attempting triage dispatch — if `ensureContainerReady()` is failing, skip triage dispatch entirely

### Fix 3: Skip git clone for triage agents

Either dispatch with `role: 'triage'` and add a skip in `agent-runner.ts` (similar to the mayor's `createMayorWorkspace`), or add the `triage` role to the mayor-like branch that creates a minimal workspace. The triage agent only needs a working directory for the Kilo SDK process, not a git repo.

### Fix 4: Cap triage request creation rate

Add a maximum number of open triage requests (e.g., 10). If the cap is reached, stop creating new ones until existing ones are resolved. This prevents unbounded growth during extended outages.

## Acceptance Criteria

- [ ] `detectCrashLoops()` excludes triage batch bead failures from crash loop detection
- [ ] Triage dispatch has a cooldown/backoff after failures (no retry every 5 seconds)
- [ ] Triage agent skips git clone (minimal workspace like mayor, or dedicated `role: 'triage'` handling)
- [ ] Triage request creation is capped to prevent unbounded growth
- [ ] The feedback loop cannot occur: triage system failures must not create more triage work

## Notes

- No data migration needed — cloud Gastown hasn't deployed to production (but this is actively happening on a production town right now)
- The original polecat crash loop detection is correct and valuable — the fix should preserve that while preventing the triage system from triaging itself
- The container being dead is the root trigger, but even when the container recovers, the backlog of triage requests and the accumulated failure count in `bead_events` would keep the loop running for the full 30-minute crash loop detection window



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Triage agent feedback loop — crash loop detection triggers on its own failures #965

Parent

Problem

Production Evidence

Root Causes

1. Triage agent dispatched as `role: 'polecat'` with unnecessary git clone — THE PRIMARY CAUSE

2. `detectCrashLoops()` doesn't exclude triage agent failures

3. No dispatch cooldown on triage batch beads

4. Agent ID collision in container process manager

Fixes Needed

Fix 1: Exclude triage-related beads from crash loop detection

Fix 2: Add a cooldown/backoff on triage dispatch

Fix 3: Skip git clone for triage agents

Fix 4: Cap triage request creation rate

Acceptance Criteria

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Bug: Triage agent feedback loop — crash loop detection triggers on its own failures #965

Description

Parent

Problem

Production Evidence

Root Causes

1. Triage agent dispatched as role: 'polecat' with unnecessary git clone — THE PRIMARY CAUSE

2. detectCrashLoops() doesn't exclude triage agent failures

3. No dispatch cooldown on triage batch beads

4. Agent ID collision in container process manager

Fixes Needed

Fix 1: Exclude triage-related beads from crash loop detection

Fix 2: Add a cooldown/backoff on triage dispatch

Fix 3: Skip git clone for triage agents

Fix 4: Cap triage request creation rate

Acceptance Criteria

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. Triage agent dispatched as `role: 'polecat'` with unnecessary git clone — THE PRIMARY CAUSE

2. `detectCrashLoops()` doesn't exclude triage agent failures