Skip to content

Bug: Triage agent feedback loop — crash loop detection triggers on its own failures #965

@jrf0110

Description

@jrf0110

Parent

Part of #204 (Phase 3: Multi-Rig + Scaling)

Problem

Triage agents enter a feedback loop in production. The cycle repeats every ~5 seconds with the failure count climbing (59 → 61 → 62 → ...).

Important context: The container is NOT dead. The Mayor works fine, polecats can be dispatched and complete work. The triage agent specifically is failing.

Production Evidence

Worker logs show three failure modes all occurring in the same loop:

"The container is not running, consider calling start()"
"The container is not listening in the TCP address 10.0.0.1:8080"
"Timeout waiting for server to start after 30000ms"
"Agent ab8032fb... is already running"

Activity logs show a ~15-second cycle of: triage request created → batch bead created → agent hooked → both beads closed → agent unhooked → repeat.

Root Causes

1. Triage agent dispatched as role: 'polecat' with unnecessary git clone — THE PRIMARY CAUSE

File: src/dos/Town.do.ts:2576 and container/src/agent-runner.ts:467-498

maybeDispatchTriageAgent() calls getOrCreateAgent(sql, 'polecat', ...) and dispatches with role: 'polecat'. This forces the full git clone + worktree flow in agent-runner.ts (only role === 'mayor' has a skip). The triage agent does no code work — it just resolves triage request beads via gt_triage_resolve. The git clone is unnecessary and is likely what's causing the startup failure or timeout.

The worker logs show "Timeout waiting for server to start after 30000ms" and "Agent ab8032fb... is already running". The 30-second timeout suggests the git clone is hanging or slow. The "already running" error occurs when a previous triage agent is still mid-startup (in starting or running status in the container's agents Map at process-manager.ts:296-298) when the next 5-second alarm tick tries to dispatch the same agent ID again.

2. detectCrashLoops() doesn't exclude triage agent failures

File: src/dos/town/patrol.ts, detectCrashLoops() function

The query counts ALL status_changed → failed events per agent_id in bead_events. It does not filter by bead type or label. When a triage batch bead is failed (because dispatch didn't work), the event is counted alongside regular polecat failures. After 3 triage dispatch failures in 30 minutes, detectCrashLoops creates a triage request about the triage agent's own crash loop — feeding the loop.

3. No dispatch cooldown on triage batch beads

File: src/dos/Town.do.ts, maybeDispatchTriageAgent()

The batch bead dedup guard (checking for existing open/in_progress batch beads with gt:triage label) doesn't catch terminal-state beads. Once a batch bead is failed or closed, the guard passes on the next alarm tick (5 seconds later), allowing immediate re-dispatch. The 5-second alarm interval means a failed dispatch is retried almost immediately.

4. Agent ID collision in container process manager

File: container/src/process-manager.ts:296-298

When startAgent is called for an agent ID that's still in the agents Map with running or starting status, it throws "Agent X is already running". The previous triage agent may still be mid-git-clone (takes >5 seconds) when the next alarm tick dispatches with the same polecat agent ID (since getOrCreateAgent reuses idle polecats, and the TownDO has already unhooked+reset the agent to idle after the previous failure at Town.do.ts:2604-2606). The TownDO thinks the agent is idle; the container thinks it's still starting.

Fixes Needed

Fix 1: Exclude triage-related beads from crash loop detection

detectCrashLoops() should filter out bead_events where the bead has label gt:triage or type triage_request. Triage agent failures are an operational concern, not a crash loop to be triaged by another triage agent.

Fix 2: Add a cooldown/backoff on triage dispatch

After a triage dispatch failure, maybeDispatchTriageAgent() should back off. Options:

  • Track a last_triage_attempt_at timestamp and skip dispatch if within a cooldown (e.g., 5 minutes)
  • Use exponential backoff based on consecutive failures
  • Check container health before attempting triage dispatch — if ensureContainerReady() is failing, skip triage dispatch entirely

Fix 3: Skip git clone for triage agents

Either dispatch with role: 'triage' and add a skip in agent-runner.ts (similar to the mayor's createMayorWorkspace), or add the triage role to the mayor-like branch that creates a minimal workspace. The triage agent only needs a working directory for the Kilo SDK process, not a git repo.

Fix 4: Cap triage request creation rate

Add a maximum number of open triage requests (e.g., 10). If the cap is reached, stop creating new ones until existing ones are resolved. This prevents unbounded growth during extended outages.

Acceptance Criteria

  • detectCrashLoops() excludes triage batch bead failures from crash loop detection
  • Triage dispatch has a cooldown/backoff after failures (no retry every 5 seconds)
  • Triage agent skips git clone (minimal workspace like mayor, or dedicated role: 'triage' handling)
  • Triage request creation is capped to prevent unbounded growth
  • The feedback loop cannot occur: triage system failures must not create more triage work

Notes

  • No data migration needed — cloud Gastown hasn't deployed to production (but this is actively happening on a production town right now)
  • The original polecat crash loop detection is correct and valuable — the fix should preserve that while preventing the triage system from triaging itself
  • The container being dead is the root trigger, but even when the container recovers, the backlog of triage requests and the accumulated failure count in bead_events would keep the loop running for the full 30-minute crash loop detection window

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingkilo-auto-fixAuto-generated label by Kilokilo-triagedAuto-generated label by Kilo

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions