[Gastown] PR 7.5: Agent Lifecycle & Container Reliability Fixes

Parent: #204 | Phase 1: Single Rig, Single Polecat

## Goal

Fix three reliability issues discovered during local dev testing that cause the agent dispatch loop to retry indefinitely, waste resources, and prevent agents from completing real work.

## Context

During end-to-end testing of the Mayor chat flow (PR 7 → PR 6 → PR 5 → PR 4), we found that while the full pipeline works (alarm fires → container dispatch → kilo serve → session creation → prompt delivery), agents exit immediately and the system enters an infinite retry loop. Three issues need to be addressed before PR 8 (Manual Merge Flow) can work correctly.

## Issues

### 1. No API credentials passed to kilo serve

The container starts `kilo serve` without `KILO_API_URL` or any API key, so the kilo session has no way to call an LLM. The agent starts a session, sends the prompt, but the session completes instantly with no useful work because there are no model credentials.

**Fix:** The `startAgentInContainer` flow needs to pass `KILO_API_URL` (and any required auth) through to the container's `buildAgentEnv()`. The Rig DO config or worker environment should supply these values so kilo serve can route LLM calls through the Kilo gateway.

### 2. No retry limit / circuit breaker for agent dispatch

When an agent exits immediately (e.g., due to missing credentials or a crash), the alarm loop runs indefinitely:
1. `witnessPatrol` sees agent container status = `exited` → resets agent to `idle`
2. `schedulePendingWork` finds idle agent with hooked bead → re-dispatches → 201 success
3. 30s later, agent has exited again → repeat forever

There is no max retry count, backoff, or circuit breaker. The system creates dozens of kilo sessions per minute in the container, all exiting immediately.

**Fix:** Track dispatch attempts per agent (or per bead). After N consecutive failed dispatches (e.g., agent exits within a short window), mark the bead as `failed` and stop retrying. Optionally create an escalation. Consider exponential backoff before the hard limit.

### 3. Agent completion does not close the bead

When an agent's session completes (detected via SSE `isCompletionEvent`), the `process-manager.ts` sets `agent.status = 'exited'` and `agent.exitReason = 'completed'`, but nothing calls back to the Rig DO to transition the bead from `in_progress` to `closed`. The agent exits, `witnessPatrol` finds it, resets the agent to idle — but the bead stays `in_progress` with the agent still hooked, so `schedulePendingWork` re-dispatches.

In the normal flow (PR 8), `gt_done` handles this via `agentDone()`. But for the Mayor agent (which may complete without calling `gt_done`), there needs to be a mechanism to detect completion and close the bead. Options:
- The container's process manager could call a Rig DO endpoint on agent completion (e.g., `POST /api/rigs/:rigId/agents/:agentId/done`)
- `witnessPatrol` could detect `exitReason = 'completed'` and auto-close the bead
- The heartbeat mechanism could report completion state back to the DO

## Dependencies

- PR 5 (Rig DO Alarm — `schedulePendingWork`, `witnessPatrol`) — #212
- PR 5.5 (Container — kilo serve adoption) — #305
- PR 6 (tRPC Routes — `sendMessage`) — #268
- PR 7 (Dashboard UI — Mayor chat) — #213

## Acceptance Criteria

- [ ] `KILO_API_URL` and auth credentials are passed through to kilo serve processes in the container
- [ ] Agent dispatch has a retry limit (configurable, e.g., 5 attempts). After exceeding the limit, the bead is marked as `failed`
- [ ] When an agent session completes, the bead is transitioned to `closed` (or `failed` if the session errored)
- [ ] The infinite alarm retry loop no longer occurs for agents that consistently fail to start or complete immediately
- [ ] `witnessPatrol` or the container reports agent completion back to the Rig DO

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Gastown] PR 7.5: Agent Lifecycle & Container Reliability Fixes #335

Goal

Context

Issues

1. No API credentials passed to kilo serve

2. No retry limit / circuit breaker for agent dispatch

3. Agent completion does not close the bead

Dependencies

Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Gastown] PR 7.5: Agent Lifecycle & Container Reliability Fixes #335

Description

Goal

Context

Issues

1. No API credentials passed to kilo serve

2. No retry limit / circuit breaker for agent dispatch

3. Agent completion does not close the bead

Dependencies

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions