fix(gastown): Platform-wide polecat dispatch failures — 4K failures/hour across 151+ towns

## Bug

As of 2026-04-01 15:00 UTC, polecat dispatch is failing across **151+ towns** at a rate of **~4,075 failures per hour** (~300 per 5-minute bucket). The failure rate has been steady for at least 3 hours (since ~12:05 UTC). All dispatch failures have **empty error strings** — the actual failure reason is not being logged.

## Evidence

Analytics Engine data for the last hour:
- 151+ unique town IDs with `agent.dispatch_failed` events
- All failures are for `polecat` role (refinery and mayor not affected)
- Error field (`blob5`) is empty on all events
- Failure rate is steady (not a spike — chronic issue)
- Top offender: town `d498f44e-...` with 595 failures in 1 hour

Additionally:
- One DO (`7c017069-...`) is in overload state: 477 "Durable Object is overloaded" errors in 1 hour
- One DO (`505b54c4-...`) hitting SQLITE_TOOBIG errors on agent-events.create

## Impact

High — polecats are the primary work-doing agents. If dispatch is failing, no beads are being worked on across the platform. Towns appear to be "doing nothing" even when there is open work.

## Likely Causes

1. **Container infrastructure issue** — polecat containers may be failing to start, hitting resource limits, or encountering image pull failures. The empty error string suggests the failure happens before the error can be captured (e.g., the `startAgentInContainer` HTTP call times out or gets a non-JSON error response).

2. **The #1653 pattern at scale** — no circuit breaker on dispatch failures means every failed polecat gets retried every tick across every town. 151 towns × ~27 retries/hour/town = 4,075 failures. The actual number of "stuck" polecats could be much smaller — most failures are retries.

3. **Polecat-specific configuration issue** — since refinery and mayor are NOT affected, the issue may be specific to how polecats are dispatched (different Container image, different startup sequence, different env vars).

## Investigation Needed

1. Check the polecat container startup path — what is the ACTUAL error when `startAgentInContainer` fails? The empty error string needs to be fixed (#1653 Fix 3).
2. Check Cloudflare Container dashboard for failed instances or resource limits.
3. Check if a recent deployment changed polecat dispatch behavior.
4. Sample a few affected towns and check their debug endpoint for polecat agent status.

## Related

- #1653 — No circuit breaker on dispatch failures (the retry loop that amplifies this)
- #1816 — Model config revert (could cause polecats to fail if they get an invalid model)
- #1817 — Team reconnect required (could cause polecats to fail auth on startup)

## Files

- `src/dos/town/actions.ts` — `dispatch_agent` action handler (where error should be logged)
- `src/dos/town/container-dispatch.ts` — `startAgentInContainer` (where the actual failure occurs)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gastown): Platform-wide polecat dispatch failures — 4K failures/hour across 151+ towns #1850

Bug

Evidence

Impact

Likely Causes

Investigation Needed

Related

Files

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

fix(gastown): Platform-wide polecat dispatch failures — 4K failures/hour across 151+ towns #1850

Description

Bug

Evidence

Impact

Likely Causes

Investigation Needed

Related

Files

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions