Skip to content

fix(gastown): Platform-wide polecat dispatch failures — 4K failures/hour across 151+ towns #1850

@jrf0110

Description

@jrf0110

Bug

As of 2026-04-01 15:00 UTC, polecat dispatch is failing across 151+ towns at a rate of ~4,075 failures per hour (~300 per 5-minute bucket). The failure rate has been steady for at least 3 hours (since ~12:05 UTC). All dispatch failures have empty error strings — the actual failure reason is not being logged.

Evidence

Analytics Engine data for the last hour:

  • 151+ unique town IDs with agent.dispatch_failed events
  • All failures are for polecat role (refinery and mayor not affected)
  • Error field (blob5) is empty on all events
  • Failure rate is steady (not a spike — chronic issue)
  • Top offender: town d498f44e-... with 595 failures in 1 hour

Additionally:

  • One DO (7c017069-...) is in overload state: 477 "Durable Object is overloaded" errors in 1 hour
  • One DO (505b54c4-...) hitting SQLITE_TOOBIG errors on agent-events.create

Impact

High — polecats are the primary work-doing agents. If dispatch is failing, no beads are being worked on across the platform. Towns appear to be "doing nothing" even when there is open work.

Likely Causes

  1. Container infrastructure issue — polecat containers may be failing to start, hitting resource limits, or encountering image pull failures. The empty error string suggests the failure happens before the error can be captured (e.g., the startAgentInContainer HTTP call times out or gets a non-JSON error response).

  2. The fix(gastown): No circuit breaker on dispatch failures — dead container causes 70h runaway loop (+ spend) #1653 pattern at scale — no circuit breaker on dispatch failures means every failed polecat gets retried every tick across every town. 151 towns × ~27 retries/hour/town = 4,075 failures. The actual number of "stuck" polecats could be much smaller — most failures are retries.

  3. Polecat-specific configuration issue — since refinery and mayor are NOT affected, the issue may be specific to how polecats are dispatched (different Container image, different startup sequence, different env vars).

Investigation Needed

  1. Check the polecat container startup path — what is the ACTUAL error when startAgentInContainer fails? The empty error string needs to be fixed (fix(gastown): No circuit breaker on dispatch failures — dead container causes 70h runaway loop (+ spend) #1653 Fix 3).
  2. Check Cloudflare Container dashboard for failed instances or resource limits.
  3. Check if a recent deployment changed polecat dispatch behavior.
  4. Sample a few affected towns and check their debug endpoint for polecat agent status.

Related

Files

  • src/dos/town/actions.tsdispatch_agent action handler (where error should be logged)
  • src/dos/town/container-dispatch.tsstartAgentInContainer (where the actual failure occurs)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P0Blocks soft launchbugSomething isn't workinggt:containerContainer management, agent processes, SDK, heartbeatgt:coreReconciler, state machine, bead lifecycle, convoy flow

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions