Skip to content

fix(gastown): Town containers never go idle — mayor holds alarm at 5s, constant health checks reset sleep timer #1450

@jrf0110

Description

@jrf0110

Bug

Town containers never reach their sleepAfter = 30m idle threshold, even when no work is active and no user is chatting. Containers stay alive indefinitely, consuming resources and billing.

Root Cause

Three factors compound to keep the container permanently awake:

1. Mayor stays working forever (primary cause)

When the mayor is started (ensureMayor or sendMayorMessage), its agent status is set to working. It never transitions back unless the container process calls agentCompleted. But the mayor is designed to stay alive waiting for user input — it does not exit on idle the way polecats do (the 2-minute idle timer is for polecats/refineries, not the mayor per handleIdleEvent in process-manager.ts:429).

Because the mayor is working:

  • hasActiveWork() in scheduling.ts returns true
  • The alarm stays on the 5-second interval (instead of 60s for idle towns)
  • ensureContainerReady() fires every 5s → sends GET /health to the container
  • Container status observation fires every 5s → sends status check per working agent
  • The container receives HTTP requests every 5 seconds, resetting its sleepAfter idle timer every time

2. ensureContainerReady() health check resets the container idle timer

Even when the mayor is the only agent, ensureContainerReady() (Town.do.ts:3452) calls container.fetch("http://container/health") every 5 seconds because hasActiveWork() is true (the mayor is working). Each fetch resets the container's sleepAfter 30-minute idle countdown.

3. refreshContainerToken in-memory throttle resets on DO eviction (#1409)

The hourly token refresh throttle is stored as an in-memory property (lastContainerTokenRefreshAt = 0). When the DO is evicted and re-instantiated, the throttle resets and the refresh fires on the next alarm tick. The refresh sends two requests to the container:

  • container.setEnvVar(...)
  • container.fetch("/refresh-token")

Both reset the container idle timer. With a 60-second idle alarm interval (if the mayor issue were fixed), this would fire once per DO eviction instead of once per hour.

The Wake-Up Chain

Mayor started → status = "working" → never transitions to idle
  → hasActiveWork() = true
  → alarm fires every 5s
  → ensureContainerReady() → GET /health → resets container idle timer
  → container status observation → GET /agents/:mayorId/status → resets container idle timer
  → refreshContainerToken() (when throttle is cold) → POST /refresh-token → resets container idle timer
  → container NEVER reaches 30 min of inactivity
  → sleepAfter = "30m" NEVER triggers

Fix

Fix 1 (Critical): Mayor lifecycle — idle the mayor when not actively processing

When the mayor finishes processing a prompt and returns to waiting-for-input (session.idle event), the container should transition the mayor to an idle-like state that the TownDO recognizes as "not actively working."

Options:

  • A) The container calls agentCompleted when the mayor goes idle. The reconciler transitions it to idle. When the user sends a new message, sendMayorMessage re-dispatches. This is the simplest but adds latency to mayor responses (re-dispatch takes a few seconds).
  • B) Introduce a new agent status waiting (or use idle with a flag) that means "alive in container, waiting for input, but not doing LLM work." hasActiveWork() treats waiting agents as inactive. The alarm drops to 60s. Container health checks stop. But the container stays alive for the sleepAfter window, so the mayor is still available for instant responses if the user messages within 30 minutes.
  • C) hasActiveWork() explicitly excludes mayors: WHERE status = "working" AND role != "mayor". Simplest code change, but a bit hacky — conflates role with lifecycle.

Recommendation: Option B provides the best UX (instant mayor responses within the sleep window) with correct semantics. Option C is an acceptable stopgap.

Fix 2: Gate health checks on non-mayor active work

If Fix 1 uses Option C (exclude mayors from hasActiveWork), also update ensureContainerReady() to skip when the only working agent is the mayor:

if (!hasWork) {
  // Skip if the only "work" is the mayor waiting for user input
  if (!isRecentlyConfigured) return;
}

With Option B, this is automatic — hasActiveWork() returns false when the mayor is waiting.

Fix 3: Persist token refresh throttle (#1409)

Already filed as #1409. Store lastContainerTokenRefreshAt in ctx.storage instead of an in-memory property. This prevents the throttle from resetting on DO eviction.

Fix 4: Increase idle alarm interval

Already proposed in #1409 — change IDLE_ALARM_INTERVAL_MS from 60s to 300s (5 min). With the mayor fix, the alarm would drop to 5 min when no polecats/refineries are active. At 5 min intervals, ensureContainerReady() would not fire (no active work), and the container would receive at most 1 request per 5 minutes (token refresh, if the throttle expired). The container would idle and sleep after 30 minutes.

Expected Result After All Fixes

User stops chatting with mayor
  → Mayor session.idle fires → container sets mayor to "waiting"
  → agentCompleted or status update propagated to TownDO
  → hasActiveWork() returns false (no polecats/refineries working)
  → Alarm drops to 5-minute interval
  → No health checks, no status observations
  → Token refresh: at most once/hour, persisted throttle survives eviction
  → Container receives ~0 requests
  → After 30 minutes of inactivity → sleepAfter triggers → container sleeps
  → User sends new message → sendMayorMessage dispatches mayor → container wakes

Files

  • src/dos/Town.do.tshasActiveWork() usage (alarm interval), ensureContainerReady(), refreshContainerToken()
  • src/dos/town/scheduling.tshasActiveWork() query
  • container/src/process-manager.tshandleIdleEvent(), mayor idle timer exclusion
  • container/src/control-server.tsagentCompleted reporting for mayor
  • src/dos/town/agents.ts — agent status enum (if adding waiting)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    P0Blocks soft launchbugSomething isn't workinggt:containerContainer management, agent processes, SDK, heartbeatgt:coreReconciler, state machine, bead lifecycle, convoy flowgt:mayorMayor agent, chat interface, delegation toolskilo-duplicateAuto-generated label by Kilokilo-triagedAuto-generated label by Kilo

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions