Skip to content

Bug: Agent JWT expires after 8h with no refresh — Mayor tool calls 401 #923

@jrf0110

Description

@jrf0110

Parent

Part of #204 (Phase 3: Multi-Rig + Scaling)

Problem

Agent tool calls fail with 401 "Token expired" after 8 hours. The GASTOWN_SESSION_TOKEN (agent JWT) is minted once per startAgentInContainer() dispatch with an 8-hour expiry, but there is no token refresh mechanism anywhere in the codebase. Persistent agents — especially the Mayor — can run far longer than 8 hours without being re-dispatched, causing all API calls back to the worker to fail.

This is happening in production right now: a town created a week ago is getting 401s on the Mayor's tool calls.

Root Cause

Token lifecycle

  1. mintAgentToken() (src/dos/town/container-dispatch.ts:45-58) signs a JWT with GASTOWN_JWT_SECRET, 8-hour expiry, claims { agentId, rigId, townId, userId }
  2. startAgentInContainer() (container-dispatch.ts:152-274) calls mintAgentToken() and passes the JWT in the request body's envVars.GASTOWN_SESSION_TOKEN to the container's /agents/start endpoint
  3. The container stores the token on the ManagedAgent record (container/src/process-manager.ts:318) — this value is never updated
  4. The agent's plugin client reads GASTOWN_SESSION_TOKEN once at init and uses it for all subsequent API calls (container/plugin/client.ts:322, 53)

The failure path (Mayor)

  1. Mayor is dispatched → gets fresh 8h JWT
  2. User interacts periodically (within 30-min windows, preventing container sleep)
  3. sendMayorMessage() (Town.do.ts:627) checks container status → sees isAlive = true → calls sendMessageToAgent() which sends a prompt to the running agent without minting a new JWT
  4. After 8 hours, the original JWT expires
  5. All Mayor tool calls → authMiddleware → 401 "Token expired"
  6. Heartbeats, event persistence, and completion reports also fail (all use the same stale token)

Why polecats are less affected

Polecat agents exit on session.idle (process-manager.ts:259). A typical session completes in minutes to hours. Each new dispatch mints a fresh token. Only the Mayor (and potentially long-running refinery reviews) survive past 8 hours.

Why container sleep/wake isn't a safety net

Container sleep would fix this — on wake, the next startAgentInContainer() call mints a fresh JWT. But if the user keeps the container alive by interacting within the 30-min sleepAfter window, the container never sleeps and the Mayor accumulates time against its original JWT indefinitely.

Everything affected by the stale token

Component File What fails
Tool calls (gt_sling, gt_done, etc.) container/plugin/client.ts:53 All agent API calls back to worker
Heartbeats container/src/heartbeat.ts:62 Agent liveness POSTs
Event persistence container/src/process-manager.ts:107 Agent event storage in AgentDO
Completion reporting container/src/completion-reporter.ts:36 Agent completed callbacks
Merge callbacks container/src/control-server.ts:268-270 Review complete notifications

Acceptance Criteria

  • Mayor (and any persistent agent) can operate indefinitely without JWT expiration causing 401s
  • Token refresh does not require restarting the agent or container
  • Fresh tokens are delivered to already-running agents
  • The container updates the agent's stored token and plugin client transparently
  • No security regression — tokens should still have bounded expiry (don't "fix" this by making tokens never expire)

Possible Approaches

The exact approach is left to the implementer, but here are the options identified:

Option A — Refresh on message delivery: When sendMayorMessage() takes the isAlive path, mint a fresh JWT and include it in the /agents/:id/message request. The container updates the ManagedAgent.gastownSessionToken and the plugin client's Bearer token. Requires the container to have a mechanism for hot-swapping the token on a running agent.

Option B — Proactive alarm-based refresh: The TownDO alarm periodically (e.g., every hour) re-mints JWTs for all running agents and pushes them to the container via a new /agents/:id/refresh-token endpoint. Container updates the stored token.

Option C — Container-side 401 retry with token request: When the plugin client receives a 401, it calls a dedicated (unauthenticated or separately authenticated) token refresh endpoint on the worker, gets a fresh JWT, retries the original request. Requires a way for the container to prove agent identity without the expired JWT.

Option D — Hybrid: Combine A (refresh on interaction) with B (alarm-based refresh for agents that don't receive messages but stay running, e.g., refinery during long reviews).

Notes

  • No data migration needed — cloud Gastown hasn't deployed to production (though this bug IS affecting a production town right now)
  • KILOCODE_TOKEN (Kilo LLM gateway token) has a 30-day expiry and IS persisted on the container DO via setEnvVar() — it's not affected by this bug
  • The GASTOWN_SESSION_TOKEN is intentionally NOT persisted on the container DO (it's per-agent, not per-container). This is correct — the fix should refresh per-agent tokens, not change the storage model.
  • verifyAgentJWT() in src/util/jwt.util.ts:18 also sets maxAge: '8h' redundantly on top of the exp claim — both need to be considered when adjusting expiry

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingkilo-auto-fixAuto-generated label by Kilokilo-triagedAuto-generated label by Kilo

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions