Parent
Part of #204 (Phase 3: Multi-Rig + Scaling)
Problem
Agent tool calls fail with 401 "Token expired" after 8 hours. The GASTOWN_SESSION_TOKEN (agent JWT) is minted once per startAgentInContainer() dispatch with an 8-hour expiry, but there is no token refresh mechanism anywhere in the codebase. Persistent agents — especially the Mayor — can run far longer than 8 hours without being re-dispatched, causing all API calls back to the worker to fail.
This is happening in production right now: a town created a week ago is getting 401s on the Mayor's tool calls.
Root Cause
Token lifecycle
mintAgentToken() (src/dos/town/container-dispatch.ts:45-58) signs a JWT with GASTOWN_JWT_SECRET, 8-hour expiry, claims { agentId, rigId, townId, userId }
startAgentInContainer() (container-dispatch.ts:152-274) calls mintAgentToken() and passes the JWT in the request body's envVars.GASTOWN_SESSION_TOKEN to the container's /agents/start endpoint
- The container stores the token on the
ManagedAgent record (container/src/process-manager.ts:318) — this value is never updated
- The agent's plugin client reads
GASTOWN_SESSION_TOKEN once at init and uses it for all subsequent API calls (container/plugin/client.ts:322, 53)
The failure path (Mayor)
- Mayor is dispatched → gets fresh 8h JWT
- User interacts periodically (within 30-min windows, preventing container sleep)
sendMayorMessage() (Town.do.ts:627) checks container status → sees isAlive = true → calls sendMessageToAgent() which sends a prompt to the running agent without minting a new JWT
- After 8 hours, the original JWT expires
- All Mayor tool calls →
authMiddleware → 401 "Token expired"
- Heartbeats, event persistence, and completion reports also fail (all use the same stale token)
Why polecats are less affected
Polecat agents exit on session.idle (process-manager.ts:259). A typical session completes in minutes to hours. Each new dispatch mints a fresh token. Only the Mayor (and potentially long-running refinery reviews) survive past 8 hours.
Why container sleep/wake isn't a safety net
Container sleep would fix this — on wake, the next startAgentInContainer() call mints a fresh JWT. But if the user keeps the container alive by interacting within the 30-min sleepAfter window, the container never sleeps and the Mayor accumulates time against its original JWT indefinitely.
Everything affected by the stale token
| Component |
File |
What fails |
| Tool calls (gt_sling, gt_done, etc.) |
container/plugin/client.ts:53 |
All agent API calls back to worker |
| Heartbeats |
container/src/heartbeat.ts:62 |
Agent liveness POSTs |
| Event persistence |
container/src/process-manager.ts:107 |
Agent event storage in AgentDO |
| Completion reporting |
container/src/completion-reporter.ts:36 |
Agent completed callbacks |
| Merge callbacks |
container/src/control-server.ts:268-270 |
Review complete notifications |
Acceptance Criteria
Possible Approaches
The exact approach is left to the implementer, but here are the options identified:
Option A — Refresh on message delivery: When sendMayorMessage() takes the isAlive path, mint a fresh JWT and include it in the /agents/:id/message request. The container updates the ManagedAgent.gastownSessionToken and the plugin client's Bearer token. Requires the container to have a mechanism for hot-swapping the token on a running agent.
Option B — Proactive alarm-based refresh: The TownDO alarm periodically (e.g., every hour) re-mints JWTs for all running agents and pushes them to the container via a new /agents/:id/refresh-token endpoint. Container updates the stored token.
Option C — Container-side 401 retry with token request: When the plugin client receives a 401, it calls a dedicated (unauthenticated or separately authenticated) token refresh endpoint on the worker, gets a fresh JWT, retries the original request. Requires a way for the container to prove agent identity without the expired JWT.
Option D — Hybrid: Combine A (refresh on interaction) with B (alarm-based refresh for agents that don't receive messages but stay running, e.g., refinery during long reviews).
Notes
- No data migration needed — cloud Gastown hasn't deployed to production (though this bug IS affecting a production town right now)
KILOCODE_TOKEN (Kilo LLM gateway token) has a 30-day expiry and IS persisted on the container DO via setEnvVar() — it's not affected by this bug
- The
GASTOWN_SESSION_TOKEN is intentionally NOT persisted on the container DO (it's per-agent, not per-container). This is correct — the fix should refresh per-agent tokens, not change the storage model.
verifyAgentJWT() in src/util/jwt.util.ts:18 also sets maxAge: '8h' redundantly on top of the exp claim — both need to be considered when adjusting expiry
Parent
Part of #204 (Phase 3: Multi-Rig + Scaling)
Problem
Agent tool calls fail with 401 "Token expired" after 8 hours. The
GASTOWN_SESSION_TOKEN(agent JWT) is minted once perstartAgentInContainer()dispatch with an 8-hour expiry, but there is no token refresh mechanism anywhere in the codebase. Persistent agents — especially the Mayor — can run far longer than 8 hours without being re-dispatched, causing all API calls back to the worker to fail.This is happening in production right now: a town created a week ago is getting 401s on the Mayor's tool calls.
Root Cause
Token lifecycle
mintAgentToken()(src/dos/town/container-dispatch.ts:45-58) signs a JWT withGASTOWN_JWT_SECRET, 8-hour expiry, claims{ agentId, rigId, townId, userId }startAgentInContainer()(container-dispatch.ts:152-274) callsmintAgentToken()and passes the JWT in the request body'senvVars.GASTOWN_SESSION_TOKENto the container's/agents/startendpointManagedAgentrecord (container/src/process-manager.ts:318) — this value is never updatedGASTOWN_SESSION_TOKENonce at init and uses it for all subsequent API calls (container/plugin/client.ts:322, 53)The failure path (Mayor)
sendMayorMessage()(Town.do.ts:627) checks container status → seesisAlive = true→ callssendMessageToAgent()which sends a prompt to the running agent without minting a new JWTauthMiddleware→ 401 "Token expired"Why polecats are less affected
Polecat agents exit on
session.idle(process-manager.ts:259). A typical session completes in minutes to hours. Each new dispatch mints a fresh token. Only the Mayor (and potentially long-running refinery reviews) survive past 8 hours.Why container sleep/wake isn't a safety net
Container sleep would fix this — on wake, the next
startAgentInContainer()call mints a fresh JWT. But if the user keeps the container alive by interacting within the 30-minsleepAfterwindow, the container never sleeps and the Mayor accumulates time against its original JWT indefinitely.Everything affected by the stale token
container/plugin/client.ts:53container/src/heartbeat.ts:62container/src/process-manager.ts:107container/src/completion-reporter.ts:36container/src/control-server.ts:268-270Acceptance Criteria
Possible Approaches
The exact approach is left to the implementer, but here are the options identified:
Option A — Refresh on message delivery: When
sendMayorMessage()takes theisAlivepath, mint a fresh JWT and include it in the/agents/:id/messagerequest. The container updates theManagedAgent.gastownSessionTokenand the plugin client's Bearer token. Requires the container to have a mechanism for hot-swapping the token on a running agent.Option B — Proactive alarm-based refresh: The TownDO alarm periodically (e.g., every hour) re-mints JWTs for all running agents and pushes them to the container via a new
/agents/:id/refresh-tokenendpoint. Container updates the stored token.Option C — Container-side 401 retry with token request: When the plugin client receives a 401, it calls a dedicated (unauthenticated or separately authenticated) token refresh endpoint on the worker, gets a fresh JWT, retries the original request. Requires a way for the container to prove agent identity without the expired JWT.
Option D — Hybrid: Combine A (refresh on interaction) with B (alarm-based refresh for agents that don't receive messages but stay running, e.g., refinery during long reviews).
Notes
KILOCODE_TOKEN(Kilo LLM gateway token) has a 30-day expiry and IS persisted on the container DO viasetEnvVar()— it's not affected by this bugGASTOWN_SESSION_TOKENis intentionally NOT persisted on the container DO (it's per-agent, not per-container). This is correct — the fix should refresh per-agent tokens, not change the storage model.verifyAgentJWT()insrc/util/jwt.util.ts:18also setsmaxAge: '8h'redundantly on top of theexpclaim — both need to be considered when adjusting expiry