fix(gastown): add JWT token refresh for persistent agents#964
fix(gastown): add JWT token refresh for persistent agents#964
Conversation
Agent JWTs minted once at dispatch with 8h expiry caused 401s on all tool calls for persistent agents (Mayor) running longer than 8 hours. Hybrid approach (Option D): - On message delivery: sendMayorMessage/ensureMayor mint and push a fresh JWT when the agent is already alive in the container - Proactive alarm refresh: TownDO alarm refreshes tokens for all working/stalled agents hourly (safety net for agents not receiving user messages, e.g. long refinery reviews) Container side: - POST /agents/:id/refresh-token endpoint hot-swaps tokens on running agents (ManagedAgent record, heartbeat module, plugin clients) - GastownClient/MayorGastownClient gain setToken() for live updates - Global plugin client registry (Symbol.for keyed on globalThis) lets the control server reach plugin instances across TS project roots Closes #923
| // Also update the heartbeat module's token so heartbeat POSTs use the | ||
| // fresh JWT. The heartbeat uses a single module-level token shared | ||
| // across all agents, so any refresh keeps heartbeats alive. | ||
| updateHeartbeatToken(parsed.data.token); |
There was a problem hiding this comment.
WARNING: Refreshing one agent's token breaks heartbeats for every other running agent
updateHeartbeatToken() stores a single module-level JWT, but /api/towns/:townId/rigs/:rigId/agents/:agentId/heartbeat is protected by agentOnlyMiddleware, so the token has to match the specific :agentId on each request. As soon as one agent refreshes here, the heartbeat loop starts sending that agent's token for all other active agents and their heartbeats will begin returning 403/401.
|
|
||
| /** Register a plugin client so its token can be refreshed externally. */ | ||
| export function registerPluginClient(client: TokenRefreshable): void { | ||
| getRegistry().push(client); |
There was a problem hiding this comment.
WARNING: The plugin client registry never releases completed sessions
Every plugin initialization pushes another client into this global array, but there is no matching unregister on session teardown. In a long-lived town container this grows without bound and every token refresh will sweep stale clients from old sessions, which is both a memory leak and an avoidable O(n) cost on each refresh.
Code Review SummaryStatus: 2 Issues Found | Recommendation: Address before merge Overview
Issue Details (click to expand)WARNING
Fix these issues in Kilo Cloud Other Observations (not in diff)None. Files Reviewed (9 files)
|
|
Closing in favor of #988 |
Summary
sendMayorMessage/ensureMayorisAlive paths) AND proactively by the TownDO alarm (hourly for all working/stalled agents). This covers both the common case (user keeps chatting) and the edge case (agents running without messages).ManagedAgentrecord (broadcastEvent, completion reporting), the heartbeat module, AND the plugin client instances (tool-call API requests) via aglobalThissymbol-keyed registry that bridges the containersrc/andplugin/TypeScript project roots.Closes #923
Verification
pnpm typecheck— passes with zero errorspnpm vitest run— all 108 tests pass across 7 test files (including 6 new token refresh tests)Visual Changes
N/A
Reviewer Notes
Symbol.for('gastown.pluginClientRegistry')onglobalThisto bridge thecontainer/src/andcontainer/plugin/TypeScript project roots (they can't cross-import). Both sides access the same array in the same Bun process. This is a deliberate design choice — the alternative would be restructuring the TS project boundaries.refreshRunningAgentTokens()is throttled to once per hour via a simple timestamp check (not persisted across DO restarts, which is fine — a restart re-dispatches agents with fresh tokens anyway).ensureMayorpath fires avoid(fire-and-forget) token refresh because it returns immediately without waiting for the agent to process a message. ThesendMayorMessagepath awaits the refresh before sending the message to ensure the agent has a valid token for any tool calls triggered by that message.townIdtoTEST_ENVand updated URL expectations to include/towns/{townId}/— the test fixture was stale (missing the townId field added during the beads-centric refactor).