fix(gastown): replace per-agent JWTs with per-container JWT#988
Merged
fix(gastown): replace per-agent JWTs with per-container JWT#988
Conversation
Contributor
Code Review SummaryStatus: 3 Issues Found | Recommendation: Address before merge Overview
Fix these issues in Kilo Cloud Issue Details (click to expand)WARNING
Other Observations (not in diff)Issues found in unchanged code that cannot receive inline comments: N/A Files Reviewed (19 files)
Reviewed by gpt-5.4-20260305 · 2,001,884 tokens |
2 tasks
25eb647 to
552708e
Compare
552708e to
a4a6093
Compare
…r landing (#962) (#13) Primary fix: In completeReviewWithResult(), after closeBead() on the source bead, explicitly call updateConvoyProgress() with the source bead ID. The polecat already closed the source bead before gt_done, so closeBead is a no-op (guard in updateBeadStatus short-circuits on same status). Calling updateConvoyProgress directly ensures the convoy recounts after the MR bead transitions to 'closed', allowing the source bead to pass the NOT EXISTS guard and count toward closedCount. Secondary fix: Fix getConvoyForBead() to handle the case where the bead IS the convoy itself. When processConvoyLandings() creates the final landing MR, it passes convoyId as the source bead. The old lookup (find 'tracks' edge from bead) returns null for convoy beads. Now also checks for convoy_metadata presence so the landing MR receives the correct convoy context (merge mode, isIntermediateStep=false) and the refinery sees the 'Final Landing' section in its system prompt. Co-authored-by: Maple (gastown) <Maple@gastown.local>
- detectCrashLoops: exclude agents hooked to triage beads via NOT EXISTS subquery so triage failures don't create new crash-loop triage requests - createTriageRequest: add global cap (MAX_OPEN_TRIAGE_REQUESTS=5) to prevent unbounded accumulation during feedback loops - maybeDispatchTriageAgent: pass role='triage' to skip git clone in container; apply DISPATCH_COOLDOWN_MS on failure via last_activity_at - agent-runner: handle role='triage' with createLightweightWorkspace (no git clone); refactor createMayorWorkspace to share the same helper - Add 'triage' to AgentRole enum in both worker and container type files Co-authored-by: Birch (gastown) <Birch@gastown.local>
…923) Agent JWTs (8h expiry) caused 401s for persistent agents (Mayor) running longer than 8 hours. Instead of adding token refresh complexity, replace the auth model with HMAC-based container secrets that never expire — they live as long as the container does. When the container sleeps and wakes, a new secret is minted automatically. Container secret design: - Format: townId:nonce:hmac (HMAC-SHA256 signed with GASTOWN_JWT_SECRET) - No expiry — lives as long as the container process - Stateless verification — no DO lookup needed, just HMAC check - Town-scoped — cross-town access prevented by HMAC input binding - Agent identity via X-Gastown-Agent-Id/Rig-Id headers, trusted because the container secret proves the request origin Backwards compatible: - Auth middleware accepts both container secrets AND legacy JWTs - Container code prefers GASTOWN_CONTAINER_SECRET, falls back to GASTOWN_SESSION_TOKEN - Legacy JWT minting retained (marked deprecated) for rollout safety Closes #923
The mayor's gt_list_rigs tool requires a userId (to look up rigs via GastownUserDO). With JWT auth, userId was embedded in the token payload. With container secrets, it was missing — causing 401 'Missing userId in token'. Fix: inject GASTOWN_USER_ID as an env var in startAgentInContainer, propagate it through buildAgentEnv, and send it as X-Gastown-User-Id header alongside the container secret.
Replaces the HMAC-based container secret approach with a simpler
per-container JWT. The container JWT:
- Carries { townId, userId, scope: 'container' } — same JWT format
the auth middleware already understands
- Has 8h expiry (same as legacy agent JWTs) but is proactively
refreshed hourly by the TownDO alarm
- Is shared by all agents in the container (one token per town)
- Eliminates the X-Gastown-* identity headers — userId lives in the
JWT, agentId/rigId come from route params
Removes container-secret.util.ts entirely. The auth middleware tries
container JWT verification first (scope: 'container'), falls back to
legacy agent JWT verification.
The TownDO alarm calls refreshContainerToken() once per hour,
which mints a fresh JWT and pushes it to the TownContainerDO via
setEnvVar('GASTOWN_CONTAINER_TOKEN', ...).
- Add auth guard to startMergeInContainer (missing null check for tokens) - Add POST /refresh-token endpoint to container control server so the alarm-based refresh actually updates process.env on the running Bun process (setEnvVar only takes effect on next boot) - Plugin clients read process.env.GASTOWN_CONTAINER_TOKEN on each request to pick up refreshed tokens without needing to restart - dispatch.refreshContainerToken() pushes fresh JWT to both the ContainerDO (setEnvVar for next boot) and the running container (POST /refresh-token for current process) - Clarify auth middleware comments re: intentional no-op checks for container JWTs (town-scoped, rig/agent identity from route params)
- broadcastEvent: read live token from process.env instead of cached ManagedAgent field, so event persistence uses refreshed tokens - heartbeat: read process.env.GASTOWN_CONTAINER_TOKEN on each tick instead of the module-level cached token from startHeartbeat() - completion-reporter: same pattern — prefer live process.env token - Strip gastownContainerToken from /agents/start response to prevent leaking the town-wide bearer token to dashboard callers
- ensureContainerToken now pushes to running container via POST /refresh-token (not just setEnvVar), so existing agents pick up the fresh token immediately on every agent start — no gap until the next alarm-based refresh - refreshContainerToken is now an alias for ensureContainerToken since both do the same thing (setEnvVar + POST /refresh-token) - Move throttle timestamp update to after successful refresh so failed refreshes are retried on the next alarm tick instead of being throttled away for an hour
…d param Container JWTs don't carry agentId, and routes like /triage/resolve and /mail don't have :agentId in the URL. This left agentId as '' in the auth payload, breaking handleResolveTriage (which requires a non-empty agentId) and weakening ownership checks in other handlers. Fix: tryContainerJWTAuth falls back to X-Gastown-Agent-Id and X-Gastown-Rig-Id headers when route params are absent. Both plugin clients (GastownClient, MayorGastownClient) now send these headers when using a container-scoped JWT.
The /agent-events route doesn't have :agentId in the URL, so with a container JWT the getEnforcedAgentId() ownership check became a no-op. Add X-Gastown-Agent-Id/Rig-Id headers when using the container token so the handler can still verify agent_id ownership.
container.fetch() only throws on transport errors. A 4xx/5xx from /refresh-token was silently swallowed, causing the alarm throttle to advance even though the container never accepted the new token. Now check resp.ok and throw on non-2xx so the error propagates to refreshContainerToken() in Town.do.ts, which only advances lastContainerTokenRefreshAt after success.
…tainer owner_user_id is optional in TownConfigSchema, so the fallback was minting a container JWT with userId: '' which broke resolveUserId() in mayor tool handlers. Match the pattern used in refreshContainerToken by falling back to townId.
a4a6093 to
335df17
Compare
| // propagate the error so the alarm retries on the next tick. | ||
| const isContainerDown = | ||
| err instanceof TypeError || (err instanceof Error && err.message.includes('fetch')); | ||
| if (!isContainerDown) throw err; |
Contributor
There was a problem hiding this comment.
WARNING: Refresh failures now block new dispatches
ensureContainerToken() throws on any non-transport /refresh-token failure, but both startAgentInContainer() and startMergeInContainer() call it before minting the legacy per-agent JWT fallback. A transient 4xx/5xx from the running container will therefore make new agents and merges return false even though the fallback credential path still exists.
pandemicsyn
approved these changes
Mar 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
{ townId, userId, scope: 'container' }with 8h expiry, proactively refreshed hourly by the TownDO alarm. This eliminates the per-agent token overhead while keeping short-lived tokens to limit blast radius from exfiltration.scope: 'container') and legacy per-agent JWTs. Container code prefersGASTOWN_CONTAINER_TOKEN, falls back toGASTOWN_SESSION_TOKEN.How it works
ensureContainerToken()signs a container JWT and stores it on theTownContainerDOviasetEnvVar('GASTOWN_CONTAINER_TOKEN', ...). Called fromstartAgentInContainerandstartMergeInContainer.refreshContainerToken()inTown.do.tsruns in the alarm handler, throttled to once per hour. Mints a fresh JWT and pushes it to the ContainerDO env var.verifyContainerJWT()first (recognizesscope: 'container'), falls back toverifyAgentJWT(). The container JWT carriestownIdanduserId;agentId/rigIdcome from route params (trusted because the JWT proves the request came from the right town's container).Why per-container instead of per-agent
The original design minted one JWT per agent, encoding
{ agentId, rigId, townId, userId }. This meant every agent dispatch required a fresh signing operation, and the 8h expiry was a ticking bomb for long-running agents. The per-container model:userId(needed by mayor tool routes likelistRigs) lives in the JWT instead of requiring a separate env var or headersetEnvVarcall per hour instead of tracking N agentsCloses #923
Verification
pnpm typecheck— zero errorspnpm vitest run— 105 tests pass across 7 test filesVisual Changes
N/A
Reviewer Notes
scope: 'container'field in the JWT payload is how the middleware distinguishes container JWTs from legacy agent JWTs. Both are HS256-signed with the sameGASTOWN_JWT_SECRET. TheContainerJWTPayloadZod schema enforcesscope: z.literal('container')so a legacy agent JWT can never accidentally parse as a container JWT.agentOnlyMiddlewareis relaxed for container JWTs: theagentIdis populated from the route param rather than the token, so the middleware allows emptyagentIdin the payload.refreshContainerToken()uses a simple in-memory timestamp (lastContainerTokenRefreshAt) for throttling. Not persisted across DO restarts, which is fine — a restart re-dispatches agents with fresh tokens anyway.TEST_ENVwas missingtownIdand URL expectations didn't include/towns/{townId}/in the path (stale fixtures from the beads-centric refactor).