Skip to content

fix(gastown): replace per-agent JWTs with per-container JWT#988

Merged
jrf0110 merged 13 commits intomainfrom
923-container-secret-auth
Mar 11, 2026
Merged

fix(gastown): replace per-agent JWTs with per-container JWT#988
jrf0110 merged 13 commits intomainfrom
923-container-secret-auth

Conversation

@jrf0110
Copy link
Copy Markdown
Contributor

@jrf0110 jrf0110 commented Mar 10, 2026

Summary

  • Fix for Bug: Agent JWT expires after 8h with no refresh — Mayor tool calls 401 #923: Agent JWTs were minted per-agent with 8h expiry and no refresh, causing 401s on all tool calls for persistent agents (Mayor) running longer than 8 hours.
  • New model: One JWT per container, shared by all agents in the town. Carries { townId, userId, scope: 'container' } with 8h expiry, proactively refreshed hourly by the TownDO alarm. This eliminates the per-agent token overhead while keeping short-lived tokens to limit blast radius from exfiltration.
  • Backwards compatible: Auth middleware accepts both container JWTs (scope: 'container') and legacy per-agent JWTs. Container code prefers GASTOWN_CONTAINER_TOKEN, falls back to GASTOWN_SESSION_TOKEN.

How it works

  1. Minting: ensureContainerToken() signs a container JWT and stores it on the TownContainerDO via setEnvVar('GASTOWN_CONTAINER_TOKEN', ...). Called from startAgentInContainer and startMergeInContainer.
  2. Refresh: refreshContainerToken() in Town.do.ts runs in the alarm handler, throttled to once per hour. Mints a fresh JWT and pushes it to the ContainerDO env var.
  3. Auth: Middleware tries verifyContainerJWT() first (recognizes scope: 'container'), falls back to verifyAgentJWT(). The container JWT carries townId and userId; agentId/rigId come from route params (trusted because the JWT proves the request came from the right town's container).

Why per-container instead of per-agent

The original design minted one JWT per agent, encoding { agentId, rigId, townId, userId }. This meant every agent dispatch required a fresh signing operation, and the 8h expiry was a ticking bomb for long-running agents. The per-container model:

  • Mints once per container boot, shared by all agents
  • The userId (needed by mayor tool routes like listRigs) lives in the JWT instead of requiring a separate env var or header
  • Refresh is trivial — one setEnvVar call per hour instead of tracking N agents

Closes #923

Verification

  • pnpm typecheck — zero errors
  • pnpm vitest run — 105 tests pass across 7 test files

Visual Changes

N/A

Reviewer Notes

  • The scope: 'container' field in the JWT payload is how the middleware distinguishes container JWTs from legacy agent JWTs. Both are HS256-signed with the same GASTOWN_JWT_SECRET. The ContainerJWTPayload Zod schema enforces scope: z.literal('container') so a legacy agent JWT can never accidentally parse as a container JWT.
  • agentOnlyMiddleware is relaxed for container JWTs: the agentId is populated from the route param rather than the token, so the middleware allows empty agentId in the payload.
  • refreshContainerToken() uses a simple in-memory timestamp (lastContainerTokenRefreshAt) for throttling. Not persisted across DO restarts, which is fine — a restart re-dispatches agents with fresh tokens anyway.
  • Test fixes included: TEST_ENV was missing townId and URL expectations didn't include /towns/{townId}/ in the path (stale fixtures from the beads-centric refactor).

Comment thread cloudflare-gastown/container/plugin/client.ts
Comment thread cloudflare-gastown/src/middleware/auth.middleware.ts Outdated
@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented Mar 10, 2026

Code Review Summary

Status: 3 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 3
SUGGESTION 0

Fix these issues in Kilo Cloud

Issue Details (click to expand)

WARNING

File Line Issue
cloudflare-gastown/src/dos/town/container-dispatch.ts 116 Refresh-token failures abort new agent/merge dispatches before the legacy JWT fallback can be minted.
cloudflare-gastown/src/dos/Town.do.ts 2538 Triage agents are still persisted as polecat, so recovery paths redispatch them without the new lightweight triage behavior.
cloudflare-gastown/src/dos/town/patrol.ts 602 Crash-loop exclusion checks the current hook, so failed triage batches are counted again after unhooking and can recreate the feedback loop.
Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

N/A

Files Reviewed (19 files)
  • cloudflare-gastown/container/plugin/client.test.ts - 0 issues
  • cloudflare-gastown/container/plugin/client.ts - 0 issues
  • cloudflare-gastown/container/plugin/types.ts - 0 issues
  • cloudflare-gastown/container/src/agent-runner.ts - 0 issues
  • cloudflare-gastown/container/src/completion-reporter.ts - 0 issues
  • cloudflare-gastown/container/src/control-server.ts - 0 issues
  • cloudflare-gastown/container/src/heartbeat.ts - 0 issues
  • cloudflare-gastown/container/src/process-manager.ts - 0 issues
  • cloudflare-gastown/container/src/types.ts - 0 issues
  • cloudflare-gastown/src/dos/Town.do.ts - 1 issue
  • cloudflare-gastown/src/dos/town/beads.ts - 0 issues
  • cloudflare-gastown/src/dos/town/container-dispatch.ts - 1 issue
  • cloudflare-gastown/src/dos/town/patrol.ts - 1 issue
  • cloudflare-gastown/src/dos/town/review-queue.ts - 0 issues
  • cloudflare-gastown/src/middleware/auth.middleware.ts - 0 issues
  • cloudflare-gastown/src/middleware/mayor-auth.middleware.ts - 0 issues
  • cloudflare-gastown/src/types.ts - 0 issues
  • cloudflare-gastown/src/util/jwt.util.ts - 0 issues
  • plans/gastown-org-level-architecture.md - 0 issues

Reviewed by gpt-5.4-20260305 · 2,001,884 tokens

Comment thread cloudflare-gastown/src/middleware/auth.middleware.ts
Comment thread cloudflare-gastown/src/dos/Town.do.ts
Comment thread cloudflare-gastown/src/dos/Town.do.ts Outdated
@jrf0110 jrf0110 changed the title fix(gastown): replace per-agent JWTs with per-container HMAC secrets fix(gastown): replace per-agent JWTs with per-container JWT Mar 10, 2026
Comment thread cloudflare-gastown/container/src/process-manager.ts Outdated
Comment thread cloudflare-gastown/container/src/process-manager.ts
Comment thread cloudflare-gastown/container/src/control-server.ts
Comment thread cloudflare-gastown/src/dos/town/container-dispatch.ts
Comment thread cloudflare-gastown/src/dos/Town.do.ts Outdated
Comment thread cloudflare-gastown/src/middleware/auth.middleware.ts Outdated
Comment thread cloudflare-gastown/container/src/process-manager.ts
Comment thread cloudflare-gastown/src/dos/town/container-dispatch.ts Outdated
Comment thread cloudflare-gastown/src/dos/town/container-dispatch.ts Outdated
Comment thread cloudflare-gastown/container/src/control-server.ts
@jrf0110 jrf0110 force-pushed the 923-container-secret-auth branch from 25eb647 to 552708e Compare March 10, 2026 21:23
Comment thread cloudflare-gastown/src/dos/Town.do.ts
Comment thread cloudflare-gastown/src/dos/town/patrol.ts
@jrf0110 jrf0110 force-pushed the 923-container-secret-auth branch from 552708e to a4a6093 Compare March 10, 2026 22:26
Comment thread cloudflare-gastown/src/middleware/mayor-auth.middleware.ts
kilo-code-bot Bot and others added 12 commits March 10, 2026 19:56
…r landing (#962) (#13)

Primary fix: In completeReviewWithResult(), after closeBead() on the source bead,
explicitly call updateConvoyProgress() with the source bead ID. The polecat already
closed the source bead before gt_done, so closeBead is a no-op (guard in
updateBeadStatus short-circuits on same status). Calling updateConvoyProgress
directly ensures the convoy recounts after the MR bead transitions to 'closed',
allowing the source bead to pass the NOT EXISTS guard and count toward closedCount.

Secondary fix: Fix getConvoyForBead() to handle the case where the bead IS the
convoy itself. When processConvoyLandings() creates the final landing MR, it passes
convoyId as the source bead. The old lookup (find 'tracks' edge from bead) returns
null for convoy beads. Now also checks for convoy_metadata presence so the landing
MR receives the correct convoy context (merge mode, isIntermediateStep=false) and
the refinery sees the 'Final Landing' section in its system prompt.

Co-authored-by: Maple (gastown) <Maple@gastown.local>
- detectCrashLoops: exclude agents hooked to triage beads via NOT EXISTS
  subquery so triage failures don't create new crash-loop triage requests
- createTriageRequest: add global cap (MAX_OPEN_TRIAGE_REQUESTS=5) to
  prevent unbounded accumulation during feedback loops
- maybeDispatchTriageAgent: pass role='triage' to skip git clone in
  container; apply DISPATCH_COOLDOWN_MS on failure via last_activity_at
- agent-runner: handle role='triage' with createLightweightWorkspace
  (no git clone); refactor createMayorWorkspace to share the same helper
- Add 'triage' to AgentRole enum in both worker and container type files

Co-authored-by: Birch (gastown) <Birch@gastown.local>
…923)

Agent JWTs (8h expiry) caused 401s for persistent agents (Mayor)
running longer than 8 hours. Instead of adding token refresh complexity,
replace the auth model with HMAC-based container secrets that never
expire — they live as long as the container does. When the container
sleeps and wakes, a new secret is minted automatically.

Container secret design:
- Format: townId:nonce:hmac (HMAC-SHA256 signed with GASTOWN_JWT_SECRET)
- No expiry — lives as long as the container process
- Stateless verification — no DO lookup needed, just HMAC check
- Town-scoped — cross-town access prevented by HMAC input binding
- Agent identity via X-Gastown-Agent-Id/Rig-Id headers, trusted
  because the container secret proves the request origin

Backwards compatible:
- Auth middleware accepts both container secrets AND legacy JWTs
- Container code prefers GASTOWN_CONTAINER_SECRET, falls back to
  GASTOWN_SESSION_TOKEN
- Legacy JWT minting retained (marked deprecated) for rollout safety

Closes #923
The mayor's gt_list_rigs tool requires a userId (to look up rigs via
GastownUserDO). With JWT auth, userId was embedded in the token payload.
With container secrets, it was missing — causing 401 'Missing userId
in token'.

Fix: inject GASTOWN_USER_ID as an env var in startAgentInContainer,
propagate it through buildAgentEnv, and send it as X-Gastown-User-Id
header alongside the container secret.
Replaces the HMAC-based container secret approach with a simpler
per-container JWT. The container JWT:

- Carries { townId, userId, scope: 'container' } — same JWT format
  the auth middleware already understands
- Has 8h expiry (same as legacy agent JWTs) but is proactively
  refreshed hourly by the TownDO alarm
- Is shared by all agents in the container (one token per town)
- Eliminates the X-Gastown-* identity headers — userId lives in the
  JWT, agentId/rigId come from route params

Removes container-secret.util.ts entirely. The auth middleware tries
container JWT verification first (scope: 'container'), falls back to
legacy agent JWT verification.

The TownDO alarm calls refreshContainerToken() once per hour,
which mints a fresh JWT and pushes it to the TownContainerDO via
setEnvVar('GASTOWN_CONTAINER_TOKEN', ...).
- Add auth guard to startMergeInContainer (missing null check for tokens)
- Add POST /refresh-token endpoint to container control server so the
  alarm-based refresh actually updates process.env on the running Bun
  process (setEnvVar only takes effect on next boot)
- Plugin clients read process.env.GASTOWN_CONTAINER_TOKEN on each request
  to pick up refreshed tokens without needing to restart
- dispatch.refreshContainerToken() pushes fresh JWT to both the
  ContainerDO (setEnvVar for next boot) and the running container
  (POST /refresh-token for current process)
- Clarify auth middleware comments re: intentional no-op checks for
  container JWTs (town-scoped, rig/agent identity from route params)
- broadcastEvent: read live token from process.env instead of cached
  ManagedAgent field, so event persistence uses refreshed tokens
- heartbeat: read process.env.GASTOWN_CONTAINER_TOKEN on each tick
  instead of the module-level cached token from startHeartbeat()
- completion-reporter: same pattern — prefer live process.env token
- Strip gastownContainerToken from /agents/start response to prevent
  leaking the town-wide bearer token to dashboard callers
- ensureContainerToken now pushes to running container via POST
  /refresh-token (not just setEnvVar), so existing agents pick up
  the fresh token immediately on every agent start — no gap until
  the next alarm-based refresh
- refreshContainerToken is now an alias for ensureContainerToken
  since both do the same thing (setEnvVar + POST /refresh-token)
- Move throttle timestamp update to after successful refresh so
  failed refreshes are retried on the next alarm tick instead of
  being throttled away for an hour
…d param

Container JWTs don't carry agentId, and routes like /triage/resolve
and /mail don't have :agentId in the URL. This left agentId as ''
in the auth payload, breaking handleResolveTriage (which requires a
non-empty agentId) and weakening ownership checks in other handlers.

Fix: tryContainerJWTAuth falls back to X-Gastown-Agent-Id and
X-Gastown-Rig-Id headers when route params are absent. Both plugin
clients (GastownClient, MayorGastownClient) now send these headers
when using a container-scoped JWT.
The /agent-events route doesn't have :agentId in the URL, so with a
container JWT the getEnforcedAgentId() ownership check became a no-op.
Add X-Gastown-Agent-Id/Rig-Id headers when using the container token
so the handler can still verify agent_id ownership.
container.fetch() only throws on transport errors. A 4xx/5xx from
/refresh-token was silently swallowed, causing the alarm throttle
to advance even though the container never accepted the new token.

Now check resp.ok and throw on non-2xx so the error propagates to
refreshContainerToken() in Town.do.ts, which only advances
lastContainerTokenRefreshAt after success.
…tainer

owner_user_id is optional in TownConfigSchema, so the fallback was
minting a container JWT with userId: '' which broke resolveUserId()
in mayor tool handlers. Match the pattern used in refreshContainerToken
by falling back to townId.
@jrf0110 jrf0110 force-pushed the 923-container-secret-auth branch from a4a6093 to 335df17 Compare March 11, 2026 00:56
// propagate the error so the alarm retries on the next tick.
const isContainerDown =
err instanceof TypeError || (err instanceof Error && err.message.includes('fetch'));
if (!isContainerDown) throw err;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Refresh failures now block new dispatches

ensureContainerToken() throws on any non-transport /refresh-token failure, but both startAgentInContainer() and startMergeInContainer() call it before minting the legacy per-agent JWT fallback. A transient 4xx/5xx from the running container will therefore make new agents and merges return false even though the fallback credential path still exists.

@jrf0110 jrf0110 merged commit 29b558b into main Mar 11, 2026
18 checks passed
@jrf0110 jrf0110 deleted the 923-container-secret-auth branch March 11, 2026 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Agent JWT expires after 8h with no refresh — Mayor tool calls 401

2 participants