Skip to content

chore(gastown): promote gastown-staging to main#2974

Open
jrf0110 wants to merge 6 commits intomainfrom
gastown-staging
Open

chore(gastown): promote gastown-staging to main#2974
jrf0110 wants to merge 6 commits intomainfrom
gastown-staging

Conversation

@jrf0110
Copy link
Copy Markdown
Contributor

@jrf0110 jrf0110 commented Apr 30, 2026

Summary

Batch promote of gastown-staging to main. Four commits since the last promotion, each addressing a distinct gastown issue:

# Commit PR Type
1 21a14d04 (no PR — direct push) dev-only fix
2 c532f40e #2999 bug fix
3 db584066 #3047 bug fix
4 47e95a8b #3055 observability + race fix

Constituent changes

  • fix(gastown): point dev GIT_TOKEN_SERVICE binding at git-token-service-dev (21a14d04, no PR — direct push)
    Local-dev binding fix. services/git-token-service/wrangler.jsonc overrides its worker name to git-token-service-dev in env.dev, but gastown's env.dev.services binding still referenced the base git-token-service name. Wrangler's local dev registry does exact-name matching, so the binding showed [not connected] whenever both workers ran side by side. Production binding (top-level services) is untouched; only the env.dev block was changed.

  • fix(gastown): push new model onto resumed mayor session on hot-swap (#2999)
    Regression where changing the mayor's model in town settings persisted the config but the running mayor kept using the previous model — users had to run /model manually. Root cause: a prior fix (9785570b9) skipped the session.prompt(...) call on resumed sessions to avoid a duplicate startup turn, but that call was also responsible for pushing the new model field onto the session. Fix sends a session.prompt with noReply: true and the new model on resume, so the model updates without injecting a synthetic user turn.

  • fix(gastown): stop reconciler log spam from orphaned bead_cancelled events (#3047)
    Production logs were filling with reconciler: applyEvent failed … Bead <id> not found repeating every alarm tick forever. Two cooperating bugs: (1) deleteBead cascaded cleanup to satellite tables but not the town_events reconciler queue, leaving orphan events; (2) the drain loop in Town.do.ts intentionally never marked failed events as processed, expecting retries to eventually succeed — but for a deleted bead they never can. Fix cleans up town_events in deleteBead/deleteBeads, pre-checks bead existence in applyEvent, and classifies Bead/Agent … not found errors as terminal in the drain loop.

  • feat(gastown-container): add crash visibility + per-agent start mutex (#3055)
    Investigation hooks for repeated container restarts seen on a specific town (~1.5–2 min cadence). Adds an unhandledRejection listener with full stack logging (no process.exit), periodic RSS memory logging via the heartbeat path, and a per-agentId mutex in startAgent that fixes a real concurrency race exposed by duplicate /agents/start log lines arriving in the same millisecond for the same mayor.

Verification

  • Each constituent PR was reviewed and merged into gastown-staging independently — see PR links above for per-change verification details.
  • 21a14d04 was manually verified locally (wrangler dev --env dev for both workers, binding now connects).

Reviewer notes

  • All four changes are scoped to services/gastown (container + DO/worker code) and the env.dev block of services/gastown/wrangler.jsonc. No cross-service changes.
  • Production worker bindings are unchanged.

…e-dev

git-token-service's wrangler env.dev overrides the worker name to
'git-token-service-dev', but gastown's env.dev.services binding was
still referencing the base 'git-token-service' name. Wrangler's local
dev registry does exact-name matching, so the binding showed as
[not connected] whenever both workers were running side by side.

Every other consumer in the repo (cloud-agent-next, security-sync,
security-auto-analysis) already uses 'git-token-service-dev' in their
env.dev block; gastown was the outlier.
@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented Apr 30, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Files Reviewed (9 files)
  • services/gastown/wrangler.jsonc
  • services/gastown/container/src/main.ts
  • services/gastown/container/src/process-manager.ts
  • services/gastown/container/src/process-manager.test.ts
  • services/gastown/container/vitest.config.ts
  • services/gastown/src/dos/Town.do.ts
  • services/gastown/src/dos/town/beads.ts
  • services/gastown/src/dos/town/reconciler.ts
  • services/gastown/test/integration/event-cleanup.test.ts

Reviewed by gpt-5.5-20260423 · 550,886 tokens

jrf0110 and others added 3 commits May 1, 2026 16:22
…2999)

When a user changes the mayor's model in town settings, updateAgentModel
restarts the SDK server with new KILO_CONFIG_CONTENT and resumes the
existing session from kilo.db. Commit 9785570 intentionally stopped
sending any session.prompt on resume to avoid duplicating the
MAYOR_STARTUP_PROMPT, but that also dropped the model param — so the
resumed session kept its prior per-session model until the user ran
/model manually.

Extract the fresh vs. resumed session-prompt logic into applyModelToSession
and on resume send a noReply:true prompt carrying only the new model
param. This updates the SDK server's per-session model without replaying
the startup prompt. Errors on the resume path are swallowed so the
hot-swap still succeeds; the SDK server fell back to the config-loaded
model at startup, which was already updated.

Add container tests covering both fresh and resumed paths.

Co-authored-by: John Fawcett <john@kilcoode.ai>
…vents (#3047)

Two independent bugs compose to flood production logs every alarm tick
with 'Bead <id> not found' errors:

1. deleteBead / deleteBeads did not clean up the town_events queue,
   leaving bead_cancelled and container_status rows pointing at deleted
   beads/agents.
2. applyEvent threw on missing beads and the drain loop never marked
   the failing event processed — so it retried forever.

Fix 1: purge town_events rows (by bead_id OR agent_id, since agents are
beads) from deleteBead and the deleteBeads bulk path.

Fix 2a: reconciler.applyEvent('bead_cancelled') checks for the target
bead up front and returns (with a warn) when it's missing, instead of
throwing.

Fix 2b: the Town.do.ts drain loop recognises 'Bead/Agent <uuid> not
found' terminal errors, logs them at warn, and marks the offending
event processed so it stops retrying.

Adds debug RPCs (debugTownEvents, debugInsertTownEvent,
debugRecordContainerStatus) and integration coverage in
event-cleanup.test.ts.

Co-authored-by: John Fawcett <john@kilcoode.ai>
…#3055)

* feat(gastown-container): add crash visibility + per-agent start mutex

Diagnostic changes to investigate frequent container restarts for town
4d82f099-ccb7-4eaf-8676-73562e0a27eb (~1.5–2 min boot-hydration loops).

- main.ts: add unhandledRejection listener that logs full error/stack
  without exiting (Bun/Node silently drop rejections without a handler,
  making fire-and-forget failures like void saveDbSnapshot()/void
  subscribeToEvents() invisible). Include uptime and active-agent count
  for correlation.
- main.ts: improve uncaughtException log with name/uptime/agent count.
- main.ts: 30s periodic container.memory_usage log (rss/heap/external)
  so OOM-class failures (external SIGKILL from Cloudflare Containers
  runtime when the memory ceiling is hit) become observable — these
  leave no exception behind.
- main.ts: wrap bootHydration() in try/catch so a rare synchronous throw
  before the first await doesn't crash the process.
- process-manager.ts: add per-agentId mutex for startAgent. Production
  logs show two /agents/start requests for the same agentId logged at
  the same millisecond; both pass the re-entrancy check before either
  commits a 'starting' record, then race on startupAbortController,
  session creation, idle timers, and SDK sessionCount. Serialising
  per agentId makes the re-entrant path observe a consistent snapshot.
- process-manager.test.ts: three tests for the mutex — same-id
  serialisation, different-id concurrency, lock release on throw.

* fix(container): replace Promise.withResolvers with explicit new Promise

Promise.withResolvers is a newer API not available on older Bun
runtimes. Since process-manager.ts is imported during container
startup, a missing global would throw before crash handlers are
registered and prevent the control server from starting. Use the
same explicit new Promise pattern as the existing sdkServerLock.

* feat(gastown/container): include townId in crash and memory logs

Per review feedback, attach the container's GASTOWN_TOWN_ID to
unhandled_rejection, uncaught_exception, cold_start, memory_usage,
and boot_hydration_failed log entries so production crash logs can
be correlated with a specific town without needing to also have an
agent registered.

---------

Co-authored-by: John Fawcett <john@kilcoode.ai>
chore(gastown): fix format and lint CI failures on staging

Co-authored-by: John Fawcett <john@kilcoode.ai>
…dler (#3074)

Co-authored-by: John Fawcett <john@kilcoode.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant