feat(gastown-container): add crash visibility + per-agent start mutex by jrf0110 · Pull Request #3055 · Kilo-Org/cloud

jrf0110 · 2026-05-05T20:06:09Z

Summary

Diagnostic changes for the investigation bead: town 4d82f099-ccb7-4eaf-8676-73562e0a27eb is restarting its container every ~1.5–2 min for sustained periods. Root cause is not yet known — this PR adds the observability and safety we need to tell H1–H6 apart from production logs, plus fixes one concrete race that shows up in those logs.

main.ts — crash visibility

New unhandledRejection listener logs container.unhandled_rejection with full message/stack, container uptime, and active-agent count. It does not call process.exit, so visibility is the only side effect. Bun/Node silently drop rejections without a handler, making fire-and-forget failures (void saveDbSnapshot(), void subscribeToEvents(), setInterval(() => void sendHeartbeats())) invisible today.
Existing uncaughtException handler now also logs name, uptimeMs, and activeAgents alongside message/stack. Still fatal (process.exit(1)) — an exception escaping every try/catch is a genuine invariant break.
bootHydration() is wrapped in try/catch at the call site so a rare synchronous throw before its first await doesn't crash the container.

main.ts — OOM observability

30s periodic container.memory_usage log (rss/heap/external/agents/uptime). Cadence matches the heartbeat. This is what catches H3 (memory leak + external SIGKILL) — those failures leave no stack behind.

process-manager.ts — per-agentId startAgent mutex (fixes H6)

The /agents/start log line appears twice at the same millisecond for the same agentId in production logs. Both callers pass the re-entrancy check at the top of startAgent (because neither has committed a 'starting' record yet), then race on startupAbortController, session.create(), idle timers, sdkInstance.sessionCount, and the agents map — leaving leaked sessions and a confused lifecycle.
Added withStartAgentLock(agentId, fn) (chained-promise mutex, same shape as the existing sdkServerLock) and wrapped the body of startAgent with it. The second concurrent caller now waits for the first to finish (or abort cleanly) before proceeding.
Three unit tests cover: same-id serialisation, cross-id concurrency, and lock release on throw.

Investigation plan (next step after this lands)

With this deployed to staging, pull 1–2 hours of logs for town 4d82f099-... and classify:

Hypothesis	Signal in logs after this PR
H1 (unhandled throw / rejection)	`container.unhandled_rejection` or `container.uncaught_exception` lines clustered right before each restart
H2 (corrupt kilo.db)	`${MANAGER_LOG} session.create failed … stale DB recovery` repeating
H3 (OOM / external kill)	`container.memory_usage` showing rssMB monotonically growing, followed by restart with NO preceding exception log
H4 (user-triggered stop/destroy)	Absence of any container-side crash log + tRPC `forceRestartContainer` / `destroyContainer` calls on the worker side
H5 (/refresh-token loop)	High cadence of `refresh_token.received` logs
H6 (concurrent /agents/start race)	Previously would manifest as leaked sessions / 'failed' agents. The mutex eliminates this class, so if restarts stop after deploy, H6 was the cause

The follow-up bead/PR with the actual fix (once H1–H6 is narrowed down) is a separate deliverable — this PR is the instrumentation required to get there. The mutex is pre-emptive because the race it fixes is real and visible in the current logs regardless of whether it's the root cause of the restarts.

Verification

Container typecheck clean: cd services/gastown/container && pnpm typecheck ✅
Container tests: pnpm test — 62 pass; 2 pre-existing JWT-mock failures in plugin/client.test.ts confirmed unrelated (reproduce on main without this patch).
New tests in process-manager.test.ts directly exercise withStartAgentLock (same-agentId serialisation, different-agentId concurrency, lock release on throw).

Visual Changes

N/A — backend-only instrumentation.

Reviewer Notes

The unhandledRejection handler intentionally does not exit. If production shows that a specific rejection leaves the process in a wedged state, we can escalate individually; visibility first.
withStartAgentLock is exported so the test file can exercise it without booting the full SDK harness. The only production caller is startAgent in the same module.
Promise.withResolvers is Bun-native and already used in this codebase; no polyfill needed.
No changes to bootHydration, subscribeToEvents, or saveDbSnapshot themselves — those audits can happen after we have the crash-class data from the new logs.

Diagnostic changes to investigate frequent container restarts for town 4d82f099-ccb7-4eaf-8676-73562e0a27eb (~1.5–2 min boot-hydration loops). - main.ts: add unhandledRejection listener that logs full error/stack without exiting (Bun/Node silently drop rejections without a handler, making fire-and-forget failures like void saveDbSnapshot()/void subscribeToEvents() invisible). Include uptime and active-agent count for correlation. - main.ts: improve uncaughtException log with name/uptime/agent count. - main.ts: 30s periodic container.memory_usage log (rss/heap/external) so OOM-class failures (external SIGKILL from Cloudflare Containers runtime when the memory ceiling is hit) become observable — these leave no exception behind. - main.ts: wrap bootHydration() in try/catch so a rare synchronous throw before the first await doesn't crash the process. - process-manager.ts: add per-agentId mutex for startAgent. Production logs show two /agents/start requests for the same agentId logged at the same millisecond; both pass the re-entrancy check before either commits a 'starting' record, then race on startupAbortController, session creation, idle timers, and SDK sessionCount. Serialising per agentId makes the re-entrant path observe a consistent snapshot. - process-manager.test.ts: three tests for the mutex — same-id serialisation, different-id concurrency, lock release on throw.

kilo-code-bot · 2026-05-05T20:08:30Z

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Resolved Issues

File	Line	Issue
`services/gastown/container/src/process-manager.ts`	93	Resolved: `Promise.withResolvers` was replaced with the explicit `new Promise` release pattern, removing the older-Bun startup risk.
`services/gastown/container/src/main.ts`	41	Resolved: container diagnostic logs now include `townId` from `GASTOWN_TOWN_ID`.

Files Reviewed (1 file incremental, 3 files total)

services/gastown/container/src/main.ts
services/gastown/container/src/process-manager.ts
services/gastown/container/src/process-manager.test.ts

_{Reviewed by gpt-5.5-20260423 · 354,083 tokens}

Promise.withResolvers is a newer API not available on older Bun runtimes. Since process-manager.ts is imported during container startup, a missing global would throw before crash handlers are registered and prevent the control server from starting. Use the same explicit new Promise pattern as the existing sdkServerLock.

Per review feedback, attach the container's GASTOWN_TOWN_ID to unhandled_rejection, uncaught_exception, cold_start, memory_usage, and boot_hydration_failed log entries so production crash logs can be correlated with a specific town without needing to also have an agent registered.

kilo-code-bot Bot reviewed May 5, 2026

View reviewed changes

Comment thread services/gastown/container/src/process-manager.ts Outdated

jrf0110 commented May 5, 2026

View reviewed changes

Comment thread services/gastown/container/src/main.ts

kilo-code-bot Bot merged commit 47e95a8 into gastown-staging May 5, 2026
2 checks passed

kilo-code-bot Bot deleted the gt/toast/f0491780 branch May 5, 2026 20:23

jrf0110 mentioned this pull request May 6, 2026

chore(gastown): promote gastown-staging to main #2974

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gastown-container): add crash visibility + per-agent start mutex#3055

feat(gastown-container): add crash visibility + per-agent start mutex#3055
kilo-code-bot[bot] merged 3 commits intogastown-stagingfrom
gt/toast/f0491780

jrf0110 commented May 5, 2026

Uh oh!

Uh oh!

kilo-code-bot Bot commented May 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jrf0110 commented May 5, 2026

Summary

Investigation plan (next step after this lands)

Verification

Visual Changes

Reviewer Notes

Uh oh!

Uh oh!

kilo-code-bot Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kilo-code-bot Bot commented May 5, 2026 •

edited

Loading