Skip to content

feat(gastown-container): add crash visibility + per-agent start mutex#3055

Merged
kilo-code-bot[bot] merged 3 commits intogastown-stagingfrom
gt/toast/f0491780
May 5, 2026
Merged

feat(gastown-container): add crash visibility + per-agent start mutex#3055
kilo-code-bot[bot] merged 3 commits intogastown-stagingfrom
gt/toast/f0491780

Conversation

@jrf0110
Copy link
Copy Markdown
Contributor

@jrf0110 jrf0110 commented May 5, 2026

Summary

Diagnostic changes for the investigation bead: town 4d82f099-ccb7-4eaf-8676-73562e0a27eb is restarting its container every ~1.5–2 min for sustained periods. Root cause is not yet known — this PR adds the observability and safety we need to tell H1–H6 apart from production logs, plus fixes one concrete race that shows up in those logs.

main.ts — crash visibility

  • New unhandledRejection listener logs container.unhandled_rejection with full message/stack, container uptime, and active-agent count. It does not call process.exit, so visibility is the only side effect. Bun/Node silently drop rejections without a handler, making fire-and-forget failures (void saveDbSnapshot(), void subscribeToEvents(), setInterval(() => void sendHeartbeats())) invisible today.
  • Existing uncaughtException handler now also logs name, uptimeMs, and activeAgents alongside message/stack. Still fatal (process.exit(1)) — an exception escaping every try/catch is a genuine invariant break.
  • bootHydration() is wrapped in try/catch at the call site so a rare synchronous throw before its first await doesn't crash the container.

main.ts — OOM observability

  • 30s periodic container.memory_usage log (rss/heap/external/agents/uptime). Cadence matches the heartbeat. This is what catches H3 (memory leak + external SIGKILL) — those failures leave no stack behind.

process-manager.ts — per-agentId startAgent mutex (fixes H6)

  • The /agents/start log line appears twice at the same millisecond for the same agentId in production logs. Both callers pass the re-entrancy check at the top of startAgent (because neither has committed a 'starting' record yet), then race on startupAbortController, session.create(), idle timers, sdkInstance.sessionCount, and the agents map — leaving leaked sessions and a confused lifecycle.
  • Added withStartAgentLock(agentId, fn) (chained-promise mutex, same shape as the existing sdkServerLock) and wrapped the body of startAgent with it. The second concurrent caller now waits for the first to finish (or abort cleanly) before proceeding.
  • Three unit tests cover: same-id serialisation, cross-id concurrency, and lock release on throw.

Investigation plan (next step after this lands)

With this deployed to staging, pull 1–2 hours of logs for town 4d82f099-... and classify:

Hypothesis Signal in logs after this PR
H1 (unhandled throw / rejection) container.unhandled_rejection or container.uncaught_exception lines clustered right before each restart
H2 (corrupt kilo.db) ${MANAGER_LOG} session.create failed … stale DB recovery repeating
H3 (OOM / external kill) container.memory_usage showing rssMB monotonically growing, followed by restart with NO preceding exception log
H4 (user-triggered stop/destroy) Absence of any container-side crash log + tRPC forceRestartContainer / destroyContainer calls on the worker side
H5 (/refresh-token loop) High cadence of refresh_token.received logs
H6 (concurrent /agents/start race) Previously would manifest as leaked sessions / 'failed' agents. The mutex eliminates this class, so if restarts stop after deploy, H6 was the cause

The follow-up bead/PR with the actual fix (once H1–H6 is narrowed down) is a separate deliverable — this PR is the instrumentation required to get there. The mutex is pre-emptive because the race it fixes is real and visible in the current logs regardless of whether it's the root cause of the restarts.

Verification

  • Container typecheck clean: cd services/gastown/container && pnpm typecheck
  • Container tests: pnpm test — 62 pass; 2 pre-existing JWT-mock failures in plugin/client.test.ts confirmed unrelated (reproduce on main without this patch).
  • New tests in process-manager.test.ts directly exercise withStartAgentLock (same-agentId serialisation, different-agentId concurrency, lock release on throw).

Visual Changes

N/A — backend-only instrumentation.

Reviewer Notes

  • The unhandledRejection handler intentionally does not exit. If production shows that a specific rejection leaves the process in a wedged state, we can escalate individually; visibility first.
  • withStartAgentLock is exported so the test file can exercise it without booting the full SDK harness. The only production caller is startAgent in the same module.
  • Promise.withResolvers is Bun-native and already used in this codebase; no polyfill needed.
  • No changes to bootHydration, subscribeToEvents, or saveDbSnapshot themselves — those audits can happen after we have the crash-class data from the new logs.

Diagnostic changes to investigate frequent container restarts for town
4d82f099-ccb7-4eaf-8676-73562e0a27eb (~1.5–2 min boot-hydration loops).

- main.ts: add unhandledRejection listener that logs full error/stack
  without exiting (Bun/Node silently drop rejections without a handler,
  making fire-and-forget failures like void saveDbSnapshot()/void
  subscribeToEvents() invisible). Include uptime and active-agent count
  for correlation.
- main.ts: improve uncaughtException log with name/uptime/agent count.
- main.ts: 30s periodic container.memory_usage log (rss/heap/external)
  so OOM-class failures (external SIGKILL from Cloudflare Containers
  runtime when the memory ceiling is hit) become observable — these
  leave no exception behind.
- main.ts: wrap bootHydration() in try/catch so a rare synchronous throw
  before the first await doesn't crash the process.
- process-manager.ts: add per-agentId mutex for startAgent. Production
  logs show two /agents/start requests for the same agentId logged at
  the same millisecond; both pass the re-entrancy check before either
  commits a 'starting' record, then race on startupAbortController,
  session creation, idle timers, and SDK sessionCount. Serialising
  per agentId makes the re-entrant path observe a consistent snapshot.
- process-manager.test.ts: three tests for the mutex — same-id
  serialisation, different-id concurrency, lock release on throw.
Comment thread services/gastown/container/src/process-manager.ts Outdated
@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented May 5, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Resolved Issues
File Line Issue
services/gastown/container/src/process-manager.ts 93 Resolved: Promise.withResolvers was replaced with the explicit new Promise release pattern, removing the older-Bun startup risk.
services/gastown/container/src/main.ts 41 Resolved: container diagnostic logs now include townId from GASTOWN_TOWN_ID.
Files Reviewed (1 file incremental, 3 files total)
  • services/gastown/container/src/main.ts
  • services/gastown/container/src/process-manager.ts
  • services/gastown/container/src/process-manager.test.ts

Reviewed by gpt-5.5-20260423 · 354,083 tokens

Promise.withResolvers is a newer API not available on older Bun
runtimes. Since process-manager.ts is imported during container
startup, a missing global would throw before crash handlers are
registered and prevent the control server from starting. Use the
same explicit new Promise pattern as the existing sdkServerLock.
Comment thread services/gastown/container/src/main.ts
Per review feedback, attach the container's GASTOWN_TOWN_ID to
unhandled_rejection, uncaught_exception, cold_start, memory_usage,
and boot_hydration_failed log entries so production crash logs can
be correlated with a specific town without needing to also have an
agent registered.
@kilo-code-bot kilo-code-bot Bot merged commit 47e95a8 into gastown-staging May 5, 2026
2 checks passed
@kilo-code-bot kilo-code-bot Bot deleted the gt/toast/f0491780 branch May 5, 2026 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant