feat(gastown-container): add crash visibility + per-agent start mutex#3055
Merged
kilo-code-bot[bot] merged 3 commits intogastown-stagingfrom May 5, 2026
Merged
feat(gastown-container): add crash visibility + per-agent start mutex#3055kilo-code-bot[bot] merged 3 commits intogastown-stagingfrom
kilo-code-bot[bot] merged 3 commits intogastown-stagingfrom
Conversation
Diagnostic changes to investigate frequent container restarts for town 4d82f099-ccb7-4eaf-8676-73562e0a27eb (~1.5–2 min boot-hydration loops). - main.ts: add unhandledRejection listener that logs full error/stack without exiting (Bun/Node silently drop rejections without a handler, making fire-and-forget failures like void saveDbSnapshot()/void subscribeToEvents() invisible). Include uptime and active-agent count for correlation. - main.ts: improve uncaughtException log with name/uptime/agent count. - main.ts: 30s periodic container.memory_usage log (rss/heap/external) so OOM-class failures (external SIGKILL from Cloudflare Containers runtime when the memory ceiling is hit) become observable — these leave no exception behind. - main.ts: wrap bootHydration() in try/catch so a rare synchronous throw before the first await doesn't crash the process. - process-manager.ts: add per-agentId mutex for startAgent. Production logs show two /agents/start requests for the same agentId logged at the same millisecond; both pass the re-entrancy check before either commits a 'starting' record, then race on startupAbortController, session creation, idle timers, and SDK sessionCount. Serialising per agentId makes the re-entrant path observe a consistent snapshot. - process-manager.test.ts: three tests for the mutex — same-id serialisation, different-id concurrency, lock release on throw.
Contributor
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Resolved Issues
Files Reviewed (1 file incremental, 3 files total)
Reviewed by gpt-5.5-20260423 · 354,083 tokens |
Promise.withResolvers is a newer API not available on older Bun runtimes. Since process-manager.ts is imported during container startup, a missing global would throw before crash handlers are registered and prevent the control server from starting. Use the same explicit new Promise pattern as the existing sdkServerLock.
jrf0110
commented
May 5, 2026
Per review feedback, attach the container's GASTOWN_TOWN_ID to unhandled_rejection, uncaught_exception, cold_start, memory_usage, and boot_hydration_failed log entries so production crash logs can be correlated with a specific town without needing to also have an agent registered.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Diagnostic changes for the investigation bead: town
4d82f099-ccb7-4eaf-8676-73562e0a27ebis restarting its container every ~1.5–2 min for sustained periods. Root cause is not yet known — this PR adds the observability and safety we need to tell H1–H6 apart from production logs, plus fixes one concrete race that shows up in those logs.main.ts— crash visibilityunhandledRejectionlistener logscontainer.unhandled_rejectionwith full message/stack, container uptime, and active-agent count. It does not callprocess.exit, so visibility is the only side effect. Bun/Node silently drop rejections without a handler, making fire-and-forget failures (void saveDbSnapshot(),void subscribeToEvents(),setInterval(() => void sendHeartbeats())) invisible today.uncaughtExceptionhandler now also logsname,uptimeMs, andactiveAgentsalongside message/stack. Still fatal (process.exit(1)) — an exception escaping every try/catch is a genuine invariant break.bootHydration()is wrapped in try/catch at the call site so a rare synchronous throw before its first await doesn't crash the container.main.ts— OOM observabilitycontainer.memory_usagelog (rss/heap/external/agents/uptime). Cadence matches the heartbeat. This is what catches H3 (memory leak + external SIGKILL) — those failures leave no stack behind.process-manager.ts— per-agentId startAgent mutex (fixes H6)/agents/startlog line appears twice at the same millisecond for the same agentId in production logs. Both callers pass the re-entrancy check at the top ofstartAgent(because neither has committed a 'starting' record yet), then race onstartupAbortController,session.create(), idle timers,sdkInstance.sessionCount, and theagentsmap — leaving leaked sessions and a confused lifecycle.withStartAgentLock(agentId, fn)(chained-promise mutex, same shape as the existingsdkServerLock) and wrapped the body ofstartAgentwith it. The second concurrent caller now waits for the first to finish (or abort cleanly) before proceeding.Investigation plan (next step after this lands)
With this deployed to staging, pull 1–2 hours of logs for town
4d82f099-...and classify:container.unhandled_rejectionorcontainer.uncaught_exceptionlines clustered right before each restart${MANAGER_LOG} session.create failed … stale DB recoveryrepeatingcontainer.memory_usageshowing rssMB monotonically growing, followed by restart with NO preceding exception logforceRestartContainer/destroyContainercalls on the worker siderefresh_token.receivedlogsThe follow-up bead/PR with the actual fix (once H1–H6 is narrowed down) is a separate deliverable — this PR is the instrumentation required to get there. The mutex is pre-emptive because the race it fixes is real and visible in the current logs regardless of whether it's the root cause of the restarts.
Verification
cd services/gastown/container && pnpm typecheck✅pnpm test— 62 pass; 2 pre-existing JWT-mock failures inplugin/client.test.tsconfirmed unrelated (reproduce onmainwithout this patch).process-manager.test.tsdirectly exercisewithStartAgentLock(same-agentId serialisation, different-agentId concurrency, lock release on throw).Visual Changes
N/A — backend-only instrumentation.
Reviewer Notes
unhandledRejectionhandler intentionally does not exit. If production shows that a specific rejection leaves the process in a wedged state, we can escalate individually; visibility first.withStartAgentLockisexported so the test file can exercise it without booting the full SDK harness. The only production caller isstartAgentin the same module.Promise.withResolversis Bun-native and already used in this codebase; no polyfill needed.bootHydration,subscribeToEvents, orsaveDbSnapshotthemselves — those audits can happen after we have the crash-class data from the new logs.