feat(router): capture OOMKilled + exit reason on worker exits#1193
Merged
zbigniewsobiecki merged 1 commit intodevfrom Apr 25, 2026
Merged
feat(router): capture OOMKilled + exit reason on worker exits#1193zbigniewsobiecki merged 1 commit intodevfrom
zbigniewsobiecki merged 1 commit intodevfrom
Conversation
When a worker container exits non-zero, the router previously logged
only `statusCode` and stamped the run record's `error` field with the
generic `Worker crashed with exit code N`. That collapsed every kind
of failure — cgroup OOM, internal SIGKILL, codex CLI crash, runtime
abort — into the same opaque message. Investigating ucho's recent
exit-137 runs required ssh + syslog grep on the bauer host because
cascade itself had discarded the diagnostic signal.
This commit captures the signal at the source: `container.inspect()`
runs immediately after `wait()` (before AutoRemove or our manual
`removeContainer` reaps the container) and pulls `State.OOMKilled`,
`State.Error`, and the actual `StartedAt → FinishedAt` duration from
Docker's own clocks. The crash reason on the run record is now a
structured, grep-stable string:
Worker crashed with exit code 137 · OOMKilled=true · reason="Out of memory"
`OOMKilled=true` is the *definitive* cgroup-OOM signal; a 137 exit
*without* it means the kill came from inside the container or from a
non-cgroup signal, not memory. Future post-mortems get the answer
from `cascade runs show <id> --json` instead of the host's syslog.
Also: `[WorkerManager] Resolved spawn settings` is now emitted at
every spawn with both `projectWatchdogTimeoutMs` and
`globalWorkerTimeoutMs` so the "did the per-project override actually
win?" question is one log query away. This was a real load-bearing
unknown during the ucho exit-137 investigation — the project config
said 45 min but production behavior matched the global 30 min env
default.
Surfaces:
* `src/router/active-workers.ts` — new `ExitDetails` type +
`formatCrashReason(exitCode, details?)` helper. `cleanupWorker()`
takes an optional third `details` arg and packs the diagnostics
into `failOrphanedRun` / `failOrphanedRunFallback`'s reason string.
Existing callers that don't pass details fall back to the bare
message.
* `src/router/container-manager.ts` — `inspectExitedContainer()`
reads Docker's `State` immediately post-wait. `logWorkerTail()` and
`onWorkerExit()` extracted from the wait callback for complexity.
`resolveSpawnSettings()` now emits the structured `Resolved spawn
settings` log. Defensive: malformed/sentinel timestamps yield
`durationMs: undefined` rather than `NaN` leaking into Sentry.
* `CLAUDE.md` — new "Worker exit diagnostics" paragraph documenting
the format and the load-bearing logs.
* `AGENTS.md` — symlinked to `CLAUDE.md`. The previous untracked
copy was stale (broken `Codex setup-token` / `~/.Codex.json`
artefacts from a botched search-replace, missing the work-item
concurrency lock + post-completion review dispatch sections, and
outdated integration text). One source of truth.
Tests:
* `tests/unit/router/active-workers.test.ts` — extended with 6 new
cases pinning the OOMKilled / exitReason fields onto the failOrphan
reason string (workItem path + fallback path).
* `tests/unit/router/container-manager-diagnostics.test.ts` (new) —
24 direct tests across three suites:
* `formatCrashReason` (8): bare/oom/reason permutations + grep-
stability regression for the `· ` separator and `OOMKilled=…`
marker. The format is now de-facto API for any future dashboard
parser; bumping it silently fails CI.
* `inspectExitedContainer` (12): OOMKilled true/false, exitReason
extraction, durationMs from real timestamps + Docker's
`0001-01-01` sentinel + malformed strings + missing-half +
inverted (negative-span) timestamps + inspect-rejection path.
Diagnostics are best-effort: rejection logs a warn and returns
all-undefined, never throws.
* `resolveSpawnSettings` (4): per-project watchdogTimeoutMs override
applied → 45+2 = 47 min; no project override → falls through to
global; null projectId → no log + no `loadProjectConfig` call;
Math.round on non-integer minutes.
* `tests/unit/router/snapshot-integration.test.ts` — `setupMockContainer`
now stubs `inspect()` so the post-exit pipeline runs through cleanly
in the snapshot tests (production code no longer needs to defend
against a test-mock gap).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This was referenced Apr 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the opaque
Worker crashed with exit code Nsignal with astructured, grep-stable diagnostic — captured from Docker's own
container.inspect()immediately afterwait(), before AutoRemovereaps the container.
Why
Investigating ucho's recent
exit 137runs required ssh access tothe bauer host +
grep oom-kill /var/log/syslog, because cascadeitself had collapsed every kind of failure (cgroup OOM, internal
SIGKILL, codex CLI crash, runtime abort) into the same generic
message. The 30-min consistency in those failures looked like a
timer, the 8 GB memory ceiling could plausibly OOM, but the run
records told us nothing — there was no signal in cascade to
distinguish the cases. This PR captures the signal at the source so
the next failure self-explains.
What changes
cascade runs show <id> --jsonnow returns a structurederror:OOMKilled=trueis the definitive cgroup-OOM signal (Docker'sown
State.OOMKilled). A 137 exit without it ⇒ the kill camefrom inside the container or from a non-cgroup signal — not memory.
[WorkerManager] Resolved spawn settingsis logged at every spawnwith both
projectWatchdogTimeoutMsandglobalWorkerTimeoutMsso the "did the per-project override actually win?" question is
one log query away. (This was a real load-bearing unknown during
the ucho investigation — config said 45 min, behavior said 30 min.)
durationMsreported from Docker'sStartedAt → FinishedAtclocksrather than cascade's wall-clock — and defended against Docker's
0001-01-01sentinel timestamps that would otherwise leakNaNdurations into Sentry / dashboards.
Surfaces
src/router/active-workers.tsExitDetailstype + exportedformatCrashReason(exitCode, details?).cleanupWorker(jobId, exitCode?, details?)packs diagnostics into thefailOrphanedRunreason. Backwards-compatible — callers without details produce the bare message.src/router/container-manager.tsinspectExitedContainer()reads DockerStatepost-wait.logWorkerTail()andonWorkerExit()extracted from thewait()callback.resolveSpawnSettings()emits the structured spawn-settings log.CLAUDE.mdAGENTS.mdCLAUDE.md. The previous untracked copy was a stale, broken sibling (Codex setup-token,~/.Codex.jsonartefacts from a botched search-replace; missing concurrency-lock + post-completion-review-dispatch sections). One source of truth.Tests
TDD discipline — all new behavior covered before / together with
the implementation. 30 new tests:
tests/unit/router/active-workers.test.ts— +6 cases pinningOOMKilled / exitReason / combined into the
failOrphanedRunandfailOrphanedRunFallbackreason strings; back-compat (legacycallers) preserved.
tests/unit/router/container-manager-diagnostics.test.ts(new) — 24 cases across three suites:formatCrashReason(8) — every permutation + grep-stabilityregressions for the
·separator andOOMKilled=(true|false)marker. The format is now de-facto API; future bumps fail CI.
inspectExitedContainer(12) — OOMKilled true/false, exitReasonextraction, durationMs from real / sentinel / malformed / missing
/ inverted timestamps, inspect-rejection (best-effort: warns +
returns all-undefined, never throws).
resolveSpawnSettings(4) — project watchdog override applied(45+2=47 min); no override falls to global default; null
projectIdskips the log + skipsloadProjectConfig;Math.roundon non-integer minutes.tests/unit/router/snapshot-integration.test.ts—setupMockContainerstubs
inspect()so production code doesn't have to defend againsta test-mock gap.
Checklist
npm run typecheckclean.npx biome check .— 1226 files, 0 issues.npx vitest run --project unit-api— 1429/1429 passed (was 1399 before, +30 new tests).cleanupWorker(jobId, exitCode)callers without details still produce the bare crash message.cascade runs show <id> --jsonreturns the new structured error.Test plan
OOMKilled=…and·so future format drift fails CI.inspectExitedContainertested for daemon-socket-drop (rejection path) — still completes the post-exit pipeline.cascade runs logs <id>shows the newResolved spawn settingsline at spawn and theWorker exitedline includesoomKilled/exitReason/durationMs.🤖 Generated with Claude Code