feat(router): capture OOMKilled + exit reason on worker exits by zbigniewsobiecki · Pull Request #1193 · mongrel-intelligence/cascade

zbigniewsobiecki · 2026-04-25T10:05:28Z

Summary

Replaces the opaque Worker crashed with exit code N signal with a
structured, grep-stable diagnostic — captured from Docker's own
container.inspect() immediately after wait(), before AutoRemove
reaps the container.

Why

Investigating ucho's recent exit 137 runs required ssh access to
the bauer host + grep oom-kill /var/log/syslog, because cascade
itself had collapsed every kind of failure (cgroup OOM, internal
SIGKILL, codex CLI crash, runtime abort) into the same generic
message. The 30-min consistency in those failures looked like a
timer, the 8 GB memory ceiling could plausibly OOM, but the run
records told us nothing — there was no signal in cascade to
distinguish the cases. This PR captures the signal at the source so
the next failure self-explains.

What changes

cascade runs show <id> --json now returns a structured error:
```
Worker crashed with exit code 137 · OOMKilled=true · reason="Out of memory"
```
OOMKilled=true is the definitive cgroup-OOM signal (Docker's
own State.OOMKilled). A 137 exit without it ⇒ the kill came
from inside the container or from a non-cgroup signal — not memory.
[WorkerManager] Resolved spawn settings is logged at every spawn
with both projectWatchdogTimeoutMs and globalWorkerTimeoutMs
so the "did the per-project override actually win?" question is
one log query away. (This was a real load-bearing unknown during
the ucho investigation — config said 45 min, behavior said 30 min.)
durationMs reported from Docker's StartedAt → FinishedAt clocks
rather than cascade's wall-clock — and defended against Docker's
0001-01-01 sentinel timestamps that would otherwise leak NaN
durations into Sentry / dashboards.

Surfaces

File	Change
`src/router/active-workers.ts`	New `ExitDetails` type + exported `formatCrashReason(exitCode, details?)`. `cleanupWorker(jobId, exitCode?, details?)` packs diagnostics into the `failOrphanedRun` reason. Backwards-compatible — callers without details produce the bare message.
`src/router/container-manager.ts`	`inspectExitedContainer()` reads Docker `State` post-wait. `logWorkerTail()` and `onWorkerExit()` extracted from the `wait()` callback. `resolveSpawnSettings()` emits the structured spawn-settings log.
`CLAUDE.md`	New "Worker exit diagnostics" paragraph documenting the format and load-bearing logs.
`AGENTS.md`	Symlinked to `CLAUDE.md`. The previous untracked copy was a stale, broken sibling (`Codex setup-token`, `~/.Codex.json` artefacts from a botched search-replace; missing concurrency-lock + post-completion-review-dispatch sections). One source of truth.

Tests

TDD discipline — all new behavior covered before / together with
the implementation. 30 new tests:

tests/unit/router/active-workers.test.ts — +6 cases pinning
OOMKilled / exitReason / combined into the failOrphanedRun and
failOrphanedRunFallback reason strings; back-compat (legacy
callers) preserved.
tests/unit/router/container-manager-diagnostics.test.ts (new) — 24 cases across three suites:
- formatCrashReason (8) — every permutation + grep-stability
  regressions for the · separator and OOMKilled=(true|false)
  marker. The format is now de-facto API; future bumps fail CI.
- inspectExitedContainer (12) — OOMKilled true/false, exitReason
  extraction, durationMs from real / sentinel / malformed / missing
  / inverted timestamps, inspect-rejection (best-effort: warns +
  returns all-undefined, never throws).
- resolveSpawnSettings (4) — project watchdog override applied
  (45+2=47 min); no override falls to global default; null
  projectId skips the log + skips loadProjectConfig;
  Math.round on non-integer minutes.
tests/unit/router/snapshot-integration.test.ts — setupMockContainer
stubs inspect() so production code doesn't have to defend against
a test-mock gap.

Checklist

npm run typecheck clean.
npx biome check . — 1226 files, 0 issues.
npx vitest run --project unit-api — 1429/1429 passed (was 1399 before, +30 new tests).
Lefthook pre-commit (lint + typecheck + auth-header provenance) and pre-push (test:fast = 156 tests) green.
Backwards-compat: cleanupWorker(jobId, exitCode) callers without details still produce the bare crash message.
Manual: trigger one ucho implementation run after merge, confirm cascade runs show <id> --json returns the new structured error.

Test plan

Format-stability regression tests pin OOMKilled=… and · so future format drift fails CI.
inspectExitedContainer tested for daemon-socket-drop (rejection path) — still completes the post-exit pipeline.
Smoke-test on a CASCADE worker post-merge: trigger a run that exits non-zero, confirm cascade runs logs <id> shows the new Resolved spawn settings line at spawn and the Worker exited line includes oomKilled / exitReason / durationMs.

🤖 Generated with Claude Code

When a worker container exits non-zero, the router previously logged only `statusCode` and stamped the run record's `error` field with the generic `Worker crashed with exit code N`. That collapsed every kind of failure — cgroup OOM, internal SIGKILL, codex CLI crash, runtime abort — into the same opaque message. Investigating ucho's recent exit-137 runs required ssh + syslog grep on the bauer host because cascade itself had discarded the diagnostic signal. This commit captures the signal at the source: `container.inspect()` runs immediately after `wait()` (before AutoRemove or our manual `removeContainer` reaps the container) and pulls `State.OOMKilled`, `State.Error`, and the actual `StartedAt → FinishedAt` duration from Docker's own clocks. The crash reason on the run record is now a structured, grep-stable string: Worker crashed with exit code 137 · OOMKilled=true · reason="Out of memory" `OOMKilled=true` is the *definitive* cgroup-OOM signal; a 137 exit *without* it means the kill came from inside the container or from a non-cgroup signal, not memory. Future post-mortems get the answer from `cascade runs show <id> --json` instead of the host's syslog. Also: `[WorkerManager] Resolved spawn settings` is now emitted at every spawn with both `projectWatchdogTimeoutMs` and `globalWorkerTimeoutMs` so the "did the per-project override actually win?" question is one log query away. This was a real load-bearing unknown during the ucho exit-137 investigation — the project config said 45 min but production behavior matched the global 30 min env default. Surfaces: * `src/router/active-workers.ts` — new `ExitDetails` type + `formatCrashReason(exitCode, details?)` helper. `cleanupWorker()` takes an optional third `details` arg and packs the diagnostics into `failOrphanedRun` / `failOrphanedRunFallback`'s reason string. Existing callers that don't pass details fall back to the bare message. * `src/router/container-manager.ts` — `inspectExitedContainer()` reads Docker's `State` immediately post-wait. `logWorkerTail()` and `onWorkerExit()` extracted from the wait callback for complexity. `resolveSpawnSettings()` now emits the structured `Resolved spawn settings` log. Defensive: malformed/sentinel timestamps yield `durationMs: undefined` rather than `NaN` leaking into Sentry. * `CLAUDE.md` — new "Worker exit diagnostics" paragraph documenting the format and the load-bearing logs. * `AGENTS.md` — symlinked to `CLAUDE.md`. The previous untracked copy was stale (broken `Codex setup-token` / `~/.Codex.json` artefacts from a botched search-replace, missing the work-item concurrency lock + post-completion review dispatch sections, and outdated integration text). One source of truth. Tests: * `tests/unit/router/active-workers.test.ts` — extended with 6 new cases pinning the OOMKilled / exitReason fields onto the failOrphan reason string (workItem path + fallback path). * `tests/unit/router/container-manager-diagnostics.test.ts` (new) — 24 direct tests across three suites: * `formatCrashReason` (8): bare/oom/reason permutations + grep- stability regression for the `· ` separator and `OOMKilled=…` marker. The format is now de-facto API for any future dashboard parser; bumping it silently fails CI. * `inspectExitedContainer` (12): OOMKilled true/false, exitReason extraction, durationMs from real timestamps + Docker's `0001-01-01` sentinel + malformed strings + missing-half + inverted (negative-span) timestamps + inspect-rejection path. Diagnostics are best-effort: rejection logs a warn and returns all-undefined, never throws. * `resolveSpawnSettings` (4): per-project watchdogTimeoutMs override applied → 45+2 = 47 min; no project override → falls through to global; null projectId → no log + no `loadProjectConfig` call; Math.round on non-integer minutes. * `tests/unit/router/snapshot-integration.test.ts` — `setupMockContainer` now stubs `inspect()` so the post-exit pipeline runs through cleanly in the snapshot tests (production code no longer needs to defend against a test-mock gap). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-04-25T10:10:35Z

Codecov Report

❌ Patch coverage is 92.15686% with 8 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/router/container-manager.ts	91.20%	8 Missing ⚠️

📢 Thoughts on this report? Let us know!

zbigniewsobiecki merged commit 2bbf462 into dev Apr 25, 2026
8 checks passed

zbigniewsobiecki deleted the feat/worker-exit-diagnostics branch April 25, 2026 10:09

This was referenced Apr 25, 2026

chore: promote dev → main (worker exit diagnostics + agent docs dedupe + JIRA / Linear / PM-wizard fixes) #1194

Merged

chore: promote dev → main (router instance scoping fixes ucho exit-137) #1196

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(router): capture OOMKilled + exit reason on worker exits#1193

feat(router): capture OOMKilled + exit reason on worker exits#1193
zbigniewsobiecki merged 1 commit intodevfrom
feat/worker-exit-diagnostics

zbigniewsobiecki commented Apr 25, 2026

Uh oh!

Uh oh!

codecov Bot commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zbigniewsobiecki commented Apr 25, 2026

Summary

Why

What changes

Surfaces

Tests

Checklist

Test plan

Uh oh!

Uh oh!

codecov Bot commented Apr 25, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant