Skip to content

feat(router): capture OOMKilled + exit reason on worker exits#1193

Merged
zbigniewsobiecki merged 1 commit intodevfrom
feat/worker-exit-diagnostics
Apr 25, 2026
Merged

feat(router): capture OOMKilled + exit reason on worker exits#1193
zbigniewsobiecki merged 1 commit intodevfrom
feat/worker-exit-diagnostics

Conversation

@zbigniewsobiecki
Copy link
Copy Markdown
Member

Summary

Replaces the opaque Worker crashed with exit code N signal with a
structured, grep-stable diagnostic — captured from Docker's own
container.inspect() immediately after wait(), before AutoRemove
reaps the container.

Why

Investigating ucho's recent exit 137 runs required ssh access to
the bauer host + grep oom-kill /var/log/syslog, because cascade
itself had collapsed every kind of failure (cgroup OOM, internal
SIGKILL, codex CLI crash, runtime abort) into the same generic
message. The 30-min consistency in those failures looked like a
timer, the 8 GB memory ceiling could plausibly OOM, but the run
records told us nothing — there was no signal in cascade to
distinguish the cases. This PR captures the signal at the source so
the next failure self-explains.

What changes

  • cascade runs show <id> --json now returns a structured error:

    Worker crashed with exit code 137 · OOMKilled=true · reason="Out of memory"
    

    OOMKilled=true is the definitive cgroup-OOM signal (Docker's
    own State.OOMKilled). A 137 exit without it ⇒ the kill came
    from inside the container or from a non-cgroup signal — not memory.

  • [WorkerManager] Resolved spawn settings is logged at every spawn
    with both projectWatchdogTimeoutMs and globalWorkerTimeoutMs
    so the "did the per-project override actually win?" question is
    one log query away. (This was a real load-bearing unknown during
    the ucho investigation — config said 45 min, behavior said 30 min.)

  • durationMs reported from Docker's StartedAt → FinishedAt clocks
    rather than cascade's wall-clock — and defended against Docker's
    0001-01-01 sentinel timestamps that would otherwise leak NaN
    durations into Sentry / dashboards.

Surfaces

File Change
src/router/active-workers.ts New ExitDetails type + exported formatCrashReason(exitCode, details?). cleanupWorker(jobId, exitCode?, details?) packs diagnostics into the failOrphanedRun reason. Backwards-compatible — callers without details produce the bare message.
src/router/container-manager.ts inspectExitedContainer() reads Docker State post-wait. logWorkerTail() and onWorkerExit() extracted from the wait() callback. resolveSpawnSettings() emits the structured spawn-settings log.
CLAUDE.md New "Worker exit diagnostics" paragraph documenting the format and load-bearing logs.
AGENTS.md Symlinked to CLAUDE.md. The previous untracked copy was a stale, broken sibling (Codex setup-token, ~/.Codex.json artefacts from a botched search-replace; missing concurrency-lock + post-completion-review-dispatch sections). One source of truth.

Tests

TDD discipline — all new behavior covered before / together with
the implementation. 30 new tests:

  • tests/unit/router/active-workers.test.ts+6 cases pinning
    OOMKilled / exitReason / combined into the failOrphanedRun and
    failOrphanedRunFallback reason strings; back-compat (legacy
    callers) preserved.
  • tests/unit/router/container-manager-diagnostics.test.ts (new) — 24 cases across three suites:
    • formatCrashReason (8) — every permutation + grep-stability
      regressions for the · separator and OOMKilled=(true|false)
      marker. The format is now de-facto API; future bumps fail CI.
    • inspectExitedContainer (12) — OOMKilled true/false, exitReason
      extraction, durationMs from real / sentinel / malformed / missing
      / inverted timestamps, inspect-rejection (best-effort: warns +
      returns all-undefined, never throws).
    • resolveSpawnSettings (4) — project watchdog override applied
      (45+2=47 min); no override falls to global default; null
      projectId skips the log + skips loadProjectConfig;
      Math.round on non-integer minutes.
  • tests/unit/router/snapshot-integration.test.tssetupMockContainer
    stubs inspect() so production code doesn't have to defend against
    a test-mock gap.

Checklist

  • npm run typecheck clean.
  • npx biome check . — 1226 files, 0 issues.
  • npx vitest run --project unit-api1429/1429 passed (was 1399 before, +30 new tests).
  • Lefthook pre-commit (lint + typecheck + auth-header provenance) and pre-push (test:fast = 156 tests) green.
  • Backwards-compat: cleanupWorker(jobId, exitCode) callers without details still produce the bare crash message.
  • Manual: trigger one ucho implementation run after merge, confirm cascade runs show <id> --json returns the new structured error.

Test plan

  • Format-stability regression tests pin OOMKilled=… and · so future format drift fails CI.
  • inspectExitedContainer tested for daemon-socket-drop (rejection path) — still completes the post-exit pipeline.
  • Smoke-test on a CASCADE worker post-merge: trigger a run that exits non-zero, confirm cascade runs logs <id> shows the new Resolved spawn settings line at spawn and the Worker exited line includes oomKilled / exitReason / durationMs.

🤖 Generated with Claude Code

When a worker container exits non-zero, the router previously logged
only `statusCode` and stamped the run record's `error` field with the
generic `Worker crashed with exit code N`. That collapsed every kind
of failure — cgroup OOM, internal SIGKILL, codex CLI crash, runtime
abort — into the same opaque message. Investigating ucho's recent
exit-137 runs required ssh + syslog grep on the bauer host because
cascade itself had discarded the diagnostic signal.

This commit captures the signal at the source: `container.inspect()`
runs immediately after `wait()` (before AutoRemove or our manual
`removeContainer` reaps the container) and pulls `State.OOMKilled`,
`State.Error`, and the actual `StartedAt → FinishedAt` duration from
Docker's own clocks. The crash reason on the run record is now a
structured, grep-stable string:

    Worker crashed with exit code 137 · OOMKilled=true · reason="Out of memory"

`OOMKilled=true` is the *definitive* cgroup-OOM signal; a 137 exit
*without* it means the kill came from inside the container or from a
non-cgroup signal, not memory. Future post-mortems get the answer
from `cascade runs show <id> --json` instead of the host's syslog.

Also: `[WorkerManager] Resolved spawn settings` is now emitted at
every spawn with both `projectWatchdogTimeoutMs` and
`globalWorkerTimeoutMs` so the "did the per-project override actually
win?" question is one log query away. This was a real load-bearing
unknown during the ucho exit-137 investigation — the project config
said 45 min but production behavior matched the global 30 min env
default.

Surfaces:

* `src/router/active-workers.ts` — new `ExitDetails` type +
  `formatCrashReason(exitCode, details?)` helper. `cleanupWorker()`
  takes an optional third `details` arg and packs the diagnostics
  into `failOrphanedRun` / `failOrphanedRunFallback`'s reason string.
  Existing callers that don't pass details fall back to the bare
  message.
* `src/router/container-manager.ts` — `inspectExitedContainer()`
  reads Docker's `State` immediately post-wait. `logWorkerTail()` and
  `onWorkerExit()` extracted from the wait callback for complexity.
  `resolveSpawnSettings()` now emits the structured `Resolved spawn
  settings` log. Defensive: malformed/sentinel timestamps yield
  `durationMs: undefined` rather than `NaN` leaking into Sentry.
* `CLAUDE.md` — new "Worker exit diagnostics" paragraph documenting
  the format and the load-bearing logs.
* `AGENTS.md` — symlinked to `CLAUDE.md`. The previous untracked
  copy was stale (broken `Codex setup-token` / `~/.Codex.json`
  artefacts from a botched search-replace, missing the work-item
  concurrency lock + post-completion review dispatch sections, and
  outdated integration text). One source of truth.

Tests:

* `tests/unit/router/active-workers.test.ts` — extended with 6 new
  cases pinning the OOMKilled / exitReason fields onto the failOrphan
  reason string (workItem path + fallback path).
* `tests/unit/router/container-manager-diagnostics.test.ts` (new) —
  24 direct tests across three suites:
  * `formatCrashReason` (8): bare/oom/reason permutations + grep-
    stability regression for the `· ` separator and `OOMKilled=…`
    marker. The format is now de-facto API for any future dashboard
    parser; bumping it silently fails CI.
  * `inspectExitedContainer` (12): OOMKilled true/false, exitReason
    extraction, durationMs from real timestamps + Docker's
    `0001-01-01` sentinel + malformed strings + missing-half +
    inverted (negative-span) timestamps + inspect-rejection path.
    Diagnostics are best-effort: rejection logs a warn and returns
    all-undefined, never throws.
  * `resolveSpawnSettings` (4): per-project watchdogTimeoutMs override
    applied → 45+2 = 47 min; no project override → falls through to
    global; null projectId → no log + no `loadProjectConfig` call;
    Math.round on non-integer minutes.
* `tests/unit/router/snapshot-integration.test.ts` — `setupMockContainer`
  now stubs `inspect()` so the post-exit pipeline runs through cleanly
  in the snapshot tests (production code no longer needs to defend
  against a test-mock gap).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zbigniewsobiecki zbigniewsobiecki merged commit 2bbf462 into dev Apr 25, 2026
8 checks passed
@zbigniewsobiecki zbigniewsobiecki deleted the feat/worker-exit-diagnostics branch April 25, 2026 10:09
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 25, 2026

Codecov Report

❌ Patch coverage is 92.15686% with 8 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/router/container-manager.ts 91.20% 8 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant