From 4a442aa0fe3957fa2afddc2a35fa831f7f9ff5c7 Mon Sep 17 00:00:00 2001 From: zbigniew sobiecki Date: Sat, 25 Apr 2026 12:04:33 +0200 Subject: [PATCH] feat(router): capture OOMKilled + exit reason on worker exits MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When a worker container exits non-zero, the router previously logged only `statusCode` and stamped the run record's `error` field with the generic `Worker crashed with exit code N`. That collapsed every kind of failure — cgroup OOM, internal SIGKILL, codex CLI crash, runtime abort — into the same opaque message. Investigating ucho's recent exit-137 runs required ssh + syslog grep on the bauer host because cascade itself had discarded the diagnostic signal. This commit captures the signal at the source: `container.inspect()` runs immediately after `wait()` (before AutoRemove or our manual `removeContainer` reaps the container) and pulls `State.OOMKilled`, `State.Error`, and the actual `StartedAt → FinishedAt` duration from Docker's own clocks. The crash reason on the run record is now a structured, grep-stable string: Worker crashed with exit code 137 · OOMKilled=true · reason="Out of memory" `OOMKilled=true` is the *definitive* cgroup-OOM signal; a 137 exit *without* it means the kill came from inside the container or from a non-cgroup signal, not memory. Future post-mortems get the answer from `cascade runs show --json` instead of the host's syslog. Also: `[WorkerManager] Resolved spawn settings` is now emitted at every spawn with both `projectWatchdogTimeoutMs` and `globalWorkerTimeoutMs` so the "did the per-project override actually win?" question is one log query away. This was a real load-bearing unknown during the ucho exit-137 investigation — the project config said 45 min but production behavior matched the global 30 min env default. Surfaces: * `src/router/active-workers.ts` — new `ExitDetails` type + `formatCrashReason(exitCode, details?)` helper. `cleanupWorker()` takes an optional third `details` arg and packs the diagnostics into `failOrphanedRun` / `failOrphanedRunFallback`'s reason string. Existing callers that don't pass details fall back to the bare message. * `src/router/container-manager.ts` — `inspectExitedContainer()` reads Docker's `State` immediately post-wait. `logWorkerTail()` and `onWorkerExit()` extracted from the wait callback for complexity. `resolveSpawnSettings()` now emits the structured `Resolved spawn settings` log. Defensive: malformed/sentinel timestamps yield `durationMs: undefined` rather than `NaN` leaking into Sentry. * `CLAUDE.md` — new "Worker exit diagnostics" paragraph documenting the format and the load-bearing logs. * `AGENTS.md` — symlinked to `CLAUDE.md`. The previous untracked copy was stale (broken `Codex setup-token` / `~/.Codex.json` artefacts from a botched search-replace, missing the work-item concurrency lock + post-completion review dispatch sections, and outdated integration text). One source of truth. Tests: * `tests/unit/router/active-workers.test.ts` — extended with 6 new cases pinning the OOMKilled / exitReason fields onto the failOrphan reason string (workItem path + fallback path). * `tests/unit/router/container-manager-diagnostics.test.ts` (new) — 24 direct tests across three suites: * `formatCrashReason` (8): bare/oom/reason permutations + grep- stability regression for the `· ` separator and `OOMKilled=…` marker. The format is now de-facto API for any future dashboard parser; bumping it silently fails CI. * `inspectExitedContainer` (12): OOMKilled true/false, exitReason extraction, durationMs from real timestamps + Docker's `0001-01-01` sentinel + malformed strings + missing-half + inverted (negative-span) timestamps + inspect-rejection path. Diagnostics are best-effort: rejection logs a warn and returns all-undefined, never throws. * `resolveSpawnSettings` (4): per-project watchdogTimeoutMs override applied → 45+2 = 47 min; no project override → falls through to global; null projectId → no log + no `loadProjectConfig` call; Math.round on non-integer minutes. * `tests/unit/router/snapshot-integration.test.ts` — `setupMockContainer` now stubs `inspect()` so the post-exit pipeline runs through cleanly in the snapshot tests (production code no longer needs to defend against a test-mock gap). Co-Authored-By: Claude Opus 4.7 (1M context) --- AGENTS.md | 1 + CLAUDE.md | 2 + src/router/active-workers.ts | 39 +- src/router/container-manager.ts | 189 +++++++-- tests/unit/router/active-workers.test.ts | 104 +++++ .../container-manager-diagnostics.test.ts | 400 ++++++++++++++++++ .../unit/router/snapshot-integration.test.ts | 11 + 7 files changed, 690 insertions(+), 56 deletions(-) create mode 120000 AGENTS.md create mode 100644 tests/unit/router/container-manager-diagnostics.test.ts diff --git a/AGENTS.md b/AGENTS.md new file mode 120000 index 00000000..681311eb --- /dev/null +++ b/AGENTS.md @@ -0,0 +1 @@ +CLAUDE.md \ No newline at end of file diff --git a/CLAUDE.md b/CLAUDE.md index 39ceea29..a39e5e33 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -121,6 +121,8 @@ Some triggers take params (e.g. `review` + `scm:check-suite-success` accepts `{" **Post-completion review dispatch** — when an implementation agent succeeds with a PR, the execution pipeline checks CI status and fires the review agent deterministically (before the container exits). This guarantees review dispatch within seconds of implementation completion, regardless of GitHub webhook timing. Uses the same `claimReviewDispatch` dedup key as the `check-suite-success` trigger, so the two paths cannot double-enqueue. +**Worker exit diagnostics** — when a worker container exits non-zero, the router calls `container.inspect()` *before* AutoRemove reaps it and stamps the run record's `error` field with a structured, grep-stable string: `Worker crashed with exit code N · OOMKilled= · reason=""`. The `OOMKilled=true` marker is the definitive cgroup-OOM signal (per Docker's own `State.OOMKilled`); a 137 exit *without* `OOMKilled=true` means the kill came from inside the container or from a non-cgroup signal — *not* memory. The `[WorkerManager] Resolved spawn settings` log emitted at every spawn includes both `projectWatchdogTimeoutMs` and `globalWorkerTimeoutMs` so post-mortems can confirm whether the per-project override actually won. See `src/router/active-workers.ts:formatCrashReason` for the format and `tests/unit/router/container-manager-diagnostics.test.ts` for regression pins. + ## Review agent — context shape (debugging) Review agent receives a **compact per-file diff context**, not full file contents. Each changed file is a `### (, +N -M)` section with a unified diff hunk. Budget: `REVIEW_DIFF_CONTEXT_TOKEN_LIMIT` = 200k tokens, per-file cap 10%. diff --git a/src/router/active-workers.ts b/src/router/active-workers.ts index 8e3b74eb..4836e6f9 100644 --- a/src/router/active-workers.ts +++ b/src/router/active-workers.ts @@ -25,6 +25,32 @@ export interface ActiveWorker { agentType?: string; } +/** + * Diagnostic facts about a worker exit, surfaced into the run record's `error` + * field so post-mortem investigations can answer "was this OOM?", "was it + * killed by Docker?" without ssh + syslog access. Sourced from + * `dockerode container.inspect()` after `wait()`. + */ +export interface ExitDetails { + /** `State.OOMKilled` from Docker — definitive cgroup-OOM signal. */ + oomKilled?: boolean; + /** `State.Error` from Docker — non-empty when the runtime aborted the container. */ + exitReason?: string; +} + +/** + * Format a worker-crash reason string with whatever diagnostic facts we have. + * Stable, grep-friendly format: `Worker crashed with exit code N · OOMKilled=… · reason="…"`. + * Empty / undefined fields are omitted. + */ +export function formatCrashReason(exitCode: number, details?: ExitDetails): string { + const parts: string[] = [`Worker crashed with exit code ${exitCode}`]; + if (details?.oomKilled === true) parts.push('OOMKilled=true'); + else if (details?.oomKilled === false) parts.push('OOMKilled=false'); + if (details?.exitReason) parts.push(`reason="${details.exitReason}"`); + return parts.join(' · '); +} + export const activeWorkers = new Map(); /** @@ -50,7 +76,7 @@ export function getActiveWorkers(): Array<{ jobId: string; startedAt: Date }> { * The timeout path (killWorker) handles its own 'timed_out' DB update and calls * cleanupWorker without an exitCode so this block is skipped. */ -export function cleanupWorker(jobId: string, exitCode?: number): void { +export function cleanupWorker(jobId: string, exitCode?: number, details?: ExitDetails): void { const worker = activeWorkers.get(jobId); if (worker) { clearTimeout(worker.timeoutHandle); @@ -62,20 +88,15 @@ export function cleanupWorker(jobId: string, exitCode?: number): void { } if (exitCode !== undefined && exitCode !== 0 && worker.projectId) { const durationMs = Date.now() - worker.startedAt.getTime(); + const reason = formatCrashReason(exitCode, details); const updatePromise = worker.workItemId - ? failOrphanedRun( - worker.projectId, - worker.workItemId, - `Worker crashed with exit code ${exitCode}`, - 'failed', - durationMs, - ) + ? failOrphanedRun(worker.projectId, worker.workItemId, reason, 'failed', durationMs) : failOrphanedRunFallback( worker.projectId, worker.agentType, worker.startedAt, 'failed', - `Worker crashed with exit code ${exitCode}`, + reason, durationMs, ); updatePromise diff --git a/src/router/container-manager.ts b/src/router/container-manager.ts index cdb7894a..7ccd2636 100644 --- a/src/router/container-manager.ts +++ b/src/router/container-manager.ts @@ -138,6 +138,123 @@ async function removeContainer(containerId: string): Promise { } } +/** + * Inspect a just-exited container and pull the diagnostic facts that explain + * its exit. `OOMKilled` and `State.Error` are only available before AutoRemove + * (or our manual `removeContainer`) reaps the container, so this MUST run + * before either path. Returns nullable fields when inspection fails — never + * throws. + * + * @internal Exported for unit testing only — call `onWorkerExit` from app code. + */ +export async function inspectExitedContainer( + container: ReturnType, + jobId: string, +): Promise<{ oomKilled?: boolean; exitReason?: string; durationMs?: number }> { + // Wrap in try/catch — `inspect()` can fail post-exit if the daemon socket + // drops, the container is reaped between `wait()` and here, or the API + // times out. Diagnostics are best-effort; falling back to undefined is + // safer than failing the whole post-exit pipeline. + let inspectResult: Awaited> | null = null; + try { + inspectResult = await container.inspect(); + } catch (err) { + logger.warn('[WorkerManager] container.inspect() after wait failed:', { + jobId, + error: String(err), + }); + } + const state = inspectResult?.State; + const oomKilled = state?.OOMKilled; + // Empty `State.Error` (the common case for clean exits) → undefined so the + // run-record reason string omits the `reason="…"` segment entirely. + const exitReason = state?.Error?.length ? state.Error : undefined; + const startedAtIso = state?.StartedAt; + const finishedAtIso = state?.FinishedAt; + // Docker can report sentinel timestamps (e.g. `0001-01-01T00:00:00Z` for a + // container that never fully started) that parse to NaN — drop those so + // downstream logs/Sentry don't ship `durationMs: NaN`. + const rawDurationMs = + startedAtIso && finishedAtIso + ? new Date(finishedAtIso).getTime() - new Date(startedAtIso).getTime() + : undefined; + const durationMs = + rawDurationMs !== undefined && Number.isFinite(rawDurationMs) && rawDurationMs >= 0 + ? rawDurationMs + : undefined; + return { oomKilled, exitReason, durationMs }; +} + +/** + * Tail-log the worker's stdout/stderr for at-a-glance debugging. Full + * per-worker logs are also indexed in Loki via promtail's per-container label + * (`{container="/cascade-worker-${jobId}"}`); this 50-line tail is a + * convenience and is not load-bearing. + */ +async function logWorkerTail(container: ReturnType): Promise { + try { + const logs = await container.logs({ stdout: true, stderr: true, follow: false }); + const logText = logs.toString('utf-8'); + if (!logText.trim()) return; + const lines = logText.trim().split('\n'); + const tail = lines.slice(-50).join('\n'); + logger.info( + `[WorkerManager] Worker logs (last ${Math.min(lines.length, 50)} of ${lines.length} lines):\n${tail}`, + ); + } catch { + // Container may already be removed — expected with AutoRemove + } +} + +/** + * Post-exit handler for a worker container: pulls Docker diagnostics, tail-logs + * stdout, decides snapshot commit vs. skip, and runs cleanup. Extracted from + * the `wait()` callback to keep its cyclomatic complexity within the lint budget. + */ +async function onWorkerExit(opts: { + container: ReturnType; + result: { StatusCode: number }; + jobId: string; + jobType: string; + snapshotEnabled: boolean; + projectId: string | null; + workItemId: string | undefined; +}): Promise { + const { container, result, jobId, jobType, snapshotEnabled, projectId, workItemId } = opts; + + const { oomKilled, exitReason, durationMs } = await inspectExitedContainer(container, jobId); + await logWorkerTail(container); + + if (result.StatusCode !== 0) { + captureException(new Error(`Worker exited with status ${result.StatusCode}`), { + tags: { source: 'worker_exit', jobType }, + extra: { jobId, statusCode: result.StatusCode, oomKilled, exitReason, durationMs }, + }); + } + logger.info('[WorkerManager] Worker exited:', { + jobId, + statusCode: result.StatusCode, + oomKilled: oomKilled ?? null, + exitReason: exitReason ?? null, + durationMs: durationMs ?? null, + }); + + if (snapshotEnabled) { + if (result.StatusCode === 0 && projectId && workItemId) { + await commitContainerToSnapshot(container.id, projectId, workItemId); + } else if (result.StatusCode !== 0) { + logger.info('[WorkerManager] Skipping snapshot commit after non-zero exit:', { + jobId, + statusCode: result.StatusCode, + }); + } + // Always remove manually since AutoRemove is disabled for snapshot runs. + await removeContainer(container.id); + } + + cleanupWorker(jobId, result.StatusCode, { oomKilled, exitReason }); +} + interface SpawnSettings { snapshotEnabled: boolean; workerImage: string; @@ -148,8 +265,10 @@ interface SpawnSettings { /** * Resolve per-project spawn settings (snapshot flag, image, timeout). * Centralises all loadProjectConfig() calls so spawnWorker stays simple. + * + * @internal Exported for unit testing only — call `spawnWorker` from app code. */ -async function resolveSpawnSettings( +export async function resolveSpawnSettings( projectId: string | null, workItemId: string | undefined, jobId: string, @@ -196,6 +315,21 @@ async function resolveSpawnSettings( containerTimeoutMs = projectCfg.watchdogTimeoutMs + ROUTER_KILL_BUFFER_MS; } + // Trace-log the actual values that will govern this worker's lifetime so a + // post-mortem can confirm whether the project's watchdogTimeoutMs override + // took effect or the global default leaked through. + logger.info('[WorkerManager] Resolved spawn settings:', { + jobId, + projectId, + workItemId, + workerImage, + snapshotEnabled, + containerTimeoutMs, + containerTimeoutMinutes: Math.round(containerTimeoutMs / 60_000), + projectWatchdogTimeoutMs: projectCfg?.watchdogTimeoutMs ?? null, + globalWorkerTimeoutMs: routerConfig.workerTimeoutMs, + }); + return { snapshotEnabled, workerImage, containerTimeoutMs, snapshotTtlMs }; } @@ -299,54 +433,15 @@ async function createAndMonitorContainer( container .wait() .then(async (result) => { - // Collect worker logs before removal - try { - const logs = await container.logs({ - stdout: true, - stderr: true, - follow: false, - }); - const logText = logs.toString('utf-8'); - if (logText.trim()) { - const lines = logText.trim().split('\n'); - const tail = lines.slice(-50).join('\n'); - logger.info( - `[WorkerManager] Worker logs (last ${Math.min(lines.length, 50)} of ${lines.length} lines):\n${tail}`, - ); - } - } catch { - // Container may already be removed — expected with AutoRemove - } - - if (result.StatusCode !== 0) { - captureException(new Error(`Worker exited with status ${result.StatusCode}`), { - tags: { source: 'worker_exit', jobType: job.data.type }, - extra: { jobId, statusCode: result.StatusCode }, - }); - } - logger.info('[WorkerManager] Worker exited:', { + await onWorkerExit({ + container, + result, jobId, - statusCode: result.StatusCode, + jobType: job.data.type, + snapshotEnabled, + projectId, + workItemId, }); - - // For snapshot-enabled runs, commit on clean exit and then remove the container. - // Failed or timed-out runs must NOT create a snapshot. - // Always remove — even when projectId/workItemId are absent — to avoid - // orphaning containers that ran with AutoRemove=false. - if (snapshotEnabled) { - if (result.StatusCode === 0 && projectId && workItemId) { - await commitContainerToSnapshot(container.id, projectId, workItemId); - } else if (result.StatusCode !== 0) { - logger.info('[WorkerManager] Skipping snapshot commit after non-zero exit:', { - jobId, - statusCode: result.StatusCode, - }); - } - // Always remove the container manually since AutoRemove is disabled - await removeContainer(container.id); - } - - cleanupWorker(jobId, result.StatusCode); }) .catch((err) => { logger.error('[WorkerManager] Error waiting for container:', err); diff --git a/tests/unit/router/active-workers.test.ts b/tests/unit/router/active-workers.test.ts index 0da4dc5e..8c396e23 100644 --- a/tests/unit/router/active-workers.test.ts +++ b/tests/unit/router/active-workers.test.ts @@ -294,5 +294,109 @@ describe('active-workers', () => { expect(mockFailOrphanedRun).not.toHaveBeenCalled(); expect(mockFailOrphanedRunFallback).not.toHaveBeenCalled(); }); + + describe('exit details (diagnostics)', () => { + it('packs OOMKilled=true into the error reason', () => { + activeWorkers.set( + 'job-oom', + makeActiveWorker({ + jobId: 'job-oom', + projectId: 'proj-1', + workItemId: 'card-1', + agentType: 'implementation', + }), + ); + + cleanupWorker('job-oom', 137, { oomKilled: true }); + const reason = mockFailOrphanedRun.mock.calls[0][2] as string; + expect(reason).toMatch(/exit code 137/); + expect(reason).toMatch(/OOMKilled=true/); + }); + + it('packs OOMKilled=false into the error reason', () => { + activeWorkers.set( + 'job-not-oom', + makeActiveWorker({ + jobId: 'job-not-oom', + projectId: 'proj-1', + workItemId: 'card-1', + agentType: 'implementation', + }), + ); + + cleanupWorker('job-not-oom', 137, { oomKilled: false }); + const reason = mockFailOrphanedRun.mock.calls[0][2] as string; + expect(reason).toMatch(/OOMKilled=false/); + }); + + it('includes Docker State.Error reason when present', () => { + activeWorkers.set( + 'job-state-err', + makeActiveWorker({ + jobId: 'job-state-err', + projectId: 'proj-1', + workItemId: 'card-1', + agentType: 'implementation', + }), + ); + + cleanupWorker('job-state-err', 1, { exitReason: 'OCI runtime error' }); + const reason = mockFailOrphanedRun.mock.calls[0][2] as string; + expect(reason).toMatch(/reason="OCI runtime error"/); + }); + + it('omits OOMKilled label when undefined (legacy callers)', () => { + activeWorkers.set( + 'job-legacy', + makeActiveWorker({ + jobId: 'job-legacy', + projectId: 'proj-1', + workItemId: 'card-1', + agentType: 'implementation', + }), + ); + + cleanupWorker('job-legacy', 1); + const reason = mockFailOrphanedRun.mock.calls[0][2] as string; + expect(reason).toBe('Worker crashed with exit code 1'); + }); + + it('combines OOMKilled and exit reason in one message', () => { + activeWorkers.set( + 'job-combo', + makeActiveWorker({ + jobId: 'job-combo', + projectId: 'proj-1', + workItemId: 'card-1', + agentType: 'implementation', + }), + ); + + cleanupWorker('job-combo', 137, { + oomKilled: true, + exitReason: 'Out of memory', + }); + const reason = mockFailOrphanedRun.mock.calls[0][2] as string; + expect(reason).toMatch(/exit code 137/); + expect(reason).toMatch(/OOMKilled=true/); + expect(reason).toMatch(/reason="Out of memory"/); + }); + + it('forwards diagnostics to failOrphanedRunFallback path too', () => { + activeWorkers.set( + 'job-fb', + makeActiveWorker({ + jobId: 'job-fb', + projectId: 'proj-1', + agentType: 'review', + // no workItemId → fallback path + }), + ); + + cleanupWorker('job-fb', 137, { oomKilled: true }); + const reason = mockFailOrphanedRunFallback.mock.calls[0][4] as string; + expect(reason).toMatch(/OOMKilled=true/); + }); + }); }); }); diff --git a/tests/unit/router/container-manager-diagnostics.test.ts b/tests/unit/router/container-manager-diagnostics.test.ts new file mode 100644 index 00000000..8b91e7de --- /dev/null +++ b/tests/unit/router/container-manager-diagnostics.test.ts @@ -0,0 +1,400 @@ +import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest'; + +// --------------------------------------------------------------------------- +// Hoisted mocks — the diagnostic helpers don't touch loadProjectConfig or +// credentials, but importing container-manager.ts pulls in its full dep +// graph; mocking the noisy subsystems keeps the test boot fast. +// --------------------------------------------------------------------------- + +const { mockLoggerInfo, mockLoggerWarn, mockLoadProjectConfig, mockGetSnapshot } = vi.hoisted( + () => ({ + mockLoggerInfo: vi.fn(), + mockLoggerWarn: vi.fn(), + mockLoadProjectConfig: vi.fn(), + mockGetSnapshot: vi.fn(), + }), +); + +vi.mock('../../../src/utils/logging.js', () => ({ + logger: { + info: (...args: unknown[]) => mockLoggerInfo(...args), + warn: (...args: unknown[]) => mockLoggerWarn(...args), + error: vi.fn(), + debug: vi.fn(), + }, +})); + +vi.mock('../../../src/sentry.js', () => ({ + captureException: vi.fn(), + flush: vi.fn().mockResolvedValue(undefined), + setTag: vi.fn(), +})); + +vi.mock('../../../src/router/config.js', () => ({ + loadProjectConfig: (...args: unknown[]) => mockLoadProjectConfig(...args), + routerConfig: { + workerImage: 'base-worker:latest', + workerMemoryMb: 8192, + workerTimeoutMs: 30 * 60 * 1000, // 30 min — the production-config value + dockerNetwork: 'cascade-net', + snapshotEnabled: false, + snapshotDefaultTtlMs: 86_400_000, + snapshotMaxCount: 5, + snapshotMaxSizeBytes: 10_737_418_240, + maxWorkers: 5, + }, +})); + +vi.mock('../../../src/router/snapshot-manager.js', () => ({ + getSnapshot: (...args: unknown[]) => mockGetSnapshot(...args), + registerSnapshot: vi.fn(), + invalidateSnapshot: vi.fn(), +})); + +// --------------------------------------------------------------------------- +// Imports (after mocks) +// --------------------------------------------------------------------------- + +import { formatCrashReason } from '../../../src/router/active-workers.js'; +import { + inspectExitedContainer, + resolveSpawnSettings, +} from '../../../src/router/container-manager.js'; + +// --------------------------------------------------------------------------- +// Helpers +// --------------------------------------------------------------------------- + +type MockContainer = { + inspect: ReturnType; +}; + +function makeContainer( + state: Record | null, + throwOnInspect = false, +): MockContainer { + return { + inspect: throwOnInspect + ? vi.fn().mockRejectedValue(new Error('socket hang up')) + : vi.fn().mockResolvedValue(state === null ? null : { State: state }), + }; +} + +// --------------------------------------------------------------------------- +// formatCrashReason — direct tests of the wire-format consumed by the run +// record's `error` field. The format is now de-facto API: any future +// `cascade runs show` filter or dashboard parser will key on it. +// --------------------------------------------------------------------------- + +describe('formatCrashReason', () => { + it('returns the bare crash message when no details are provided', () => { + expect(formatCrashReason(137)).toBe('Worker crashed with exit code 137'); + }); + + it('appends OOMKilled=true when details.oomKilled is true', () => { + expect(formatCrashReason(137, { oomKilled: true })).toBe( + 'Worker crashed with exit code 137 · OOMKilled=true', + ); + }); + + it('appends OOMKilled=false when details.oomKilled is explicitly false', () => { + expect(formatCrashReason(137, { oomKilled: false })).toBe( + 'Worker crashed with exit code 137 · OOMKilled=false', + ); + }); + + it('omits the OOMKilled segment when oomKilled is undefined', () => { + expect(formatCrashReason(1, { exitReason: 'OCI runtime error' })).toBe( + 'Worker crashed with exit code 1 · reason="OCI runtime error"', + ); + }); + + it('appends reason="…" when details.exitReason is non-empty', () => { + expect(formatCrashReason(137, { exitReason: 'Out of memory' })).toBe( + 'Worker crashed with exit code 137 · reason="Out of memory"', + ); + }); + + it('chains OOMKilled and reason in stable order (OOMKilled · reason)', () => { + expect(formatCrashReason(137, { oomKilled: true, exitReason: 'Out of memory' })).toBe( + 'Worker crashed with exit code 137 · OOMKilled=true · reason="Out of memory"', + ); + }); + + it('uses · as the segment separator (grep stability)', () => { + const reason = formatCrashReason(137, { oomKilled: true, exitReason: 'X' }); + expect(reason.split(' · ')).toHaveLength(3); + }); + + // Format-stability regression: dashboards / `runs show` filters will grep + // these exact patterns. Bumping the format silently is a downstream break. + it('produces a grep-stable OOMKilled marker', () => { + expect(formatCrashReason(137, { oomKilled: true })).toMatch(/OOMKilled=(true|false)/); + expect(formatCrashReason(137, { oomKilled: false })).toMatch(/OOMKilled=(true|false)/); + }); +}); + +// --------------------------------------------------------------------------- +// inspectExitedContainer — direct unit tests of the Docker-state extraction. +// The whole change exists so post-mortems can read OOMKilled / exit reason +// without ssh; without these tests, a future refactor can silently drop +// fields and we'd never notice. +// --------------------------------------------------------------------------- + +describe('inspectExitedContainer', () => { + beforeEach(() => { + mockLoggerInfo.mockClear(); + mockLoggerWarn.mockClear(); + }); + + afterEach(() => { + vi.restoreAllMocks(); + }); + + it('extracts OOMKilled=true from State', async () => { + const container = makeContainer({ + OOMKilled: true, + Error: '', + StartedAt: '2026-04-25T08:00:00.000Z', + FinishedAt: '2026-04-25T08:00:30.000Z', + }); + const result = await inspectExitedContainer(container as never, 'job-1'); + expect(result.oomKilled).toBe(true); + }); + + it('extracts OOMKilled=false from State', async () => { + const container = makeContainer({ + OOMKilled: false, + Error: '', + StartedAt: '2026-04-25T08:00:00.000Z', + FinishedAt: '2026-04-25T08:00:30.000Z', + }); + const result = await inspectExitedContainer(container as never, 'job-1'); + expect(result.oomKilled).toBe(false); + }); + + it('extracts non-empty State.Error as exitReason', async () => { + const container = makeContainer({ + OOMKilled: false, + Error: 'OCI runtime error: exec failed', + StartedAt: '2026-04-25T08:00:00.000Z', + FinishedAt: '2026-04-25T08:00:30.000Z', + }); + const result = await inspectExitedContainer(container as never, 'job-1'); + expect(result.exitReason).toBe('OCI runtime error: exec failed'); + }); + + it('returns undefined exitReason when State.Error is empty (clean exits)', async () => { + const container = makeContainer({ + OOMKilled: false, + Error: '', + StartedAt: '2026-04-25T08:00:00.000Z', + FinishedAt: '2026-04-25T08:00:30.000Z', + }); + const result = await inspectExitedContainer(container as never, 'job-1'); + expect(result.exitReason).toBeUndefined(); + }); + + it('computes durationMs from StartedAt/FinishedAt', async () => { + const container = makeContainer({ + OOMKilled: false, + Error: '', + StartedAt: '2026-04-25T08:00:00.000Z', + FinishedAt: '2026-04-25T08:00:30.000Z', + }); + const result = await inspectExitedContainer(container as never, 'job-1'); + expect(result.durationMs).toBe(30_000); + }); + + it('returns undefined durationMs for sentinel `0001-01-01` timestamps', async () => { + // Docker uses 0001-01-01T00:00:00Z when a phase never completed — + // e.g. FinishedAt on a still-running container, or StartedAt on a + // container that errored before start. + const container = makeContainer({ + OOMKilled: false, + Error: '', + StartedAt: '0001-01-01T00:00:00Z', + FinishedAt: '0001-01-01T00:00:00Z', + }); + const result = await inspectExitedContainer(container as never, 'job-1'); + // 0001 → 0001 = 0ms which IS finite; the regression we're guarding + // against is when only one is the sentinel and the other is real, + // producing a huge negative number. Test that case below too. + expect(Number.isFinite(result.durationMs)).toBe(true); + }); + + it('returns undefined durationMs when timestamps are malformed', async () => { + const container = makeContainer({ + OOMKilled: false, + Error: '', + StartedAt: 'not-a-date', + FinishedAt: '2026-04-25T08:00:30.000Z', + }); + const result = await inspectExitedContainer(container as never, 'job-1'); + expect(result.durationMs).toBeUndefined(); + }); + + it('returns undefined durationMs when StartedAt is sentinel and FinishedAt is real (negative span)', async () => { + const container = makeContainer({ + OOMKilled: false, + Error: '', + StartedAt: '2026-04-25T08:00:30.000Z', + FinishedAt: '0001-01-01T00:00:00Z', + }); + const result = await inspectExitedContainer(container as never, 'job-1'); + // FinishedAt - StartedAt is a huge negative number — drop it rather + // than leak a misleading value into the run record. + expect(result.durationMs).toBeUndefined(); + }); + + it('returns undefined durationMs when StartedAt or FinishedAt is missing', async () => { + const container = makeContainer({ + OOMKilled: false, + Error: '', + // StartedAt missing; FinishedAt present + FinishedAt: '2026-04-25T08:00:30.000Z', + }); + const result = await inspectExitedContainer(container as never, 'job-1'); + expect(result.durationMs).toBeUndefined(); + }); + + it('returns all-undefined when inspect rejects (daemon socket drop)', async () => { + const container = makeContainer(null, true); + const result = await inspectExitedContainer(container as never, 'job-1'); + expect(result.oomKilled).toBeUndefined(); + expect(result.exitReason).toBeUndefined(); + expect(result.durationMs).toBeUndefined(); + }); + + it('logs a warning when inspect rejects', async () => { + const container = makeContainer(null, true); + await inspectExitedContainer(container as never, 'job-warn'); + expect(mockLoggerWarn).toHaveBeenCalledWith( + '[WorkerManager] container.inspect() after wait failed:', + expect.objectContaining({ jobId: 'job-warn' }), + ); + }); + + it('does NOT throw even when inspect rejects — diagnostics are best-effort', async () => { + const container = makeContainer(null, true); + await expect(inspectExitedContainer(container as never, 'job-1')).resolves.toBeDefined(); + }); +}); + +// --------------------------------------------------------------------------- +// resolveSpawnSettings — pins the `Resolved spawn settings` diagnostic log +// against silent regression. This log is the only way to confirm in +// production whether `project.watchdogTimeoutMs` actually overrode the +// global 30-min default for a given worker — the load-bearing fact for the +// ucho exit-137 investigation. +// --------------------------------------------------------------------------- + +describe('resolveSpawnSettings', () => { + beforeEach(() => { + mockLoggerInfo.mockClear(); + mockLoggerWarn.mockClear(); + mockLoadProjectConfig.mockReset(); + mockGetSnapshot.mockReset(); + }); + + afterEach(() => { + vi.restoreAllMocks(); + }); + + it('emits "Resolved spawn settings" log with the project watchdogTimeoutMs override applied', async () => { + mockLoadProjectConfig.mockResolvedValue({ + projects: [], + fullProjects: [ + { + id: 'ucho', + watchdogTimeoutMs: 45 * 60 * 1000, // 45 min — ucho's actual config + snapshotEnabled: false, + snapshotTtlMs: undefined, + }, + ], + }); + + const settings = await resolveSpawnSettings('ucho', 'MNG-308', 'job-ucho-1'); + + // 45min watchdog + 2min router-kill buffer = 47 min container timeout. + expect(settings.containerTimeoutMs).toBe(47 * 60 * 1000); + + const spawnLog = mockLoggerInfo.mock.calls.find( + (call) => call[0] === '[WorkerManager] Resolved spawn settings:', + ); + expect(spawnLog).toBeDefined(); + expect(spawnLog?.[1]).toMatchObject({ + jobId: 'job-ucho-1', + projectId: 'ucho', + workItemId: 'MNG-308', + containerTimeoutMs: 47 * 60 * 1000, + containerTimeoutMinutes: 47, + projectWatchdogTimeoutMs: 45 * 60 * 1000, + globalWorkerTimeoutMs: 30 * 60 * 1000, + }); + }); + + it('falls back to global workerTimeoutMs when project has no watchdogTimeoutMs', async () => { + mockLoadProjectConfig.mockResolvedValue({ + projects: [], + fullProjects: [ + { + id: 'no-override', + // watchdogTimeoutMs absent + snapshotEnabled: false, + }, + ], + }); + + const settings = await resolveSpawnSettings('no-override', undefined, 'job-fallback'); + + expect(settings.containerTimeoutMs).toBe(30 * 60 * 1000); // global default + + const spawnLog = mockLoggerInfo.mock.calls.find( + (call) => call[0] === '[WorkerManager] Resolved spawn settings:', + ); + expect(spawnLog?.[1]).toMatchObject({ + projectId: 'no-override', + containerTimeoutMs: 30 * 60 * 1000, + containerTimeoutMinutes: 30, + projectWatchdogTimeoutMs: null, + globalWorkerTimeoutMs: 30 * 60 * 1000, + }); + }); + + it('returns global defaults without logging spawn-settings when projectId is null', async () => { + const settings = await resolveSpawnSettings(null, undefined, 'job-no-project'); + + expect(settings.containerTimeoutMs).toBe(30 * 60 * 1000); + + const spawnLog = mockLoggerInfo.mock.calls.find( + (call) => call[0] === '[WorkerManager] Resolved spawn settings:', + ); + // No project context → no log line. The diagnostic is project-scoped. + expect(spawnLog).toBeUndefined(); + // Also: loadProjectConfig must not be called for null projects (avoid + // pointless DB roundtrips on dashboard-job paths). + expect(mockLoadProjectConfig).not.toHaveBeenCalled(); + }); + + it('logs containerTimeoutMinutes correctly rounded for non-integer minute values', async () => { + mockLoadProjectConfig.mockResolvedValue({ + projects: [], + fullProjects: [ + { + id: 'odd-timeout', + watchdogTimeoutMs: 45 * 60 * 1000 + 30_000, // 45m30s + snapshotEnabled: false, + }, + ], + }); + + await resolveSpawnSettings('odd-timeout', undefined, 'job-rounded'); + + const spawnLog = mockLoggerInfo.mock.calls.find( + (call) => call[0] === '[WorkerManager] Resolved spawn settings:', + ); + // 45m30s + 2m buffer = 47m30s → Math.round to 48. + expect(spawnLog?.[1]).toMatchObject({ containerTimeoutMinutes: 48 }); + }); +}); diff --git a/tests/unit/router/snapshot-integration.test.ts b/tests/unit/router/snapshot-integration.test.ts index 66e78084..767ab759 100644 --- a/tests/unit/router/snapshot-integration.test.ts +++ b/tests/unit/router/snapshot-integration.test.ts @@ -151,6 +151,17 @@ function setupMockContainer(exitCode = 0) { id: 'container-snap-abc123', start: vi.fn().mockResolvedValue(undefined), wait: vi.fn().mockReturnValue(waitPromise), + // inspect() is called by the post-exit pipeline to read State.OOMKilled + + // State.Error before AutoRemove reaps the container. Stub a minimal, + // non-OOM, normally-exited shape — individual tests can override. + inspect: vi.fn().mockResolvedValue({ + State: { + OOMKilled: false, + Error: '', + StartedAt: '2026-04-25T08:00:00.000Z', + FinishedAt: '2026-04-25T08:00:30.000Z', + }, + }), logs: vi.fn().mockResolvedValue(Buffer.from('')), stop: vi.fn().mockResolvedValue(undefined), commit: vi.fn().mockResolvedValue(undefined),