fix(router): scope orphan-cleanup to this instance's own workers by zbigniewsobiecki · Pull Request #1195 · mongrel-intelligence/cascade

zbigniewsobiecki · 2026-04-25T11:27:17Z

Why — definitive root cause of the ucho exit-137 mystery

Two cascade-router instances share bauer (cascade-router prod + cascade-router-dev). Each one's periodic scanAndCleanupOrphans was silently docker stop-ing the other instance's healthy in-flight workers at the 30-min workerTimeoutMs mark — surfacing downstream as exit 137 · OOMKilled=false agent runs that everyone blamed on memory.

Smoking gun — the prod cascade-router log for run 2d0dc4b4 (MNG-308 implementation, killed today at 11:01:42 UTC after 34m12s) had zero kill events for container 9fc5fb3b…. The dev sibling's log had:

2026-04-25 11:01:43 WARN [cascade-router-dev]
  [WorkerManager] Stopped and removed orphaned container:
  { containerId: '9fc5fb3b7340', ... }

The 5-min scan grid + 30-min age threshold cleanly explains the consistent 30–34 min runtime across every recent ucho exit-137 run. Engine-agnostic (codex, claude-code) because the kill is external to the worker container.

Mechanic

scanAndCleanupOrphans (src/router/orphan-cleanup.ts:74–76) filtered by:

filters: { label: ['cascade.managed=true'] }

…then checked the resulting containers against THIS instance's in-process activeWorkers map. Sibling instances have independent maps, so each saw the other's containers as orphans and called container.stop({ t: 15 }) (SIGTERM, 15 s grace, then Docker SIGKILL → exit 137 with OOMKilled=false).

Fix

Tag every spawned container with the spawning router's instance id (cascade.router.instance Docker label). Scope the periodic scan's docker.listContainers({ filters }) to ONLY containers carrying THIS instance's id.

The dev instance now never even sees prod's containers in its scan — the filter is server-side, so there's no risk of a future bug in client-side checks leaking through.

Instance id resolution (src/router/instance-id.ts):

process.env.CASCADE_ROUTER_INSTANCE (trimmed) — explicit override for hosts sharing a hostname.
os.hostname() — Docker injects the container's short id by default; per-container unique. Normal path.

Throws fail-loud if both resolve empty.

Surfaces

File	Change
`src/router/instance-id.ts` (new)	Pure `resolveRouterInstanceId(env, hostname)` + module-load-memoised `ROUTER_INSTANCE_ID` constant.
`src/router/container-manager.ts`	Adds `'cascade.router.instance': ROUTER_INSTANCE_ID` to `Labels:` at `docker.createContainer`.
`src/router/orphan-cleanup.ts`	Two-clause filter (`cascade.managed=true` + `cascade.router.instance=<id>`); kill-confirmation log now includes `instanceId` so post-mortems unambiguously identify which instance acted.
`src/router/index.ts`	Logs `instanceId` at router startup.

Tests (TDD-first)

tests/unit/router/instance-id.test.ts (new, 7 cases) — env override wins; fallback to hostname; env-trimming; defensive throws on empty/whitespace-only inputs.
tests/unit/router/orphan-cleanup.test.ts — extended the existing lists containers with cascade.managed=true label test plus a new test that pins the safety property with an inline comment explaining the historical context (so a future contributor doesn't undo the protection without understanding why it exists).
tests/unit/router/container-manager.test.ts — new case asserts spawn passes cascade.router.instance in Labels:.

Verification

npx vitest run --project unit-api — 1438/1438 passed (was 1429 before, +7 instance-id +1 spawn-label +1 orphan-filter).
npm run typecheck clean.
npx biome check . — 1228 files, 0 issues.
Lefthook pre-commit + pre-push gates green.

Operator note for deploy

Pre-fix containers still running at deploy time carry no cascade.router.instance label. Neither router will see them in its scan filter post-deploy — they become invisible orphans (harmless: they exit naturally; only snapshot-enabled ones without AutoRemove leak, and those are usually rare).

One-shot manual sweep if needed after the deploy lands:

docker ps --filter label=cascade.managed=true --format '{{.ID}} {{.RunningFor}}' \
  | awk '/(hour|day)/ {print $1}' \
  | xargs -r docker stop -t 15

Test plan

Unit tests pin the safety property and the spawn-label flow.
Live verification on bauer post-deploy:
- ssh bauer "docker logs cascade-router 2>&1 | head -20 | grep instanceId" confirms the instance id at startup.
- ssh bauer "docker logs cascade-router-dev 2>&1 | grep instanceId" confirms a different id.
- Trigger a fresh ucho implementation run via cascade runs retry <id>; watch past the 30-min mark; expect completion (or a real watchdog at 47 min for ucho), NOT a SIGTERM at 30–34 min.

🤖 Generated with Claude Code

When two cascade-router instances share a host (prod ↔ dev, multi-replica deployments, two local-dev sandboxes), each one's periodic orphan-cleanup scan would silently `docker stop` the OTHER instance's healthy in-flight workers at the 30-min `workerTimeoutMs` mark — surfacing downstream as `exit 137 · OOMKilled=false` agent runs that everyone blamed on memory. ## The bug `scanAndCleanupOrphans` filtered Docker by a single label (`cascade.managed=true`) and checked the resulting containers against THIS instance's in-process `activeWorkers` map. Sibling instances each have their own independent map. So: cascade-router spawns + tracks worker container 9fc5fb3b… cascade-router-dev scans, sees 9fc5fb3b with `cascade.managed=true`, NOT in dev's empty activeWorkers, age > 30 min, → calls `container.stop({ t: 15 })` → SIGTERM, 15 s grace, then Docker SIGKILL → exit 137 with `OOMKilled=false` Symmetric in the other direction. The 5-min scan grid + 30-min age threshold is what produced the highly-consistent 30–34 min runtime across every recent ucho exit-137 run. The smoking gun was visible only in cascade-router-DEV's log: 2026-04-25 11:01:43 WARN [cascade-router-dev] [WorkerManager] Stopped and removed orphaned container: { containerId: '9fc5fb3b7340', ... } …paired with the prod cascade-router log showing zero kill events for that container, because the prod instance's tracking was fine. ## The fix Tag every spawned container with the spawning router's instance id (`cascade.router.instance` Docker label). Scope the periodic scan's `docker.listContainers({ filters })` call to ONLY containers carrying THIS instance's id. The dev instance now never even *sees* prod's containers in its scan — the filter is server-side, not client-side, so there's no risk of a future bug in client-side checks leaking through. Instance id resolution (`src/router/instance-id.ts`): 1. `process.env.CASCADE_ROUTER_INSTANCE` (trimmed) — explicit override for hosts that share a hostname. 2. `os.hostname()` — Docker injects the container's short id here by default, so each container of cascade-router gets a unique value automatically. This is the normal path. Throws fail-loud if both resolve empty (defensive). ## Files - `src/router/instance-id.ts` (new) — pure resolver + module-load-memoised `ROUTER_INSTANCE_ID` constant. - `src/router/container-manager.ts` — adds the new label at `docker.createContainer({ Labels: ... })`. - `src/router/orphan-cleanup.ts` — adds the second clause to the `docker.listContainers({ filters: { label: [...] } })` call; includes `instanceId` in the kill-confirmation warn log so post-mortems unambiguously identify which instance acted. - `src/router/index.ts` — logs `instanceId` at router startup. ## Tests (TDD-first) - `tests/unit/router/instance-id.test.ts` (new, 7 cases) — env override wins, fallback to hostname, env-trimming, defensive throws on empty/whitespace-only inputs. - `tests/unit/router/orphan-cleanup.test.ts` — extended the `lists containers with cascade.managed=true label` test to assert the new instance-scoping label is in the filter; new test pins the safety property explicitly with a comment explaining the historical context. - `tests/unit/router/container-manager.test.ts` — new case asserts spawn passes `cascade.router.instance` in the Labels block. All 1438/1438 unit-api tests pass. `npm run typecheck` clean. `npx biome check .` clean (1228 files). ## Operator note for deploy Pre-fix containers still running at deploy time carry no `cascade.router.instance` label. Neither router will see them in its scan filter post-deploy — they become invisible orphans (harmless: they exit naturally; only snapshot-enabled ones without AutoRemove leak). One-shot manual sweep if needed: docker ps --filter label=cascade.managed=true --format \ '{{.ID}} {{.RunningFor}}' \ | awk '/(hour|day)/ {print $1}' \ | xargs -r docker stop -t 15 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-04-25T11:31:35Z

Codecov Report

❌ Patch coverage is 91.66667% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/router/index.ts	0.00%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

zbigniewsobiecki merged commit fe0a8f8 into dev Apr 25, 2026
8 checks passed

zbigniewsobiecki deleted the fix/router-instance-orphan-scoping branch April 25, 2026 11:31

zbigniewsobiecki mentioned this pull request Apr 25, 2026

chore: promote dev → main (router instance scoping fixes ucho exit-137) #1196

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(router): scope orphan-cleanup to this instance's own workers#1195

fix(router): scope orphan-cleanup to this instance's own workers#1195
zbigniewsobiecki merged 1 commit intodevfrom
fix/router-instance-orphan-scoping

zbigniewsobiecki commented Apr 25, 2026

Uh oh!

Uh oh!

codecov Bot commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zbigniewsobiecki commented Apr 25, 2026

Why — definitive root cause of the ucho exit-137 mystery

Mechanic

Fix

Surfaces

Tests (TDD-first)

Verification

Operator note for deploy

Test plan

Uh oh!

Uh oh!

codecov Bot commented Apr 25, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant