chore: promote dev → main (router instance scoping fixes ucho exit-137) by zbigniewsobiecki · Pull Request #1196 · mongrel-intelligence/cascade

zbigniewsobiecki · 2026-04-25T11:34:53Z

Summary

Promotes the definitive fix for the ucho exit-137 mystery to prod:

fix(router): scope orphan-cleanup to this instance's own workers #1195 — fix(router): scope orphan-cleanup to this instance's own workers — root cause for the 30-34 min consistent kills. Two cascade-router instances on bauer (prod + dev) were silently docker stop-ing each other's healthy in-flight workers via orphan-cleanup. The fix tags spawned containers with the spawning router's instance id and scopes the periodic scan filter to only match that id. OOMKilled=false confirmed via the diagnostic shipped in feat(router): capture OOMKilled + exit reason on worker exits #1193 — was never memory.

If any other commits from dev are bundled implicitly (PRs already merged that haven't been promoted yet), they're included.

Test plan

CI green on fix(router): scope orphan-cleanup to this instance's own workers #1195 (lint-and-test 3m46s, integration 1m46s, Validate Docker 2m8s, CodeQL).
CI green on dev post-merge.
Build and Deploy (Prod) green after this merges to main.
Live verification on bauer:
- ssh bauer "docker logs cascade-router 2>&1 | head -20 | grep instanceId" confirms instance id at startup.
- ssh bauer "docker logs cascade-router-dev 2>&1 | grep instanceId" confirms a different id.
- Trigger a fresh ucho implementation run via cascade runs retry <id>. Watch past the 30-min mark — expect completion or the real 47-min watchdog, not a SIGTERM at 30-34 min from the sibling instance.

Operator note for deploy

Pre-fix containers still running at deploy time carry no cascade.router.instance label. Neither router will see them in its scan filter post-deploy — they become invisible orphans (harmless: they exit naturally; only snapshot-enabled ones without AutoRemove leak).

One-shot manual sweep if needed:

docker ps --filter label=cascade.managed=true --format '{{.ID}} {{.RunningFor}}' \
  | awk '/(hour|day)/ {print \$1}' \
  | xargs -r docker stop -t 15

🤖 Generated with Claude Code

When two cascade-router instances share a host (prod ↔ dev, multi-replica deployments, two local-dev sandboxes), each one's periodic orphan-cleanup scan would silently `docker stop` the OTHER instance's healthy in-flight workers at the 30-min `workerTimeoutMs` mark — surfacing downstream as `exit 137 · OOMKilled=false` agent runs that everyone blamed on memory. ## The bug `scanAndCleanupOrphans` filtered Docker by a single label (`cascade.managed=true`) and checked the resulting containers against THIS instance's in-process `activeWorkers` map. Sibling instances each have their own independent map. So: cascade-router spawns + tracks worker container 9fc5fb3b… cascade-router-dev scans, sees 9fc5fb3b with `cascade.managed=true`, NOT in dev's empty activeWorkers, age > 30 min, → calls `container.stop({ t: 15 })` → SIGTERM, 15 s grace, then Docker SIGKILL → exit 137 with `OOMKilled=false` Symmetric in the other direction. The 5-min scan grid + 30-min age threshold is what produced the highly-consistent 30–34 min runtime across every recent ucho exit-137 run. The smoking gun was visible only in cascade-router-DEV's log: 2026-04-25 11:01:43 WARN [cascade-router-dev] [WorkerManager] Stopped and removed orphaned container: { containerId: '9fc5fb3b7340', ... } …paired with the prod cascade-router log showing zero kill events for that container, because the prod instance's tracking was fine. ## The fix Tag every spawned container with the spawning router's instance id (`cascade.router.instance` Docker label). Scope the periodic scan's `docker.listContainers({ filters })` call to ONLY containers carrying THIS instance's id. The dev instance now never even *sees* prod's containers in its scan — the filter is server-side, not client-side, so there's no risk of a future bug in client-side checks leaking through. Instance id resolution (`src/router/instance-id.ts`): 1. `process.env.CASCADE_ROUTER_INSTANCE` (trimmed) — explicit override for hosts that share a hostname. 2. `os.hostname()` — Docker injects the container's short id here by default, so each container of cascade-router gets a unique value automatically. This is the normal path. Throws fail-loud if both resolve empty (defensive). ## Files - `src/router/instance-id.ts` (new) — pure resolver + module-load-memoised `ROUTER_INSTANCE_ID` constant. - `src/router/container-manager.ts` — adds the new label at `docker.createContainer({ Labels: ... })`. - `src/router/orphan-cleanup.ts` — adds the second clause to the `docker.listContainers({ filters: { label: [...] } })` call; includes `instanceId` in the kill-confirmation warn log so post-mortems unambiguously identify which instance acted. - `src/router/index.ts` — logs `instanceId` at router startup. ## Tests (TDD-first) - `tests/unit/router/instance-id.test.ts` (new, 7 cases) — env override wins, fallback to hostname, env-trimming, defensive throws on empty/whitespace-only inputs. - `tests/unit/router/orphan-cleanup.test.ts` — extended the `lists containers with cascade.managed=true label` test to assert the new instance-scoping label is in the filter; new test pins the safety property explicitly with a comment explaining the historical context. - `tests/unit/router/container-manager.test.ts` — new case asserts spawn passes `cascade.router.instance` in the Labels block. All 1438/1438 unit-api tests pass. `npm run typecheck` clean. `npx biome check .` clean (1228 files). ## Operator note for deploy Pre-fix containers still running at deploy time carry no `cascade.router.instance` label. Neither router will see them in its scan filter post-deploy — they become invisible orphans (harmless: they exit naturally; only snapshot-enabled ones without AutoRemove leak). One-shot manual sweep if needed: docker ps --filter label=cascade.managed=true --format \ '{{.ID}} {{.RunningFor}}' \ | awk '/(hour|day)/ {print $1}' \ | xargs -r docker stop -t 15 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e-orphan-scoping fix(router): scope orphan-cleanup to this instance's own workers

codecov · 2026-04-25T11:38:53Z

Codecov Report

❌ Patch coverage is 91.66667% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/router/index.ts	0.00%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

zbigniewsobiecki and others added 2 commits April 25, 2026 13:26

Merge pull request #1195 from mongrel-intelligence/fix/router-instanc…

fe0a8f8

…e-orphan-scoping fix(router): scope orphan-cleanup to this instance's own workers

zbigniewsobiecki merged commit e774600 into main Apr 25, 2026
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: promote dev → main (router instance scoping fixes ucho exit-137)#1196

chore: promote dev → main (router instance scoping fixes ucho exit-137)#1196
zbigniewsobiecki merged 2 commits intomainfrom
dev

zbigniewsobiecki commented Apr 25, 2026

Uh oh!

codecov Bot commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zbigniewsobiecki commented Apr 25, 2026

Summary

Test plan

Operator note for deploy

Uh oh!

codecov Bot commented Apr 25, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant