chore: promote dev → main (router instance scoping fixes ucho exit-137)#1196
Merged
zbigniewsobiecki merged 2 commits intomainfrom Apr 25, 2026
Merged
chore: promote dev → main (router instance scoping fixes ucho exit-137)#1196zbigniewsobiecki merged 2 commits intomainfrom
zbigniewsobiecki merged 2 commits intomainfrom
Conversation
When two cascade-router instances share a host (prod ↔ dev,
multi-replica deployments, two local-dev sandboxes), each one's
periodic orphan-cleanup scan would silently `docker stop` the OTHER
instance's healthy in-flight workers at the 30-min `workerTimeoutMs`
mark — surfacing downstream as `exit 137 · OOMKilled=false` agent
runs that everyone blamed on memory.
## The bug
`scanAndCleanupOrphans` filtered Docker by a single label
(`cascade.managed=true`) and checked the resulting containers
against THIS instance's in-process `activeWorkers` map. Sibling
instances each have their own independent map. So:
cascade-router spawns + tracks worker container 9fc5fb3b…
cascade-router-dev scans, sees 9fc5fb3b with `cascade.managed=true`,
NOT in dev's empty activeWorkers, age > 30 min,
→ calls `container.stop({ t: 15 })`
→ SIGTERM, 15 s grace, then Docker SIGKILL
→ exit 137 with `OOMKilled=false`
Symmetric in the other direction. The 5-min scan grid + 30-min age
threshold is what produced the highly-consistent 30–34 min runtime
across every recent ucho exit-137 run.
The smoking gun was visible only in cascade-router-DEV's log:
2026-04-25 11:01:43 WARN [cascade-router-dev]
[WorkerManager] Stopped and removed orphaned container:
{ containerId: '9fc5fb3b7340', ... }
…paired with the prod cascade-router log showing zero kill events
for that container, because the prod instance's tracking was fine.
## The fix
Tag every spawned container with the spawning router's instance id
(`cascade.router.instance` Docker label). Scope the periodic scan's
`docker.listContainers({ filters })` call to ONLY containers
carrying THIS instance's id. The dev instance now never even *sees*
prod's containers in its scan — the filter is server-side, not
client-side, so there's no risk of a future bug in client-side
checks leaking through.
Instance id resolution (`src/router/instance-id.ts`):
1. `process.env.CASCADE_ROUTER_INSTANCE` (trimmed) — explicit
override for hosts that share a hostname.
2. `os.hostname()` — Docker injects the container's short id here
by default, so each container of cascade-router gets a unique
value automatically. This is the normal path.
Throws fail-loud if both resolve empty (defensive).
## Files
- `src/router/instance-id.ts` (new) — pure resolver +
module-load-memoised `ROUTER_INSTANCE_ID` constant.
- `src/router/container-manager.ts` — adds the new label at
`docker.createContainer({ Labels: ... })`.
- `src/router/orphan-cleanup.ts` — adds the second clause to the
`docker.listContainers({ filters: { label: [...] } })` call;
includes `instanceId` in the kill-confirmation warn log so
post-mortems unambiguously identify which instance acted.
- `src/router/index.ts` — logs `instanceId` at router startup.
## Tests (TDD-first)
- `tests/unit/router/instance-id.test.ts` (new, 7 cases) — env
override wins, fallback to hostname, env-trimming, defensive
throws on empty/whitespace-only inputs.
- `tests/unit/router/orphan-cleanup.test.ts` — extended the
`lists containers with cascade.managed=true label` test to
assert the new instance-scoping label is in the filter; new
test pins the safety property explicitly with a comment
explaining the historical context.
- `tests/unit/router/container-manager.test.ts` — new case
asserts spawn passes `cascade.router.instance` in the Labels
block.
All 1438/1438 unit-api tests pass. `npm run typecheck` clean.
`npx biome check .` clean (1228 files).
## Operator note for deploy
Pre-fix containers still running at deploy time carry no
`cascade.router.instance` label. Neither router will see them in
its scan filter post-deploy — they become invisible orphans
(harmless: they exit naturally; only snapshot-enabled ones
without AutoRemove leak). One-shot manual sweep if needed:
docker ps --filter label=cascade.managed=true --format \
'{{.ID}} {{.RunningFor}}' \
| awk '/(hour|day)/ {print $1}' \
| xargs -r docker stop -t 15
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e-orphan-scoping fix(router): scope orphan-cleanup to this instance's own workers
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Promotes the definitive fix for the ucho exit-137 mystery to prod:
fix(router): scope orphan-cleanup to this instance's own workers— root cause for the 30-34 min consistent kills. Two cascade-router instances on bauer (prod + dev) were silentlydocker stop-ing each other's healthy in-flight workers via orphan-cleanup. The fix tags spawned containers with the spawning router's instance id and scopes the periodic scan filter to only match that id. OOMKilled=false confirmed via the diagnostic shipped in feat(router): capture OOMKilled + exit reason on worker exits #1193 — was never memory.If any other commits from dev are bundled implicitly (PRs already merged that haven't been promoted yet), they're included.
Test plan
ssh bauer "docker logs cascade-router 2>&1 | head -20 | grep instanceId"confirms instance id at startup.ssh bauer "docker logs cascade-router-dev 2>&1 | grep instanceId"confirms a different id.cascade runs retry <id>. Watch past the 30-min mark — expect completion or the real 47-min watchdog, not a SIGTERM at 30-34 min from the sibling instance.Operator note for deploy
Pre-fix containers still running at deploy time carry no
cascade.router.instancelabel. Neither router will see them in its scan filter post-deploy — they become invisible orphans (harmless: they exit naturally; only snapshot-enabled ones withoutAutoRemoveleak).One-shot manual sweep if needed:
🤖 Generated with Claude Code