fix(router): scope orphan-cleanup to this instance's own workers#1195
Merged
zbigniewsobiecki merged 1 commit intodevfrom Apr 25, 2026
Merged
fix(router): scope orphan-cleanup to this instance's own workers#1195zbigniewsobiecki merged 1 commit intodevfrom
zbigniewsobiecki merged 1 commit intodevfrom
Conversation
When two cascade-router instances share a host (prod ↔ dev,
multi-replica deployments, two local-dev sandboxes), each one's
periodic orphan-cleanup scan would silently `docker stop` the OTHER
instance's healthy in-flight workers at the 30-min `workerTimeoutMs`
mark — surfacing downstream as `exit 137 · OOMKilled=false` agent
runs that everyone blamed on memory.
## The bug
`scanAndCleanupOrphans` filtered Docker by a single label
(`cascade.managed=true`) and checked the resulting containers
against THIS instance's in-process `activeWorkers` map. Sibling
instances each have their own independent map. So:
cascade-router spawns + tracks worker container 9fc5fb3b…
cascade-router-dev scans, sees 9fc5fb3b with `cascade.managed=true`,
NOT in dev's empty activeWorkers, age > 30 min,
→ calls `container.stop({ t: 15 })`
→ SIGTERM, 15 s grace, then Docker SIGKILL
→ exit 137 with `OOMKilled=false`
Symmetric in the other direction. The 5-min scan grid + 30-min age
threshold is what produced the highly-consistent 30–34 min runtime
across every recent ucho exit-137 run.
The smoking gun was visible only in cascade-router-DEV's log:
2026-04-25 11:01:43 WARN [cascade-router-dev]
[WorkerManager] Stopped and removed orphaned container:
{ containerId: '9fc5fb3b7340', ... }
…paired with the prod cascade-router log showing zero kill events
for that container, because the prod instance's tracking was fine.
## The fix
Tag every spawned container with the spawning router's instance id
(`cascade.router.instance` Docker label). Scope the periodic scan's
`docker.listContainers({ filters })` call to ONLY containers
carrying THIS instance's id. The dev instance now never even *sees*
prod's containers in its scan — the filter is server-side, not
client-side, so there's no risk of a future bug in client-side
checks leaking through.
Instance id resolution (`src/router/instance-id.ts`):
1. `process.env.CASCADE_ROUTER_INSTANCE` (trimmed) — explicit
override for hosts that share a hostname.
2. `os.hostname()` — Docker injects the container's short id here
by default, so each container of cascade-router gets a unique
value automatically. This is the normal path.
Throws fail-loud if both resolve empty (defensive).
## Files
- `src/router/instance-id.ts` (new) — pure resolver +
module-load-memoised `ROUTER_INSTANCE_ID` constant.
- `src/router/container-manager.ts` — adds the new label at
`docker.createContainer({ Labels: ... })`.
- `src/router/orphan-cleanup.ts` — adds the second clause to the
`docker.listContainers({ filters: { label: [...] } })` call;
includes `instanceId` in the kill-confirmation warn log so
post-mortems unambiguously identify which instance acted.
- `src/router/index.ts` — logs `instanceId` at router startup.
## Tests (TDD-first)
- `tests/unit/router/instance-id.test.ts` (new, 7 cases) — env
override wins, fallback to hostname, env-trimming, defensive
throws on empty/whitespace-only inputs.
- `tests/unit/router/orphan-cleanup.test.ts` — extended the
`lists containers with cascade.managed=true label` test to
assert the new instance-scoping label is in the filter; new
test pins the safety property explicitly with a comment
explaining the historical context.
- `tests/unit/router/container-manager.test.ts` — new case
asserts spawn passes `cascade.router.instance` in the Labels
block.
All 1438/1438 unit-api tests pass. `npm run typecheck` clean.
`npx biome check .` clean (1228 files).
## Operator note for deploy
Pre-fix containers still running at deploy time carry no
`cascade.router.instance` label. Neither router will see them in
its scan filter post-deploy — they become invisible orphans
(harmless: they exit naturally; only snapshot-enabled ones
without AutoRemove leak). One-shot manual sweep if needed:
docker ps --filter label=cascade.managed=true --format \
'{{.ID}} {{.RunningFor}}' \
| awk '/(hour|day)/ {print $1}' \
| xargs -r docker stop -t 15
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why — definitive root cause of the ucho exit-137 mystery
Two cascade-router instances share bauer (
cascade-routerprod +cascade-router-dev). Each one's periodicscanAndCleanupOrphanswas silentlydocker stop-ing the other instance's healthy in-flight workers at the 30-minworkerTimeoutMsmark — surfacing downstream asexit 137 · OOMKilled=falseagent runs that everyone blamed on memory.Smoking gun — the prod cascade-router log for run
2d0dc4b4(MNG-308 implementation, killed today at 11:01:42 UTC after 34m12s) had zero kill events for container9fc5fb3b…. The dev sibling's log had:The 5-min scan grid + 30-min age threshold cleanly explains the consistent 30–34 min runtime across every recent ucho exit-137 run. Engine-agnostic (codex, claude-code) because the kill is external to the worker container.
Mechanic
scanAndCleanupOrphans(src/router/orphan-cleanup.ts:74–76) filtered by:…then checked the resulting containers against THIS instance's in-process
activeWorkersmap. Sibling instances have independent maps, so each saw the other's containers as orphans and calledcontainer.stop({ t: 15 })(SIGTERM, 15 s grace, then Docker SIGKILL → exit 137 withOOMKilled=false).Fix
Tag every spawned container with the spawning router's instance id (
cascade.router.instanceDocker label). Scope the periodic scan'sdocker.listContainers({ filters })to ONLY containers carrying THIS instance's id.The dev instance now never even sees prod's containers in its scan — the filter is server-side, so there's no risk of a future bug in client-side checks leaking through.
Instance id resolution (
src/router/instance-id.ts):process.env.CASCADE_ROUTER_INSTANCE(trimmed) — explicit override for hosts sharing a hostname.os.hostname()— Docker injects the container's short id by default; per-container unique. Normal path.Throws fail-loud if both resolve empty.
Surfaces
src/router/instance-id.ts(new)resolveRouterInstanceId(env, hostname)+ module-load-memoisedROUTER_INSTANCE_IDconstant.src/router/container-manager.ts'cascade.router.instance': ROUTER_INSTANCE_IDtoLabels:atdocker.createContainer.src/router/orphan-cleanup.tscascade.managed=true+cascade.router.instance=<id>); kill-confirmation log now includesinstanceIdso post-mortems unambiguously identify which instance acted.src/router/index.tsinstanceIdat router startup.Tests (TDD-first)
tests/unit/router/instance-id.test.ts(new, 7 cases) — env override wins; fallback to hostname; env-trimming; defensive throws on empty/whitespace-only inputs.tests/unit/router/orphan-cleanup.test.ts— extended the existinglists containers with cascade.managed=true labeltest plus a new test that pins the safety property with an inline comment explaining the historical context (so a future contributor doesn't undo the protection without understanding why it exists).tests/unit/router/container-manager.test.ts— new case asserts spawn passescascade.router.instanceinLabels:.Verification
npx vitest run --project unit-api— 1438/1438 passed (was 1429 before, +7 instance-id +1 spawn-label +1 orphan-filter).npm run typecheckclean.npx biome check .— 1228 files, 0 issues.Operator note for deploy
Pre-fix containers still running at deploy time carry no
cascade.router.instancelabel. Neither router will see them in its scan filter post-deploy — they become invisible orphans (harmless: they exit naturally; only snapshot-enabled ones withoutAutoRemoveleak, and those are usually rare).One-shot manual sweep if needed after the deploy lands:
Test plan
ssh bauer "docker logs cascade-router 2>&1 | head -20 | grep instanceId"confirms the instance id at startup.ssh bauer "docker logs cascade-router-dev 2>&1 | grep instanceId"confirms a different id.cascade runs retry <id>; watch past the 30-min mark; expect completion (or a real watchdog at 47 min for ucho), NOT a SIGTERM at 30–34 min.🤖 Generated with Claude Code