Skip to content

fix(router): scope orphan-cleanup to this instance's own workers#1195

Merged
zbigniewsobiecki merged 1 commit intodevfrom
fix/router-instance-orphan-scoping
Apr 25, 2026
Merged

fix(router): scope orphan-cleanup to this instance's own workers#1195
zbigniewsobiecki merged 1 commit intodevfrom
fix/router-instance-orphan-scoping

Conversation

@zbigniewsobiecki
Copy link
Copy Markdown
Member

Why — definitive root cause of the ucho exit-137 mystery

Two cascade-router instances share bauer (cascade-router prod + cascade-router-dev). Each one's periodic scanAndCleanupOrphans was silently docker stop-ing the other instance's healthy in-flight workers at the 30-min workerTimeoutMs mark — surfacing downstream as exit 137 · OOMKilled=false agent runs that everyone blamed on memory.

Smoking gun — the prod cascade-router log for run 2d0dc4b4 (MNG-308 implementation, killed today at 11:01:42 UTC after 34m12s) had zero kill events for container 9fc5fb3b…. The dev sibling's log had:

2026-04-25 11:01:43 WARN [cascade-router-dev]
  [WorkerManager] Stopped and removed orphaned container:
  { containerId: '9fc5fb3b7340', ... }

The 5-min scan grid + 30-min age threshold cleanly explains the consistent 30–34 min runtime across every recent ucho exit-137 run. Engine-agnostic (codex, claude-code) because the kill is external to the worker container.

Mechanic

scanAndCleanupOrphans (src/router/orphan-cleanup.ts:74–76) filtered by:

filters: { label: ['cascade.managed=true'] }

…then checked the resulting containers against THIS instance's in-process activeWorkers map. Sibling instances have independent maps, so each saw the other's containers as orphans and called container.stop({ t: 15 }) (SIGTERM, 15 s grace, then Docker SIGKILL → exit 137 with OOMKilled=false).

Fix

Tag every spawned container with the spawning router's instance id (cascade.router.instance Docker label). Scope the periodic scan's docker.listContainers({ filters }) to ONLY containers carrying THIS instance's id.

The dev instance now never even sees prod's containers in its scan — the filter is server-side, so there's no risk of a future bug in client-side checks leaking through.

Instance id resolution (src/router/instance-id.ts):

  1. process.env.CASCADE_ROUTER_INSTANCE (trimmed) — explicit override for hosts sharing a hostname.
  2. os.hostname() — Docker injects the container's short id by default; per-container unique. Normal path.

Throws fail-loud if both resolve empty.

Surfaces

File Change
src/router/instance-id.ts (new) Pure resolveRouterInstanceId(env, hostname) + module-load-memoised ROUTER_INSTANCE_ID constant.
src/router/container-manager.ts Adds 'cascade.router.instance': ROUTER_INSTANCE_ID to Labels: at docker.createContainer.
src/router/orphan-cleanup.ts Two-clause filter (cascade.managed=true + cascade.router.instance=<id>); kill-confirmation log now includes instanceId so post-mortems unambiguously identify which instance acted.
src/router/index.ts Logs instanceId at router startup.

Tests (TDD-first)

  • tests/unit/router/instance-id.test.ts (new, 7 cases) — env override wins; fallback to hostname; env-trimming; defensive throws on empty/whitespace-only inputs.
  • tests/unit/router/orphan-cleanup.test.ts — extended the existing lists containers with cascade.managed=true label test plus a new test that pins the safety property with an inline comment explaining the historical context (so a future contributor doesn't undo the protection without understanding why it exists).
  • tests/unit/router/container-manager.test.ts — new case asserts spawn passes cascade.router.instance in Labels:.

Verification

  • npx vitest run --project unit-api1438/1438 passed (was 1429 before, +7 instance-id +1 spawn-label +1 orphan-filter).
  • npm run typecheck clean.
  • npx biome check . — 1228 files, 0 issues.
  • Lefthook pre-commit + pre-push gates green.

Operator note for deploy

Pre-fix containers still running at deploy time carry no cascade.router.instance label. Neither router will see them in its scan filter post-deploy — they become invisible orphans (harmless: they exit naturally; only snapshot-enabled ones without AutoRemove leak, and those are usually rare).

One-shot manual sweep if needed after the deploy lands:

docker ps --filter label=cascade.managed=true --format '{{.ID}} {{.RunningFor}}' \
  | awk '/(hour|day)/ {print $1}' \
  | xargs -r docker stop -t 15

Test plan

  • Unit tests pin the safety property and the spawn-label flow.
  • Live verification on bauer post-deploy:
    • ssh bauer "docker logs cascade-router 2>&1 | head -20 | grep instanceId" confirms the instance id at startup.
    • ssh bauer "docker logs cascade-router-dev 2>&1 | grep instanceId" confirms a different id.
    • Trigger a fresh ucho implementation run via cascade runs retry <id>; watch past the 30-min mark; expect completion (or a real watchdog at 47 min for ucho), NOT a SIGTERM at 30–34 min.

🤖 Generated with Claude Code

When two cascade-router instances share a host (prod ↔ dev,
multi-replica deployments, two local-dev sandboxes), each one's
periodic orphan-cleanup scan would silently `docker stop` the OTHER
instance's healthy in-flight workers at the 30-min `workerTimeoutMs`
mark — surfacing downstream as `exit 137 · OOMKilled=false` agent
runs that everyone blamed on memory.

## The bug

`scanAndCleanupOrphans` filtered Docker by a single label
(`cascade.managed=true`) and checked the resulting containers
against THIS instance's in-process `activeWorkers` map. Sibling
instances each have their own independent map. So:

  cascade-router      spawns + tracks worker container 9fc5fb3b…
  cascade-router-dev  scans, sees 9fc5fb3b with `cascade.managed=true`,
                      NOT in dev's empty activeWorkers, age > 30 min,
                      → calls `container.stop({ t: 15 })`
                      → SIGTERM, 15 s grace, then Docker SIGKILL
                      → exit 137 with `OOMKilled=false`

Symmetric in the other direction. The 5-min scan grid + 30-min age
threshold is what produced the highly-consistent 30–34 min runtime
across every recent ucho exit-137 run.

The smoking gun was visible only in cascade-router-DEV's log:

  2026-04-25 11:01:43 WARN [cascade-router-dev]
    [WorkerManager] Stopped and removed orphaned container:
    { containerId: '9fc5fb3b7340', ... }

…paired with the prod cascade-router log showing zero kill events
for that container, because the prod instance's tracking was fine.

## The fix

Tag every spawned container with the spawning router's instance id
(`cascade.router.instance` Docker label). Scope the periodic scan's
`docker.listContainers({ filters })` call to ONLY containers
carrying THIS instance's id. The dev instance now never even *sees*
prod's containers in its scan — the filter is server-side, not
client-side, so there's no risk of a future bug in client-side
checks leaking through.

Instance id resolution (`src/router/instance-id.ts`):
  1. `process.env.CASCADE_ROUTER_INSTANCE` (trimmed) — explicit
     override for hosts that share a hostname.
  2. `os.hostname()` — Docker injects the container's short id here
     by default, so each container of cascade-router gets a unique
     value automatically. This is the normal path.

Throws fail-loud if both resolve empty (defensive).

## Files

- `src/router/instance-id.ts` (new) — pure resolver +
  module-load-memoised `ROUTER_INSTANCE_ID` constant.
- `src/router/container-manager.ts` — adds the new label at
  `docker.createContainer({ Labels: ... })`.
- `src/router/orphan-cleanup.ts` — adds the second clause to the
  `docker.listContainers({ filters: { label: [...] } })` call;
  includes `instanceId` in the kill-confirmation warn log so
  post-mortems unambiguously identify which instance acted.
- `src/router/index.ts` — logs `instanceId` at router startup.

## Tests (TDD-first)

- `tests/unit/router/instance-id.test.ts` (new, 7 cases) — env
  override wins, fallback to hostname, env-trimming, defensive
  throws on empty/whitespace-only inputs.
- `tests/unit/router/orphan-cleanup.test.ts` — extended the
  `lists containers with cascade.managed=true label` test to
  assert the new instance-scoping label is in the filter; new
  test pins the safety property explicitly with a comment
  explaining the historical context.
- `tests/unit/router/container-manager.test.ts` — new case
  asserts spawn passes `cascade.router.instance` in the Labels
  block.

All 1438/1438 unit-api tests pass. `npm run typecheck` clean.
`npx biome check .` clean (1228 files).

## Operator note for deploy

Pre-fix containers still running at deploy time carry no
`cascade.router.instance` label. Neither router will see them in
its scan filter post-deploy — they become invisible orphans
(harmless: they exit naturally; only snapshot-enabled ones
without AutoRemove leak). One-shot manual sweep if needed:

  docker ps --filter label=cascade.managed=true --format \
    '{{.ID}} {{.RunningFor}}' \
    | awk '/(hour|day)/ {print $1}' \
    | xargs -r docker stop -t 15

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zbigniewsobiecki zbigniewsobiecki merged commit fe0a8f8 into dev Apr 25, 2026
8 checks passed
@zbigniewsobiecki zbigniewsobiecki deleted the fix/router-instance-orphan-scoping branch April 25, 2026 11:31
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 25, 2026

Codecov Report

❌ Patch coverage is 91.66667% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/router/index.ts 0.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant