Skip to content

chore: promote dev → main (router instance scoping fixes ucho exit-137)#1196

Merged
zbigniewsobiecki merged 2 commits intomainfrom
dev
Apr 25, 2026
Merged

chore: promote dev → main (router instance scoping fixes ucho exit-137)#1196
zbigniewsobiecki merged 2 commits intomainfrom
dev

Conversation

@zbigniewsobiecki
Copy link
Copy Markdown
Member

Summary

Promotes the definitive fix for the ucho exit-137 mystery to prod:

If any other commits from dev are bundled implicitly (PRs already merged that haven't been promoted yet), they're included.

Test plan

  • CI green on fix(router): scope orphan-cleanup to this instance's own workers #1195 (lint-and-test 3m46s, integration 1m46s, Validate Docker 2m8s, CodeQL).
  • CI green on dev post-merge.
  • Build and Deploy (Prod) green after this merges to main.
  • Live verification on bauer:
    • ssh bauer "docker logs cascade-router 2>&1 | head -20 | grep instanceId" confirms instance id at startup.
    • ssh bauer "docker logs cascade-router-dev 2>&1 | grep instanceId" confirms a different id.
    • Trigger a fresh ucho implementation run via cascade runs retry <id>. Watch past the 30-min mark — expect completion or the real 47-min watchdog, not a SIGTERM at 30-34 min from the sibling instance.

Operator note for deploy

Pre-fix containers still running at deploy time carry no cascade.router.instance label. Neither router will see them in its scan filter post-deploy — they become invisible orphans (harmless: they exit naturally; only snapshot-enabled ones without AutoRemove leak).

One-shot manual sweep if needed:

docker ps --filter label=cascade.managed=true --format '{{.ID}} {{.RunningFor}}' \
  | awk '/(hour|day)/ {print \$1}' \
  | xargs -r docker stop -t 15

🤖 Generated with Claude Code

zbigniewsobiecki and others added 2 commits April 25, 2026 13:26
When two cascade-router instances share a host (prod ↔ dev,
multi-replica deployments, two local-dev sandboxes), each one's
periodic orphan-cleanup scan would silently `docker stop` the OTHER
instance's healthy in-flight workers at the 30-min `workerTimeoutMs`
mark — surfacing downstream as `exit 137 · OOMKilled=false` agent
runs that everyone blamed on memory.

## The bug

`scanAndCleanupOrphans` filtered Docker by a single label
(`cascade.managed=true`) and checked the resulting containers
against THIS instance's in-process `activeWorkers` map. Sibling
instances each have their own independent map. So:

  cascade-router      spawns + tracks worker container 9fc5fb3b…
  cascade-router-dev  scans, sees 9fc5fb3b with `cascade.managed=true`,
                      NOT in dev's empty activeWorkers, age > 30 min,
                      → calls `container.stop({ t: 15 })`
                      → SIGTERM, 15 s grace, then Docker SIGKILL
                      → exit 137 with `OOMKilled=false`

Symmetric in the other direction. The 5-min scan grid + 30-min age
threshold is what produced the highly-consistent 30–34 min runtime
across every recent ucho exit-137 run.

The smoking gun was visible only in cascade-router-DEV's log:

  2026-04-25 11:01:43 WARN [cascade-router-dev]
    [WorkerManager] Stopped and removed orphaned container:
    { containerId: '9fc5fb3b7340', ... }

…paired with the prod cascade-router log showing zero kill events
for that container, because the prod instance's tracking was fine.

## The fix

Tag every spawned container with the spawning router's instance id
(`cascade.router.instance` Docker label). Scope the periodic scan's
`docker.listContainers({ filters })` call to ONLY containers
carrying THIS instance's id. The dev instance now never even *sees*
prod's containers in its scan — the filter is server-side, not
client-side, so there's no risk of a future bug in client-side
checks leaking through.

Instance id resolution (`src/router/instance-id.ts`):
  1. `process.env.CASCADE_ROUTER_INSTANCE` (trimmed) — explicit
     override for hosts that share a hostname.
  2. `os.hostname()` — Docker injects the container's short id here
     by default, so each container of cascade-router gets a unique
     value automatically. This is the normal path.

Throws fail-loud if both resolve empty (defensive).

## Files

- `src/router/instance-id.ts` (new) — pure resolver +
  module-load-memoised `ROUTER_INSTANCE_ID` constant.
- `src/router/container-manager.ts` — adds the new label at
  `docker.createContainer({ Labels: ... })`.
- `src/router/orphan-cleanup.ts` — adds the second clause to the
  `docker.listContainers({ filters: { label: [...] } })` call;
  includes `instanceId` in the kill-confirmation warn log so
  post-mortems unambiguously identify which instance acted.
- `src/router/index.ts` — logs `instanceId` at router startup.

## Tests (TDD-first)

- `tests/unit/router/instance-id.test.ts` (new, 7 cases) — env
  override wins, fallback to hostname, env-trimming, defensive
  throws on empty/whitespace-only inputs.
- `tests/unit/router/orphan-cleanup.test.ts` — extended the
  `lists containers with cascade.managed=true label` test to
  assert the new instance-scoping label is in the filter; new
  test pins the safety property explicitly with a comment
  explaining the historical context.
- `tests/unit/router/container-manager.test.ts` — new case
  asserts spawn passes `cascade.router.instance` in the Labels
  block.

All 1438/1438 unit-api tests pass. `npm run typecheck` clean.
`npx biome check .` clean (1228 files).

## Operator note for deploy

Pre-fix containers still running at deploy time carry no
`cascade.router.instance` label. Neither router will see them in
its scan filter post-deploy — they become invisible orphans
(harmless: they exit naturally; only snapshot-enabled ones
without AutoRemove leak). One-shot manual sweep if needed:

  docker ps --filter label=cascade.managed=true --format \
    '{{.ID}} {{.RunningFor}}' \
    | awk '/(hour|day)/ {print $1}' \
    | xargs -r docker stop -t 15

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e-orphan-scoping

fix(router): scope orphan-cleanup to this instance's own workers
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 25, 2026

Codecov Report

❌ Patch coverage is 91.66667% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/router/index.ts 0.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@zbigniewsobiecki zbigniewsobiecki merged commit e774600 into main Apr 25, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant