feat(router): add periodic orphaned container cleanup by aaight · Pull Request #779 · mongrel-intelligence/cascade

aaight · 2026-03-13T17:43:53Z

Summary

Implement periodic orphaned container cleanup in the CASCADE router to handle containers that survive router restarts by being stopped if they're no longer tracked.

What Changed

New Functions in `src/router/container-manager.ts`

startOrphanCleanup() - Starts a periodic scan every 5 minutes for orphaned Docker containers with the cascade.managed=true label that are:
- NOT in the Router's activeWorkers map (not tracked)
- Older than workerTimeoutMs (avoids killing recently-spawned workers not yet registered)
stopOrphanCleanup() - Stops the periodic scan timer
scanAndCleanupOrphans() (internal, exported for testing) - Performs the actual scan:
- Lists Docker containers with cascade.managed=true label
- Filters out tracked containers (in activeWorkers map)
- Filters out young containers (< workerTimeoutMs old)
- Stops remaining orphans with graceful 15-second shutdown
- Logs each orphan stopped at warn level with container ID and age
- Handles Docker errors gracefully

Integration Points

startOrphanCleanup() called in startWorkerProcessor() after queue workers are initialized
stopOrphanCleanup() called in stopWorkerProcessor() before closing workers
detachAll() now calls stopOrphanCleanup() to clean up timer on shutdown

Re-exports in `src/router/worker-manager.ts`

Exported startOrphanCleanup and stopOrphanCleanup for use by router

Testing

Added 18 comprehensive unit tests covering:
- Start/stop lifecycle
- Idempotency and multiple cycles
- Orphan detection and stopping
- Tracked container preservation
- Young container preservation
- Docker error handling
- Container stop error handling
- Multi-container scenarios

All tests use mocked Docker client to verify behavior without actual containers.

Key Design Decisions

5-minute scan interval - balances responsiveness with resource usage
Age threshold - uses workerTimeoutMs to avoid killing recently-spawned workers that haven't been registered yet
Graceful shutdown - gives containers 15 seconds to shut down before Docker force-kills
Error resilience - continues scanning even if individual container stops fail; Docker list failures are propagated as they indicate a critical issue
Logging - each orphan is logged at warn level with container ID and age for observability

Acceptance Criteria Met

✅ New functions startOrphanCleanup() and stopOrphanCleanup() exported from src/router/container-manager.ts
✅ Periodic scan runs every 5 minutes using setInterval, listing Docker containers with cascade.managed=true label
✅ Containers not in the activeWorkers map AND older than workerTimeoutMs are stopped via container.stop()
✅ Containers under the age threshold are left alone (avoids killing newly-spawned workers not yet registered)
✅ Tracked containers (in activeWorkers) are never stopped by the orphan scan
✅ startOrphanCleanup() called during router startup in src/router/worker-manager.ts
✅ stopOrphanCleanup() called during router shutdown, clearInterval cleans up the timer
✅ Each orphan stopped is logged at warn level with container ID and age
✅ Unit tests with mocked Docker client verify: orphans stopped, tracked containers left, young containers left, Docker errors handled gracefully
✅ All existing tests pass, typecheck and lint are clean

Related Issue

https://trello.com/c/Tfp40KUo/306-as-a-developer-i-want-periodic-orphaned-container-cleanup-so-that-containers-surviving-router-restarts-are-eventually-stopped

nhopeatall

Summary

Clean, well-structured implementation of periodic orphaned container cleanup. The code is correct, properly integrated, and thoroughly tested. No blocking issues found.

Design Notes

The implementation makes good design choices:

Age threshold using workerTimeoutMs: Correctly prevents killing recently-spawned containers that haven't been registered yet. The synchronous path between container.start() and activeWorkers.set() in spawnWorker means no setInterval callback can fire in between, but the age threshold adds defense-in-depth.
Error resilience: Individual container stop failures are caught and logged without aborting the scan, while Docker list failures are propagated (correct — a list failure likely means Docker is down).
Graceful shutdown: Timer is properly cleaned up via both stopOrphanCleanup() in stopWorkerProcessor() and detachAll(). The double-call is redundant but harmless due to the null-check guard.
setInterval without unref(): Not an issue since the shutdown handler calls process.exit(0) explicitly.

Minor Observations (non-blocking)

The orphan scan interval (5 min) is hardcoded as a local constant rather than being in routerConfig, unlike the similar emailScheduleIntervalMs. Fine for now since it's unlikely to need runtime customization.
The first scan fires 5 minutes after startup (not immediately). This means orphaned containers from a previous router instance survive for up to 5 additional minutes post-restart. Acceptable given the use case, but worth noting if faster cleanup is ever desired.

LGTM — ready to merge.

feat(router): add periodic orphaned container cleanup

ee93601

aaight requested a review from zbigniewsobiecki as a code owner March 13, 2026 17:43

nhopeatall approved these changes Mar 13, 2026

View reviewed changes

aaight merged commit c021618 into dev Mar 13, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(router): add periodic orphaned container cleanup#779

feat(router): add periodic orphaned container cleanup#779
aaight merged 1 commit intodevfrom
feature/orphan-container-cleanup

aaight commented Mar 13, 2026

Uh oh!

nhopeatall left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aaight commented Mar 13, 2026

Summary

What Changed

New Functions in src/router/container-manager.ts

Integration Points

Re-exports in src/router/worker-manager.ts

Testing

Key Design Decisions

Acceptance Criteria Met

Related Issue

Uh oh!

nhopeatall left a comment

Choose a reason for hiding this comment

Summary

Design Notes

Minor Observations (non-blocking)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New Functions in `src/router/container-manager.ts`

Re-exports in `src/router/worker-manager.ts`