feat(router): add periodic orphaned container cleanup#779
Merged
Conversation
nhopeatall
approved these changes
Mar 13, 2026
Collaborator
nhopeatall
left a comment
There was a problem hiding this comment.
Summary
Clean, well-structured implementation of periodic orphaned container cleanup. The code is correct, properly integrated, and thoroughly tested. No blocking issues found.
Design Notes
The implementation makes good design choices:
- Age threshold using
workerTimeoutMs: Correctly prevents killing recently-spawned containers that haven't been registered yet. The synchronous path betweencontainer.start()andactiveWorkers.set()inspawnWorkermeans nosetIntervalcallback can fire in between, but the age threshold adds defense-in-depth. - Error resilience: Individual container stop failures are caught and logged without aborting the scan, while Docker list failures are propagated (correct — a list failure likely means Docker is down).
- Graceful shutdown: Timer is properly cleaned up via both
stopOrphanCleanup()instopWorkerProcessor()anddetachAll(). The double-call is redundant but harmless due to the null-check guard. setIntervalwithoutunref(): Not an issue since the shutdown handler callsprocess.exit(0)explicitly.
Minor Observations (non-blocking)
- The orphan scan interval (5 min) is hardcoded as a local constant rather than being in
routerConfig, unlike the similaremailScheduleIntervalMs. Fine for now since it's unlikely to need runtime customization. - The first scan fires 5 minutes after startup (not immediately). This means orphaned containers from a previous router instance survive for up to 5 additional minutes post-restart. Acceptable given the use case, but worth noting if faster cleanup is ever desired.
LGTM — ready to merge.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implement periodic orphaned container cleanup in the CASCADE router to handle containers that survive router restarts by being stopped if they're no longer tracked.
What Changed
New Functions in
src/router/container-manager.tsstartOrphanCleanup()- Starts a periodic scan every 5 minutes for orphaned Docker containers with thecascade.managed=truelabel that are:activeWorkersmap (not tracked)workerTimeoutMs(avoids killing recently-spawned workers not yet registered)stopOrphanCleanup()- Stops the periodic scan timerscanAndCleanupOrphans()(internal, exported for testing) - Performs the actual scan:cascade.managed=truelabelactiveWorkersmap)workerTimeoutMsold)Integration Points
startOrphanCleanup()called instartWorkerProcessor()after queue workers are initializedstopOrphanCleanup()called instopWorkerProcessor()before closing workersdetachAll()now callsstopOrphanCleanup()to clean up timer on shutdownRe-exports in
src/router/worker-manager.tsstartOrphanCleanupandstopOrphanCleanupfor use by routerTesting
All tests use mocked Docker client to verify behavior without actual containers.
Key Design Decisions
workerTimeoutMsto avoid killing recently-spawned workers that haven't been registered yetAcceptance Criteria Met
✅ New functions startOrphanCleanup() and stopOrphanCleanup() exported from src/router/container-manager.ts
✅ Periodic scan runs every 5 minutes using setInterval, listing Docker containers with cascade.managed=true label
✅ Containers not in the activeWorkers map AND older than workerTimeoutMs are stopped via container.stop()
✅ Containers under the age threshold are left alone (avoids killing newly-spawned workers not yet registered)
✅ Tracked containers (in activeWorkers) are never stopped by the orphan scan
✅ startOrphanCleanup() called during router startup in src/router/worker-manager.ts
✅ stopOrphanCleanup() called during router shutdown, clearInterval cleans up the timer
✅ Each orphan stopped is logged at warn level with container ID and age
✅ Unit tests with mocked Docker client verify: orphans stopped, tracked containers left, young containers left, Docker errors handled gracefully
✅ All existing tests pass, typecheck and lint are clean
Related Issue
https://trello.com/c/Tfp40KUo/306-as-a-developer-i-want-periodic-orphaned-container-cleanup-so-that-containers-surviving-router-restarts-are-eventually-stopped