Skip to content

feat(router): add periodic orphaned container cleanup#779

Merged
aaight merged 1 commit intodevfrom
feature/orphan-container-cleanup
Mar 13, 2026
Merged

feat(router): add periodic orphaned container cleanup#779
aaight merged 1 commit intodevfrom
feature/orphan-container-cleanup

Conversation

@aaight
Copy link
Copy Markdown
Collaborator

@aaight aaight commented Mar 13, 2026

Summary

Implement periodic orphaned container cleanup in the CASCADE router to handle containers that survive router restarts by being stopped if they're no longer tracked.

What Changed

New Functions in src/router/container-manager.ts

  • startOrphanCleanup() - Starts a periodic scan every 5 minutes for orphaned Docker containers with the cascade.managed=true label that are:

    • NOT in the Router's activeWorkers map (not tracked)
    • Older than workerTimeoutMs (avoids killing recently-spawned workers not yet registered)
  • stopOrphanCleanup() - Stops the periodic scan timer

  • scanAndCleanupOrphans() (internal, exported for testing) - Performs the actual scan:

    • Lists Docker containers with cascade.managed=true label
    • Filters out tracked containers (in activeWorkers map)
    • Filters out young containers (< workerTimeoutMs old)
    • Stops remaining orphans with graceful 15-second shutdown
    • Logs each orphan stopped at warn level with container ID and age
    • Handles Docker errors gracefully

Integration Points

  • startOrphanCleanup() called in startWorkerProcessor() after queue workers are initialized
  • stopOrphanCleanup() called in stopWorkerProcessor() before closing workers
  • detachAll() now calls stopOrphanCleanup() to clean up timer on shutdown

Re-exports in src/router/worker-manager.ts

  • Exported startOrphanCleanup and stopOrphanCleanup for use by router

Testing

  • Added 18 comprehensive unit tests covering:
    • Start/stop lifecycle
    • Idempotency and multiple cycles
    • Orphan detection and stopping
    • Tracked container preservation
    • Young container preservation
    • Docker error handling
    • Container stop error handling
    • Multi-container scenarios

All tests use mocked Docker client to verify behavior without actual containers.

Key Design Decisions

  1. 5-minute scan interval - balances responsiveness with resource usage
  2. Age threshold - uses workerTimeoutMs to avoid killing recently-spawned workers that haven't been registered yet
  3. Graceful shutdown - gives containers 15 seconds to shut down before Docker force-kills
  4. Error resilience - continues scanning even if individual container stops fail; Docker list failures are propagated as they indicate a critical issue
  5. Logging - each orphan is logged at warn level with container ID and age for observability

Acceptance Criteria Met

✅ New functions startOrphanCleanup() and stopOrphanCleanup() exported from src/router/container-manager.ts
✅ Periodic scan runs every 5 minutes using setInterval, listing Docker containers with cascade.managed=true label
✅ Containers not in the activeWorkers map AND older than workerTimeoutMs are stopped via container.stop()
✅ Containers under the age threshold are left alone (avoids killing newly-spawned workers not yet registered)
✅ Tracked containers (in activeWorkers) are never stopped by the orphan scan
✅ startOrphanCleanup() called during router startup in src/router/worker-manager.ts
✅ stopOrphanCleanup() called during router shutdown, clearInterval cleans up the timer
✅ Each orphan stopped is logged at warn level with container ID and age
✅ Unit tests with mocked Docker client verify: orphans stopped, tracked containers left, young containers left, Docker errors handled gracefully
✅ All existing tests pass, typecheck and lint are clean

Related Issue

https://trello.com/c/Tfp40KUo/306-as-a-developer-i-want-periodic-orphaned-container-cleanup-so-that-containers-surviving-router-restarts-are-eventually-stopped

Copy link
Copy Markdown
Collaborator

@nhopeatall nhopeatall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Clean, well-structured implementation of periodic orphaned container cleanup. The code is correct, properly integrated, and thoroughly tested. No blocking issues found.

Design Notes

The implementation makes good design choices:

  • Age threshold using workerTimeoutMs: Correctly prevents killing recently-spawned containers that haven't been registered yet. The synchronous path between container.start() and activeWorkers.set() in spawnWorker means no setInterval callback can fire in between, but the age threshold adds defense-in-depth.
  • Error resilience: Individual container stop failures are caught and logged without aborting the scan, while Docker list failures are propagated (correct — a list failure likely means Docker is down).
  • Graceful shutdown: Timer is properly cleaned up via both stopOrphanCleanup() in stopWorkerProcessor() and detachAll(). The double-call is redundant but harmless due to the null-check guard.
  • setInterval without unref(): Not an issue since the shutdown handler calls process.exit(0) explicitly.

Minor Observations (non-blocking)

  • The orphan scan interval (5 min) is hardcoded as a local constant rather than being in routerConfig, unlike the similar emailScheduleIntervalMs. Fine for now since it's unlikely to need runtime customization.
  • The first scan fires 5 minutes after startup (not immediately). This means orphaned containers from a previous router instance survive for up to 5 additional minutes post-restart. Acceptable given the use case, but worth noting if faster cleanup is ever desired.

LGTM — ready to merge.

@aaight aaight merged commit c021618 into dev Mar 13, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants