feat(router): add snapshot manager and capture/reuse worker container snapshots#1046
feat(router): add snapshot manager and capture/reuse worker container snapshots#1046
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
nhopeatall
left a comment
There was a problem hiding this comment.
Summary
Container leak when snapshotEnabled is true but workItemId is undefined. Otherwise a clean, well-structured feature addition.
Code Issues
Blocking
-
src/router/container-manager.ts:317 — Container leak when
snapshotEnabled=trueandworkItemIdis undefined. TheAutoRemoveflag is set tofalsebased solely onsnapshotEnabled(line 232), but the post-exit cleanup block on line 317 requiressnapshotEnabled && projectId && workItemId— all three. WhenworkItemIdisundefined(possible for some job types), the container runs withAutoRemove=false, exits normally, but theremoveContainer()call on line 327 is never reached. The container is orphaned on disk. Note the error-path handler (line 339) correctly checks onlyif (snapshotEnabled), which shows awareness of this edge case — but the success path doesn't match.Fix: Change the guard on line 317 to
if (snapshotEnabled), and make the commit conditional onprojectId && workItemIdinside the block:if (snapshotEnabled) { if (result.StatusCode === 0 && projectId && workItemId) { await commitContainerToSnapshot(container.id, projectId, workItemId); } else if (result.StatusCode !== 0) { logger.info('[WorkerManager] Skipping snapshot commit after non-zero exit:', { ... }); } // Always remove — consistent with the .catch() handler await removeContainer(container.id); }
Should Fix
-
src/router/container-manager.ts:150 —
resolveSpawnSettingsreturnssnapshotEnabled: trueeven whenworkItemIdisundefined. This is what enables the container leak —AutoRemovegets disabled for a container that can never be snapshot-committed. Either gatesnapshotEnabledonworkItemIdpresence inresolveSpawnSettings, or (preferred) ensure the cleanup site always handles it (the fix above). -
src/router/snapshot-manager.ts:30 — The in-memory snapshot Map has no TTL eviction or max-count enforcement. The router config declares
snapshotDefaultTtlMs(24h),snapshotMaxCount(5), andsnapshotMaxSizeBytes(10GB) but none of these are referenced anywhere in the implementation. For a long-running router with many projects/work items, this Map grows unboundedly. At minimum,getSnapshot()could checkcreatedAtagainst the TTL and returnundefined(+ delete the entry) for expired snapshots.
🕵️ claude-code · claude-opus-4-6 · run details
…orkItemId is undefined - Fix container leak: broaden success-path guard from `snapshotEnabled && projectId && workItemId` to just `snapshotEnabled`, so removeContainer() is always reached for snapshot-enabled runs regardless of whether workItemId is present — consistent with the .catch() error path. Commit is still gated on `projectId && workItemId` inside the block. - Add TTL eviction to getSnapshot(): expired entries (age > snapshotDefaultTtlMs) are eagerly deleted and undefined is returned, preventing unbounded Map growth. - Mock routerConfig in snapshot-manager unit tests and add TTL eviction tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
nhopeatall
left a comment
There was a problem hiding this comment.
Summary
Well-structured feature addition. The separation of pure state management (snapshot-manager.ts) from Docker operations (container-manager.ts) follows the existing module decomposition pattern established by active-workers.ts and orphan-cleanup.ts. The code is correct for the happy path and all key edge cases (failed exit, missing workItemId, commit failure). Two issues worth addressing.
Should Fix
-
src/router/snapshot-manager.ts:66 — Per-project
snapshotTtlMsis ignored:getSnapshot()always usesrouterConfig.snapshotDefaultTtlMsfor TTL expiry, but each project can configure its ownsnapshotTtlMsvia the DB/dashboard (seeconfigMapper.ts:92andproject-general-form.tsx). This means per-project TTL overrides have no effect — every snapshot expires at the global 24h default regardless of project configuration. Consider either: (1) adding attlMsparameter togetSnapshot()and passingprojectCfg?.snapshotTtlMs ?? routerConfig.snapshotDefaultTtlMsfromresolveSpawnSettings(), or (2) storing the effective TTL inSnapshotMetadataat registration time and checking against it ingetSnapshot(). -
src/router/container-manager.ts:232 — Orphan cleanup leaves stopped snapshot containers on disk:
orphan-cleanup.tscallscontainer.stop()but nevercontainer.remove()— it relies on Docker'sAutoRemoveto clean up. Snapshot-enabled containers haveAutoRemove=false, so when orphan cleanup stops them, they'll accumulate as stopped containers consuming disk. The orphan cleanup should callremove()afterstop()for containers with thecascade.snapshot.enabled=truelabel (or unconditionally, sinceremove()on an AutoRemove container is a harmless no-op).
Questions
- The router config defines
snapshotMaxCountandsnapshotMaxSizeBytesbut neither is enforced in this PR. Is there a follow-up planned to cap the number of in-memory snapshots and/or the on-disk image size? Unbounded growth of the snapshot map is low-risk (it's just metadata), but Docker images accumulating without a cap could consume significant disk.
🕵️ claude-code · claude-opus-4-6 · run details
…pshot containers - Add `ttlMs` parameter to `getSnapshot()` so callers can pass a per-project TTL override instead of always using the global default. `resolveSpawnSettings()` now reads `projectCfg?.snapshotTtlMs` and forwards it to `getSnapshot()`, so per-project TTL configuration via the DB/dashboard takes effect. - Call `container.remove()` after `container.stop()` in orphan cleanup so stopped snapshot containers (AutoRemove=false) are not left on disk. `remove()` failure is swallowed; on non-snapshot containers where AutoRemove=true Docker may have already removed them, making the no-op safe. - Update tests: existing orphan-cleanup mocks gain a `remove` mock; new tests verify remove behaviour and that per-project TTL is passed through the call chain. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
nhopeatall
left a comment
There was a problem hiding this comment.
Summary
LGTM — well-structured feature addition with clean separation of concerns and thorough test coverage.
The snapshot manager is a pure state module (no Docker API calls), the container-manager changes are well-guarded (commit only on exit code 0, always remove snapshot-enabled containers), and the orphan cleanup was correctly updated to remove stopped containers to prevent disk accumulation from AutoRemove=false snapshot runs.
Notes (non-blocking)
-
snapshotMaxCount/snapshotMaxSizeBytesare declared inrouterConfigbut not enforced anywhere in this PR. The in-memory snapshot registry and Docker images can grow without bound. This is fine for a first iteration, but worth tracking as follow-up work — especially the Docker image side, since metadata is ephemeral (lost on process restart) but committed images persist on the host. -
Image name collisions:
buildSnapshotImageNamesanitises IDs to[a-z0-9-], which could theoretically collide for project/work-item IDs that differ only in non-alphanumeric characters (e.g.,proj_1vsproj.1→ bothproj-1). Extremely unlikely in practice given the ID formats used, but worth being aware of.
🕵️ claude-code · claude-opus-4-6 · run details
Summary
src/router/snapshot-manager.ts— a pure state-management module that registers, looks up, and invalidates snapshot metadata keyed byprojectId:workItemIdsrc/router/container-manager.tsto prefer a snapshot image when spawning workers for snapshot-enabled projects (falls back torouterConfig.workerImagewhen no snapshot exists)AutoRemovefor snapshot-enabled containers so the container remains available fordocker commit; removes the container manually after the commit (or non-zero exit) completesresolveSpawnSettings()helper to deduplicateloadProjectConfig()calls and keepspawnWorker()within complexity limitsgetSnapshot,registerSnapshot,invalidateSnapshot) fromcontainer-manager.tsfor existing callerscascade.snapshot.enabledcontainer label for observabilitysnapshot-manager.test.ts(unit tests for the registry) andsnapshot-integration.test.ts(spawn integration tests for disabled, miss, hit, and commit behaviour)Card
https://trello.com/c/69c2920335fd24624e3a0bba
Test plan
snapshot-manager.test.ts: register, get, invalidate, and count snapshot metadatasnapshot-integration.test.ts: snapshot-disabled uses base image + AutoRemove=true; snapshot miss falls back to base image + AutoRemove=false; snapshot hit uses snapshot image; successful exit commits + removes container; non-zero exit skips commit but still removes container🤖 Generated with Claude Code
🕵️ claude-code · claude-sonnet-4-6 · run details