Skip to content

feat(router): add snapshot manager and capture/reuse worker container snapshots#1046

Merged
aaight merged 3 commits intodevfrom
feature/snapshot-manager
Mar 24, 2026
Merged

feat(router): add snapshot manager and capture/reuse worker container snapshots#1046
aaight merged 3 commits intodevfrom
feature/snapshot-manager

Conversation

@aaight
Copy link
Copy Markdown
Collaborator

@aaight aaight commented Mar 24, 2026

Summary

  • Introduces src/router/snapshot-manager.ts — a pure state-management module that registers, looks up, and invalidates snapshot metadata keyed by projectId:workItemId
  • Updates src/router/container-manager.ts to prefer a snapshot image when spawning workers for snapshot-enabled projects (falls back to routerConfig.workerImage when no snapshot exists)
  • Commits containers to a snapshot image only after a successful (exit code 0) worker run; failed or timed-out runs never create or refresh snapshots
  • Disables AutoRemove for snapshot-enabled containers so the container remains available for docker commit; removes the container manually after the commit (or non-zero exit) completes
  • Extracts resolveSpawnSettings() helper to deduplicate loadProjectConfig() calls and keep spawnWorker() within complexity limits
  • Re-exports snapshot-manager public API (getSnapshot, registerSnapshot, invalidateSnapshot) from container-manager.ts for existing callers
  • Adds cascade.snapshot.enabled container label for observability
  • Full test coverage: snapshot-manager.test.ts (unit tests for the registry) and snapshot-integration.test.ts (spawn integration tests for disabled, miss, hit, and commit behaviour)

Card

https://trello.com/c/69c2920335fd24624e3a0bba

Test plan

  • All 6914 existing unit tests pass
  • New snapshot-manager.test.ts: register, get, invalidate, and count snapshot metadata
  • New snapshot-integration.test.ts: snapshot-disabled uses base image + AutoRemove=true; snapshot miss falls back to base image + AutoRemove=false; snapshot hit uses snapshot image; successful exit commits + removes container; non-zero exit skips commit but still removes container
  • TypeScript type check passes (zero errors)
  • Lint passes (zero warnings/errors)

🤖 Generated with Claude Code

🕵️ claude-code · claude-sonnet-4-6 · run details

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 24, 2026

Codecov Report

❌ Patch coverage is 91.42857% with 15 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/router/container-manager.ts 86.23% 15 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Collaborator

@nhopeatall nhopeatall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Container leak when snapshotEnabled is true but workItemId is undefined. Otherwise a clean, well-structured feature addition.

Code Issues

Blocking

  • src/router/container-manager.ts:317 — Container leak when snapshotEnabled=true and workItemId is undefined. The AutoRemove flag is set to false based solely on snapshotEnabled (line 232), but the post-exit cleanup block on line 317 requires snapshotEnabled && projectId && workItemId — all three. When workItemId is undefined (possible for some job types), the container runs with AutoRemove=false, exits normally, but the removeContainer() call on line 327 is never reached. The container is orphaned on disk. Note the error-path handler (line 339) correctly checks only if (snapshotEnabled), which shows awareness of this edge case — but the success path doesn't match.

    Fix: Change the guard on line 317 to if (snapshotEnabled), and make the commit conditional on projectId && workItemId inside the block:

    if (snapshotEnabled) {
      if (result.StatusCode === 0 && projectId && workItemId) {
        await commitContainerToSnapshot(container.id, projectId, workItemId);
      } else if (result.StatusCode !== 0) {
        logger.info('[WorkerManager] Skipping snapshot commit after non-zero exit:', { ... });
      }
      // Always remove — consistent with the .catch() handler
      await removeContainer(container.id);
    }

Should Fix

  • src/router/container-manager.ts:150resolveSpawnSettings returns snapshotEnabled: true even when workItemId is undefined. This is what enables the container leak — AutoRemove gets disabled for a container that can never be snapshot-committed. Either gate snapshotEnabled on workItemId presence in resolveSpawnSettings, or (preferred) ensure the cleanup site always handles it (the fix above).

  • src/router/snapshot-manager.ts:30 — The in-memory snapshot Map has no TTL eviction or max-count enforcement. The router config declares snapshotDefaultTtlMs (24h), snapshotMaxCount (5), and snapshotMaxSizeBytes (10GB) but none of these are referenced anywhere in the implementation. For a long-running router with many projects/work items, this Map grows unboundedly. At minimum, getSnapshot() could check createdAt against the TTL and return undefined (+ delete the entry) for expired snapshots.

🕵️ claude-code · claude-opus-4-6 · run details

…orkItemId is undefined

- Fix container leak: broaden success-path guard from
  `snapshotEnabled && projectId && workItemId` to just `snapshotEnabled`,
  so removeContainer() is always reached for snapshot-enabled runs regardless
  of whether workItemId is present — consistent with the .catch() error path.
  Commit is still gated on `projectId && workItemId` inside the block.
- Add TTL eviction to getSnapshot(): expired entries (age > snapshotDefaultTtlMs)
  are eagerly deleted and undefined is returned, preventing unbounded Map growth.
- Mock routerConfig in snapshot-manager unit tests and add TTL eviction tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@nhopeatall nhopeatall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Well-structured feature addition. The separation of pure state management (snapshot-manager.ts) from Docker operations (container-manager.ts) follows the existing module decomposition pattern established by active-workers.ts and orphan-cleanup.ts. The code is correct for the happy path and all key edge cases (failed exit, missing workItemId, commit failure). Two issues worth addressing.

Should Fix

  • src/router/snapshot-manager.ts:66 — Per-project snapshotTtlMs is ignored: getSnapshot() always uses routerConfig.snapshotDefaultTtlMs for TTL expiry, but each project can configure its own snapshotTtlMs via the DB/dashboard (see configMapper.ts:92 and project-general-form.tsx). This means per-project TTL overrides have no effect — every snapshot expires at the global 24h default regardless of project configuration. Consider either: (1) adding a ttlMs parameter to getSnapshot() and passing projectCfg?.snapshotTtlMs ?? routerConfig.snapshotDefaultTtlMs from resolveSpawnSettings(), or (2) storing the effective TTL in SnapshotMetadata at registration time and checking against it in getSnapshot().

  • src/router/container-manager.ts:232 — Orphan cleanup leaves stopped snapshot containers on disk: orphan-cleanup.ts calls container.stop() but never container.remove() — it relies on Docker's AutoRemove to clean up. Snapshot-enabled containers have AutoRemove=false, so when orphan cleanup stops them, they'll accumulate as stopped containers consuming disk. The orphan cleanup should call remove() after stop() for containers with the cascade.snapshot.enabled=true label (or unconditionally, since remove() on an AutoRemove container is a harmless no-op).

Questions

  • The router config defines snapshotMaxCount and snapshotMaxSizeBytes but neither is enforced in this PR. Is there a follow-up planned to cap the number of in-memory snapshots and/or the on-disk image size? Unbounded growth of the snapshot map is low-risk (it's just metadata), but Docker images accumulating without a cap could consume significant disk.

🕵️ claude-code · claude-opus-4-6 · run details

…pshot containers

- Add `ttlMs` parameter to `getSnapshot()` so callers can pass a
  per-project TTL override instead of always using the global default.
  `resolveSpawnSettings()` now reads `projectCfg?.snapshotTtlMs` and
  forwards it to `getSnapshot()`, so per-project TTL configuration
  via the DB/dashboard takes effect.

- Call `container.remove()` after `container.stop()` in orphan cleanup
  so stopped snapshot containers (AutoRemove=false) are not left on disk.
  `remove()` failure is swallowed; on non-snapshot containers where
  AutoRemove=true Docker may have already removed them, making the
  no-op safe.

- Update tests: existing orphan-cleanup mocks gain a `remove` mock;
  new tests verify remove behaviour and that per-project TTL is passed
  through the call chain.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@nhopeatall nhopeatall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

LGTM — well-structured feature addition with clean separation of concerns and thorough test coverage.

The snapshot manager is a pure state module (no Docker API calls), the container-manager changes are well-guarded (commit only on exit code 0, always remove snapshot-enabled containers), and the orphan cleanup was correctly updated to remove stopped containers to prevent disk accumulation from AutoRemove=false snapshot runs.

Notes (non-blocking)

  • snapshotMaxCount / snapshotMaxSizeBytes are declared in routerConfig but not enforced anywhere in this PR. The in-memory snapshot registry and Docker images can grow without bound. This is fine for a first iteration, but worth tracking as follow-up work — especially the Docker image side, since metadata is ephemeral (lost on process restart) but committed images persist on the host.

  • Image name collisions: buildSnapshotImageName sanitises IDs to [a-z0-9-], which could theoretically collide for project/work-item IDs that differ only in non-alphanumeric characters (e.g., proj_1 vs proj.1 → both proj-1). Extremely unlikely in practice given the ID formats used, but worth being aware of.

🕵️ claude-code · claude-opus-4-6 · run details

@aaight aaight merged commit 414fbce into dev Mar 24, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants