fix(snapshots): actually rmi evicted images + reconcile orphans on startup by zbigniewsobiecki · Pull Request #1132 · mongrel-intelligence/cascade

zbigniewsobiecki · 2026-04-16T18:17:36Z

Summary

The router's [SnapshotCleanup] loop has been a sham since it shipped: evictSnapshots() removes entries from an in-memory Map and that's it. No docker rmi anywhere in the codebase (verified — whole-source grep returns zero matches). The Map dies on every router restart, so every snapshot image for a work item that never re-runs is orphaned forever.

Symptom that surfaced: dev disk filled to 100% (50MB free of 436GB), with 13 cascade-snapshot-llmist-* images on disk totaling ~40GB. Six of those dated to March 25 — three weeks past the 24-hour TTL the "cleanup" loop had been pretending to enforce.

Plus a second bug: snapshotMaxSizeBytes (10GB) eviction has never fired because no caller passed imageSizeBytes to registerSnapshot — container-manager:91 called it with three args, leaving size at 0 in the Map so the size phase always thought the registry was empty.

Three narrow fixes

evictSnapshots returns the array of evicted SnapshotMetadata instead of a count. snapshot-cleanup.runSnapshotCleanup() iterates that list and calls docker.getImage(name).remove({ force: false }). force: false so an in-use image survives (Docker returns 409 — swallowed). 404 means already gone — also swallowed. Anything else logs warn + Sentry capture.
New snapshot-startup-sync module: on router boot, list every cascade-snapshot-* image via docker.listImages, register each as a "discovered" entry in the in-memory Map (synthetic project key __discovered__ to avoid colliding with real registrations), then immediately runSnapshotCleanup so TTL/max-count/max-size apply right away. Wired from worker-manager.startWorkerProcessor() as a best-effort post-startup step. This is what would have caught the six March images after the post-Linear-migration router restart.
commitContainerToSnapshot now inspects the freshly committed image and passes Size to registerSnapshot, so the max-size eviction phase actually has data to work with. Inspect failures don't block registration — they just leave imageSizeBytes undefined, which falls back to TTL/max-count enforcement.

Tests

snapshot-manager.test.ts — extended (return-type change to .toHaveLength(N), new registerDiscoveredSnapshot dedup tests, new "evicted metadata is returned" test).
snapshot-cleanup.test.ts — rewrote to mock dockerode and assert image.remove per evicted entry, 409/404 swallowed, Sentry on unexpected error.
snapshot-startup-sync.test.ts (NEW) — listImages registration, register-before-cleanup ordering, multi-tag support, Docker-down resilience.
snapshot-integration.test.ts — added file-wide beforeEach to re-arm mockDockerGetImage after vi.restoreAllMocks() wipes it; updated assertion to expect the size 4th arg.

All 7830 unit + 524 integration tests pass. npm run typecheck, npm run lint, npm run build clean. dist/router/snapshot-startup-sync.js produced.

Out of scope

❌ Backfilling the 40GB on disk (cleaned manually with docker prune while filing this PR; the next router restart will keep on-disk state under SNAPSHOT_MAX_COUNT / SNAPSHOT_MAX_SIZE_BYTES going forward).
❌ The 119GB build cache hoarding (separate operational concern — BuildKit GC config, not a Cascade-level bug).
❌ Persisting snapshot metadata to the DB (startup reconciliation from Docker itself is sufficient and avoids new schema).
❌ Refactoring buildPipelineLists shape so a single per-key ?? chain isn't needed (separate cleanup, do only when a 4th provider lands).

Test plan

npm test — 7830 / 7830 ✓
npm run test:integration — 524 / 524 ✓ (~5.5 min)
npm run typecheck, npm run lint clean
npm run build produces dist/router/snapshot-startup-sync.js
After deploy: docker images | grep cascade-snapshot | wc -l stays ≤ SNAPSHOT_MAX_COUNT (5). Watch docker logs cascade-router-dev | grep SnapshotCleanup for Removed snapshot image lines and [SnapshotStartupSync] Reconciled snapshot images from Docker at boot.

🤖 Generated with Claude Code

…artup The router's [SnapshotCleanup] loop has been a sham since it shipped: evictSnapshots() removes entries from an in-memory Map and that's it. No docker rmi anywhere in the codebase (verified — whole-source grep returns zero matches). The Map dies on every router restart, so every snapshot image for a work item that never re-runs is orphaned forever. Symptom that surfaced: dev disk filled to 100% (50MB free of 436GB), with 13 cascade-snapshot-llmist-* images on disk totaling ~40GB. Six of those dated to March 25 — three weeks past the 24-hour TTL the "cleanup" loop had been pretending to enforce. Also: snapshotMaxSizeBytes (10GB) eviction has never fired because no caller passed imageSizeBytes to registerSnapshot — container-manager:91 called it with three args, leaving size at 0 in the Map so the size phase always thought the registry was empty. Three narrow fixes that together make the configured TTL/max-count/ max-size limits actually mean something on disk: 1. evictSnapshots now returns the array of evicted SnapshotMetadata instead of a count. snapshot-cleanup.runSnapshotCleanup() iterates that list and calls docker.getImage(name).remove({ force: false }). `force: false` so an in-use image survives (Docker returns 409 — we swallow it). 404 means already gone — also swallowed. Anything else logs warn + sentry capture. 2. New snapshot-startup-sync module: on router boot, list every cascade-snapshot-* image via docker.listImages, register each as a "discovered" entry in the in-memory Map (synthetic project key `__discovered__` to avoid colliding with real registrations), then immediately runSnapshotCleanup so TTL/max-count/max-size apply right away. Wired from worker-manager.startWorkerProcessor() as a best-effort post-startup step. This is what would have caught the six March images after the post-Linear-migration router restart. 3. commitContainerToSnapshot now inspects the freshly committed image and passes Size to registerSnapshot, so the max-size eviction phase actually has data to work with. Inspect failures don't block the registration — they just leave imageSizeBytes undefined, which falls back to TTL/max-count enforcement. Tests: extended snapshot-manager.test.ts (return-type change + new registerDiscoveredSnapshot dedup tests), rewrote snapshot-cleanup.test.ts to mock dockerode and assert image.remove + 409/404 swallow + sentry on unexpected error, added snapshot-startup-sync.test.ts (NEW), updated snapshot-integration.test.ts to expect the size 4th arg on registerSnapshot. All 7830 unit + 524 integration tests pass. Out of scope: backfilling the 40GB on disk (cleaned manually with docker prune; the next router restart will keep on-disk state under SNAPSHOT_MAX_COUNT/SNAPSHOT_MAX_SIZE_BYTES going forward). The 119GB build cache hoarding is a separate operational concern (BuildKit GC config, not a Cascade-level bug). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codecov · 2026-04-16T18:21:21Z

Codecov Report

❌ Patch coverage is 92.42424% with 10 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/router/container-manager.ts	69.23%	4 Missing ⚠️
src/router/worker-manager.ts	50.00%	3 Missing ⚠️
src/router/snapshot-manager.ts	94.28%	2 Missing ⚠️
src/router/snapshot-cleanup.ts	97.72%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

zbigniewsobiecki merged commit 23dd458 into dev Apr 16, 2026
9 checks passed

zbigniewsobiecki deleted the fix/snapshot-disk-leak branch April 16, 2026 18:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(snapshots): actually rmi evicted images + reconcile orphans on startup#1132

fix(snapshots): actually rmi evicted images + reconcile orphans on startup#1132
zbigniewsobiecki merged 1 commit intodevfrom
fix/snapshot-disk-leak

zbigniewsobiecki commented Apr 16, 2026

Uh oh!

Uh oh!

codecov Bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zbigniewsobiecki commented Apr 16, 2026

Summary

Three narrow fixes

Tests

Out of scope

Test plan

Uh oh!

Uh oh!

codecov Bot commented Apr 16, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant