Skip to content

fix(snapshots): actually rmi evicted images + reconcile orphans on startup#1132

Merged
zbigniewsobiecki merged 1 commit intodevfrom
fix/snapshot-disk-leak
Apr 16, 2026
Merged

fix(snapshots): actually rmi evicted images + reconcile orphans on startup#1132
zbigniewsobiecki merged 1 commit intodevfrom
fix/snapshot-disk-leak

Conversation

@zbigniewsobiecki
Copy link
Copy Markdown
Member

Summary

The router's [SnapshotCleanup] loop has been a sham since it shipped: evictSnapshots() removes entries from an in-memory Map and that's it. No docker rmi anywhere in the codebase (verified — whole-source grep returns zero matches). The Map dies on every router restart, so every snapshot image for a work item that never re-runs is orphaned forever.

Symptom that surfaced: dev disk filled to 100% (50MB free of 436GB), with 13 cascade-snapshot-llmist-* images on disk totaling ~40GB. Six of those dated to March 25 — three weeks past the 24-hour TTL the "cleanup" loop had been pretending to enforce.

Plus a second bug: snapshotMaxSizeBytes (10GB) eviction has never fired because no caller passed imageSizeBytes to registerSnapshotcontainer-manager:91 called it with three args, leaving size at 0 in the Map so the size phase always thought the registry was empty.

Three narrow fixes

  1. evictSnapshots returns the array of evicted SnapshotMetadata instead of a count. snapshot-cleanup.runSnapshotCleanup() iterates that list and calls docker.getImage(name).remove({ force: false }). force: false so an in-use image survives (Docker returns 409 — swallowed). 404 means already gone — also swallowed. Anything else logs warn + Sentry capture.

  2. New snapshot-startup-sync module: on router boot, list every cascade-snapshot-* image via docker.listImages, register each as a "discovered" entry in the in-memory Map (synthetic project key __discovered__ to avoid colliding with real registrations), then immediately runSnapshotCleanup so TTL/max-count/max-size apply right away. Wired from worker-manager.startWorkerProcessor() as a best-effort post-startup step. This is what would have caught the six March images after the post-Linear-migration router restart.

  3. commitContainerToSnapshot now inspects the freshly committed image and passes Size to registerSnapshot, so the max-size eviction phase actually has data to work with. Inspect failures don't block registration — they just leave imageSizeBytes undefined, which falls back to TTL/max-count enforcement.

Tests

  • snapshot-manager.test.ts — extended (return-type change to .toHaveLength(N), new registerDiscoveredSnapshot dedup tests, new "evicted metadata is returned" test).
  • snapshot-cleanup.test.ts — rewrote to mock dockerode and assert image.remove per evicted entry, 409/404 swallowed, Sentry on unexpected error.
  • snapshot-startup-sync.test.ts (NEW) — listImages registration, register-before-cleanup ordering, multi-tag support, Docker-down resilience.
  • snapshot-integration.test.ts — added file-wide beforeEach to re-arm mockDockerGetImage after vi.restoreAllMocks() wipes it; updated assertion to expect the size 4th arg.

All 7830 unit + 524 integration tests pass. npm run typecheck, npm run lint, npm run build clean. dist/router/snapshot-startup-sync.js produced.

Out of scope

  • ❌ Backfilling the 40GB on disk (cleaned manually with docker prune while filing this PR; the next router restart will keep on-disk state under SNAPSHOT_MAX_COUNT / SNAPSHOT_MAX_SIZE_BYTES going forward).
  • ❌ The 119GB build cache hoarding (separate operational concern — BuildKit GC config, not a Cascade-level bug).
  • ❌ Persisting snapshot metadata to the DB (startup reconciliation from Docker itself is sufficient and avoids new schema).
  • ❌ Refactoring buildPipelineLists shape so a single per-key ?? chain isn't needed (separate cleanup, do only when a 4th provider lands).

Test plan

  • npm test — 7830 / 7830 ✓
  • npm run test:integration — 524 / 524 ✓ (~5.5 min)
  • npm run typecheck, npm run lint clean
  • npm run build produces dist/router/snapshot-startup-sync.js
  • After deploy: docker images | grep cascade-snapshot | wc -l stays ≤ SNAPSHOT_MAX_COUNT (5). Watch docker logs cascade-router-dev | grep SnapshotCleanup for Removed snapshot image lines and [SnapshotStartupSync] Reconciled snapshot images from Docker at boot.

🤖 Generated with Claude Code

…artup

The router's [SnapshotCleanup] loop has been a sham since it shipped:
evictSnapshots() removes entries from an in-memory Map and that's it. No
docker rmi anywhere in the codebase (verified — whole-source grep
returns zero matches). The Map dies on every router restart, so every
snapshot image for a work item that never re-runs is orphaned forever.

Symptom that surfaced: dev disk filled to 100% (50MB free of 436GB),
with 13 cascade-snapshot-llmist-* images on disk totaling ~40GB. Six
of those dated to March 25 — three weeks past the 24-hour TTL the
"cleanup" loop had been pretending to enforce.

Also: snapshotMaxSizeBytes (10GB) eviction has never fired because no
caller passed imageSizeBytes to registerSnapshot — container-manager:91
called it with three args, leaving size at 0 in the Map so the size
phase always thought the registry was empty.

Three narrow fixes that together make the configured TTL/max-count/
max-size limits actually mean something on disk:

1. evictSnapshots now returns the array of evicted SnapshotMetadata
   instead of a count. snapshot-cleanup.runSnapshotCleanup() iterates
   that list and calls docker.getImage(name).remove({ force: false }).
   `force: false` so an in-use image survives (Docker returns 409 — we
   swallow it). 404 means already gone — also swallowed. Anything else
   logs warn + sentry capture.

2. New snapshot-startup-sync module: on router boot, list every
   cascade-snapshot-* image via docker.listImages, register each as a
   "discovered" entry in the in-memory Map (synthetic project key
   `__discovered__` to avoid colliding with real registrations), then
   immediately runSnapshotCleanup so TTL/max-count/max-size apply right
   away. Wired from worker-manager.startWorkerProcessor() as a
   best-effort post-startup step. This is what would have caught the
   six March images after the post-Linear-migration router restart.

3. commitContainerToSnapshot now inspects the freshly committed image
   and passes Size to registerSnapshot, so the max-size eviction phase
   actually has data to work with. Inspect failures don't block the
   registration — they just leave imageSizeBytes undefined, which falls
   back to TTL/max-count enforcement.

Tests: extended snapshot-manager.test.ts (return-type change + new
registerDiscoveredSnapshot dedup tests), rewrote snapshot-cleanup.test.ts
to mock dockerode and assert image.remove + 409/404 swallow + sentry on
unexpected error, added snapshot-startup-sync.test.ts (NEW), updated
snapshot-integration.test.ts to expect the size 4th arg on
registerSnapshot. All 7830 unit + 524 integration tests pass.

Out of scope: backfilling the 40GB on disk (cleaned manually with
docker prune; the next router restart will keep on-disk state under
SNAPSHOT_MAX_COUNT/SNAPSHOT_MAX_SIZE_BYTES going forward). The 119GB
build cache hoarding is a separate operational concern (BuildKit GC
config, not a Cascade-level bug).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@zbigniewsobiecki zbigniewsobiecki merged commit 23dd458 into dev Apr 16, 2026
9 checks passed
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 16, 2026

Codecov Report

❌ Patch coverage is 92.42424% with 10 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/router/container-manager.ts 69.23% 4 Missing ⚠️
src/router/worker-manager.ts 50.00% 3 Missing ⚠️
src/router/snapshot-manager.ts 94.28% 2 Missing ⚠️
src/router/snapshot-cleanup.ts 97.72% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@zbigniewsobiecki zbigniewsobiecki deleted the fix/snapshot-disk-leak branch April 16, 2026 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant