fix(snapshots): actually rmi evicted images + reconcile orphans on startup#1132
Merged
zbigniewsobiecki merged 1 commit intodevfrom Apr 16, 2026
Merged
fix(snapshots): actually rmi evicted images + reconcile orphans on startup#1132zbigniewsobiecki merged 1 commit intodevfrom
zbigniewsobiecki merged 1 commit intodevfrom
Conversation
…artup
The router's [SnapshotCleanup] loop has been a sham since it shipped:
evictSnapshots() removes entries from an in-memory Map and that's it. No
docker rmi anywhere in the codebase (verified — whole-source grep
returns zero matches). The Map dies on every router restart, so every
snapshot image for a work item that never re-runs is orphaned forever.
Symptom that surfaced: dev disk filled to 100% (50MB free of 436GB),
with 13 cascade-snapshot-llmist-* images on disk totaling ~40GB. Six
of those dated to March 25 — three weeks past the 24-hour TTL the
"cleanup" loop had been pretending to enforce.
Also: snapshotMaxSizeBytes (10GB) eviction has never fired because no
caller passed imageSizeBytes to registerSnapshot — container-manager:91
called it with three args, leaving size at 0 in the Map so the size
phase always thought the registry was empty.
Three narrow fixes that together make the configured TTL/max-count/
max-size limits actually mean something on disk:
1. evictSnapshots now returns the array of evicted SnapshotMetadata
instead of a count. snapshot-cleanup.runSnapshotCleanup() iterates
that list and calls docker.getImage(name).remove({ force: false }).
`force: false` so an in-use image survives (Docker returns 409 — we
swallow it). 404 means already gone — also swallowed. Anything else
logs warn + sentry capture.
2. New snapshot-startup-sync module: on router boot, list every
cascade-snapshot-* image via docker.listImages, register each as a
"discovered" entry in the in-memory Map (synthetic project key
`__discovered__` to avoid colliding with real registrations), then
immediately runSnapshotCleanup so TTL/max-count/max-size apply right
away. Wired from worker-manager.startWorkerProcessor() as a
best-effort post-startup step. This is what would have caught the
six March images after the post-Linear-migration router restart.
3. commitContainerToSnapshot now inspects the freshly committed image
and passes Size to registerSnapshot, so the max-size eviction phase
actually has data to work with. Inspect failures don't block the
registration — they just leave imageSizeBytes undefined, which falls
back to TTL/max-count enforcement.
Tests: extended snapshot-manager.test.ts (return-type change + new
registerDiscoveredSnapshot dedup tests), rewrote snapshot-cleanup.test.ts
to mock dockerode and assert image.remove + 409/404 swallow + sentry on
unexpected error, added snapshot-startup-sync.test.ts (NEW), updated
snapshot-integration.test.ts to expect the size 4th arg on
registerSnapshot. All 7830 unit + 524 integration tests pass.
Out of scope: backfilling the 40GB on disk (cleaned manually with
docker prune; the next router restart will keep on-disk state under
SNAPSHOT_MAX_COUNT/SNAPSHOT_MAX_SIZE_BYTES going forward). The 119GB
build cache hoarding is a separate operational concern (BuildKit GC
config, not a Cascade-level bug).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The router's
[SnapshotCleanup]loop has been a sham since it shipped:evictSnapshots()removes entries from an in-memoryMapand that's it. Nodocker rmianywhere in the codebase (verified — whole-source grep returns zero matches). The Map dies on every router restart, so every snapshot image for a work item that never re-runs is orphaned forever.Symptom that surfaced: dev disk filled to 100% (50MB free of 436GB), with 13
cascade-snapshot-llmist-*images on disk totaling ~40GB. Six of those dated to March 25 — three weeks past the 24-hour TTL the "cleanup" loop had been pretending to enforce.Plus a second bug:
snapshotMaxSizeBytes(10GB) eviction has never fired because no caller passedimageSizeBytestoregisterSnapshot—container-manager:91called it with three args, leaving size at 0 in the Map so the size phase always thought the registry was empty.Three narrow fixes
evictSnapshotsreturns the array of evictedSnapshotMetadatainstead of a count.snapshot-cleanup.runSnapshotCleanup()iterates that list and callsdocker.getImage(name).remove({ force: false }).force: falseso an in-use image survives (Docker returns 409 — swallowed). 404 means already gone — also swallowed. Anything else logs warn + Sentry capture.New
snapshot-startup-syncmodule: on router boot, list everycascade-snapshot-*image viadocker.listImages, register each as a "discovered" entry in the in-memory Map (synthetic project key__discovered__to avoid colliding with real registrations), then immediatelyrunSnapshotCleanupso TTL/max-count/max-size apply right away. Wired fromworker-manager.startWorkerProcessor()as a best-effort post-startup step. This is what would have caught the six March images after the post-Linear-migration router restart.commitContainerToSnapshotnow inspects the freshly committed image and passesSizetoregisterSnapshot, so the max-size eviction phase actually has data to work with. Inspect failures don't block registration — they just leaveimageSizeBytesundefined, which falls back to TTL/max-count enforcement.Tests
snapshot-manager.test.ts— extended (return-type change to.toHaveLength(N), newregisterDiscoveredSnapshotdedup tests, new "evicted metadata is returned" test).snapshot-cleanup.test.ts— rewrote to mockdockerodeand assertimage.removeper evicted entry, 409/404 swallowed, Sentry on unexpected error.snapshot-startup-sync.test.ts(NEW) — listImages registration, register-before-cleanup ordering, multi-tag support, Docker-down resilience.snapshot-integration.test.ts— added file-wide beforeEach to re-armmockDockerGetImageaftervi.restoreAllMocks()wipes it; updated assertion to expect the size 4th arg.All 7830 unit + 524 integration tests pass.
npm run typecheck,npm run lint,npm run buildclean.dist/router/snapshot-startup-sync.jsproduced.Out of scope
docker prunewhile filing this PR; the next router restart will keep on-disk state underSNAPSHOT_MAX_COUNT/SNAPSHOT_MAX_SIZE_BYTESgoing forward).buildPipelineListsshape so a single per-key??chain isn't needed (separate cleanup, do only when a 4th provider lands).Test plan
npm test— 7830 / 7830 ✓npm run test:integration— 524 / 524 ✓ (~5.5 min)npm run typecheck,npm run lintcleannpm run buildproducesdist/router/snapshot-startup-sync.jsdocker images | grep cascade-snapshot | wc -lstays ≤SNAPSHOT_MAX_COUNT(5). Watchdocker logs cascade-router-dev | grep SnapshotCleanupforRemoved snapshot imagelines and[SnapshotStartupSync] Reconciled snapshot images from Dockerat boot.🤖 Generated with Claude Code