Skip to content

fix(snapshots): gracefully recover when snapshot image is missing#1054

Merged
zbigniewsobiecki merged 1 commit intodevfrom
fix/snapshot-stale-image-fallback
Mar 25, 2026
Merged

fix(snapshots): gracefully recover when snapshot image is missing#1054
zbigniewsobiecki merged 1 commit intodevfrom
fix/snapshot-stale-image-fallback

Conversation

@zbigniewsobiecki
Copy link
Copy Markdown
Member

Summary

  • When a snapshot Docker image is deleted externally while the in-memory SnapshotManager still holds a reference, the next job dispatch crashed with a Docker 404 and left the card permanently stuck in "failed" state (every re-trigger hit the same 404 until the router was restarted)
  • Fix: detect the missing-image 404, invalidate the stale registry entry, and transparently retry with the base worker image — no user intervention required
  • Refactor spawnWorker internals to support the retry cleanly and address several code quality issues found during review

Changes

src/router/container-manager.ts

  • isImageNotFoundError now uses dockerode's typed statusCode === 404 as primary signal rather than fragile substring matching; secondary guard prevents false positives from other 404s
  • Extract docker create/start/monitor logic into createAndMonitorContainer so the fallback can reuse it without duplicating the spawn setup
  • Introduce ContainerLaunchConfig interface — reduces the helper from 11 positional params to 7, and makes the two call sites in spawnWorker explicitly show what differs (only workerImage and workerEnv)
  • Remove dead snapshotReuse parameter from createAndMonitorContainer (never referenced in the body; env already has CASCADE_SNAPSHOT_REUSE baked in before the call)
  • Include staleImage in captureException extra context on fallback failure for Sentry correlation

tests/unit/router/snapshot-integration.test.ts

  • Hoist mockInvalidateSnapshot so it is properly reset between tests (was an anonymous vi.fn() in the mock factory — not clearable, could accumulate calls across tests)
  • Shared staleImageError fixture with statusCode: 404 to match the strengthened predicate
  • Assert invalidateSnapshot is called in both the success path and the fallback-failure path
  • Add test: fallback container preserves snapshotEnabled (AutoRemove=false, will commit on success)
  • Add test: 404 on base image when snapshotReuse=false propagates without retry or snapshot invalidation

Test plan

  • npm test — 7152/7152 unit tests pass
  • npm run lint — clean
  • npm run typecheck — clean
  • Pre-push hook ran changed tests, all green

🤖 Generated with Claude Code

When a snapshot Docker image is deleted externally while the in-memory
SnapshotManager still holds a reference to it, the next job dispatch
crashed with a Docker 404 and left the card in "failed" state. Every
subsequent trigger attempt hit the same 404 until the router was restarted.

Fix: detect the missing-image error in spawnWorker, invalidate the stale
registry entry, and transparently retry with the base worker image so the
run proceeds without any user intervention.

Changes:
- isImageNotFoundError uses Docker's typed statusCode property (statusCode
  === 404) as the primary check rather than fragile substring matching
- Extract docker create/start/monitor logic into createAndMonitorContainer
  so the fallback can retry without duplicating the spawn setup
- Introduce ContainerLaunchConfig interface to reduce the helper from 11
  positional params to 7, making the two call sites clearly show what
  differs between primary and fallback (only workerImage and workerEnv)
- Remove dead snapshotReuse param from createAndMonitorContainer (it was
  never referenced inside the function; the env already has
  CASCADE_SNAPSHOT_REUSE baked in before the call)
- Include staleImage in captureException extra context on fallback failure
  so both errors can be correlated in Sentry

Tests:
- Hoist mockInvalidateSnapshot so it is properly reset between tests
- Verify invalidateSnapshot is called in both the success and failure paths
- Add test: fallback container preserves snapshotEnabled (AutoRemove=false)
- Add test: 404 on base image (snapshotReuse=false) propagates without
  retry and without invalidating any snapshot

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@zbigniewsobiecki zbigniewsobiecki merged commit f1c6bb0 into dev Mar 25, 2026
8 checks passed
@zbigniewsobiecki zbigniewsobiecki deleted the fix/snapshot-stale-image-fallback branch March 25, 2026 15:37
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 25, 2026

Codecov Report

❌ Patch coverage is 93.52941% with 11 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/router/container-manager.ts 93.52% 11 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant