fix(snapshots): gracefully recover when snapshot image is missing#1054
Merged
zbigniewsobiecki merged 1 commit intodevfrom Mar 25, 2026
Merged
fix(snapshots): gracefully recover when snapshot image is missing#1054zbigniewsobiecki merged 1 commit intodevfrom
zbigniewsobiecki merged 1 commit intodevfrom
Conversation
When a snapshot Docker image is deleted externally while the in-memory SnapshotManager still holds a reference to it, the next job dispatch crashed with a Docker 404 and left the card in "failed" state. Every subsequent trigger attempt hit the same 404 until the router was restarted. Fix: detect the missing-image error in spawnWorker, invalidate the stale registry entry, and transparently retry with the base worker image so the run proceeds without any user intervention. Changes: - isImageNotFoundError uses Docker's typed statusCode property (statusCode === 404) as the primary check rather than fragile substring matching - Extract docker create/start/monitor logic into createAndMonitorContainer so the fallback can retry without duplicating the spawn setup - Introduce ContainerLaunchConfig interface to reduce the helper from 11 positional params to 7, making the two call sites clearly show what differs between primary and fallback (only workerImage and workerEnv) - Remove dead snapshotReuse param from createAndMonitorContainer (it was never referenced inside the function; the env already has CASCADE_SNAPSHOT_REUSE baked in before the call) - Include staleImage in captureException extra context on fallback failure so both errors can be correlated in Sentry Tests: - Hoist mockInvalidateSnapshot so it is properly reset between tests - Verify invalidateSnapshot is called in both the success and failure paths - Add test: fallback container preserves snapshotEnabled (AutoRemove=false) - Add test: 404 on base image (snapshotReuse=false) propagates without retry and without invalidating any snapshot Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SnapshotManagerstill holds a reference, the next job dispatch crashed with a Docker 404 and left the card permanently stuck in "failed" state (every re-trigger hit the same 404 until the router was restarted)spawnWorkerinternals to support the retry cleanly and address several code quality issues found during reviewChanges
src/router/container-manager.tsisImageNotFoundErrornow uses dockerode's typedstatusCode === 404as primary signal rather than fragile substring matching; secondary guard prevents false positives from other 404screateAndMonitorContainerso the fallback can reuse it without duplicating the spawn setupContainerLaunchConfiginterface — reduces the helper from 11 positional params to 7, and makes the two call sites inspawnWorkerexplicitly show what differs (onlyworkerImageandworkerEnv)snapshotReuseparameter fromcreateAndMonitorContainer(never referenced in the body; env already hasCASCADE_SNAPSHOT_REUSEbaked in before the call)staleImageincaptureExceptionextra context on fallback failure for Sentry correlationtests/unit/router/snapshot-integration.test.tsmockInvalidateSnapshotso it is properly reset between tests (was an anonymousvi.fn()in the mock factory — not clearable, could accumulate calls across tests)staleImageErrorfixture withstatusCode: 404to match the strengthened predicateinvalidateSnapshotis called in both the success path and the fallback-failure pathsnapshotEnabled(AutoRemove=false, will commit on success)snapshotReuse=falsepropagates without retry or snapshot invalidationTest plan
npm test— 7152/7152 unit tests passnpm run lint— cleannpm run typecheck— clean🤖 Generated with Claude Code