[PPSC-678] fix: graceful degradation when result fetching fails#141
Merged
[PPSC-678] fix: graceful degradation when result fetching fails#141
Conversation
The per-request HTTP client timeout was hardcoded at 60 seconds, causing "context deadline exceeded" errors when fetching normalized results under network latency or load. Increased to 180 seconds to match the ALB idle timeout and provide more headroom for result retrieval. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Test Coverage Reporttotal: (statements) 80.9% Coverage by function |
Instead of failing the entire scan when normalized result retrieval times out, retry up to 5 times with 10s intervals (total ~50s window). If all retries fail, print a warning with the scan ID so the user can retrieve results later, rather than reporting "scan failed" when the scan actually completed successfully on the server. Also reverts the timeout bump from 60s back to the original value since the retry + graceful degradation is the proper fix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Exit code 2 on incomplete results: Instead of returning nil error (exit 0) when results can't be fetched, return ErrResultsIncomplete which main.go handles with exit code 2. CI pipelines will not silently pass when results are unavailable. 2. Error discrimination: Only retry on transient errors (5xx, timeouts, network errors). Permanent errors (4xx, decode failures) break immediately instead of wasting ~50s retrying pointlessly. Added APIError type to the api package for structured status code checks. 3. Repo scanner test: Added TestScan/returns_ErrResultsIncomplete test to match the existing image scanner test coverage. 4. Image scan retry feedback: Added spinner with retry progress messages to image scans, matching the repo scan behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ests Fixes goconst lint failure by extracting "scan-123" into a testScanID constant. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Problem
The CLI reported "scan failed" (exit 1) when the
/api/v1/ingest/normalized-resultsHTTP request timed out, even though the scan completed successfully on the server. The scan ID was lost in the error, leaving users with no way to retrieve their results. In CI, this was indistinguishable from an actual scan failure.This is distinct from slow scans (e.g., due to Cerebras LLM latency in the AI verifier). Slow server-side processing is handled correctly by the polling loop in
WaitForIngest, which polls every 5s for up to 60 minutes. The failure only occurred in the result-fetching phase after the scan was already markedCOMPLETED.Solution
Retry with error discrimination: Retry result fetching up to 5 times with 10s intervals, but only for transient errors (5xx, timeouts, network errors). Permanent errors (4xx, decode failures) fail immediately. Each attempt also has 3 internal HTTP retries with 60s timeout, giving ~15 min total window for transient issues.
Non-zero exit code: On total failure, return
ErrResultsIncomplete(exit code 2) instead of nil (exit 0). CI pipelines will not silently pass when results are unavailable. Exit codes: 0 = success, 1 = findings exceeded--fail-onthreshold, 2 = results incomplete, 130 = cancelled.User feedback: Print warnings with the scan ID so users can retrieve results later. Both repo and image scans show retry progress via spinner.
Changes
internal/api/client.go: AddedAPIErrortype for structured HTTP status code checksinternal/output/output.go: AddedErrResultsIncompleteerror typecmd/armis-cli/main.go: HandleErrResultsIncompletewith exit code 2internal/scan/repo/repo.go: Retry loop with error discrimination and graceful degradationinternal/scan/image/image.go: Same retry logic + spinner feedback (was missing)Test plan
go build ./...compiles successfullygo test ./internal/scan/repo/...passes (new test for ErrResultsIncomplete)go test ./internal/scan/image/...passes (updated test)go test ./internal/api/...passesgo test ./internal/output/...passes🤖 Generated with Claude Code