🔍 Smoke Test Investigation - Run #63
Summary
Smoke Claude workflow failed in the detection job due to a transient Anthropic API 500 error: "Overloaded". The main agent job completed successfully and created the expected issue output, but the secondary threat detection analysis failed when the API service was temporarily overloaded.
Failure Details
- Run: 18907236182
- Workflow: Smoke Claude
- Commit: 8f8cfbd
- Trigger: schedule
- Duration: 4.7 minutes
- Date: 2025-10-29 12:09:23 UTC
Root Cause Analysis
Primary Issue
The detection job failed with an Anthropic API 500 error indicating the service was overloaded:
API Error: 500 {"type":"error","error":{"type":"api_error","message":"Overloaded"},"request_id":null}
This occurred when the detection agent attempted to read /tmp/gh-aw/threat-detection/prompt.txt using the Read tool. The API call failed with "Streaming fallback triggered" before returning the overload error.
Why This Matters
- The main
agent job succeeded (created issue output, 16 turns, 162k tokens, ~$0.22)
- The
create_issue job succeeded
- Only the
detection job failed due to this transient API error
- This is a new failure pattern - first occurrence of "Overloaded" error in our investigation history
Is This a Real Problem?
This is a transient infrastructure issue with the Anthropic API, not a code bug. However, it represents a workflow robustness issue:
- The detection job is a secondary analysis step
- Its failure caused the entire smoke test to fail
- This could mask real issues in future runs
Failed Jobs and Errors
Failed: detection (2.7m)
Error Type: api_error - Anthropic API Overloaded
HTTP Status: 500
Message: "Overloaded"
Context: Failed while attempting to read threat detection prompt file
Is Transient: Yes ✅
Succeeded Jobs
- ✅ pre_activation (4s)
- ✅ activation (4s)
- ✅ agent (1.4m) - Main job completed successfully
- ✅ create_issue (7s) - Issue created successfully
Investigation Findings
Agent Performance
- Turns: 16
- Tokens Used: 162,803
- Estimated Cost: $0.22
- Task: Review last 5 merged PRs and create summary issue
- Result: Task completed successfully ✅
Detection Job Analysis
- Started threat detection analysis
- Attempted to read
/tmp/gh-aw/threat-detection/prompt.txt
- Hit Anthropic API overload before completing first read
- Duration: 2.7 minutes (includes wait time for API)
- 4 turns attempted before failure
Pattern Analysis
Pattern ID: ANTHROPIC_API_OVERLOADED (New Pattern)
- First Occurrence: This run (2025-10-29)
- Severity: Medium (transient, not code-related)
- Category: AI Engine - API Failure
- Flaky: Yes - depends on Anthropic API load
Related Patterns Found:
- Similar to
OPENCODE_ANTHROPIC_API_ERROR (AI_APICallError) seen in run 18893290104
- Different from typical OpenCode "no safe-outputs" patterns
Recommended Actions
🔴 High Priority
🟡 Medium Priority
🟢 Low Priority
Prevention Strategies
Immediate Actions
- ✅ Pattern documented: Created
/tmp/gh-aw/cache-memory/patterns/anthropic_api_overloaded.json
- ✅ Investigation saved: Created
/tmp/gh-aw/cache-memory/investigations/2025-10-29-18907236182.json
- Next: Implement retry logic for detection job
Long-term Improvements
- Consider scheduled workflow staggering to reduce concurrent API load
- Monitor Anthropic API status before running detection jobs
- Implement circuit breaker pattern for repeated API failures
Historical Context
Similar Past Failures
- Run 18893290104 (2025-10-29): OpenCode with
AI_APICallError calling Anthropic API
- Different error: 401 vs 500
- Different engine: OpenCode vs Claude
- Same root: Anthropic API issues
Pattern Comparison
This is the first occurrence of the "Overloaded" error specifically. Previous Anthropic API failures were:
- Authentication errors (401)
- Generic API call errors
- Not "Overloaded" capacity issues
Frequency
- Anthropic API failures: Rare but recurring (3 occurrences in past week across different engines)
- This specific error: First occurrence
Related PR Context
PR #2717: "docs: document zizmor URL links and verbose Docker command output"
- Merged: Just before workflow run
- Content: Documentation updates
- Relevance: ❌ Not related to failure
- Assessment: Failure was due to external Anthropic API overload, not code changes
Investigation Metadata
- Pattern:
ANTHROPIC_API_OVERLOADED (New)
- Investigator: smoke-detector
- Investigation Run: 18907360485
- Investigation Files:
/tmp/gh-aw/cache-memory/investigations/2025-10-29-18907236182.json
/tmp/gh-aw/cache-memory/patterns/anthropic_api_overloaded.json
AI generated by Smoke Detector - Smoke Test Failure Investigator
🔍 Smoke Test Investigation - Run #63
Summary
Smoke Claude workflow failed in the
detectionjob due to a transient Anthropic API 500 error: "Overloaded". The main agent job completed successfully and created the expected issue output, but the secondary threat detection analysis failed when the API service was temporarily overloaded.Failure Details
Root Cause Analysis
Primary Issue
The
detectionjob failed with an Anthropic API 500 error indicating the service was overloaded:This occurred when the detection agent attempted to read
/tmp/gh-aw/threat-detection/prompt.txtusing the Read tool. The API call failed with "Streaming fallback triggered" before returning the overload error.Why This Matters
agentjob succeeded (created issue output, 16 turns, 162k tokens, ~$0.22)create_issuejob succeededdetectionjob failed due to this transient API errorIs This a Real Problem?
This is a transient infrastructure issue with the Anthropic API, not a code bug. However, it represents a workflow robustness issue:
Failed Jobs and Errors
Failed: detection (2.7m)
Error Type:
api_error- Anthropic API OverloadedHTTP Status: 500
Message: "Overloaded"
Context: Failed while attempting to read threat detection prompt file
Is Transient: Yes ✅
Succeeded Jobs
Investigation Findings
Agent Performance
Detection Job Analysis
/tmp/gh-aw/threat-detection/prompt.txtPattern Analysis
Pattern ID:
ANTHROPIC_API_OVERLOADED(New Pattern)Related Patterns Found:
OPENCODE_ANTHROPIC_API_ERROR(AI_APICallError) seen in run 18893290104Recommended Actions
🔴 High Priority
🟡 Medium Priority
Make detection job non-blocking: Add
continue-on-error: trueto detection job or make it optionalAdd rate limit handling: Implement monitoring and graceful handling of Anthropic API rate limits
🟢 Low Priority
Prevention Strategies
Immediate Actions
/tmp/gh-aw/cache-memory/patterns/anthropic_api_overloaded.json/tmp/gh-aw/cache-memory/investigations/2025-10-29-18907236182.jsonLong-term Improvements
Historical Context
Similar Past Failures
AI_APICallErrorcalling Anthropic APIPattern Comparison
This is the first occurrence of the "Overloaded" error specifically. Previous Anthropic API failures were:
Frequency
Related PR Context
PR #2717: "docs: document zizmor URL links and verbose Docker command output"
Investigation Metadata
ANTHROPIC_API_OVERLOADED(New)/tmp/gh-aw/cache-memory/investigations/2025-10-29-18907236182.json/tmp/gh-aw/cache-memory/patterns/anthropic_api_overloaded.json