[smoke-detector] 🔍 Smoke Test Investigation - Run #63: Anthropic API Overload in Detection Job

# 🔍 Smoke Test Investigation - Run #63

## Summary
**Smoke Claude** workflow failed in the `detection` job due to a transient **Anthropic API 500 error: "Overloaded"**. The main agent job completed successfully and created the expected issue output, but the secondary threat detection analysis failed when the API service was temporarily overloaded.

## Failure Details
- **Run**: [18907236182]((redacted))
- **Workflow**: Smoke Claude
- **Commit**: 8f8cfbd6faa2a0565176326b3a51a5351c024932
- **Trigger**: schedule
- **Duration**: 4.7 minutes
- **Date**: 2025-10-29 12:09:23 UTC

## Root Cause Analysis

### Primary Issue
The `detection` job failed with an **Anthropic API 500 error** indicating the service was overloaded:

```
API Error: 500 {"type":"error","error":{"type":"api_error","message":"Overloaded"},"request_id":null}
```

This occurred when the detection agent attempted to read `/tmp/gh-aw/threat-detection/prompt.txt` using the Read tool. The API call failed with "Streaming fallback triggered" before returning the overload error.

### Why This Matters
- The main `agent` job **succeeded** (created issue output, 16 turns, 162k tokens, ~$0.22)
- The `create_issue` job **succeeded** 
- Only the `detection` job failed due to this transient API error
- This is a **new failure pattern** - first occurrence of "Overloaded" error in our investigation history

### Is This a Real Problem?
This is a **transient infrastructure issue** with the Anthropic API, not a code bug. However, it represents a workflow robustness issue:
1. The detection job is a secondary analysis step
2. Its failure caused the entire smoke test to fail
3. This could mask real issues in future runs

## Failed Jobs and Errors

### Failed: detection (2.7m)
**Error Type**: `api_error` - Anthropic API Overloaded  
**HTTP Status**: 500  
**Message**: "Overloaded"  
**Context**: Failed while attempting to read threat detection prompt file  
**Is Transient**: Yes ✅

## Succeeded Jobs
- ✅ pre_activation (4s)
- ✅ activation (4s)  
- ✅ agent (1.4m) - **Main job completed successfully**
- ✅ create_issue (7s) - **Issue created successfully**

## Investigation Findings

### Agent Performance
- **Turns**: 16
- **Tokens Used**: 162,803
- **Estimated Cost**: $0.22
- **Task**: Review last 5 merged PRs and create summary issue
- **Result**: Task completed successfully ✅

### Detection Job Analysis
- Started threat detection analysis
- Attempted to read `/tmp/gh-aw/threat-detection/prompt.txt`
- Hit Anthropic API overload before completing first read
- Duration: 2.7 minutes (includes wait time for API)
- 4 turns attempted before failure

### Pattern Analysis
**Pattern ID**: `ANTHROPIC_API_OVERLOADED` (New Pattern)
- **First Occurrence**: This run (2025-10-29)
- **Severity**: Medium (transient, not code-related)
- **Category**: AI Engine - API Failure
- **Flaky**: Yes - depends on Anthropic API load

**Related Patterns Found**:
- Similar to `OPENCODE_ANTHROPIC_API_ERROR` (AI_APICallError) seen in run 18893290104
- Different from typical OpenCode "no safe-outputs" patterns

## Recommended Actions

### 🔴 High Priority
- [ ] **Add retry logic to detection job**: Implement exponential backoff retry for Anthropic API 500 errors
  - Initial delay: 30 seconds
  - Max retries: 3
  - Doubles delay on each retry
  - **Prevents**: Transient API failures causing workflow failure

### 🟡 Medium Priority  
- [ ] **Make detection job non-blocking**: Add `continue-on-error: true` to detection job or make it optional
  - The detection job is a secondary analysis
  - Main agent success should be the primary signal
  - **Prevents**: False negatives in smoke tests masking real issues
  
- [ ] **Add rate limit handling**: Implement monitoring and graceful handling of Anthropic API rate limits
  - Track frequency of overload errors
  - Identify peak load times (scheduled runs may create spikes)
  - **Prevents**: Cascading failures during high load periods

### 🟢 Low Priority
- [ ] **Implement fallback detection**: Consider simpler static analysis when API unavailable
  - **Prevents**: Complete detection failure

## Prevention Strategies

### Immediate Actions
1. ✅ **Pattern documented**: Created `/tmp/gh-aw/cache-memory/patterns/anthropic_api_overloaded.json`
2. ✅ **Investigation saved**: Created `/tmp/gh-aw/cache-memory/investigations/2025-10-29-18907236182.json`
3. **Next**: Implement retry logic for detection job

### Long-term Improvements
- Consider scheduled workflow staggering to reduce concurrent API load
- Monitor Anthropic API status before running detection jobs
- Implement circuit breaker pattern for repeated API failures

## Historical Context

### Similar Past Failures
- **Run 18893290104** (2025-10-29): OpenCode with `AI_APICallError` calling Anthropic API
  - Different error: 401 vs 500
  - Different engine: OpenCode vs Claude
  - Same root: Anthropic API issues

### Pattern Comparison
This is the **first occurrence** of the "Overloaded" error specifically. Previous Anthropic API failures were:
- Authentication errors (401)
- Generic API call errors
- Not "Overloaded" capacity issues

### Frequency
- **Anthropic API failures**: Rare but recurring (3 occurrences in past week across different engines)
- **This specific error**: First occurrence

## Related PR Context

**PR #2717**: "docs: document zizmor URL links and verbose Docker command output"
- **Merged**: Just before workflow run
- **Content**: Documentation updates
- **Relevance**: ❌ Not related to failure
- **Assessment**: Failure was due to external Anthropic API overload, not code changes

---

**Investigation Metadata**
- Pattern: `ANTHROPIC_API_OVERLOADED` (New)
- Investigator: smoke-detector
- Investigation Run: [18907360485]((redacted))
- Investigation Files:
  - `/tmp/gh-aw/cache-memory/investigations/2025-10-29-18907236182.json`
  - `/tmp/gh-aw/cache-memory/patterns/anthropic_api_overloaded.json`




> AI generated by [Smoke Detector - Smoke Test Failure Investigator](https://github.com/githubnext/gh-aw/actions/runs/18907360485)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[smoke-detector] 🔍 Smoke Test Investigation - Run #63: Anthropic API Overload in Detection Job #2730

🔍 Smoke Test Investigation - Run #63

Summary

Failure Details

Root Cause Analysis

Primary Issue

Why This Matters

Is This a Real Problem?

Failed Jobs and Errors

Failed: detection (2.7m)

Succeeded Jobs

Investigation Findings

Agent Performance

Detection Job Analysis

Pattern Analysis

Recommended Actions

🔴 High Priority

🟡 Medium Priority

🟢 Low Priority

Prevention Strategies

Immediate Actions

Long-term Improvements

Historical Context

Similar Past Failures

Pattern Comparison

Frequency

Related PR Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[smoke-detector] 🔍 Smoke Test Investigation - Run #63: Anthropic API Overload in Detection Job #2730

Description

🔍 Smoke Test Investigation - Run #63

Summary

Failure Details

Root Cause Analysis

Primary Issue

Why This Matters

Is This a Real Problem?

Failed Jobs and Errors

Failed: detection (2.7m)

Succeeded Jobs

Investigation Findings

Agent Performance

Detection Job Analysis

Pattern Analysis

Recommended Actions

🔴 High Priority

🟡 Medium Priority

🟢 Low Priority

Prevention Strategies

Immediate Actions

Long-term Improvements

Historical Context

Similar Past Failures

Pattern Comparison

Frequency

Related PR Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions