Infinite retry loop when StreamIdleTimeoutError occurs during tool input generation

## Summary

When a model attempts to generate a large tool input (e.g., writing a full page of content), the stream can stall and trigger a `StreamIdleTimeoutError`. This error is marked as retryable, causing an infinite loop where the model repeatedly attempts the same failing operation with exponential backoff delays.

**User Experience**: The UI shows "Preparing write..." indefinitely, with the agent stuck in a loop. The only way to exit is to manually abort (Escape key).

## Detailed Timeline from Real Session

Session `ses_3d5454748ffeA0QZlbOzaY4s4q` on project VibeBrowserProductPage:

| Time | Event | Delay Since Last |
|------|-------|------------------|
| 02:50:05 | Stream started (claude-opus-4.5) | - |
| 02:51:10 | **StreamIdleTimeoutError** (60s timeout) | 65s |
| 02:51:12 | Retry #1 started | 2s backoff |
| 02:52:16 | **StreamIdleTimeoutError** | 64s |
| 02:52:20 | Retry #2 started | 4s backoff |
| 02:53:23 | **StreamIdleTimeoutError** | 63s |
| 02:53:31 | Retry #3 started | 8s backoff |
| 02:54:34 | **StreamIdleTimeoutError** | 63s |
| 02:54:50 | Retry #4 started | 16s backoff |
| 02:55:53 | **StreamIdleTimeoutError** | 63s |
| 02:56:23 | Retry #5 started | 30s backoff |
| 02:57:27 | **StreamIdleTimeoutError** | 64s |
| 02:57:52 | **User manually aborted** | - |

Total time stuck: **~8 minutes** before user intervention.

## Raw Log Evidence

### StreamIdleTimeoutError Sequence
```
ERROR 2026-02-05T02:51:10 +60107ms service=session.processor error=Stream idle timeout: no data received for 60000ms stack="StreamIdleTimeoutError: Stream idle timeout: no data received for 60000ms\n    at <anonymous> (src/session/processor.ts:44:20)" process
ERROR 2026-02-05T02:52:16 +60096ms service=session.processor error=Stream idle timeout: no data received for 60000ms stack="StreamIdleTimeoutError: Stream idle timeout: no data received for 60000ms\n    at <anonymous> (src/session/processor.ts:44:20)" process
ERROR 2026-02-05T02:53:23 +60218ms service=session.processor error=Stream idle timeout: no data received for 60000ms stack="StreamIdleTimeoutError: Stream idle timeout: no data received for 60000ms\n    at <anonymous> (src/session/processor.ts:44:20)" process
ERROR 2026-02-05T02:54:34 +60211ms service=session.processor error=Stream idle timeout: no data received for 60000ms stack="StreamIdleTimeoutError: Stream idle timeout: no data received for 60000ms\n    at <anonymous> (src/session/processor.ts:44:20)" process
ERROR 2026-02-05T02:55:53 +60214ms service=session.processor error=Stream idle timeout: no data received for 60000ms stack="StreamIdleTimeoutError: Stream idle timeout: no data received for 60000ms\n    at <anonymous> (src/session/processor.ts:44:20)" process
ERROR 2026-02-05T02:57:27 +60133ms service=session.processor error=Stream idle timeout: no data received for 60000ms stack="StreamIdleTimeoutError: Stream idle timeout: no data received for 60000ms\n    at <anonymous> (src/session/processor.ts:44:20)" process
```

### Retry Pattern with Exponential Backoff
```
INFO  2026-02-05T02:50:05 service=llm modelID=claude-opus-4.5 sessionID=ses_3d5454748ffeA0QZlbOzaY4s4q stream
INFO  2026-02-05T02:51:12 +2002ms service=llm modelID=claude-opus-4.5 sessionID=ses_3d5454748ffeA0QZlbOzaY4s4q stream
INFO  2026-02-05T02:52:20 +4002ms service=llm modelID=claude-opus-4.5 sessionID=ses_3d5454748ffeA0QZlbOzaY4s4q stream
INFO  2026-02-05T02:53:31 +8003ms service=llm modelID=claude-opus-4.5 sessionID=ses_3d5454748ffeA0QZlbOzaY4s4q stream
INFO  2026-02-05T02:54:50 +16003ms service=llm modelID=claude-opus-4.5 sessionID=ses_3d5454748ffeA0QZlbOzaY4s4q stream
INFO  2026-02-05T02:56:23 +30002ms service=llm modelID=claude-opus-4.5 sessionID=ses_3d5454748ffeA0QZlbOzaY4s4q stream
```

Note the delays: 2s → 4s → 8s → 16s → 30s (capped at RETRY_MAX_DELAY_NO_HEADERS)

### Task Verification Shows 10 Empty Write Attempts
From the reflection/task verification system:
```
## Tools Used
write: {}
write: {}
write: {}
write: {}
write: {}
write: {}
write: {}
write: {}
write: {}
write: {}

## Agent's Response
You're right, I was overthinking. Let me just write the full page:
```

The Write tool was called **10 times with empty input `{}`** because the stream died during `tool-input-start` phase before the JSON input was fully parsed.

## Root Cause Analysis

### The Retry Loop Flow

```
User asks: "continue working on full product page"
    ↓
Model starts generating Write tool call with large content
    ↓
Provider API stalls (rate limit, internal processing, or output token exhaustion)
    ↓
60 seconds pass with no stream data chunks
    ↓
StreamIdleTimeoutError thrown (processor.ts:44)
    ↓
Error converted to APIError with isRetryable: true (message-v2.ts:715)
    ↓
retry.ts.retryable() returns message string (line 62-64)
    ↓
processor.ts catches error, increments attempt, waits with backoff (line 403-420)
    ↓
New LLM.stream() call starts from scratch
    ↓
Model sees previous failed attempt with "Tool execution aborted" error
    ↓
Model tries THE SAME approach again
    ↓
REPEAT FOREVER (or until user aborts)
```

### Why Doom Loop Detection Doesn't Trigger

The existing doom loop detection in `processor.ts:207-232` checks:
```typescript
if (part.state.status === "running" && part.state.input) {
  // Track same tool + same input called 3 times
}
```

**This fails because:**
1. Stream dies during `tool-input-start` phase (before `tool-call` event)
2. Tool never reaches "running" status - it stays in "pending"
3. Input is always `{}` (empty) - JSON was never fully received
4. Cleanup marks tool as "error" with empty input
5. Each retry has a **different** tool call ID
6. Empty inputs `{}` are not detected as "same input"

### Code Path Evidence

**message-v2.ts:711-720** - StreamIdleTimeoutError marked as retryable:
```typescript
case e instanceof StreamIdleTimeoutError:
  return new MessageV2.APIError(
    {
      message: e.message,
      isRetryable: true,  // <-- This causes infinite retries
      metadata: {
        timeoutMs: String(e.timeoutMs),
      },
    },
    { cause: e },
  ).toObject()
```

**processor.ts:403-420** - Retry logic with no max attempts:
```typescript
} catch (e: any) {
  log.error("process", { error: e, stack: JSON.stringify(e.stack) })
  const error = MessageV2.fromError(e, { providerID: input.model.providerID })
  const retry = SessionRetry.retryable(error)
  if (retry !== undefined) {
    attempt++
    const delay = SessionRetry.delay(attempt, error.name === "APIError" ? error : undefined)
    SessionStatus.set(input.sessionID, {
      type: "retry",
      attempt,
      message: retry,
      next: Date.now() + delay,
    })
    await SessionRetry.sleep(delay, input.abort).catch(() => {})
    continue  // <-- No max retry check for StreamIdleTimeoutError
  }
  // ...
}
```

**processor.ts:442-458** - Cleanup marks incomplete tools as aborted:
```typescript
for (const part of p) {
  if (part.type === "tool" && part.state.status !== "completed" && part.state.status !== "error") {
    await Session.updatePart({
      ...part,
      state: {
        ...part.state,
        status: "error",
        error: "Tool execution aborted",  // <-- Generic message, no actionable guidance
        // ...
      },
    })
  }
}
```

## Environment

- **Provider**: github-copilot
- **Model**: claude-opus-4.5
- **Stream idle timeout**: 60000ms (default)
- **Tool**: write
- **User task**: "continue working on full product page, target financial sectors"

## Suggested Fixes

### Option 1: Add max retries for StreamIdleTimeoutError (Recommended)

```typescript
// In processor.ts
let idleTimeoutRetries = 0
const MAX_IDLE_TIMEOUT_RETRIES = 3

// In catch block, before retry logic:
if (e instanceof StreamIdleTimeoutError) {
  idleTimeoutRetries++
  if (idleTimeoutRetries >= MAX_IDLE_TIMEOUT_RETRIES) {
    input.assistantMessage.error = MessageV2.fromError(
      new Error(`Stream repeatedly timed out (${MAX_IDLE_TIMEOUT_RETRIES} attempts). The model may be trying to generate content that exceeds output limits. Try breaking the task into smaller pieces.`),
      { providerID: input.model.providerID }
    )
    Bus.publish(Session.Event.Error, {
      sessionID: input.assistantMessage.sessionID,
      error: input.assistantMessage.error,
    })
    break // Exit the retry loop
  }
}
```

### Option 2: Detect repeated incomplete tool calls

Track tools that fail during input generation (empty inputs):

```typescript
// In processor.ts
const incompleteToolAttempts: Record<string, number> = {}

// In cleanup section (line 442-458):
for (const part of p) {
  if (part.type === "tool" && part.state.status !== "completed" && part.state.status !== "error") {
    // Track incomplete tool attempts
    const inputSize = JSON.stringify(part.state.input || {}).length
    if (inputSize <= 2) { // Empty object "{}"
      incompleteToolAttempts[part.tool] = (incompleteToolAttempts[part.tool] || 0) + 1
      if (incompleteToolAttempts[part.tool] >= DOOM_LOOP_THRESHOLD) {
        blocked = true
        // Add guidance to error message
      }
    }
    // ... rest of cleanup
  }
}
```

### Option 3: Better error message with actionable guidance

Instead of generic "Tool execution aborted":
```typescript
error: `Tool execution aborted: stream timed out after ${timeoutMs/1000}s while generating tool input. This often happens when attempting to write very large content. Consider breaking the write operation into smaller chunks.`
```

### Option 4: Make StreamIdleTimeoutError non-retryable (simplest)

```typescript
// In message-v2.ts:711-720
case e instanceof StreamIdleTimeoutError:
  return new MessageV2.APIError(
    {
      message: e.message,
      isRetryable: false,  // <-- Stop automatic retries
      metadata: {
        timeoutMs: String(e.timeoutMs),
      },
    },
    { cause: e },
  ).toObject()
```

This surfaces the error to the user immediately, who can then choose to retry or modify their request.

## Related Files

- `packages/opencode/src/session/processor.ts` - Stream processing, idle timeout, doom loop detection, cleanup
- `packages/opencode/src/session/message-v2.ts` - StreamIdleTimeoutError class, error conversion, isRetryable flag
- `packages/opencode/src/session/retry.ts` - Retry logic, backoff calculation
- `packages/opencode/src/session/prompt.ts` - Main agentic loop

## Additional Context

This issue can occur with any provider when:
1. The model tries to generate very large tool inputs (like writing full files)
2. The provider has internal rate limiting or processing delays
3. The model hits output token limits during tool input generation
4. Network issues cause intermittent stalls

The exponential backoff makes this particularly frustrating - after a few retries, the user is waiting 30+ seconds between each failed attempt, with no indication that the same error will keep occurring.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infinite retry loop when StreamIdleTimeoutError occurs during tool input generation #12234

Summary

Detailed Timeline from Real Session

Raw Log Evidence

StreamIdleTimeoutError Sequence

Retry Pattern with Exponential Backoff

Task Verification Shows 10 Empty Write Attempts

Root Cause Analysis

The Retry Loop Flow

Why Doom Loop Detection Doesn't Trigger

Code Path Evidence

Environment

Suggested Fixes

Option 1: Add max retries for StreamIdleTimeoutError (Recommended)

Option 2: Detect repeated incomplete tool calls

Option 3: Better error message with actionable guidance

Option 4: Make StreamIdleTimeoutError non-retryable (simplest)

Related Files

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Time	Event	Delay Since Last
02:50:05	Stream started (claude-opus-4.5)	-
02:51:10	StreamIdleTimeoutError (60s timeout)	65s
02:51:12	Retry #1 started	2s backoff
02:52:16	StreamIdleTimeoutError	64s
02:52:20	Retry #2 started	4s backoff
02:53:23	StreamIdleTimeoutError	63s
02:53:31	Retry #3 started	8s backoff
02:54:34	StreamIdleTimeoutError	63s
02:54:50	Retry #4 started	16s backoff
02:55:53	StreamIdleTimeoutError	63s
02:56:23	Retry #5 started	30s backoff
02:57:27	StreamIdleTimeoutError	64s
02:57:52	User manually aborted	-

Infinite retry loop when StreamIdleTimeoutError occurs during tool input generation #12234

Description

Summary

Detailed Timeline from Real Session

Raw Log Evidence

StreamIdleTimeoutError Sequence

Retry Pattern with Exponential Backoff

Task Verification Shows 10 Empty Write Attempts

Root Cause Analysis

The Retry Loop Flow

Why Doom Loop Detection Doesn't Trigger

Code Path Evidence

Environment

Suggested Fixes

Option 1: Add max retries for StreamIdleTimeoutError (Recommended)

Option 2: Detect repeated incomplete tool calls

Option 3: Better error message with actionable guidance

Option 4: Make StreamIdleTimeoutError non-retryable (simplest)

Related Files

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions