Skip to content

Infinite retry loop when StreamIdleTimeoutError occurs during tool input generation #12234

@dzianisv

Description

@dzianisv

Summary

When a model attempts to generate a large tool input (e.g., writing a full page of content), the stream can stall and trigger a StreamIdleTimeoutError. This error is marked as retryable, causing an infinite loop where the model repeatedly attempts the same failing operation with exponential backoff delays.

User Experience: The UI shows "Preparing write..." indefinitely, with the agent stuck in a loop. The only way to exit is to manually abort (Escape key).

Detailed Timeline from Real Session

Session ses_3d5454748ffeA0QZlbOzaY4s4q on project VibeBrowserProductPage:

Time Event Delay Since Last
02:50:05 Stream started (claude-opus-4.5) -
02:51:10 StreamIdleTimeoutError (60s timeout) 65s
02:51:12 Retry #1 started 2s backoff
02:52:16 StreamIdleTimeoutError 64s
02:52:20 Retry #2 started 4s backoff
02:53:23 StreamIdleTimeoutError 63s
02:53:31 Retry #3 started 8s backoff
02:54:34 StreamIdleTimeoutError 63s
02:54:50 Retry #4 started 16s backoff
02:55:53 StreamIdleTimeoutError 63s
02:56:23 Retry #5 started 30s backoff
02:57:27 StreamIdleTimeoutError 64s
02:57:52 User manually aborted -

Total time stuck: ~8 minutes before user intervention.

Raw Log Evidence

StreamIdleTimeoutError Sequence

ERROR 2026-02-05T02:51:10 +60107ms service=session.processor error=Stream idle timeout: no data received for 60000ms stack="StreamIdleTimeoutError: Stream idle timeout: no data received for 60000ms\n    at <anonymous> (src/session/processor.ts:44:20)" process
ERROR 2026-02-05T02:52:16 +60096ms service=session.processor error=Stream idle timeout: no data received for 60000ms stack="StreamIdleTimeoutError: Stream idle timeout: no data received for 60000ms\n    at <anonymous> (src/session/processor.ts:44:20)" process
ERROR 2026-02-05T02:53:23 +60218ms service=session.processor error=Stream idle timeout: no data received for 60000ms stack="StreamIdleTimeoutError: Stream idle timeout: no data received for 60000ms\n    at <anonymous> (src/session/processor.ts:44:20)" process
ERROR 2026-02-05T02:54:34 +60211ms service=session.processor error=Stream idle timeout: no data received for 60000ms stack="StreamIdleTimeoutError: Stream idle timeout: no data received for 60000ms\n    at <anonymous> (src/session/processor.ts:44:20)" process
ERROR 2026-02-05T02:55:53 +60214ms service=session.processor error=Stream idle timeout: no data received for 60000ms stack="StreamIdleTimeoutError: Stream idle timeout: no data received for 60000ms\n    at <anonymous> (src/session/processor.ts:44:20)" process
ERROR 2026-02-05T02:57:27 +60133ms service=session.processor error=Stream idle timeout: no data received for 60000ms stack="StreamIdleTimeoutError: Stream idle timeout: no data received for 60000ms\n    at <anonymous> (src/session/processor.ts:44:20)" process

Retry Pattern with Exponential Backoff

INFO  2026-02-05T02:50:05 service=llm modelID=claude-opus-4.5 sessionID=ses_3d5454748ffeA0QZlbOzaY4s4q stream
INFO  2026-02-05T02:51:12 +2002ms service=llm modelID=claude-opus-4.5 sessionID=ses_3d5454748ffeA0QZlbOzaY4s4q stream
INFO  2026-02-05T02:52:20 +4002ms service=llm modelID=claude-opus-4.5 sessionID=ses_3d5454748ffeA0QZlbOzaY4s4q stream
INFO  2026-02-05T02:53:31 +8003ms service=llm modelID=claude-opus-4.5 sessionID=ses_3d5454748ffeA0QZlbOzaY4s4q stream
INFO  2026-02-05T02:54:50 +16003ms service=llm modelID=claude-opus-4.5 sessionID=ses_3d5454748ffeA0QZlbOzaY4s4q stream
INFO  2026-02-05T02:56:23 +30002ms service=llm modelID=claude-opus-4.5 sessionID=ses_3d5454748ffeA0QZlbOzaY4s4q stream

Note the delays: 2s → 4s → 8s → 16s → 30s (capped at RETRY_MAX_DELAY_NO_HEADERS)

Task Verification Shows 10 Empty Write Attempts

From the reflection/task verification system:

## Tools Used
write: {}
write: {}
write: {}
write: {}
write: {}
write: {}
write: {}
write: {}
write: {}
write: {}

## Agent's Response
You're right, I was overthinking. Let me just write the full page:

The Write tool was called 10 times with empty input {} because the stream died during tool-input-start phase before the JSON input was fully parsed.

Root Cause Analysis

The Retry Loop Flow

User asks: "continue working on full product page"
    ↓
Model starts generating Write tool call with large content
    ↓
Provider API stalls (rate limit, internal processing, or output token exhaustion)
    ↓
60 seconds pass with no stream data chunks
    ↓
StreamIdleTimeoutError thrown (processor.ts:44)
    ↓
Error converted to APIError with isRetryable: true (message-v2.ts:715)
    ↓
retry.ts.retryable() returns message string (line 62-64)
    ↓
processor.ts catches error, increments attempt, waits with backoff (line 403-420)
    ↓
New LLM.stream() call starts from scratch
    ↓
Model sees previous failed attempt with "Tool execution aborted" error
    ↓
Model tries THE SAME approach again
    ↓
REPEAT FOREVER (or until user aborts)

Why Doom Loop Detection Doesn't Trigger

The existing doom loop detection in processor.ts:207-232 checks:

if (part.state.status === "running" && part.state.input) {
  // Track same tool + same input called 3 times
}

This fails because:

  1. Stream dies during tool-input-start phase (before tool-call event)
  2. Tool never reaches "running" status - it stays in "pending"
  3. Input is always {} (empty) - JSON was never fully received
  4. Cleanup marks tool as "error" with empty input
  5. Each retry has a different tool call ID
  6. Empty inputs {} are not detected as "same input"

Code Path Evidence

message-v2.ts:711-720 - StreamIdleTimeoutError marked as retryable:

case e instanceof StreamIdleTimeoutError:
  return new MessageV2.APIError(
    {
      message: e.message,
      isRetryable: true,  // <-- This causes infinite retries
      metadata: {
        timeoutMs: String(e.timeoutMs),
      },
    },
    { cause: e },
  ).toObject()

processor.ts:403-420 - Retry logic with no max attempts:

} catch (e: any) {
  log.error("process", { error: e, stack: JSON.stringify(e.stack) })
  const error = MessageV2.fromError(e, { providerID: input.model.providerID })
  const retry = SessionRetry.retryable(error)
  if (retry !== undefined) {
    attempt++
    const delay = SessionRetry.delay(attempt, error.name === "APIError" ? error : undefined)
    SessionStatus.set(input.sessionID, {
      type: "retry",
      attempt,
      message: retry,
      next: Date.now() + delay,
    })
    await SessionRetry.sleep(delay, input.abort).catch(() => {})
    continue  // <-- No max retry check for StreamIdleTimeoutError
  }
  // ...
}

processor.ts:442-458 - Cleanup marks incomplete tools as aborted:

for (const part of p) {
  if (part.type === "tool" && part.state.status !== "completed" && part.state.status !== "error") {
    await Session.updatePart({
      ...part,
      state: {
        ...part.state,
        status: "error",
        error: "Tool execution aborted",  // <-- Generic message, no actionable guidance
        // ...
      },
    })
  }
}

Environment

  • Provider: github-copilot
  • Model: claude-opus-4.5
  • Stream idle timeout: 60000ms (default)
  • Tool: write
  • User task: "continue working on full product page, target financial sectors"

Suggested Fixes

Option 1: Add max retries for StreamIdleTimeoutError (Recommended)

// In processor.ts
let idleTimeoutRetries = 0
const MAX_IDLE_TIMEOUT_RETRIES = 3

// In catch block, before retry logic:
if (e instanceof StreamIdleTimeoutError) {
  idleTimeoutRetries++
  if (idleTimeoutRetries >= MAX_IDLE_TIMEOUT_RETRIES) {
    input.assistantMessage.error = MessageV2.fromError(
      new Error(`Stream repeatedly timed out (${MAX_IDLE_TIMEOUT_RETRIES} attempts). The model may be trying to generate content that exceeds output limits. Try breaking the task into smaller pieces.`),
      { providerID: input.model.providerID }
    )
    Bus.publish(Session.Event.Error, {
      sessionID: input.assistantMessage.sessionID,
      error: input.assistantMessage.error,
    })
    break // Exit the retry loop
  }
}

Option 2: Detect repeated incomplete tool calls

Track tools that fail during input generation (empty inputs):

// In processor.ts
const incompleteToolAttempts: Record<string, number> = {}

// In cleanup section (line 442-458):
for (const part of p) {
  if (part.type === "tool" && part.state.status !== "completed" && part.state.status !== "error") {
    // Track incomplete tool attempts
    const inputSize = JSON.stringify(part.state.input || {}).length
    if (inputSize <= 2) { // Empty object "{}"
      incompleteToolAttempts[part.tool] = (incompleteToolAttempts[part.tool] || 0) + 1
      if (incompleteToolAttempts[part.tool] >= DOOM_LOOP_THRESHOLD) {
        blocked = true
        // Add guidance to error message
      }
    }
    // ... rest of cleanup
  }
}

Option 3: Better error message with actionable guidance

Instead of generic "Tool execution aborted":

error: `Tool execution aborted: stream timed out after ${timeoutMs/1000}s while generating tool input. This often happens when attempting to write very large content. Consider breaking the write operation into smaller chunks.`

Option 4: Make StreamIdleTimeoutError non-retryable (simplest)

// In message-v2.ts:711-720
case e instanceof StreamIdleTimeoutError:
  return new MessageV2.APIError(
    {
      message: e.message,
      isRetryable: false,  // <-- Stop automatic retries
      metadata: {
        timeoutMs: String(e.timeoutMs),
      },
    },
    { cause: e },
  ).toObject()

This surfaces the error to the user immediately, who can then choose to retry or modify their request.

Related Files

  • packages/opencode/src/session/processor.ts - Stream processing, idle timeout, doom loop detection, cleanup
  • packages/opencode/src/session/message-v2.ts - StreamIdleTimeoutError class, error conversion, isRetryable flag
  • packages/opencode/src/session/retry.ts - Retry logic, backoff calculation
  • packages/opencode/src/session/prompt.ts - Main agentic loop

Additional Context

This issue can occur with any provider when:

  1. The model tries to generate very large tool inputs (like writing full files)
  2. The provider has internal rate limiting or processing delays
  3. The model hits output token limits during tool input generation
  4. Network issues cause intermittent stalls

The exponential backoff makes this particularly frustrating - after a few retries, the user is waiting 30+ seconds between each failed attempt, with no indication that the same error will keep occurring.

Metadata

Metadata

Assignees

Labels

perfIndicates a performance issue or need for optimization

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions