Skip to content

Idle recovery: cumulative counter fails long-running agents — add per-workflow config and reset-on-progress #5

@e-s-gh

Description

@e-s-gh

Problem

The idle recovery mechanism in _wait_with_idle_detection() uses a cumulative counter that never resets within a single agent session. For long-running agents (10+ minutes with hundreds of tool calls), intermittent SDK stalls can exhaust the recovery budget even though each individual stall recovers successfully.

The counter never resets after successful recovery

In copilot.py L922-L935:

recovery_attempts = 0  # Never resets within the session

while True:
    try:
        await asyncio.wait_for(done.wait(), timeout=...)
        return
    except TimeoutError as e:
        recovery_attempts += 1  # Cumulative — never reset

This means max_recovery_attempts (currently defaulting to 3) represents max total stalls over the entire session, not max consecutive stalls without progress.

Example scenario

A reviewer agent processing a large repo runs for ~25 minutes:

  1. Minute 5: SDK stalls during a view call → idle recovery 1/3 → resumes fine
  2. Minute 12: SDK stalls during a grep call → idle recovery 2/3 → resumes fine
  3. Minute 20: SDK stalls during a powershell call → idle recovery 3/3 → fails permanently

The agent was making steady progress between each stall. Each recovery worked. But because the counter is cumulative, the third stall — which would have recovered just like the first two — kills the session.

No workflow-level configuration

IdleRecoveryConfig is a provider-level dataclass, but it's never exposed through the workflow YAML schema. The factory creates CopilotProvider without passing any idle recovery config, and RuntimeConfig has no idle recovery fields. Workflows can't tune this for their use case.

Proposed Solution

1. Add idle_recovery to RuntimeConfig in the workflow schema

Add an IdleRecoveryConfig Pydantic model to schema.py and wire it into RuntimeConfig:

workflow:
  runtime:
    provider: copilot
    idle_recovery:
      timeout_seconds: 300       # Default: 300 (5 min)
      max_attempts: 5            # Default: 3
      reset_on_progress: true    # Default: true

This exposes the existing IdleRecoveryConfig knobs through the YAML without breaking any defaults.

2. Reset counter on resumed activity (key behavior change)

Modify _wait_with_idle_detection() to track whether the agent made progress (tool calls, events) between recovery attempts. If it did, reset the counter:

recovery_attempts = 0
last_recovery_activity_count = 0

while True:
    try:
        await asyncio.wait_for(done.wait(), timeout=idle_timeout_seconds)
        return
    except TimeoutError as e:
        current_activity = get_activity_count(last_activity_ref)
        
        if current_activity > last_recovery_activity_count:
            # Agent made progress since last recovery — reset counter
            recovery_attempts = 0
            last_recovery_activity_count = current_activity
        
        recovery_attempts += 1
        if recovery_attempts > max_recovery_attempts:
            raise ProviderError("Session appears stuck...")
        await session.send(recovery_prompt)
        done.clear()

This way max_recovery_attempts means "max consecutive stalls without progress" — which is what you actually want to detect (a truly stuck session vs intermittent pauses).

The last_activity_ref is already tracked and passed into the method — it just needs an activity counter added alongside the existing event type / tool name tracking.

3. Wire through the factory

Update create_provider() and ProviderFactory.create_provider() to pass the idle recovery config from RuntimeConfig to CopilotProvider.

Implementation scope

File Change
src/conductor/config/schema.py Add IdleRecoveryDef model, add idle_recovery field to RuntimeConfig
src/conductor/providers/copilot.py Add activity counter to last_activity_ref, reset recovery_attempts on progress in _wait_with_idle_detection()
src/conductor/providers/factory.py Pass idle recovery config from RuntimeConfigCopilotProvider
tests/test_providers/test_idle_recovery.py Add tests for reset-on-progress behavior

Backward compatibility

  • reset_on_progress defaults to true (this is the safer default — existing low max_attempts values become less likely to cause false failures)
  • All existing YAML workflows continue to work unchanged (new fields are optional with defaults)
  • Existing IdleRecoveryConfig dataclass stays as the internal representation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions