Problem
The idle recovery mechanism in _wait_with_idle_detection() uses a cumulative counter that never resets within a single agent session. For long-running agents (10+ minutes with hundreds of tool calls), intermittent SDK stalls can exhaust the recovery budget even though each individual stall recovers successfully.
The counter never resets after successful recovery
In copilot.py L922-L935:
recovery_attempts = 0 # Never resets within the session
while True:
try:
await asyncio.wait_for(done.wait(), timeout=...)
return
except TimeoutError as e:
recovery_attempts += 1 # Cumulative — never reset
This means max_recovery_attempts (currently defaulting to 3) represents max total stalls over the entire session, not max consecutive stalls without progress.
Example scenario
A reviewer agent processing a large repo runs for ~25 minutes:
- Minute 5: SDK stalls during a
view call → idle recovery 1/3 → resumes fine
- Minute 12: SDK stalls during a
grep call → idle recovery 2/3 → resumes fine
- Minute 20: SDK stalls during a
powershell call → idle recovery 3/3 → fails permanently
The agent was making steady progress between each stall. Each recovery worked. But because the counter is cumulative, the third stall — which would have recovered just like the first two — kills the session.
No workflow-level configuration
IdleRecoveryConfig is a provider-level dataclass, but it's never exposed through the workflow YAML schema. The factory creates CopilotProvider without passing any idle recovery config, and RuntimeConfig has no idle recovery fields. Workflows can't tune this for their use case.
Proposed Solution
1. Add idle_recovery to RuntimeConfig in the workflow schema
Add an IdleRecoveryConfig Pydantic model to schema.py and wire it into RuntimeConfig:
workflow:
runtime:
provider: copilot
idle_recovery:
timeout_seconds: 300 # Default: 300 (5 min)
max_attempts: 5 # Default: 3
reset_on_progress: true # Default: true
This exposes the existing IdleRecoveryConfig knobs through the YAML without breaking any defaults.
2. Reset counter on resumed activity (key behavior change)
Modify _wait_with_idle_detection() to track whether the agent made progress (tool calls, events) between recovery attempts. If it did, reset the counter:
recovery_attempts = 0
last_recovery_activity_count = 0
while True:
try:
await asyncio.wait_for(done.wait(), timeout=idle_timeout_seconds)
return
except TimeoutError as e:
current_activity = get_activity_count(last_activity_ref)
if current_activity > last_recovery_activity_count:
# Agent made progress since last recovery — reset counter
recovery_attempts = 0
last_recovery_activity_count = current_activity
recovery_attempts += 1
if recovery_attempts > max_recovery_attempts:
raise ProviderError("Session appears stuck...")
await session.send(recovery_prompt)
done.clear()
This way max_recovery_attempts means "max consecutive stalls without progress" — which is what you actually want to detect (a truly stuck session vs intermittent pauses).
The last_activity_ref is already tracked and passed into the method — it just needs an activity counter added alongside the existing event type / tool name tracking.
3. Wire through the factory
Update create_provider() and ProviderFactory.create_provider() to pass the idle recovery config from RuntimeConfig to CopilotProvider.
Implementation scope
| File |
Change |
src/conductor/config/schema.py |
Add IdleRecoveryDef model, add idle_recovery field to RuntimeConfig |
src/conductor/providers/copilot.py |
Add activity counter to last_activity_ref, reset recovery_attempts on progress in _wait_with_idle_detection() |
src/conductor/providers/factory.py |
Pass idle recovery config from RuntimeConfig → CopilotProvider |
tests/test_providers/test_idle_recovery.py |
Add tests for reset-on-progress behavior |
Backward compatibility
reset_on_progress defaults to true (this is the safer default — existing low max_attempts values become less likely to cause false failures)
- All existing YAML workflows continue to work unchanged (new fields are optional with defaults)
- Existing
IdleRecoveryConfig dataclass stays as the internal representation
Problem
The idle recovery mechanism in
_wait_with_idle_detection()uses a cumulative counter that never resets within a single agent session. For long-running agents (10+ minutes with hundreds of tool calls), intermittent SDK stalls can exhaust the recovery budget even though each individual stall recovers successfully.The counter never resets after successful recovery
In
copilot.pyL922-L935:This means
max_recovery_attempts(currently defaulting to 3) represents max total stalls over the entire session, not max consecutive stalls without progress.Example scenario
A reviewer agent processing a large repo runs for ~25 minutes:
viewcall → idle recovery 1/3 → resumes finegrepcall → idle recovery 2/3 → resumes finepowershellcall → idle recovery 3/3 → fails permanentlyThe agent was making steady progress between each stall. Each recovery worked. But because the counter is cumulative, the third stall — which would have recovered just like the first two — kills the session.
No workflow-level configuration
IdleRecoveryConfigis a provider-level dataclass, but it's never exposed through the workflow YAML schema. The factory createsCopilotProviderwithout passing any idle recovery config, andRuntimeConfighas no idle recovery fields. Workflows can't tune this for their use case.Proposed Solution
1. Add
idle_recoverytoRuntimeConfigin the workflow schemaAdd an
IdleRecoveryConfigPydantic model toschema.pyand wire it intoRuntimeConfig:This exposes the existing
IdleRecoveryConfigknobs through the YAML without breaking any defaults.2. Reset counter on resumed activity (key behavior change)
Modify
_wait_with_idle_detection()to track whether the agent made progress (tool calls, events) between recovery attempts. If it did, reset the counter:This way
max_recovery_attemptsmeans "max consecutive stalls without progress" — which is what you actually want to detect (a truly stuck session vs intermittent pauses).The
last_activity_refis already tracked and passed into the method — it just needs an activity counter added alongside the existing event type / tool name tracking.3. Wire through the factory
Update
create_provider()andProviderFactory.create_provider()to pass the idle recovery config fromRuntimeConfigtoCopilotProvider.Implementation scope
src/conductor/config/schema.pyIdleRecoveryDefmodel, addidle_recoveryfield toRuntimeConfigsrc/conductor/providers/copilot.pylast_activity_ref, resetrecovery_attemptson progress in_wait_with_idle_detection()src/conductor/providers/factory.pyRuntimeConfig→CopilotProvidertests/test_providers/test_idle_recovery.pyBackward compatibility
reset_on_progressdefaults totrue(this is the safer default — existing lowmax_attemptsvalues become less likely to cause false failures)IdleRecoveryConfigdataclass stays as the internal representation