Idle recovery: cumulative counter fails long-running agents — add per-workflow config and reset-on-progress

## Problem

The idle recovery mechanism in [`_wait_with_idle_detection()`](https://github.com/microsoft/conductor/blob/87476b74dd1eb365bbb13633d04b71f45231ce00/src/conductor/providers/copilot.py#L896-L967) uses a **cumulative counter** that never resets within a single agent session. For long-running agents (10+ minutes with hundreds of tool calls), intermittent SDK stalls can exhaust the recovery budget even though each individual stall recovers successfully.

### The counter never resets after successful recovery

In [`copilot.py` L922-L935](https://github.com/microsoft/conductor/blob/87476b74dd1eb365bbb13633d04b71f45231ce00/src/conductor/providers/copilot.py#L922-L935):

```python
recovery_attempts = 0  # Never resets within the session

while True:
    try:
        await asyncio.wait_for(done.wait(), timeout=...)
        return
    except TimeoutError as e:
        recovery_attempts += 1  # Cumulative — never reset
```

This means `max_recovery_attempts` (currently [defaulting to 3](https://github.com/microsoft/conductor/blob/87476b74dd1eb365bbb13633d04b71f45231ce00/src/conductor/providers/copilot.py#L74)) represents max **total** stalls over the entire session, not max **consecutive** stalls without progress.

### Example scenario

A reviewer agent processing a large repo runs for ~25 minutes:
1. **Minute 5**: SDK stalls during a `view` call → idle recovery 1/3 → resumes fine
2. **Minute 12**: SDK stalls during a `grep` call → idle recovery 2/3 → resumes fine
3. **Minute 20**: SDK stalls during a `powershell` call → idle recovery 3/3 → **fails permanently**

The agent was making steady progress between each stall. Each recovery worked. But because the counter is cumulative, the third stall — which would have recovered just like the first two — kills the session.

### No workflow-level configuration

[`IdleRecoveryConfig`](https://github.com/microsoft/conductor/blob/87476b74dd1eb365bbb13633d04b71f45231ce00/src/conductor/providers/copilot.py#L59-L80) is a provider-level dataclass, but it's never exposed through the workflow YAML schema. The [factory](https://github.com/microsoft/conductor/blob/87476b74dd1eb365bbb13633d04b71f45231ce00/src/conductor/providers/factory.py#L56-L60) creates `CopilotProvider` without passing any idle recovery config, and [`RuntimeConfig`](https://github.com/microsoft/conductor/blob/87476b74dd1eb365bbb13633d04b71f45231ce00/src/conductor/config/schema.py#L469) has no idle recovery fields. Workflows can't tune this for their use case.

## Proposed Solution

### 1. Add `idle_recovery` to `RuntimeConfig` in the workflow schema

Add an `IdleRecoveryConfig` Pydantic model to [`schema.py`](https://github.com/microsoft/conductor/blob/87476b74dd1eb365bbb13633d04b71f45231ce00/src/conductor/config/schema.py) and wire it into `RuntimeConfig`:

```yaml
workflow:
  runtime:
    provider: copilot
    idle_recovery:
      timeout_seconds: 300       # Default: 300 (5 min)
      max_attempts: 5            # Default: 3
      reset_on_progress: true    # Default: true
```

This exposes the existing `IdleRecoveryConfig` knobs through the YAML without breaking any defaults.

### 2. Reset counter on resumed activity (key behavior change)

Modify [`_wait_with_idle_detection()`](https://github.com/microsoft/conductor/blob/87476b74dd1eb365bbb13633d04b71f45231ce00/src/conductor/providers/copilot.py#L896-L967) to track whether the agent made progress (tool calls, events) between recovery attempts. If it did, reset the counter:

```python
recovery_attempts = 0
last_recovery_activity_count = 0

while True:
    try:
        await asyncio.wait_for(done.wait(), timeout=idle_timeout_seconds)
        return
    except TimeoutError as e:
        current_activity = get_activity_count(last_activity_ref)
        
        if current_activity > last_recovery_activity_count:
            # Agent made progress since last recovery — reset counter
            recovery_attempts = 0
            last_recovery_activity_count = current_activity
        
        recovery_attempts += 1
        if recovery_attempts > max_recovery_attempts:
            raise ProviderError("Session appears stuck...")
        await session.send(recovery_prompt)
        done.clear()
```

This way `max_recovery_attempts` means "max **consecutive** stalls without progress" — which is what you actually want to detect (a truly stuck session vs intermittent pauses).

The `last_activity_ref` is [already tracked](https://github.com/microsoft/conductor/blob/87476b74dd1eb365bbb13633d04b71f45231ce00/src/conductor/providers/copilot.py#L575) and passed into the method — it just needs an activity counter added alongside the existing event type / tool name tracking.

### 3. Wire through the factory

Update [`create_provider()`](https://github.com/microsoft/conductor/blob/87476b74dd1eb365bbb13633d04b71f45231ce00/src/conductor/providers/factory.py#L56-L60) and `ProviderFactory.create_provider()` to pass the idle recovery config from `RuntimeConfig` to `CopilotProvider`.

### Implementation scope

| File | Change |
|------|--------|
| `src/conductor/config/schema.py` | Add `IdleRecoveryDef` model, add `idle_recovery` field to `RuntimeConfig` |
| `src/conductor/providers/copilot.py` | Add activity counter to `last_activity_ref`, reset `recovery_attempts` on progress in `_wait_with_idle_detection()` |
| `src/conductor/providers/factory.py` | Pass idle recovery config from `RuntimeConfig` → `CopilotProvider` |
| `tests/test_providers/test_idle_recovery.py` | Add tests for reset-on-progress behavior |

### Backward compatibility

- `reset_on_progress` defaults to `true` (this is the safer default — existing low `max_attempts` values become less likely to cause false failures)
- All existing YAML workflows continue to work unchanged (new fields are optional with defaults)
- Existing `IdleRecoveryConfig` dataclass stays as the internal representation

File	Change
`src/conductor/config/schema.py`	Add `IdleRecoveryDef` model, add `idle_recovery` field to `RuntimeConfig`
`src/conductor/providers/copilot.py`	Add activity counter to `last_activity_ref`, reset `recovery_attempts` on progress in `_wait_with_idle_detection()`
`src/conductor/providers/factory.py`	Pass idle recovery config from `RuntimeConfig` → `CopilotProvider`
`tests/test_providers/test_idle_recovery.py`	Add tests for reset-on-progress behavior

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idle recovery: cumulative counter fails long-running agents — add per-workflow config and reset-on-progress #5

Problem

The counter never resets after successful recovery

Example scenario

No workflow-level configuration

Proposed Solution

1. Add `idle_recovery` to `RuntimeConfig` in the workflow schema

2. Reset counter on resumed activity (key behavior change)

3. Wire through the factory

Implementation scope

Backward compatibility

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Idle recovery: cumulative counter fails long-running agents — add per-workflow config and reset-on-progress #5

Description

Problem

The counter never resets after successful recovery

Example scenario

No workflow-level configuration

Proposed Solution

1. Add idle_recovery to RuntimeConfig in the workflow schema

2. Reset counter on resumed activity (key behavior change)

3. Wire through the factory

Implementation scope

Backward compatibility

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Add `idle_recovery` to `RuntimeConfig` in the workflow schema