Summary
When a Copilot session gets stuck (e.g., due to the race condition in #27, or any other permission/connectivity issue), agents spin uselessly for the full 1800s (30 min) hardcoded timeout before failing. In a for_each group with 10 items, a single stuck agent can turn a 13-minute workflow into a 60-minute timeout.
Two mitigations would dramatically reduce blast radius:
1. Expose max_session_seconds in workflow YAML
IdleRecoveryConfig.max_session_seconds is currently hardcoded to 1800s and only settable via Python constructor. Workflow authors should be able to tune this per-workflow or per-agent.
Proposed YAML schema:
workflow:
runtime:
provider: copilot
max_session_seconds: 300 # workflow-level default
Or per-agent:
agents:
- name: source_gatherer
max_session_seconds: 120 # this agent should finish in ~60s
For for_each groups, the per-item agent timeout is especially important — a source-gathering agent that takes 30 minutes is certainly stuck, not working.
Implementation: Plumb the value through create_provider() in factory.py → CopilotProvider.__init__() → IdleRecoveryConfig(max_session_seconds=...).
2. Detect permission-denied loops and fail fast
When every tool call returns "Permission denied", the agent is in an unrecoverable state — no amount of retrying will fix a missing session registration or a policy denial. Currently the agent keeps trying different tools, spawning sub-agents, and rephrasing requests for the full session timeout.
Proposed behavior: If an agent receives "Permission denied" (or the full string "Permission denied and could not request permission from user") on N consecutive tool results (e.g., N=5), Conductor should kill the session immediately with a clear ProviderError rather than waiting for max_session_seconds.
Implementation options:
- In
_send_and_wait() or the event callback in copilot.py, track consecutive tool results containing the permission-denied string
- After N consecutive denials, raise
ProviderError("All tool calls denied — possible permission configuration issue") with retryable=False
- This could be an
IdleRecoveryConfig option: max_consecutive_denials: int = 5
Impact
With both changes, a stuck agent in a for-each group would fail in ~30s instead of ~1800s, keeping total workflow runtime close to the healthy baseline even when the race condition in #27 is hit.
Related
Summary
When a Copilot session gets stuck (e.g., due to the race condition in #27, or any other permission/connectivity issue), agents spin uselessly for the full 1800s (30 min) hardcoded timeout before failing. In a
for_eachgroup with 10 items, a single stuck agent can turn a 13-minute workflow into a 60-minute timeout.Two mitigations would dramatically reduce blast radius:
1. Expose
max_session_secondsin workflow YAMLIdleRecoveryConfig.max_session_secondsis currently hardcoded to 1800s and only settable via Python constructor. Workflow authors should be able to tune this per-workflow or per-agent.Proposed YAML schema:
Or per-agent:
For
for_eachgroups, the per-item agent timeout is especially important — a source-gathering agent that takes 30 minutes is certainly stuck, not working.Implementation: Plumb the value through
create_provider()infactory.py→CopilotProvider.__init__()→IdleRecoveryConfig(max_session_seconds=...).2. Detect permission-denied loops and fail fast
When every tool call returns "Permission denied", the agent is in an unrecoverable state — no amount of retrying will fix a missing session registration or a policy denial. Currently the agent keeps trying different tools, spawning sub-agents, and rephrasing requests for the full session timeout.
Proposed behavior: If an agent receives "Permission denied" (or the full string "Permission denied and could not request permission from user") on N consecutive tool results (e.g., N=5), Conductor should kill the session immediately with a clear
ProviderErrorrather than waiting formax_session_seconds.Implementation options:
_send_and_wait()or the event callback incopilot.py, track consecutive tool results containing the permission-denied stringProviderError("All tool calls denied — possible permission configuration issue")withretryable=FalseIdleRecoveryConfigoption:max_consecutive_denials: int = 5Impact
With both changes, a stuck agent in a for-each group would fail in ~30s instead of ~1800s, keeping total workflow runtime close to the healthy baseline even when the race condition in #27 is hit.
Related