Skip to content

Expose max_session_seconds in workflow YAML and add permission-denied fast-fail #28

@jrob5756

Description

@jrob5756

Summary

When a Copilot session gets stuck (e.g., due to the race condition in #27, or any other permission/connectivity issue), agents spin uselessly for the full 1800s (30 min) hardcoded timeout before failing. In a for_each group with 10 items, a single stuck agent can turn a 13-minute workflow into a 60-minute timeout.

Two mitigations would dramatically reduce blast radius:

1. Expose max_session_seconds in workflow YAML

IdleRecoveryConfig.max_session_seconds is currently hardcoded to 1800s and only settable via Python constructor. Workflow authors should be able to tune this per-workflow or per-agent.

Proposed YAML schema:

workflow:
  runtime:
    provider: copilot
    max_session_seconds: 300  # workflow-level default

Or per-agent:

agents:
  - name: source_gatherer
    max_session_seconds: 120  # this agent should finish in ~60s

For for_each groups, the per-item agent timeout is especially important — a source-gathering agent that takes 30 minutes is certainly stuck, not working.

Implementation: Plumb the value through create_provider() in factory.pyCopilotProvider.__init__()IdleRecoveryConfig(max_session_seconds=...).

2. Detect permission-denied loops and fail fast

When every tool call returns "Permission denied", the agent is in an unrecoverable state — no amount of retrying will fix a missing session registration or a policy denial. Currently the agent keeps trying different tools, spawning sub-agents, and rephrasing requests for the full session timeout.

Proposed behavior: If an agent receives "Permission denied" (or the full string "Permission denied and could not request permission from user") on N consecutive tool results (e.g., N=5), Conductor should kill the session immediately with a clear ProviderError rather than waiting for max_session_seconds.

Implementation options:

  • In _send_and_wait() or the event callback in copilot.py, track consecutive tool results containing the permission-denied string
  • After N consecutive denials, raise ProviderError("All tool calls denied — possible permission configuration issue") with retryable=False
  • This could be an IdleRecoveryConfig option: max_consecutive_denials: int = 5

Impact

With both changes, a stuck agent in a for-each group would fail in ~30s instead of ~1800s, keeping total workflow runtime close to the healthy baseline even when the race condition in #27 is hit.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions