Skip to content

Race condition in SDK causes Permission denied in for-each groups under concurrency #27

@jrob5756

Description

@jrob5756

Summary

When running for_each groups with max_concurrent > 1, Copilot sessions intermittently receive "Permission denied and could not request permission from user" on all tool calls. This causes agents to spin uselessly for the full 30-minute max_session_seconds timeout before failing, turning a 13-minute workflow into a 60-minute timeout.

Root Cause

There is a race condition in copilot-sdk's CopilotClient.create_session() (client.py lines 442–451):

response = await self._client.request("session.create", payload)  # 1. CLI creates session
session_id = response["sessionId"]
session = CopilotSession(session_id, self._client, workspace_path)
session._register_tools(tools)
if on_permission_request:
    session._register_permission_handler(on_permission_request)   # 2. Handler registered
with self._sessions_lock:
    self._sessions[session_id] = session                          # 3. Session added to lookup dict

The CLI process starts the session at step 1 and can immediately begin sending permission.request JSON-RPC messages. But the Python SDK doesn't register the session in _sessions until step 3. If a permission.request arrives between steps 1 and 3:

  1. _handle_permission_request() looks up the session in _sessionsnot found
  2. Raises ValueError("unknown session {session_id}")
  3. _dispatch_request() catches the exception and sends a JSON-RPC error response (-32603)
  4. The CLI interprets this as a permission denial and returns "Permission denied" to the model

With 5 concurrent create_session calls (from max_concurrent: 5 in a for-each group), the race window widens significantly. Once a session's first permission request is denied, the model starts retrying every tool it has — each retry also gets denied — and the agent burns its entire 1800s session timeout on futile retries.

Evidence

Comparing CI runs on the same workflow with same Copilot CLI 1.0.2:

Date gather_sources duration "Permission denied" count Total run
Mar 7 396s (10/10 succeeded) 0 13 min ✅
Mar 8 3,050s (9/10 succeeded, 1 timed out at 1800s) hundreds 59 min ⚠️
Mar 9 never completed (agents stuck in retry loops) 83+ 60 min ❌ cancelled

The pattern is consistent: agents get "Permission denied" on their very first tool call, then every subsequent tool call also fails. Other agents in the same batch work fine — they won the race.

Suggested Fixes

In copilot-sdk (root cause)

Register the session in _sessions before sending session.create to the CLI, or use a placeholder entry:

# Pre-register with a placeholder so permission requests can find the session
session = CopilotSession(None, self._client, None)
if on_permission_request:
    session._register_permission_handler(on_permission_request)

# Now create on CLI side
response = await self._client.request("session.create", payload)
session_id = response["sessionId"]
session._session_id = session_id

with self._sessions_lock:
    self._sessions[session_id] = session

Or alternatively, queue incoming permission.request messages for unknown sessions and replay them once the session is registered.

In Conductor (mitigation)

  1. Expose max_session_seconds in workflow YAML — the hardcoded 1800s is far too long for a for-each item that should take ~60s. A 5-minute cap would limit damage to 5 min instead of 30 min per stuck agent.

  2. Detect permission-denied loops — if an agent receives "Permission denied" on N consecutive tool calls, fail fast instead of waiting for the session timeout.

Reproduction

Any workflow with a for_each group using max_concurrent >= 2 and the Copilot provider can hit this intermittently. Higher concurrency = higher probability.

Environment

  • Conductor: installed from main
  • github-copilot-sdk: 0.1.18
  • Copilot CLI: 1.0.2
  • Runtime: GitHub Actions ubuntu-latest

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions