Skip to content

fix(copilot): add permission handler, event logging, and idle detection race fix#15

Merged
jrob5756 merged 2 commits intomicrosoft:mainfrom
lucioctinoco:fix/copilot-sdk-integration
Mar 3, 2026
Merged

fix(copilot): add permission handler, event logging, and idle detection race fix#15
jrob5756 merged 2 commits intomicrosoft:mainfrom
lucioctinoco:fix/copilot-sdk-integration

Conversation

@lucioctinoco
Copy link
Copy Markdown
Contributor

Three fixes for the Copilot SDK provider:

  1. Add on_permission_request auto-approve handler - The Copilot SDK now requires this handler when creating sessions. Without it, every workflow fails. Auto-approve is correct for non-interactive conductor workflows.

  2. Add SDK event logging via logger.debug() - Every SDK event is logged at DEBUG level, visible when --log-file is passed. Enables post-mortem debugging of session stalls. Works with --web-bg.

  3. Fix done.clear() race condition - If session.idle arrives between a timeout check and done.clear(), the completion signal is lost, causing the loop to wait another full idle timeout (5 min). Added done.is_set() guards at three locations.

…on race fix

Three fixes for the Copilot SDK provider:

1. Add on_permission_request auto-approve handler to session config.
   The Copilot SDK now requires this handler when creating sessions.
   Without it, every workflow fails with 'An on_permission_request
   handler is required'. Auto-approve is correct for non-interactive
   conductor workflows.

2. Add SDK event logging via logger.debug() in the on_event callback.
   Every SDK event (including tool execution starts) is logged at
   DEBUG level, visible when --log-file is passed. This enables
   post-mortem debugging of session stalls. Works with --web-bg
   because bg_runner.py already forwards --log-file to the
   background process.

3. Fix done.clear() race condition in _wait_with_idle_detection.
   If session.idle arrives between a timeout check and done.clear(),
   the completion signal is lost, causing the loop to wait another
   full idle timeout (5 minutes) before checking again. Added
   done.is_set() guards at three locations before done.clear()
   and an early-return check at the top of the loop.
@lucioctinoco lucioctinoco force-pushed the fix/copilot-sdk-integration branch from 62b33b8 to 2426156 Compare March 3, 2026 23:02
Comment thread src/conductor/providers/copilot.py Outdated
"model": model,
# Auto-approve all permission requests (shell, write, mcp, read, url)
# since Conductor workflows run non-interactively.
"on_permission_request": lambda _req, _ctx: {"kind": "approved"},
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Lint failure (CI blocker): This inline lambda is causing the ruff format check to fail in CI. Extract it to a named static method with proper type hints and a docstring.

Suggestion:

@staticmethod
def _default_permission_handler(
    request: dict[str, Any],
    invocation: dict[str, str],
) -> dict[str, Any]:
    """Default permission handler that approves all requests.

    SDK v0.1.28+ requires a permission handler on session creation.
    In orchestration mode, we approve all tool permissions since the
    workflow author controls which tools are available to each agent.
    """
    return {"kind": "approved"}

Then reference it here as self._default_permission_handler.

Comment thread src/conductor/providers/copilot.py Outdated
# Build session config with MCP servers from workflow configuration
session_config: dict[str, Any] = {
"model": model,
# Auto-approve all permission requests (shell, write, mcp, read, url)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Missing resume_session handler: SDK v0.1.28+ also requires on_permission_request when calling resume_session(). Without it, resuming workflows will still raise the permission denied error.

Around line 429 where resume_session is called, pass a config dict:

session = await self._client.resume_session(
    resume_sid,
    {"on_permission_request": self._default_permission_handler},
)

Comment thread src/conductor/providers/copilot.py Outdated
session_config: dict[str, Any] = {
"model": model,
# Auto-approve all permission requests (shell, write, mcp, read, url)
# since Conductor workflows run non-interactively.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 No audit logging on permission approvals. Since this handler silently auto-approves dangerous categories (shell, write, mcp), consider adding a logger.debug so approvals are visible in --log-file output:

logger.debug("auto-approved permission request: %s", request)
return {"kind": "approved"}

Comment thread src/conductor/providers/copilot.py Outdated
tool_info = ""
if event_type == "tool.execution_start":
tn = getattr(event.data, "tool_name", None) or getattr(
event.data, "name", "?"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Minor: The getattr(event.data, ...) chain will raise AttributeError if the event object itself has no data attribute (e.g., a malformed SDK event). Consider adding a defensive guard:

if event_type == "tool.execution_start" and hasattr(event, "data") and event.data is not None:

Not a regression (existing code has the same pattern), but since this is new code it's worth hardening.

@@ -1267,6 +1280,11 @@
idle_timeout = self._idle_recovery_config.idle_timeout_seconds

while True:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix. This early-return guard closes a real race window where session.idle could arrive between a previous done.clear() and the await wait_for(), causing the completion signal to be lost and an unnecessary full idle-timeout wait.

@@ -1288,11 +1306,15 @@
# just hasn't finished yet. Reset recovery counter (new task)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix. Without this guard, if wait_for raises TimeoutError but done was set concurrently (e.g., session.idle arrived during the timeout window), the unconditional clear() would erase the legitimate completion signal. The check-then-act is safe here since there's no await between is_set() and clear().

@@ -1321,9 +1343,10 @@
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix. Between await session.send() and done.clear(), the SDK could process the recovery prompt fast enough that session.idle fires and sets done before we reach clear(). Without this guard, we'd erase that signal and loop again, sending a redundant recovery prompt.

…ume_session

- Extract inline lambda to static `_default_permission_handler` with
  proper type hints and docstring (fixes ruff format CI failure)
- Pass `on_permission_request` to `resume_session()` (was only on
  `create_session`, causing permission denied on resumed workflows)
- Add `logger.debug` audit logging on permission approvals
- Add defensive `hasattr(event, "data")` guard in event logging

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jrob5756 jrob5756 merged commit bf5dffd into microsoft:main Mar 3, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants