Skip to content

feat: Intercept SDK task tool for multi-agent orchestration (#229)#318

Closed
PureWeen wants to merge 34 commits intomainfrom
feature/intercept-task-tool-229
Closed

feat: Intercept SDK task tool for multi-agent orchestration (#229)#318
PureWeen wants to merge 34 commits intomainfrom
feature/intercept-task-tool-229

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

@PureWeen PureWeen commented Mar 9, 2026

Closes #229

What

Replaces fragile @worker: text-parsing with structured tool-call dispatch for multi-agent orchestration.

Changes

  • Block SDK built-in task tool via ExcludedTools=["task"]
  • Register custom task AIFunction that routes to workers via ExecuteWorkerAsync
  • Thread-safe result collection (WorkerDelegationContext)
  • Tool-dispatch path with @worker: text fallback
  • Reconnect handling (clear _reflectToolConfigured)
  • Codespace client support (GetClientForGroup)
  • 14 unit tests

Testing

  • 2114+ tests passing
  • Use Implement & Challenge preset (OrchestratorReflect mode) to verify tool-dispatch
  • Fallback to @worker: text parsing works when tool setup fails

@PureWeen PureWeen force-pushed the feature/intercept-task-tool-229 branch 2 times, most recently from b5ba48c to c8441b6 Compare March 9, 2026 23:22
PureWeen and others added 27 commits March 9, 2026 20:26
- Add WorkerDelegationContext and WorkerDelegationTool to intercept
  the model built-in task sub-agent calls in OrchestratorReflect mode
- Resume orchestrator session with ExcludedTools + custom
  delegation AIFunction that routes calls to PolyPilot workers
- Add BuildOrchestratorReflectToolPrompt for tool-based planning prompts
- Add EnsureOrchestratorReflectToolsAsync to configure orchestrators
- Detect tool-dispatch results and skip ParseTaskAssignments fallback
- Add WorkerDelegationToolTests covering context and tool behavior

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When an orchestrator session reconnects, SendPromptAsync resumes with
only ShowImageTool -- no ExcludedTools or custom delegation function.
Previously _reflectToolConfigured stayed set, so EnsureOrchestratorReflect-
ToolsAsync would skip re-configuration, leaving the orchestrator without
the delegation tool after reconnect.

Fix: TryRemove the session from _reflectToolConfigured after the reconnect
so EnsureOrchestratorReflectToolsAsync re-runs on the next reflect iteration.

Adds regression test ReconnectState_ShouldClearReflectToolConfigured.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Bug 1 - Thread safety: WorkerDelegationContext._results was a plain List<T>
that the SDK could mutate concurrently via parallel tool calls. Added
_resultsLock; AddResult and Reset lock on it; DispatchedResults returns a
ToList() snapshot so callers see a consistent view.

Bug 2 - Wrong nudge format: when usingToolDispatch=true but the model
skips the task tool, code fell through to ParseTaskAssignments and sent
a @worker: text-format nudge. Added BuildToolDispatchNudgePrompt and
handle the no-tool-call case inside the tool-dispatch if-block.

Bug 3 - False GoalMet: at iteration > 1 with usingToolDispatch=true and
no tool calls, the text-path declared GoalMet with no workers dispatched.
The new tool-dispatch branch handles this by nudging the model up to 3
times before stalling, never declaring GoalMet without actual dispatches.

Bug 4 - Stale usingToolDispatch: computed once before the loop; a
mid-loop reconnect cleared _reflectToolConfigured but usingToolDispatch
stayed true. Now re-checked at the top of each iteration: if the flag was
cleared, EnsureOrchestratorReflectToolsAsync re-registers the tool on the
new session handle and refreshes usingToolDispatch.

Added 5 regression tests: Context_ConcurrentAddResult_DoesNotCorrupt,
Context_DispatchedResults_ReturnsConcurrentSnapshot,
WorkerDelegationContext_ResultsList_IsThreadSafe,
ToolDispatch_ShouldNudgeWhenModelSkipsTaskTool,
ToolDispatch_ShouldRefreshUsingToolDispatchEachIteration.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… codespace support (#229)

Now that PR #308 (codespace support) is merged, an OrchestratorReflect group
can have its sessions running in a codespace via a tunnel client stored in
_codespaceClients[groupId] — not in the local _client field.

EnsureOrchestratorReflectToolsAsync was calling _client.ResumeSessionAsync
directly, which for a codespace orchestrator would use the wrong (local) client,
causing ResumeSessionAsync to fail and silently fall back to @worker: text
dispatch — negating the entire #229 feature for codespace groups.

Fix: resolve the correct CopilotClient via GetClientForGroup(orchestratorGroupId),
matching the pattern used by CreateSessionAsync, SendPromptAsync, and all other
session operations throughout CopilotService.

Regression test added: EnsureOrchestratorReflectTools_UsesGetClientForGroup
verifies the source uses GetClientForGroup and not _client.ResumeSessionAsync.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…estratorToolsAsync (#229)

- EnsureOrchestratorReflectToolsAsync renamed to EnsureOrchestratorToolsAsync (not reflect-specific)
- _reflectToolConfigured renamed to _toolDispatchConfigured throughout
- SendViaOrchestratorAsync now calls EnsureOrchestratorToolsAsync and handles tool-dispatch results,
  letting regular Orchestrator groups use task tool calls instead of @worker: text blocks
- BuildOrchestratorPlanningPrompt gains useToolDispatch param: emits task-tool instructions
  instead of @worker:/@EnD blocks when tool-dispatch is active
- All regression tests updated to reflect new names and expanded test window

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The SDK validates that external tools with names matching built-in tools must
explicitly declare is_override=true in AdditionalProperties. Use the
AIFunctionFactoryOptions overload of AIFunctionFactory.Create to set Name,
Description, and AdditionalProperties together.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a reflect loop is cancelled (e.g., user sends a new message) while a
worker is still running its task tool invocation, the new loop's dispatch
immediately fails with 'Session is already processing a request'.

ExecuteWorkerAsync now checks if the target worker is still processing and
waits on its ResponseCompletion TCS (up to 10 min) before dispatching the
new task. This mirrors the busy-wait pattern used elsewhere and prevents
the 'elapsed=0.0s FAILED' pattern seen in production logs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tool dispatch (#229)

When the orchestrator calls our custom task tool, the SDK may fire
SessionIdleEvent before the long-running worker callback completes.
Fix: increment ActiveToolCallCount on tool dispatch start, decrement
on completion (in finally block). The existing guard in CompleteResponse
(line 476) already skips completion when ActiveToolCallCount > 0.

Also removes ExcludedTools=[task] from ResumeSessionConfig since
is_override=true on the custom tool handles the replacement, and
updates stale log messages.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…stom tool

The SDK (0.1.32) fires ToolExecutionStartEvent for is_override custom tools
but never fires ToolExecutionCompleteEvent or SessionIdleEvent after the
callback returns. This leaves ActiveToolCallCount permanently elevated and
the orchestrator stuck until the 600s watchdog fires.

Two-part fix:
1. SessionIdleEvent handler: guard against premature completion when
   ActiveToolCallCount > 0 (tools still in-flight)
2. OnToolDispatchEnd: after 8s grace period, if SDK hasn't produced events,
   manually reset ActiveToolCallCount and trigger CompleteResponse

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…losure

EnsureOrchestratorToolsAsync replaces the SessionState during ResumeSessionAsync.
The OnToolDispatchStart/End closures captured the old state reference, so the
manual CompleteResponse workaround was operating on a stale object — its
ProcessingGeneration was 0 and its TCS had no awaiters.

Fix: look up the current state from _sessions dictionary at callback time.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Cache the delegation AIFunction in _delegationFunctions dictionary so
the reconnect handler in SendPromptAsync can include it in the
ResumeSessionConfig.Tools list. This prevents iteration 2+ from
stalling because the reconnected session only had ShowImageTool.

Also handles the fresh-session fallback path (session expired).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ning concurrent tools

The unconditional Exchange(ref ActiveToolCallCount, 0) could corrupt
tool accounting if a new tool started during the 8s grace period.
Now uses CompareExchange to only reset when exactly 1 tool remains,
and skips manual CompleteResponse if other tools are still in-flight.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…1600ms)

The SDK either fires events within milliseconds or not at all.
Polling with early bail-out reduces worst-case latency from 8s to ~3s
while still giving the SDK a fair chance at each interval.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…aching

ResumeSessionAsync on a session stuck in 'tool executing' state produces
a dead session (no events). Instead of including the delegation tool in
the reconnect config (which doesn't help), always clear the flag and
add a post-send re-ensure check: after SendPromptAndWaitAsync returns,
if tools were lost (reconnect cleared the flag), immediately recreate
a clean session via EnsureOrchestratorToolsAsync and retry the prompt.
This avoids the 600s watchdog wait on each reconnect.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ResumeSessionAsync on a session stuck in 'tool executing' state produces
a dead session that never fires events. For orchestrator sessions with
tool dispatch, skip the retry entirely and complete with empty response.
The reflection loop's post-send re-ensure detects tools were lost,
creates a clean session via EnsureOrchestratorToolsAsync, and retries
immediately — no 600s watchdog wait.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…sh session fallback

When ResumeSessionAsync throws 'Session not found' (expired server-side),
create a fresh session via CreateSessionAsync with the delegation tool.
Also moved _delegationFunctions storage to after successful session
creation and clean up on failure to prevent stale entries that cause
the reconnect handler to incorrectly skip retry.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The SDK fires ToolExecutionStartEvent for ALL parallel tool calls but
only invokes our is_override callback for ONE of them. The un-invoked
calls leave phantom ActiveToolCallCount entries that prevent manual
CompleteResponse from firing. Force-reset to 0 since our
OnToolDispatchStart/End callbacks are the real source of truth.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ator mode

- Add post-send re-ensure after planning prompt (same as OrchestratorReflect)
- Clear delegation state before synthesis prompt so reconnect retries normally
  instead of skip-retry (synthesis doesn't use the task tool)
- Fixes dead-session hang during PR Review Squad synthesis phase

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After manual CompleteResponse for is_override tool dispatch, the server-side
session is stuck in 'tool executing' state. ResumeSessionAsync on it creates
a dead session. Instead, create a completely fresh session via CreateSessionAsync
for the synthesis phase (which doesn't need the delegation tool).

Also adds post-send re-ensure for the planning prompt in Orchestrator mode,
matching the pattern already used in OrchestratorReflect.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…allel

The SDK (0.1.32) fires ToolExecutionStartEvent for all parallel is_override
tool calls but only invokes the callback for a subset. This causes phantom
dispatches where some workers never execute. Instructing the model to call
task one-at-a-time works around this SDK limitation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The SDK only executes one is_override tool callback per turn, then our
manual CompleteResponse ends the turn. For multi-worker scenarios (e.g.,
review 3 PRs), Orchestrator mode now loops: each iteration dispatches
one worker via a fresh session with continuation prompt, accumulating
results until all workers are dispatched. Then a final synthesis prompt
combines all results. Capped at 10 iterations for safety.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…urns

After manual CompleteResponse, the server-side session is stuck in
'tool executing' state. ResumeSessionAsync creates a dead session.
Added forceCreate parameter to EnsureOrchestratorToolsAsync that uses
CreateSessionAsync instead, giving each continuation turn a clean session.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-robin

Reset() with all workers + round-robin always picked worker-1.
Now passes only un-dispatched workers so each continuation iteration
dispatches to a different worker.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Check ActiveToolCallCount before completing on reconnect. If > 0, a tool
callback is still executing (worker is running). Let it finish naturally —
OnToolDispatchEnd will trigger CompleteResponse. Prevents premature empty
response when worker takes longer than the SDK reconnect interval.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The continuation prompt told the model 'workers still need tasks' which caused
it to dispatch ALL available workers even when the user's request was fulfilled.
Now explicitly tells the model to stop dispatching when all requested work is
done, and that it does NOT need to use all available workers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When RestorePreviousSessionsAsync encounters a 'session file corrupted'
error (e.g., events.jsonl locked by another copilot process), fall back
to CreateSessionAsync instead of silently dropping the session. This
prevents multi-agent orchestrator sessions from vanishing after restart.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a session's events.jsonl is locked by another copilot process (e.g.,
Copilot CLI terminal), the SDK reports 'session file corrupted'. Updated
the UI error message to explain the actual cause (CLI lock) and suggest
closing the CLI session first.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Changed the dispatch loop from sequential (await each worker before
dispatching the next) to parallel (fire-and-forget each worker, await
all at the end). The SDK's one-callback-per-turn limitation means we
still loop to dispatch (~3s per worker), but workers now run in parallel
instead of serially.

Before: dispatch W1 (wait 43s) → W2 (wait 199s) → W3 (wait 531s) = 773s
After:  dispatch all 3 (~9s) → await longest (531s) = ~540s

Key changes:
- WorkerDelegationTool.InvokeAsync is now non-blocking: starts worker
  in background via Task.Run, returns 'dispatched' immediately
- WorkerDelegationContext gains PendingTasks dict and AwaitAllPendingAsync()
- FullReset() clears pending tasks; Reset() preserves them across iterations
- Dispatch loop collects dispatched names (not results), sends lightweight
  continuation prompts, then awaits all pending tasks after loop completes
- Updated prompts to explain parallel execution to the model

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the feature/intercept-task-tool-229 branch from 975d8de to a6e8b9f Compare March 10, 2026 01:27
PureWeen added a commit that referenced this pull request Mar 10, 2026
- Fix OrchestratorReflect round-robin: add ResetResults() that preserves
  round-robin index across iterations so workers alternate correctly
  (worker-1 → worker-2 → worker-1) instead of always picking index 0
- Fix critical processing-state cleanup violation: reconnect skip-retry
  path now clears all 9 companion fields matching CompleteResponse pattern
  (was only clearing IsProcessing + watchdog, leaving stale API time,
  tool counts, phase, and resumed flag)
- Fix placeholder Success=true: add Dispatched flag to ToolDispatchedResult
  so placeholders are semantically distinct from completed results
- Add ObserveAllPending() for safe cleanup of unobserved task exceptions
  when dispatch loop crashes before AwaitAllPendingAsync()
- Replace Interlocked.CompareExchange read-only checks with Volatile.Read
  for ActiveToolCallCount (idiomatic and clearer intent)
- Normalize MSBuild path separators (backslash → forward slash)
- Improve session error message to cover both locked and corrupted cases
- Add 3 new tests: AwaitAllPendingAsync with throwing task,
  ResetResults preserves round-robin, ObserveAllPending suppresses exceptions

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix OrchestratorReflect round-robin: add ResetResults() that preserves
  round-robin index across iterations so workers alternate correctly
  (worker-1 → worker-2 → worker-1) instead of always picking index 0
- Fix critical processing-state cleanup violation: reconnect skip-retry
  path now clears all 9 companion fields matching CompleteResponse pattern
  (was only clearing IsProcessing + watchdog, leaving stale API time,
  tool counts, phase, and resumed flag)
- Fix placeholder Success=true: add Dispatched flag to ToolDispatchedResult
  so placeholders are semantically distinct from completed results
- Add ObserveAllPending() for safe cleanup of unobserved task exceptions
  when dispatch loop crashes before AwaitAllPendingAsync()
- Replace Interlocked.CompareExchange read-only checks with Volatile.Read
  for ActiveToolCallCount (idiomatic and clearer intent)
- Normalize MSBuild path separators (backslash → forward slash)
- Improve session error message to cover both locked and corrupted cases
- Add 3 new tests: AwaitAllPendingAsync with throwing task,
  ResetResults preserves round-robin, ObserveAllPending suppresses exceptions

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the feature/intercept-task-tool-229 branch from d5fd38a to 45b34b3 Compare March 10, 2026 03:08
PureWeen and others added 3 commits March 9, 2026 22:58
When the model responds with text (planning) instead of calling the task
tool on the first turn, the Orchestrator mode dispatch branch was skipped
entirely — falling through to text-parsing and never dispatching workers
in parallel.

Fix: mirror the OrchestratorReflect nudge pattern (lines 2604-2630).
If DispatchedResults is empty after the first turn, send a
BuildToolDispatchNudgePrompt on a fresh session before falling through
to text-parsing. This gives the model a second chance to use the task
tool.

Also adds 7 new unit tests covering:
- ResetResults/Reset preserve pending tasks across iterations
- FullReset clears pending tasks
- PendingTasks accumulate across context resets
- Multiple dispatches round-robin workers
- Nudge scenario (empty results then dispatch)
- ToolDispatchedResult.Dispatched flag defaults

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ndingAsync

Two fixes for zombie worker sessions after SDK reconnect:

1. Worker revival: when SendPromptAndWaitAsync returns an empty response
   (dead session — SDK lost event stream after reconnect), automatically
   send a 'please continue' steering message. The server-side session may
   still be alive; a new SendAsync can kick it back into action.

2. Per-worker timeout on AwaitAllPendingAsync: if a worker task doesn't
   complete within 5 minutes (configurable), return a timeout error instead
   of blocking the entire dispatch forever. Previously, a single dead worker
   with a stuck watchdog (600s tool timeout) blocked synthesis for 10+ min.

Root cause analysis: workers that reconnect during SDK event streaming get
a new session object, but the server-side session sometimes goes silent
after 1-2 turns. The watchdog eventually fires but uses the 600s tool
timeout (isMultiAgentSession=true triggers Case B at 600s instead of 120s).
This is a pre-existing SDK issue, not caused by our changes.

Also adds test: Context_AwaitAllPendingAsync_TimesOutDeadWorker

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the initial ResumeSession produces a dead session, the nudge code
creates a fresh session via CreateSessionAsync. The fresh session has
no conversation history, so sending only the nudge text ('use the task
tool') left the model without context about being an orchestrator.

Fix: re-send the full planningPrompt (with appended tool-use instruction)
on the fresh session so the model has all the context it needs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the feature/intercept-task-tool-229 branch from 618b920 to 7ebf792 Compare March 10, 2026 04:58
Two fixes for zombie worker sessions:

1. Serialize client reconnect (thundering herd fix): When multiple
   workers hit IsConnectionError concurrently, each one would dispose
   and recreate _client, destroying the previous worker's SSE stream.
   Only the last worker to reconnect survived. Fix: add
   _clientReconnectLock (SemaphoreSlim) to serialize the path. The
   first worker recreates _client; subsequent workers reuse it.

2. Fresh-session revival: When a worker returns empty (dead SSE stream),
   instead of sending a steering message to the same dead session,
   create a brand new session via CreateSessionAsync and replay the
   full worker prompt. The old session is disposed, the new SessionId
   is stored in Info, and _sessions[name] is replaced with fresh state.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the feature/intercept-task-tool-229 branch from 7ebf792 to b457532 Compare March 10, 2026 05:00
@PureWeen
Copy link
Copy Markdown
Owner Author

PR #318 — 5-Model Consensus Review

Title: feat: Intercept SDK task tool for multi-agent orchestration
CI: ⚠️ No checks configured on feature/intercept-task-tool-229
Prior reviews: None
Models: claude-opus-4.6 ×2, gemini-3-pro-preview, gpt-5.3-codex (4/5 complete; 1 timed out — findings below reflect 4-model consensus)


🔴 CRITICAL — Reconnect-skip path leaves SendingFlag set → session deadlock (4/4 consensus)

PolyPilot/Services/CopilotService.cs — reconnect-skip block

The new orchestrator reconnect-skip path sets IsProcessing = false but does not call Interlocked.Exchange(ref state.SendingFlag, 0).

SendPromptAsync sets SendingFlag = 1 at entry. All other IsProcessing = false paths release it (either via CompleteResponse or explicit reset). This path does neither. The next call to SendPromptAsync for this session will fail the CompareExchange(ref state.SendingFlag, 1, 0) check and throw "Session is already processing a request" — permanently deadlocking the session until the app restarts.

Additional missing INV-1 cleanup in the same block (4/4 consensus):

  • FlushCurrentResponse(state) not called before IsProcessing = false → any streaming content accumulated before reconnect is silently dropped
  • ClearPermissionDenials() not called → stale permission denials persist into next turn
  • OnSessionComplete not fired → external subscribers never unblocked
// Missing from the reconnect-skip block:
FlushCurrentResponse(state);                             // before IsProcessing = false
Interlocked.Exchange(ref state.SendingFlag, 0);          // ← CRITICAL: prevents deadlock
state.Info.ClearPermissionDenials();
OnSessionComplete?.Invoke(sessionName);
// Also missing diagnostic log tag: [RECONNECT-SKIP] or similar

🟡 MODERATE — OnStateChanged?.Invoke() in reconnect-skip path fires on background thread (4/4 consensus)

PolyPilot/Services/CopilotService.cs — same reconnect-skip block

OnStateChanged?.Invoke() is called directly without InvokeOnUI(). This code runs in SendPromptAsync's catch block, which for worker sessions executes on a Task.Run background thread. Project invariant INV-2 requires ALL IsProcessing state mutations and OnStateChanged notifications to be marshaled to the UI thread. Every other termination path uses InvokeOnUI() or _syncContext.Post(). This will cause cross-thread Blazor render exceptions under load.

Fix: Wrap the entire block in InvokeOnUI(() => { ... }).


🟡 MODERATE — Escalating poll calls CompleteResponse on background thread when _syncContext == null (4/4 consensus)

PolyPilot/Services/CopilotService.Organization.csEnsureOrchestratorToolsAsync, OnToolDispatchEnd lambda

if (_syncContext != null)
    _syncContext.Post(_ => CompleteResponse(cur, endGen), null);
else
    CompleteResponse(cur, endGen);  // ← called on Task.Run thread

The else branch calls CompleteResponse directly on the thread pool. CompleteResponse mutates IsProcessing, fires OnStateChanged, and clears SendingFlag — all must run on the UI thread per INV-2. The watchdog uses InvokeOnUI() (which handles null _syncContext safely); this code should too. In production MAUI _syncContext is typically non-null, but the fallback path is a correctness bug.

Fix: Replace else CompleteResponse(cur, endGen) with else InvokeOnUI(() => CompleteResponse(cur, endGen)).


🟡 MODERATE — AddPendingTask silently overwrites prior task for same worker name (3/4 consensus)

PolyPilot/Services/WorkerDelegationTool.csAddPendingTask

internal void AddPendingTask(string workerName, Task<ToolDispatchedResult> task)
{
    _pendingTasks[workerName] = task;  // overwrites silently
}

The round-robin index wraps around when all workers have been dispatched. With 2 workers and 3 dispatches, the sequence is w1→w2→w1. The second w1 dispatch overwrites the first task in _pendingTasks. The first w1 worker continues running in the background but its result is permanently lost — AwaitAllPendingAsync only collects the second task. This also produces an unobserved task exception if the first task faults.

Fix: Use a list-of-tasks keyed by worker name, or append a dispatch counter to the key.


🟡 MODERATE — Stale pending tasks from interrupted dispatch survive into new context (3/4 consensus)

PolyPilot/Services/CopilotService.Organization.csEnsureOrchestratorToolsAsync

var context = _delegationContexts.GetOrAdd(orchestratorName, _ => new WorkerDelegationContext());

GetOrAdd reuses the existing WorkerDelegationContext on reconnect. The orchestration loops call FullReset at the start of a new cycle (which clears _pendingTasks), but the reflect loop only calls Reset/ResetResults — which deliberately preserve pending tasks. After a reconnect mid-reflect-loop, the new iteration's EnsureOrchestratorToolsAsync reuses the context with surviving pending tasks from the interrupted dispatch. Those tasks are then awaited by AwaitAllPendingAsync in the new iteration, mixing stale results from the old dispatch with new ones.


🟡 MODERATE — NRE in synthesis session creation when _client is null (2/4 consensus)

PolyPilot/Services/CopilotService.Organization.cs — synthesis path

GitHub.Copilot.SDK.CopilotClient synthClient;
try { synthClient = GetClientForGroup(orchestratorGroupId2); }
catch { synthClient = _client; }
var freshSession = await synthClient.CreateSessionAsync(...);  // NRE if _client is null

If GetClientForGroup throws (codespace tunnel disconnected) AND _client is null (after a failed reconnect set _client = null; IsInitialized = false), synthClient is null and the next line throws NullReferenceException, crashing the synthesis phase with an opaque error.

Fix: catch { synthClient = _client ?? throw new InvalidOperationException("No client available for synthesis session"); }


Test Coverage Assessment

The 14 new tests in WorkerDelegationToolTests.cs are comprehensive for WorkerDelegationContext behavior (concurrency, timeouts, round-robin, nudge scenarios). The 6 new tests in MultiAgentRegressionTests.cs use the source-text-as-spec pattern (read source file, assert string patterns exist) which is fragile but consistent with the existing regression test style.

Missing test coverage for new code paths:

  • Reconnect-skip path: no test verifying SendingFlag is cleared, FlushCurrentResponse called, and OnStateChanged fires on UI thread
  • AddPendingTask overwrite: the existing Context_ConcurrentAddResult_DoesNotCorrupt test doesn't cover the overwrite scenario (same worker dispatched twice → second overwrites first)
  • Stale pending task contamination after reconnect: no test for GetOrAdd reuse with surviving tasks

Suggested test cases:

  1. ReconnectSkipPath_ClearsSendingFlag — verify SendingFlag == 0 after reconnect-skip fires
  2. AddPendingTask_SameWorkerTwice_BothTasksAwaited — currently only the second task is collected
  3. AwaitAllPendingAsync_StaleTasksFromPreviousCycle_AreNotIncluded

Summary

The thundering-herd fix (_clientReconnectLock) and GetClientForGroup usage in EnsureOrchestratorToolsAsync are correct. The WorkerDelegationContext thread-safety model is sound. The core issue is that the new reconnect-skip path introduces a new IsProcessing = false path that violates 4 of the project's 12 required cleanup steps — most critically, SendingFlag is not released, which will deadlock sessions.

⚠️ Request changes — address the CRITICAL SendingFlag issue + the two threading violations before merge.


4-model review: claude-opus-4.6 ×2, gemini-3-pro-preview, gpt-5.3-codex

@PureWeen
Copy link
Copy Markdown
Owner Author

PR #318 Review — feat: Intercept SDK task tool for multi-agent orchestration

CI Status: ⚠️ No CI checks configured on this branch
Existing reviews: None
Models: claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex (5/5)
Consensus threshold: ≥2 models


🟡 MODERATE — Missing McpServers, SkillDirectories, SystemMessage in all fresh session configs (5/5 consensus)

Every SessionConfig created fresh in this PR omits MCP servers, skill directories, and system message. This affects five code paths:

Location Affected path
CopilotService.Organization.cs EnsureOrchestratorToolsAsyncforceCreate=true
CopilotService.Organization.cs EnsureOrchestratorToolsAsync — resume path
CopilotService.Organization.cs EnsureOrchestratorToolsAsync — "session not found" fallback
CopilotService.Organization.cs SendViaOrchestratorAsync — synthesis session
CopilotService.Organization.cs ExecuteWorkerAsync — dead-worker revival

Compare with the existing reconnect handler (CopilotService.cs:~2561) which correctly calls LoadMcpServers() and LoadSkillDirectories(). PR #330 fixed this exact gap for the standard reconnect path; this PR reintroduces it for five new paths.

The most impactful case is worker revival: a revived worker loses all external tool access — file editing, running tests, reading code — making the revival useless for any real task. The orchestrator forceCreate paths are less critical (orchestrators primarily call the task tool) but still lose custom system prompts and MCP context.

Fix: Copy the MCP/skills/system-message restoration pattern from the existing reconnect handler into each fresh SessionConfig construction.


🟡 MODERATE (disputed 2/5) — CompleteResponse called on threadpool when _syncContext is null

In OnToolDispatchEnd's background Task.Run, the fallback branch:

if (_syncContext != null)
    _syncContext.Post(_ => CompleteResponse(cur, endGen), null);
else
    CompleteResponse(cur, endGen);  // runs on threadpool

Per INV-2, all IsProcessing mutations must run on the UI thread. Two models (opus-4.6, gemini) flag this as an INV-2 violation. One model (sonnet) disagrees, arguing this is functionally identical to InvokeOnUI's null-context fallback behavior.

Additionally, the _syncContext.Post path dispatches CompleteResponse without a try-catch; all other dispatch sites use InvokeOnUI() or Invoke() which wrap the callback. An exception from CompleteResponse would propagate unhandled into the SynchronizationContext.

Suggestion: Use InvokeOnUI(() => CompleteResponse(cur, endGen)) to consolidate the dispatch logic and get the try-catch wrapping for free.


🟡 MODERATE (disputed 2/5) — Interlocked.Exchange(ref cur.ActiveToolCallCount, 0) races with concurrent ToolExecutionStartEvent

The polling loop unconditionally zeroes the count after ~3s:

Interlocked.Exchange(ref cur.ActiveToolCallCount, 0); // Force-reset
_syncContext.Post(_ => CompleteResponse(cur, endGen), null);

Two models (opus-4.6, gemini) flag that this races with a concurrent ToolExecutionStartEvent increment for a second parallel task call, potentially leaving ActiveToolCallCount at 0 while a second tool is in-flight.

One model (sonnet) disagrees: "The SDK invokes at most one is_override callback per turn, so OnToolDispatchStart/End fire exactly once before the 3-second poll runs." The comment in the code itself acknowledges this: "The SDK fires ToolExecutionStartEvent for ALL parallel tool calls but may only invoke our callback for ONE of them."

Given the competing analysis, this warrants confirmation of the SDK behavior but may not be a bug in practice.


Test Coverage Assessment

The PR includes 452-line WorkerDelegationToolTests.cs and 14 new regression tests in MultiAgentRegressionTests.cs. Strong coverage overall. Missing test cases:

  1. MCP server propagation in revival/forceCreate paths: Verify that EnsureOrchestratorToolsAsync (forceCreate=true) and ExecuteWorkerAsync revival restore MCP servers in the new session config.
  2. _syncContext == null dispatch path: Test that CompleteResponse is invoked correctly when _syncContext is null in the OnToolDispatchEnd fallback.

Below consensus (single-model findings, for author awareness)

  • (sonnet) CopilotService.Events.cs:528CancelTurnEndFallback fires unconditionally before the ActiveToolCallCount guard. When the guard triggers (tools in-flight), both the SessionIdleEvent's CompleteResponse path AND the TurnEnd-fallback timer are gone — only the 3s OnToolDispatchEnd poll remains. Consider only cancelling CancelTurnEndFallback after the guard passes.
  • (opus-1) CopilotService.Organization.cs:1182 — continuation retry calls retryCtx2.Reset(prompt, workerNames, ct) with the full workerNames list instead of remaining workers; round-robin restarts at worker-0 regardless of dispatch history.
  • (opus-2) CopilotService.cs:2682 — non-volatile state.ActiveToolCallCount direct read in the reconnect path; all other reads use Volatile.Read.
  • (opus-2) CopilotService.cs:2683-2698 — orchestrator reconnect-skip path sets IsProcessing=false without calling FlushCurrentResponse(state) first (per INV-1).
  • (codex) WorkerDelegationContext.cs_pendingTasks keyed only by workerName; dispatching the same worker twice in one iteration overwrites the first task handle, orphaning it.

Recommended Action

⚠️ Request changes

The MCP/skills regression (5/5 consensus) is the primary blocker: worker revival loses all external tool access, making the revival feature non-functional for any task requiring tools. The fix is mechanical — copy the MCP/skills restoration pattern from the existing reconnect handler into the five new SessionConfig construction sites.

Phase 1 (Critical/P0):
- Fix reconnect-skip SendingFlag deadlock (FlushCurrentResponse, ClearPermissionDenials, InvokeOnUI)
- Split watchdog tiers: active-tool=600s, used-tools-idle=180s, default=120s
- Reset HasUsedToolsThisTurn before reconnect retry

Phase 2 (Fresh Session Config):
- Extract BuildFreshSessionConfig helper with MCP/skills/system-message
- Applied to all 5 fresh SessionConfig sites + reconnect handler

Phase 3 (Moderate):
- Fix OnToolDispatchEnd threading (InvokeOnUI)
- Counter-keyed AddPendingTask prevents task overwrite
- Reset clears stale pending tasks
- Synthesis NRE guard when _client is null
- Wire CTS to ResponseCompletion TCS for real timeout

Phase 4 (Dispatch Coverage):
- Continuation prompt includes full planningPrompt + dispatch status
- Removed discouraging dispatch language
- Added duplicate-task prevention instruction

Bug fixes from multi-model review:
- Bug 1: Clean delegation dicts in reflect finally + CloseSessionCoreAsync
- Bug 2: Cancel watchdog/fallback timers before session replacement in EnsureOrchestratorToolsAsync

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

Superseded by #339. The tool-intercept approach fought the SDK event model — reverting to enhanced text-parsing dispatch with JSON mode, retry loop, and ported reliability fixes.

@PureWeen PureWeen closed this Mar 10, 2026
PureWeen added a commit that referenced this pull request Mar 10, 2026
)

## Summary

Replaces the tool-intercept approach (PR #318) with enhanced
text-parsing dispatch. The tool-intercept approach fought the SDK's
event model — 22/33 commits were workarounds for the impedance mismatch,
and it *caused* the 'single worker dispatched' and 'identical tasks'
production bugs.

This PR keeps the text-parsing path that was working reliably, and
enhances it:

### Text-Parsing Enhancements
- **JSON mode parsing**: Orchestrator can return \[{worker, task}]\ JSON
arrays, parsed with System.Text.Json. Falls back to \@worker:...@EnD\
regex on parse failure.
- **Exact match only** for worker name resolution — removed
bidirectional \Contains\ fallback that caused misroutes when names are
substrings of each other.
- **Backtick/quote stripping** from worker names for robustness.
- **Differentiated task instruction**: 'Each worker MUST receive a
DIFFERENT sub-task'
- **Retry loop** (up to 3 iterations) with conversation history when not
all workers dispatched. Model remembers what it already assigned — no
fresh sessions, no amnesia.

### Reliability Fixes (ported from tool-dispatch branch)
- \_clientReconnectLock\: SemaphoreSlim thundering-herd fix for
concurrent workers hitting connection errors
- **Watchdog tier split**: active-tool=600s, used-tools-idle=180s,
default=120s (cuts zombie detection from 10min to 3min)
- **BuildFreshSessionConfig** helper: MCP servers, skills, system
message in one place
- **CTS-to-TCS wiring**: 10-minute timeout in SendPromptAndWaitAsync
actually cancels the TCS
- **HasUsedToolsThisTurn reset** before reconnect retry
- **Worker revival**: detect empty response → fresh session → retry once
(~20 lines)
- **Volatile.Read** cleanup for ActiveToolCallCount
- **Corrupt/locked session** restore fallback

### Tests
- 6 new JSON parsing tests
- Updated fuzzy match tests for exact-match-only behavior
- Updated structural regression guards for BuildFreshSessionConfig

### Why Not Tool-Intercept?
Multi-model review (Opus 4.6, Sonnet 4, Gemini 3 Pro, GPT-5.2)
unanimously recommended against the tool-intercept approach. The core
issue: \is_override\ on a custom task AIFunction returns a fake
placeholder while running real work in background Task.Run. The SDK
doesn't handle this — events don't fire correctly, sessions get stuck,
and fresh sessions lose conversation history.

Closes #229
Supersedes #318

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

PR #318 Review: "feat: Intercept SDK task tool for multi-agent orchestration"

CI Status: ⚠️ No CI checks configured

5 models reviewed (claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex). 4 consensus findings.


🔴 CRITICAL — [[GROUP_REFLECT_COMPLETE]] checked on wrong string (5/5 models)

CopilotService.Organization.csSendViaOrchestratorReflectAsync

The reflect loop termination check is performed on the concatenated WORKER responses, not on the orchestrator's synthesizing response. [[GROUP_REFLECT_COMPLETE]] is the signal the orchestrator sends when it decides the multi-agent loop is done — but the code checks the string that workers returned, not the orchestrator's reply. As a result, the reflect loop never self-terminates. It will run until the max iteration cap regardless of whether the orchestrator signals completion. This causes unbounded cost/latency on every multi-agent reflect cycle.


🔴 CRITICAL — OnToolDispatchEnd poll fires on wrong session after reconnect (4/5 models)

CopilotService.Organization.csEnsureOrchestratorToolsAsync / OnToolDispatchEnd

The poll callback captures endGen from the session's ProcessingGeneration at dispatch time. After a reconnect, the new session object starts at generation 0. If endGen was also 0 (first prompt after init), the endGen == state.Info.ProcessingGeneration guard is satisfied on the wrong session. CompleteResponse fires on the new/wrong session, incorrectly ending a prompt that hasn't finished.


🟡 MODERATE — FullReset() creates zombie tasks (4/5 models)

WorkerDelegationTool.csFullReset()

FullReset() clears _pendingTasks without first calling ObserveAllPending(). Any in-flight worker Task objects are abandoned with no observer. When they eventually complete or fault, they inject their (now-stale) results into the next orchestration cycle's task dictionary via the shared completion callback. This causes stale results from a previous cycle to corrupt the next cycle.

Fix: Call ObserveAllPending() (or equivalent drain) before Reset().


🟡 MODERATE — AddPendingTask duplicate key overwrites first task (3/5 models)

WorkerDelegationTool.csAddPendingTask()

Worker name is used as the dictionary key. In a reflect cycle where the same worker is dispatched more than once across iterations, the second AddPendingTask call for that worker silently overwrites the first task. The first worker's result is lost; the orchestrator never sees it. Combined with the FullReset() zombie issue, this creates unpredictable result sets in multi-iteration reflect cycles.


Test Coverage

No tests were added for the new WorkerDelegationTool, WorkerDelegationContext, or the orchestration interception path. Given the severity of the generation-guard race and the FullReset zombie issue, these are critical paths that need unit test coverage. Suggested:

  • FullReset_DoesNotLeakPendingTaskResults
  • AddPendingTask_DuplicateWorkerName_ThrowsOrDeduplicates
  • OnToolDispatchEnd_DoesNotFireOnWrongGeneration

⚠️ Request changes — the [[GROUP_REFLECT_COMPLETE]] string check and OnToolDispatchEnd generation guard are both logic bugs that affect every multi-agent reflect cycle.

arisng pushed a commit to arisng/PolyPilot that referenced this pull request Apr 4, 2026
…#229) (PureWeen#339)

## Summary

Replaces the tool-intercept approach (PR PureWeen#318) with enhanced
text-parsing dispatch. The tool-intercept approach fought the SDK's
event model — 22/33 commits were workarounds for the impedance mismatch,
and it *caused* the 'single worker dispatched' and 'identical tasks'
production bugs.

This PR keeps the text-parsing path that was working reliably, and
enhances it:

### Text-Parsing Enhancements
- **JSON mode parsing**: Orchestrator can return \[{worker, task}]\ JSON
arrays, parsed with System.Text.Json. Falls back to \@worker:...@EnD\
regex on parse failure.
- **Exact match only** for worker name resolution — removed
bidirectional \Contains\ fallback that caused misroutes when names are
substrings of each other.
- **Backtick/quote stripping** from worker names for robustness.
- **Differentiated task instruction**: 'Each worker MUST receive a
DIFFERENT sub-task'
- **Retry loop** (up to 3 iterations) with conversation history when not
all workers dispatched. Model remembers what it already assigned — no
fresh sessions, no amnesia.

### Reliability Fixes (ported from tool-dispatch branch)
- \_clientReconnectLock\: SemaphoreSlim thundering-herd fix for
concurrent workers hitting connection errors
- **Watchdog tier split**: active-tool=600s, used-tools-idle=180s,
default=120s (cuts zombie detection from 10min to 3min)
- **BuildFreshSessionConfig** helper: MCP servers, skills, system
message in one place
- **CTS-to-TCS wiring**: 10-minute timeout in SendPromptAndWaitAsync
actually cancels the TCS
- **HasUsedToolsThisTurn reset** before reconnect retry
- **Worker revival**: detect empty response → fresh session → retry once
(~20 lines)
- **Volatile.Read** cleanup for ActiveToolCallCount
- **Corrupt/locked session** restore fallback

### Tests
- 6 new JSON parsing tests
- Updated fuzzy match tests for exact-match-only behavior
- Updated structural regression guards for BuildFreshSessionConfig

### Why Not Tool-Intercept?
Multi-model review (Opus 4.6, Sonnet 4, Gemini 3 Pro, GPT-5.2)
unanimously recommended against the tool-intercept approach. The core
issue: \is_override\ on a custom task AIFunction returns a fake
placeholder while running real work in background Task.Run. The SDK
doesn't handle this — events don't fire correctly, sessions get stuck,
and fresh sessions lose conversation history.

Closes PureWeen#229
Supersedes PureWeen#318

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rework multi-agent orchestrator: intercept SDK task tool to surface sub-agents as PolyPilot sessions

1 participant