feat: Intercept SDK task tool for multi-agent orchestration (#229)#318
feat: Intercept SDK task tool for multi-agent orchestration (#229)#318
Conversation
b5ba48c to
c8441b6
Compare
- Add WorkerDelegationContext and WorkerDelegationTool to intercept the model built-in task sub-agent calls in OrchestratorReflect mode - Resume orchestrator session with ExcludedTools + custom delegation AIFunction that routes calls to PolyPilot workers - Add BuildOrchestratorReflectToolPrompt for tool-based planning prompts - Add EnsureOrchestratorReflectToolsAsync to configure orchestrators - Detect tool-dispatch results and skip ParseTaskAssignments fallback - Add WorkerDelegationToolTests covering context and tool behavior Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When an orchestrator session reconnects, SendPromptAsync resumes with only ShowImageTool -- no ExcludedTools or custom delegation function. Previously _reflectToolConfigured stayed set, so EnsureOrchestratorReflect- ToolsAsync would skip re-configuration, leaving the orchestrator without the delegation tool after reconnect. Fix: TryRemove the session from _reflectToolConfigured after the reconnect so EnsureOrchestratorReflectToolsAsync re-runs on the next reflect iteration. Adds regression test ReconnectState_ShouldClearReflectToolConfigured. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Bug 1 - Thread safety: WorkerDelegationContext._results was a plain List<T> that the SDK could mutate concurrently via parallel tool calls. Added _resultsLock; AddResult and Reset lock on it; DispatchedResults returns a ToList() snapshot so callers see a consistent view. Bug 2 - Wrong nudge format: when usingToolDispatch=true but the model skips the task tool, code fell through to ParseTaskAssignments and sent a @worker: text-format nudge. Added BuildToolDispatchNudgePrompt and handle the no-tool-call case inside the tool-dispatch if-block. Bug 3 - False GoalMet: at iteration > 1 with usingToolDispatch=true and no tool calls, the text-path declared GoalMet with no workers dispatched. The new tool-dispatch branch handles this by nudging the model up to 3 times before stalling, never declaring GoalMet without actual dispatches. Bug 4 - Stale usingToolDispatch: computed once before the loop; a mid-loop reconnect cleared _reflectToolConfigured but usingToolDispatch stayed true. Now re-checked at the top of each iteration: if the flag was cleared, EnsureOrchestratorReflectToolsAsync re-registers the tool on the new session handle and refreshes usingToolDispatch. Added 5 regression tests: Context_ConcurrentAddResult_DoesNotCorrupt, Context_DispatchedResults_ReturnsConcurrentSnapshot, WorkerDelegationContext_ResultsList_IsThreadSafe, ToolDispatch_ShouldNudgeWhenModelSkipsTaskTool, ToolDispatch_ShouldRefreshUsingToolDispatchEachIteration. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… codespace support (#229) Now that PR #308 (codespace support) is merged, an OrchestratorReflect group can have its sessions running in a codespace via a tunnel client stored in _codespaceClients[groupId] — not in the local _client field. EnsureOrchestratorReflectToolsAsync was calling _client.ResumeSessionAsync directly, which for a codespace orchestrator would use the wrong (local) client, causing ResumeSessionAsync to fail and silently fall back to @worker: text dispatch — negating the entire #229 feature for codespace groups. Fix: resolve the correct CopilotClient via GetClientForGroup(orchestratorGroupId), matching the pattern used by CreateSessionAsync, SendPromptAsync, and all other session operations throughout CopilotService. Regression test added: EnsureOrchestratorReflectTools_UsesGetClientForGroup verifies the source uses GetClientForGroup and not _client.ResumeSessionAsync. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…estratorToolsAsync (#229) - EnsureOrchestratorReflectToolsAsync renamed to EnsureOrchestratorToolsAsync (not reflect-specific) - _reflectToolConfigured renamed to _toolDispatchConfigured throughout - SendViaOrchestratorAsync now calls EnsureOrchestratorToolsAsync and handles tool-dispatch results, letting regular Orchestrator groups use task tool calls instead of @worker: text blocks - BuildOrchestratorPlanningPrompt gains useToolDispatch param: emits task-tool instructions instead of @worker:/@EnD blocks when tool-dispatch is active - All regression tests updated to reflect new names and expanded test window Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The SDK validates that external tools with names matching built-in tools must explicitly declare is_override=true in AdditionalProperties. Use the AIFunctionFactoryOptions overload of AIFunctionFactory.Create to set Name, Description, and AdditionalProperties together. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a reflect loop is cancelled (e.g., user sends a new message) while a worker is still running its task tool invocation, the new loop's dispatch immediately fails with 'Session is already processing a request'. ExecuteWorkerAsync now checks if the target worker is still processing and waits on its ResponseCompletion TCS (up to 10 min) before dispatching the new task. This mirrors the busy-wait pattern used elsewhere and prevents the 'elapsed=0.0s FAILED' pattern seen in production logs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tool dispatch (#229) When the orchestrator calls our custom task tool, the SDK may fire SessionIdleEvent before the long-running worker callback completes. Fix: increment ActiveToolCallCount on tool dispatch start, decrement on completion (in finally block). The existing guard in CompleteResponse (line 476) already skips completion when ActiveToolCallCount > 0. Also removes ExcludedTools=[task] from ResumeSessionConfig since is_override=true on the custom tool handles the replacement, and updates stale log messages. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…stom tool The SDK (0.1.32) fires ToolExecutionStartEvent for is_override custom tools but never fires ToolExecutionCompleteEvent or SessionIdleEvent after the callback returns. This leaves ActiveToolCallCount permanently elevated and the orchestrator stuck until the 600s watchdog fires. Two-part fix: 1. SessionIdleEvent handler: guard against premature completion when ActiveToolCallCount > 0 (tools still in-flight) 2. OnToolDispatchEnd: after 8s grace period, if SDK hasn't produced events, manually reset ActiveToolCallCount and trigger CompleteResponse Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…losure EnsureOrchestratorToolsAsync replaces the SessionState during ResumeSessionAsync. The OnToolDispatchStart/End closures captured the old state reference, so the manual CompleteResponse workaround was operating on a stale object — its ProcessingGeneration was 0 and its TCS had no awaiters. Fix: look up the current state from _sessions dictionary at callback time. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Cache the delegation AIFunction in _delegationFunctions dictionary so the reconnect handler in SendPromptAsync can include it in the ResumeSessionConfig.Tools list. This prevents iteration 2+ from stalling because the reconnected session only had ShowImageTool. Also handles the fresh-session fallback path (session expired). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ning concurrent tools The unconditional Exchange(ref ActiveToolCallCount, 0) could corrupt tool accounting if a new tool started during the 8s grace period. Now uses CompareExchange to only reset when exactly 1 tool remains, and skips manual CompleteResponse if other tools are still in-flight. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…1600ms) The SDK either fires events within milliseconds or not at all. Polling with early bail-out reduces worst-case latency from 8s to ~3s while still giving the SDK a fair chance at each interval. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…aching ResumeSessionAsync on a session stuck in 'tool executing' state produces a dead session (no events). Instead of including the delegation tool in the reconnect config (which doesn't help), always clear the flag and add a post-send re-ensure check: after SendPromptAndWaitAsync returns, if tools were lost (reconnect cleared the flag), immediately recreate a clean session via EnsureOrchestratorToolsAsync and retry the prompt. This avoids the 600s watchdog wait on each reconnect. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ResumeSessionAsync on a session stuck in 'tool executing' state produces a dead session that never fires events. For orchestrator sessions with tool dispatch, skip the retry entirely and complete with empty response. The reflection loop's post-send re-ensure detects tools were lost, creates a clean session via EnsureOrchestratorToolsAsync, and retries immediately — no 600s watchdog wait. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…sh session fallback When ResumeSessionAsync throws 'Session not found' (expired server-side), create a fresh session via CreateSessionAsync with the delegation tool. Also moved _delegationFunctions storage to after successful session creation and clean up on failure to prevent stale entries that cause the reconnect handler to incorrectly skip retry. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The SDK fires ToolExecutionStartEvent for ALL parallel tool calls but only invokes our is_override callback for ONE of them. The un-invoked calls leave phantom ActiveToolCallCount entries that prevent manual CompleteResponse from firing. Force-reset to 0 since our OnToolDispatchStart/End callbacks are the real source of truth. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ator mode - Add post-send re-ensure after planning prompt (same as OrchestratorReflect) - Clear delegation state before synthesis prompt so reconnect retries normally instead of skip-retry (synthesis doesn't use the task tool) - Fixes dead-session hang during PR Review Squad synthesis phase Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After manual CompleteResponse for is_override tool dispatch, the server-side session is stuck in 'tool executing' state. ResumeSessionAsync on it creates a dead session. Instead, create a completely fresh session via CreateSessionAsync for the synthesis phase (which doesn't need the delegation tool). Also adds post-send re-ensure for the planning prompt in Orchestrator mode, matching the pattern already used in OrchestratorReflect. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…allel The SDK (0.1.32) fires ToolExecutionStartEvent for all parallel is_override tool calls but only invokes the callback for a subset. This causes phantom dispatches where some workers never execute. Instructing the model to call task one-at-a-time works around this SDK limitation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The SDK only executes one is_override tool callback per turn, then our manual CompleteResponse ends the turn. For multi-worker scenarios (e.g., review 3 PRs), Orchestrator mode now loops: each iteration dispatches one worker via a fresh session with continuation prompt, accumulating results until all workers are dispatched. Then a final synthesis prompt combines all results. Capped at 10 iterations for safety. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…urns After manual CompleteResponse, the server-side session is stuck in 'tool executing' state. ResumeSessionAsync creates a dead session. Added forceCreate parameter to EnsureOrchestratorToolsAsync that uses CreateSessionAsync instead, giving each continuation turn a clean session. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-robin Reset() with all workers + round-robin always picked worker-1. Now passes only un-dispatched workers so each continuation iteration dispatches to a different worker. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Check ActiveToolCallCount before completing on reconnect. If > 0, a tool callback is still executing (worker is running). Let it finish naturally — OnToolDispatchEnd will trigger CompleteResponse. Prevents premature empty response when worker takes longer than the SDK reconnect interval. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The continuation prompt told the model 'workers still need tasks' which caused it to dispatch ALL available workers even when the user's request was fulfilled. Now explicitly tells the model to stop dispatching when all requested work is done, and that it does NOT need to use all available workers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When RestorePreviousSessionsAsync encounters a 'session file corrupted' error (e.g., events.jsonl locked by another copilot process), fall back to CreateSessionAsync instead of silently dropping the session. This prevents multi-agent orchestrator sessions from vanishing after restart. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a session's events.jsonl is locked by another copilot process (e.g., Copilot CLI terminal), the SDK reports 'session file corrupted'. Updated the UI error message to explain the actual cause (CLI lock) and suggest closing the CLI session first. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Changed the dispatch loop from sequential (await each worker before dispatching the next) to parallel (fire-and-forget each worker, await all at the end). The SDK's one-callback-per-turn limitation means we still loop to dispatch (~3s per worker), but workers now run in parallel instead of serially. Before: dispatch W1 (wait 43s) → W2 (wait 199s) → W3 (wait 531s) = 773s After: dispatch all 3 (~9s) → await longest (531s) = ~540s Key changes: - WorkerDelegationTool.InvokeAsync is now non-blocking: starts worker in background via Task.Run, returns 'dispatched' immediately - WorkerDelegationContext gains PendingTasks dict and AwaitAllPendingAsync() - FullReset() clears pending tasks; Reset() preserves them across iterations - Dispatch loop collects dispatched names (not results), sends lightweight continuation prompts, then awaits all pending tasks after loop completes - Updated prompts to explain parallel execution to the model Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
975d8de to
a6e8b9f
Compare
- Fix OrchestratorReflect round-robin: add ResetResults() that preserves round-robin index across iterations so workers alternate correctly (worker-1 → worker-2 → worker-1) instead of always picking index 0 - Fix critical processing-state cleanup violation: reconnect skip-retry path now clears all 9 companion fields matching CompleteResponse pattern (was only clearing IsProcessing + watchdog, leaving stale API time, tool counts, phase, and resumed flag) - Fix placeholder Success=true: add Dispatched flag to ToolDispatchedResult so placeholders are semantically distinct from completed results - Add ObserveAllPending() for safe cleanup of unobserved task exceptions when dispatch loop crashes before AwaitAllPendingAsync() - Replace Interlocked.CompareExchange read-only checks with Volatile.Read for ActiveToolCallCount (idiomatic and clearer intent) - Normalize MSBuild path separators (backslash → forward slash) - Improve session error message to cover both locked and corrupted cases - Add 3 new tests: AwaitAllPendingAsync with throwing task, ResetResults preserves round-robin, ObserveAllPending suppresses exceptions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix OrchestratorReflect round-robin: add ResetResults() that preserves round-robin index across iterations so workers alternate correctly (worker-1 → worker-2 → worker-1) instead of always picking index 0 - Fix critical processing-state cleanup violation: reconnect skip-retry path now clears all 9 companion fields matching CompleteResponse pattern (was only clearing IsProcessing + watchdog, leaving stale API time, tool counts, phase, and resumed flag) - Fix placeholder Success=true: add Dispatched flag to ToolDispatchedResult so placeholders are semantically distinct from completed results - Add ObserveAllPending() for safe cleanup of unobserved task exceptions when dispatch loop crashes before AwaitAllPendingAsync() - Replace Interlocked.CompareExchange read-only checks with Volatile.Read for ActiveToolCallCount (idiomatic and clearer intent) - Normalize MSBuild path separators (backslash → forward slash) - Improve session error message to cover both locked and corrupted cases - Add 3 new tests: AwaitAllPendingAsync with throwing task, ResetResults preserves round-robin, ObserveAllPending suppresses exceptions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
d5fd38a to
45b34b3
Compare
When the model responds with text (planning) instead of calling the task tool on the first turn, the Orchestrator mode dispatch branch was skipped entirely — falling through to text-parsing and never dispatching workers in parallel. Fix: mirror the OrchestratorReflect nudge pattern (lines 2604-2630). If DispatchedResults is empty after the first turn, send a BuildToolDispatchNudgePrompt on a fresh session before falling through to text-parsing. This gives the model a second chance to use the task tool. Also adds 7 new unit tests covering: - ResetResults/Reset preserve pending tasks across iterations - FullReset clears pending tasks - PendingTasks accumulate across context resets - Multiple dispatches round-robin workers - Nudge scenario (empty results then dispatch) - ToolDispatchedResult.Dispatched flag defaults Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ndingAsync Two fixes for zombie worker sessions after SDK reconnect: 1. Worker revival: when SendPromptAndWaitAsync returns an empty response (dead session — SDK lost event stream after reconnect), automatically send a 'please continue' steering message. The server-side session may still be alive; a new SendAsync can kick it back into action. 2. Per-worker timeout on AwaitAllPendingAsync: if a worker task doesn't complete within 5 minutes (configurable), return a timeout error instead of blocking the entire dispatch forever. Previously, a single dead worker with a stuck watchdog (600s tool timeout) blocked synthesis for 10+ min. Root cause analysis: workers that reconnect during SDK event streaming get a new session object, but the server-side session sometimes goes silent after 1-2 turns. The watchdog eventually fires but uses the 600s tool timeout (isMultiAgentSession=true triggers Case B at 600s instead of 120s). This is a pre-existing SDK issue, not caused by our changes. Also adds test: Context_AwaitAllPendingAsync_TimesOutDeadWorker Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the initial ResumeSession produces a dead session, the nudge code
creates a fresh session via CreateSessionAsync. The fresh session has
no conversation history, so sending only the nudge text ('use the task
tool') left the model without context about being an orchestrator.
Fix: re-send the full planningPrompt (with appended tool-use instruction)
on the fresh session so the model has all the context it needs.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
618b920 to
7ebf792
Compare
Two fixes for zombie worker sessions: 1. Serialize client reconnect (thundering herd fix): When multiple workers hit IsConnectionError concurrently, each one would dispose and recreate _client, destroying the previous worker's SSE stream. Only the last worker to reconnect survived. Fix: add _clientReconnectLock (SemaphoreSlim) to serialize the path. The first worker recreates _client; subsequent workers reuse it. 2. Fresh-session revival: When a worker returns empty (dead SSE stream), instead of sending a steering message to the same dead session, create a brand new session via CreateSessionAsync and replay the full worker prompt. The old session is disposed, the new SessionId is stored in Info, and _sessions[name] is replaced with fresh state. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
7ebf792 to
b457532
Compare
PR #318 — 5-Model Consensus ReviewTitle: feat: Intercept SDK task tool for multi-agent orchestration 🔴 CRITICAL — Reconnect-skip path leaves
|
PR #318 Review — feat: Intercept SDK task tool for multi-agent orchestrationCI Status: 🟡 MODERATE — Missing
|
| Location | Affected path |
|---|---|
CopilotService.Organization.cs |
EnsureOrchestratorToolsAsync — forceCreate=true |
CopilotService.Organization.cs |
EnsureOrchestratorToolsAsync — resume path |
CopilotService.Organization.cs |
EnsureOrchestratorToolsAsync — "session not found" fallback |
CopilotService.Organization.cs |
SendViaOrchestratorAsync — synthesis session |
CopilotService.Organization.cs |
ExecuteWorkerAsync — dead-worker revival |
Compare with the existing reconnect handler (CopilotService.cs:~2561) which correctly calls LoadMcpServers() and LoadSkillDirectories(). PR #330 fixed this exact gap for the standard reconnect path; this PR reintroduces it for five new paths.
The most impactful case is worker revival: a revived worker loses all external tool access — file editing, running tests, reading code — making the revival useless for any real task. The orchestrator forceCreate paths are less critical (orchestrators primarily call the task tool) but still lose custom system prompts and MCP context.
Fix: Copy the MCP/skills/system-message restoration pattern from the existing reconnect handler into each fresh SessionConfig construction.
🟡 MODERATE (disputed 2/5) — CompleteResponse called on threadpool when _syncContext is null
In OnToolDispatchEnd's background Task.Run, the fallback branch:
if (_syncContext != null)
_syncContext.Post(_ => CompleteResponse(cur, endGen), null);
else
CompleteResponse(cur, endGen); // runs on threadpoolPer INV-2, all IsProcessing mutations must run on the UI thread. Two models (opus-4.6, gemini) flag this as an INV-2 violation. One model (sonnet) disagrees, arguing this is functionally identical to InvokeOnUI's null-context fallback behavior.
Additionally, the _syncContext.Post path dispatches CompleteResponse without a try-catch; all other dispatch sites use InvokeOnUI() or Invoke() which wrap the callback. An exception from CompleteResponse would propagate unhandled into the SynchronizationContext.
Suggestion: Use InvokeOnUI(() => CompleteResponse(cur, endGen)) to consolidate the dispatch logic and get the try-catch wrapping for free.
🟡 MODERATE (disputed 2/5) — Interlocked.Exchange(ref cur.ActiveToolCallCount, 0) races with concurrent ToolExecutionStartEvent
The polling loop unconditionally zeroes the count after ~3s:
Interlocked.Exchange(ref cur.ActiveToolCallCount, 0); // Force-reset
_syncContext.Post(_ => CompleteResponse(cur, endGen), null);Two models (opus-4.6, gemini) flag that this races with a concurrent ToolExecutionStartEvent increment for a second parallel task call, potentially leaving ActiveToolCallCount at 0 while a second tool is in-flight.
One model (sonnet) disagrees: "The SDK invokes at most one is_override callback per turn, so OnToolDispatchStart/End fire exactly once before the 3-second poll runs." The comment in the code itself acknowledges this: "The SDK fires ToolExecutionStartEvent for ALL parallel tool calls but may only invoke our callback for ONE of them."
Given the competing analysis, this warrants confirmation of the SDK behavior but may not be a bug in practice.
Test Coverage Assessment
The PR includes 452-line WorkerDelegationToolTests.cs and 14 new regression tests in MultiAgentRegressionTests.cs. Strong coverage overall. Missing test cases:
- MCP server propagation in revival/forceCreate paths: Verify that
EnsureOrchestratorToolsAsync(forceCreate=true) andExecuteWorkerAsyncrevival restore MCP servers in the new session config. _syncContext == nulldispatch path: Test thatCompleteResponseis invoked correctly when_syncContextis null in theOnToolDispatchEndfallback.
Below consensus (single-model findings, for author awareness)
- (sonnet)
CopilotService.Events.cs:528—CancelTurnEndFallbackfires unconditionally before theActiveToolCallCountguard. When the guard triggers (tools in-flight), both theSessionIdleEvent'sCompleteResponsepath AND the TurnEnd-fallback timer are gone — only the 3sOnToolDispatchEndpoll remains. Consider only cancellingCancelTurnEndFallbackafter the guard passes. - (opus-1)
CopilotService.Organization.cs:1182— continuation retry callsretryCtx2.Reset(prompt, workerNames, ct)with the fullworkerNameslist instead ofremainingworkers; round-robin restarts at worker-0 regardless of dispatch history. - (opus-2)
CopilotService.cs:2682— non-volatilestate.ActiveToolCallCountdirect read in the reconnect path; all other reads useVolatile.Read. - (opus-2)
CopilotService.cs:2683-2698— orchestrator reconnect-skip path setsIsProcessing=falsewithout callingFlushCurrentResponse(state)first (per INV-1). - (codex)
WorkerDelegationContext.cs—_pendingTaskskeyed only byworkerName; dispatching the same worker twice in one iteration overwrites the first task handle, orphaning it.
Recommended Action
The MCP/skills regression (5/5 consensus) is the primary blocker: worker revival loses all external tool access, making the revival feature non-functional for any task requiring tools. The fix is mechanical — copy the MCP/skills restoration pattern from the existing reconnect handler into the five new SessionConfig construction sites.
Phase 1 (Critical/P0): - Fix reconnect-skip SendingFlag deadlock (FlushCurrentResponse, ClearPermissionDenials, InvokeOnUI) - Split watchdog tiers: active-tool=600s, used-tools-idle=180s, default=120s - Reset HasUsedToolsThisTurn before reconnect retry Phase 2 (Fresh Session Config): - Extract BuildFreshSessionConfig helper with MCP/skills/system-message - Applied to all 5 fresh SessionConfig sites + reconnect handler Phase 3 (Moderate): - Fix OnToolDispatchEnd threading (InvokeOnUI) - Counter-keyed AddPendingTask prevents task overwrite - Reset clears stale pending tasks - Synthesis NRE guard when _client is null - Wire CTS to ResponseCompletion TCS for real timeout Phase 4 (Dispatch Coverage): - Continuation prompt includes full planningPrompt + dispatch status - Removed discouraging dispatch language - Added duplicate-task prevention instruction Bug fixes from multi-model review: - Bug 1: Clean delegation dicts in reflect finally + CloseSessionCoreAsync - Bug 2: Cancel watchdog/fallback timers before session replacement in EnsureOrchestratorToolsAsync Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Superseded by #339. The tool-intercept approach fought the SDK event model — reverting to enhanced text-parsing dispatch with JSON mode, retry loop, and ported reliability fixes. |
) ## Summary Replaces the tool-intercept approach (PR #318) with enhanced text-parsing dispatch. The tool-intercept approach fought the SDK's event model — 22/33 commits were workarounds for the impedance mismatch, and it *caused* the 'single worker dispatched' and 'identical tasks' production bugs. This PR keeps the text-parsing path that was working reliably, and enhances it: ### Text-Parsing Enhancements - **JSON mode parsing**: Orchestrator can return \[{worker, task}]\ JSON arrays, parsed with System.Text.Json. Falls back to \@worker:...@EnD\ regex on parse failure. - **Exact match only** for worker name resolution — removed bidirectional \Contains\ fallback that caused misroutes when names are substrings of each other. - **Backtick/quote stripping** from worker names for robustness. - **Differentiated task instruction**: 'Each worker MUST receive a DIFFERENT sub-task' - **Retry loop** (up to 3 iterations) with conversation history when not all workers dispatched. Model remembers what it already assigned — no fresh sessions, no amnesia. ### Reliability Fixes (ported from tool-dispatch branch) - \_clientReconnectLock\: SemaphoreSlim thundering-herd fix for concurrent workers hitting connection errors - **Watchdog tier split**: active-tool=600s, used-tools-idle=180s, default=120s (cuts zombie detection from 10min to 3min) - **BuildFreshSessionConfig** helper: MCP servers, skills, system message in one place - **CTS-to-TCS wiring**: 10-minute timeout in SendPromptAndWaitAsync actually cancels the TCS - **HasUsedToolsThisTurn reset** before reconnect retry - **Worker revival**: detect empty response → fresh session → retry once (~20 lines) - **Volatile.Read** cleanup for ActiveToolCallCount - **Corrupt/locked session** restore fallback ### Tests - 6 new JSON parsing tests - Updated fuzzy match tests for exact-match-only behavior - Updated structural regression guards for BuildFreshSessionConfig ### Why Not Tool-Intercept? Multi-model review (Opus 4.6, Sonnet 4, Gemini 3 Pro, GPT-5.2) unanimously recommended against the tool-intercept approach. The core issue: \is_override\ on a custom task AIFunction returns a fake placeholder while running real work in background Task.Run. The SDK doesn't handle this — events don't fire correctly, sessions get stuck, and fresh sessions lose conversation history. Closes #229 Supersedes #318 --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR #318 Review: "feat: Intercept SDK task tool for multi-agent orchestration"CI Status: 5 models reviewed (claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex). 4 consensus findings. 🔴 CRITICAL —
|
…#229) (PureWeen#339) ## Summary Replaces the tool-intercept approach (PR PureWeen#318) with enhanced text-parsing dispatch. The tool-intercept approach fought the SDK's event model — 22/33 commits were workarounds for the impedance mismatch, and it *caused* the 'single worker dispatched' and 'identical tasks' production bugs. This PR keeps the text-parsing path that was working reliably, and enhances it: ### Text-Parsing Enhancements - **JSON mode parsing**: Orchestrator can return \[{worker, task}]\ JSON arrays, parsed with System.Text.Json. Falls back to \@worker:...@EnD\ regex on parse failure. - **Exact match only** for worker name resolution — removed bidirectional \Contains\ fallback that caused misroutes when names are substrings of each other. - **Backtick/quote stripping** from worker names for robustness. - **Differentiated task instruction**: 'Each worker MUST receive a DIFFERENT sub-task' - **Retry loop** (up to 3 iterations) with conversation history when not all workers dispatched. Model remembers what it already assigned — no fresh sessions, no amnesia. ### Reliability Fixes (ported from tool-dispatch branch) - \_clientReconnectLock\: SemaphoreSlim thundering-herd fix for concurrent workers hitting connection errors - **Watchdog tier split**: active-tool=600s, used-tools-idle=180s, default=120s (cuts zombie detection from 10min to 3min) - **BuildFreshSessionConfig** helper: MCP servers, skills, system message in one place - **CTS-to-TCS wiring**: 10-minute timeout in SendPromptAndWaitAsync actually cancels the TCS - **HasUsedToolsThisTurn reset** before reconnect retry - **Worker revival**: detect empty response → fresh session → retry once (~20 lines) - **Volatile.Read** cleanup for ActiveToolCallCount - **Corrupt/locked session** restore fallback ### Tests - 6 new JSON parsing tests - Updated fuzzy match tests for exact-match-only behavior - Updated structural regression guards for BuildFreshSessionConfig ### Why Not Tool-Intercept? Multi-model review (Opus 4.6, Sonnet 4, Gemini 3 Pro, GPT-5.2) unanimously recommended against the tool-intercept approach. The core issue: \is_override\ on a custom task AIFunction returns a fake placeholder while running real work in background Task.Run. The SDK doesn't handle this — events don't fire correctly, sessions get stuck, and fresh sessions lose conversation history. Closes PureWeen#229 Supersedes PureWeen#318 --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closes #229
What
Replaces fragile @worker: text-parsing with structured tool-call dispatch for multi-agent orchestration.
Changes
Testing