feat: enhanced text-parsing dispatch with reliability fixes (#229) by PureWeen · Pull Request #339 · PureWeen/PolyPilot

PureWeen · 2026-03-10T14:48:55Z

Summary

Replaces the tool-intercept approach (PR #318) with enhanced text-parsing dispatch. The tool-intercept approach fought the SDK's event model — 22/33 commits were workarounds for the impedance mismatch, and it caused the 'single worker dispatched' and 'identical tasks' production bugs.

This PR keeps the text-parsing path that was working reliably, and enhances it:

Text-Parsing Enhancements

JSON mode parsing: Orchestrator can return [{worker, task}]\ JSON arrays, parsed with System.Text.Json. Falls back to @worker:...@EnD\ regex on parse failure.
Exact match only for worker name resolution — removed bidirectional \Contains\ fallback that caused misroutes when names are substrings of each other.
Backtick/quote stripping from worker names for robustness.
Differentiated task instruction: 'Each worker MUST receive a DIFFERENT sub-task'
Retry loop (up to 3 iterations) with conversation history when not all workers dispatched. Model remembers what it already assigned — no fresh sessions, no amnesia.

Reliability Fixes (ported from tool-dispatch branch)

_clientReconnectLock: SemaphoreSlim thundering-herd fix for concurrent workers hitting connection errors
Watchdog tier split: active-tool=600s, used-tools-idle=180s, default=120s (cuts zombie detection from 10min to 3min)
BuildFreshSessionConfig helper: MCP servers, skills, system message in one place
CTS-to-TCS wiring: 10-minute timeout in SendPromptAndWaitAsync actually cancels the TCS
HasUsedToolsThisTurn reset before reconnect retry
Worker revival: detect empty response → fresh session → retry once (~20 lines)
Volatile.Read cleanup for ActiveToolCallCount
Corrupt/locked session restore fallback

Tests

6 new JSON parsing tests
Updated fuzzy match tests for exact-match-only behavior
Updated structural regression guards for BuildFreshSessionConfig

Why Not Tool-Intercept?

Multi-model review (Opus 4.6, Sonnet 4, Gemini 3 Pro, GPT-5.2) unanimously recommended against the tool-intercept approach. The core issue: \is_override\ on a custom task AIFunction returns a fake placeholder while running real work in background Task.Run. The SDK doesn't handle this — events don't fire correctly, sessions get stuck, and fresh sessions lose conversation history.

Closes #229
Supersedes #318

When RestorePreviousSessionsAsync encounters a 'session file corrupted' error (e.g., events.jsonl locked by another copilot process), fall back to CreateSessionAsync instead of silently dropping the session. Updated error message to explain CLI lock cause. Cherry-picked from: de5f0ae, 5a21b76 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Cherry-picked from feature/intercept-task-tool-229: - Corrupt/locked session restore fallback (de5f0ae, 5a21b76) - Volatile.Read cleanup for ActiveToolCallCount (45b34b3) Ported reliability fixes: - _clientReconnectLock: SemaphoreSlim thundering-herd fix for concurrent workers hitting IsConnectionError simultaneously - Watchdog tier split: active-tool=600s, used-tools-idle=180s, default=120s (cuts zombie detection from 10min to 3min) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@worker

Text-Parsing Enhancements: - JSON mode parsing: orchestrator can return [{worker,task}] array, parsed with System.Text.Json. Falls back to @worker:...@EnD regex on parse failure. - Exact match only for worker names — removed bidirectional Contains fallback that caused misroutes when names are substrings of each other. - Strip backticks/quotes from worker names for robustness. - Differentiated task instruction: 'Each worker MUST receive a DIFFERENT sub-task' - Retry loop (up to 3 iterations) with conversation history when not all workers dispatched. Model remembers what it already assigned (no fresh sessions/amnesia). Reliability Fixes (ported from tool-dispatch branch): - BuildFreshSessionConfig helper: MCP servers, skills, system message in one place. Applied to reconnect handler and worker revival. - CTS-to-TCS wiring: 10-minute timeout in SendPromptAndWaitAsync actually cancels the ResponseCompletion TCS instead of being a no-op. - Reset HasUsedToolsThisTurn before reconnect retry to prevent 600s zombie timeout. - Worker revival: detect empty response in ExecuteWorkerAsync, create fresh session with BuildFreshSessionConfig, retry once (~20 lines vs ~70 in tool-dispatch). Tests: - 6 new JSON parsing tests (array, code-fenced, unknown worker, malformed, empty) - Updated fuzzy match tests to verify exact-match-only behavior - Updated ConnectionRecovery + ChatExperienceSafety for BuildFreshSessionConfig Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Update planning prompt to instruct model to assign ALL workers in a single response instead of 'at least one'. This fixes the issue where the model would assign only 1 worker per iteration, requiring 3 iterations to get 3 workers assigned (and missing workers 4-5). - Update nudge prompt to request ALL remaining workers at once. - Add per-group SemaphoreSlim (_groupDispatchLocks) to prevent concurrent dispatches to the same group. The bridge's send_message handler and the event queue drain can both call SendToMultiAgentGroupAsync; without this guard, the second call hits 'Session already processing' error. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…edup helper - Fix reconnect lock double-check dead code: replace !IsConnectionError(ex) (always false) with !ReferenceEquals(_client, client) to detect if another worker already reconnected (CRITICAL) - Update ComputeEffectiveTimeout test helper to match 3-tier production logic (was using old 2-tier formula, causing false-negative tests) - Add WatchdogTimeout_UsedToolsIdle_Uses180s test for the middle tier - Update WatchdogTimeout_BetweenToolRounds to expect 180s (not 600s) - Update WatchdogTimeout_MultiAgent to expect 120s (isMultiAgent no longer escalates) - Update AllCombinations theory data for 3-tier formula - Extract DeduplicateAssignments() helper replacing 5 copy-pasted GroupBy chains - Add WorkerExecutionTimeout named constant (was magic TimeSpan.FromMinutes(10)) - Document _groupDispatchLocks silent-skip invariant - Narrow bare catch {} to catch (JsonException) in TryParseJsonAssignments - Add StringComparison.Ordinal to json.StartsWith Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The _groupDispatchLocks guard used WaitAsync(0) which silently dropped any message arriving while a dispatch was in progress. This caused user messages sent to a busy orchestrator to vanish entirely. Changed to WaitAsync(ct) so concurrent callers wait their turn and execute sequentially instead of being discarded. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen · 2026-03-10T16:58:37Z

🤖 Multi-Model Fleet Review — PR #339

PR: feat: enhanced text-parsing dispatch with reliability fixes (#229)
Method: 5 independent AI reviewers (2× Opus 4.6, Sonnet 4.6, Gemini 3 Pro, GPT-5.3-Codex) reviewed the full diff in parallel. Only findings flagged by ≥2/5 models survive the consensus filter.

📋 CI Status

⚠️ No CI checks configured. No automated tests were run against this PR.

🔍 Consensus Findings

🔴 F1: Revived worker session missing event handler registration (2/5 consensus)

File: CopilotService.Organization.cs — ExecuteWorkerAsync, worker revival block (~line 1389)

The worker revival path calls client.CreateSessionAsync() and stores the fresh state, but never calls freshSession.On(evt => HandleSessionEvent(freshState, evt)). Every other session-creation site in the codebase registers this handler — it is the sole mechanism by which CompleteResponse and ResponseCompletion.TrySetResult() fire.

Impact: Without the event handler, the revived session sends a prompt, but no SDK events arrive → ResponseCompletion never completes → the call hangs until the 10-minute CTS timeout → returns empty. The revival feature is dead on arrival.

Fix: Add freshSession.On(evt => HandleSessionEvent(freshState, workerName)); after _sessions[workerName] = freshState;, mirroring the reconnect path.

Flagged by: Sonnet 4.6 (directly), Gemini 3 Pro (misdiagnosed as "infinite recursion" — same root cause)

🟡 F2: Silent dispatch drop via `_groupDispatchLocks` WaitAsync(0) (4/5 consensus)

File: CopilotService.Organization.cs:~936 — SendToMultiAgentGroupAsync

if (!await dispatchLock.WaitAsync(0, cancellationToken))
{
    Debug($"... dispatch already in progress, skipping");
    return;  // ← prompt is silently lost
}

If a user sends a message via the bridge while a dispatch is already running, the new prompt is silently discarded. The _queuedGroupPrompts mechanism only handles the reflect loop — there is no queue on the skip path.

Fix: Either (a) queue the prompt for post-dispatch delivery, (b) use WaitAsync(cancellationToken) to block instead of skip, or (c) return a status so the caller can retry.

Flagged by: Opus 4.6 #1, Opus 4.6 #2, Gemini 3 Pro, GPT-5.3-Codex

🟡 F3: `TrySetCanceled()` without CancellationToken (3/5 consensus)

File: CopilotService.Organization.cs:~1414 — SendPromptAndWaitAsync

cts.Token.Register(() => {
    if (_sessions.TryGetValue(sessionName, out var s))
        s.ResponseCompletion?.TrySetCanceled();  // ← no token
});

TrySetCanceled() without a token produces OperationCanceledException with CancellationToken.None. Callers cannot distinguish a 10-minute timeout from a user-initiated abort, and catch (OCE ex) when (ex.CancellationToken == ct) filters won't match.

Fix: Use s.ResponseCompletion?.TrySetCanceled(cts.Token).

Flagged by: Opus 4.6 #1, Opus 4.6 #2, GPT-5.3-Codex

🟡 F4: Multi-agent orchestrator timeout reduced 600s → 120s (2/5 consensus)

File: CopilotService.Events.cs:~1536 — RunProcessingWatchdogAsync

isMultiAgent was removed from the useToolTimeout calculation. An orchestrator session waiting for workers — with no active tools of its own and HasUsedToolsThisTurn = false — now gets 120s instead of 600s. The PR's own retry loop (3 iterations × nudge + worker revival) can easily exceed 120s, at which point the watchdog kills the orchestrator mid-dispatch.

Fix: Either add isMultiAgent to the 180s useUsedToolsTimeout tier, or ensure the orchestrator's HasUsedToolsThisTurn is set early in the dispatch flow.

Flagged by: Opus 4.6 #2, Sonnet 4.6

🟡 F5: Worker revival non-atomic session replacement (2/5 consensus)

File: CopilotService.Organization.cs:~1380 — ExecuteWorkerAsync

The revival sequence disposes the old session, mutates deadState.Info.SessionId, then replaces _sessions[workerName]. Between dispose and replace, concurrent readers (watchdog, CTS registration, event handlers) see a state with a disposed Session but a mutated Info.SessionId, creating session-ID mismatches in logging and event routing. The old watchdog continues running against the shared Info object, potentially clearing IsProcessing on the new session.

Fix: Clone Info for the fresh state so old and new watchdogs don't share mutable state.

Flagged by: Opus 4.6 #1, Opus 4.6 #2

📊 Sub-Consensus Notable Mentions (1/5 — not actionable, listed for author awareness)

Finding	Model	Note
Overly broad `"corrupt"` / `"session file"` substring matching in `Persistence.cs`	Opus #1	Could match unrelated errors; consider more specific strings
`_groupDispatchLocks` never cleaned up → memory leak	Gemini	SemaphoreSlim entries accumulate; add cleanup on group deletion
Duplicate comment block on `_groupDispatchLocks` declaration	All models noted	Copy-paste artifact — 3 lines repeated verbatim

🧪 Test Coverage Assessment

Area	Coverage
Watchdog 3-tier timeout logic	✅ Thorough — theory + point tests updated
JSON parsing (`TryParseJsonAssignments`)	✅ Good — 6 new tests covering valid, malformed, code-fence, unknown workers
Exact-match-only worker resolution	✅ Good — 2 existing tests updated to verify rejection
`BuildFreshSessionConfig` structural guards	✅ Good — 4 existing tests updated
Retry dispatch loop (3 iterations)	❌ No unit test
Worker revival path	❌ No unit test
`_clientReconnectLock` thundering-herd fix	❌ No unit test
`_groupDispatchLocks` concurrent skip behavior	❌ No unit test
CTS-to-TCS cancellation wiring	❌ No unit test

5 of the 10 key changes have zero test coverage. The untested paths are all concurrency-sensitive.

🏁 Verdict

⚠️ Request Changes

Blocking (must fix):

F1 — Register event handler on revived sessions (the revival feature is non-functional without this)
F4 — Ensure multi-agent orchestrator sessions get adequate watchdog timeout (120s is too short for the new retry loop)

Should fix:
3. F3 — Pass cts.Token to TrySetCanceled() (one-line fix)
4. F5 — Clone Info during worker revival to prevent shared-state races

Consider:
5. F2 — Queue or block on concurrent dispatch instead of dropping (design decision)

The architecture improvements (thundering-herd lock, JSON parsing, retry loop, BuildFreshSessionConfig extraction) are solid. The critical gap is F1 — without the event handler, worker revival silently fails every time.

@worker

…#229) (PureWeen#339) ## Summary Replaces the tool-intercept approach (PR PureWeen#318) with enhanced text-parsing dispatch. The tool-intercept approach fought the SDK's event model — 22/33 commits were workarounds for the impedance mismatch, and it *caused* the 'single worker dispatched' and 'identical tasks' production bugs. This PR keeps the text-parsing path that was working reliably, and enhances it: ### Text-Parsing Enhancements - **JSON mode parsing**: Orchestrator can return \[{worker, task}]\ JSON arrays, parsed with System.Text.Json. Falls back to \@worker:...@EnD\ regex on parse failure. - **Exact match only** for worker name resolution — removed bidirectional \Contains\ fallback that caused misroutes when names are substrings of each other. - **Backtick/quote stripping** from worker names for robustness. - **Differentiated task instruction**: 'Each worker MUST receive a DIFFERENT sub-task' - **Retry loop** (up to 3 iterations) with conversation history when not all workers dispatched. Model remembers what it already assigned — no fresh sessions, no amnesia. ### Reliability Fixes (ported from tool-dispatch branch) - \_clientReconnectLock\: SemaphoreSlim thundering-herd fix for concurrent workers hitting connection errors - **Watchdog tier split**: active-tool=600s, used-tools-idle=180s, default=120s (cuts zombie detection from 10min to 3min) - **BuildFreshSessionConfig** helper: MCP servers, skills, system message in one place - **CTS-to-TCS wiring**: 10-minute timeout in SendPromptAndWaitAsync actually cancels the TCS - **HasUsedToolsThisTurn reset** before reconnect retry - **Worker revival**: detect empty response → fresh session → retry once (~20 lines) - **Volatile.Read** cleanup for ActiveToolCallCount - **Corrupt/locked session** restore fallback ### Tests - 6 new JSON parsing tests - Updated fuzzy match tests for exact-match-only behavior - Updated structural regression guards for BuildFreshSessionConfig ### Why Not Tool-Intercept? Multi-model review (Opus 4.6, Sonnet 4, Gemini 3 Pro, GPT-5.2) unanimously recommended against the tool-intercept approach. The core issue: \is_override\ on a custom task AIFunction returns a fake placeholder while running real work in background Task.Run. The SDK doesn't handle this — events don't fire correctly, sessions get stuck, and fresh sessions lose conversation history. Closes PureWeen#229 Supersedes PureWeen#318 --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen mentioned this pull request Mar 10, 2026

feat: Intercept SDK task tool for multi-agent orchestration (#229) #318

Closed

PureWeen and others added 5 commits March 10, 2026 11:31

PureWeen force-pushed the feature/enhanced-text-dispatch-229 branch from 24722ff to 30132f9 Compare March 10, 2026 16:32

PureWeen merged commit a79f9f4 into main Mar 10, 2026

PureWeen deleted the feature/enhanced-text-dispatch-229 branch March 10, 2026 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: enhanced text-parsing dispatch with reliability fixes (#229)#339

feat: enhanced text-parsing dispatch with reliability fixes (#229)#339
PureWeen merged 6 commits intomainfrom
feature/enhanced-text-dispatch-229

PureWeen commented Mar 10, 2026

Uh oh!

PureWeen commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PureWeen commented Mar 10, 2026

Summary

Text-Parsing Enhancements

Reliability Fixes (ported from tool-dispatch branch)

Tests

Why Not Tool-Intercept?

Uh oh!

PureWeen commented Mar 10, 2026

🤖 Multi-Model Fleet Review — PR #339

📋 CI Status

🔍 Consensus Findings

🔴 F1: Revived worker session missing event handler registration (2/5 consensus)

🟡 F2: Silent dispatch drop via _groupDispatchLocks WaitAsync(0) (4/5 consensus)

🟡 F3: TrySetCanceled() without CancellationToken (3/5 consensus)

🟡 F4: Multi-agent orchestrator timeout reduced 600s → 120s (2/5 consensus)

🟡 F5: Worker revival non-atomic session replacement (2/5 consensus)

📊 Sub-Consensus Notable Mentions (1/5 — not actionable, listed for author awareness)

🧪 Test Coverage Assessment

🏁 Verdict

⚠️ Request Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

🟡 F2: Silent dispatch drop via `_groupDispatchLocks` WaitAsync(0) (4/5 consensus)

🟡 F3: `TrySetCanceled()` without CancellationToken (3/5 consensus)