Skip to content

feat: enhanced text-parsing dispatch with reliability fixes (#229)#339

Merged
PureWeen merged 6 commits intomainfrom
feature/enhanced-text-dispatch-229
Mar 10, 2026
Merged

feat: enhanced text-parsing dispatch with reliability fixes (#229)#339
PureWeen merged 6 commits intomainfrom
feature/enhanced-text-dispatch-229

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

Summary

Replaces the tool-intercept approach (PR #318) with enhanced text-parsing dispatch. The tool-intercept approach fought the SDK's event model — 22/33 commits were workarounds for the impedance mismatch, and it caused the 'single worker dispatched' and 'identical tasks' production bugs.

This PR keeps the text-parsing path that was working reliably, and enhances it:

Text-Parsing Enhancements

  • JSON mode parsing: Orchestrator can return [{worker, task}]\ JSON arrays, parsed with System.Text.Json. Falls back to @worker:...@EnD\ regex on parse failure.
  • Exact match only for worker name resolution — removed bidirectional \Contains\ fallback that caused misroutes when names are substrings of each other.
  • Backtick/quote stripping from worker names for robustness.
  • Differentiated task instruction: 'Each worker MUST receive a DIFFERENT sub-task'
  • Retry loop (up to 3 iterations) with conversation history when not all workers dispatched. Model remembers what it already assigned — no fresh sessions, no amnesia.

Reliability Fixes (ported from tool-dispatch branch)

  • _clientReconnectLock: SemaphoreSlim thundering-herd fix for concurrent workers hitting connection errors
  • Watchdog tier split: active-tool=600s, used-tools-idle=180s, default=120s (cuts zombie detection from 10min to 3min)
  • BuildFreshSessionConfig helper: MCP servers, skills, system message in one place
  • CTS-to-TCS wiring: 10-minute timeout in SendPromptAndWaitAsync actually cancels the TCS
  • HasUsedToolsThisTurn reset before reconnect retry
  • Worker revival: detect empty response → fresh session → retry once (~20 lines)
  • Volatile.Read cleanup for ActiveToolCallCount
  • Corrupt/locked session restore fallback

Tests

  • 6 new JSON parsing tests
  • Updated fuzzy match tests for exact-match-only behavior
  • Updated structural regression guards for BuildFreshSessionConfig

Why Not Tool-Intercept?

Multi-model review (Opus 4.6, Sonnet 4, Gemini 3 Pro, GPT-5.2) unanimously recommended against the tool-intercept approach. The core issue: \is_override\ on a custom task AIFunction returns a fake placeholder while running real work in background Task.Run. The SDK doesn't handle this — events don't fire correctly, sessions get stuck, and fresh sessions lose conversation history.

Closes #229
Supersedes #318

PureWeen and others added 5 commits March 10, 2026 11:31
When RestorePreviousSessionsAsync encounters a 'session file corrupted' error
(e.g., events.jsonl locked by another copilot process), fall back to
CreateSessionAsync instead of silently dropping the session. Updated error
message to explain CLI lock cause.

Cherry-picked from: de5f0ae, 5a21b76

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Cherry-picked from feature/intercept-task-tool-229:
- Corrupt/locked session restore fallback (de5f0ae, 5a21b76)
- Volatile.Read cleanup for ActiveToolCallCount (45b34b3)

Ported reliability fixes:
- _clientReconnectLock: SemaphoreSlim thundering-herd fix for concurrent
  workers hitting IsConnectionError simultaneously
- Watchdog tier split: active-tool=600s, used-tools-idle=180s, default=120s
  (cuts zombie detection from 10min to 3min)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Text-Parsing Enhancements:
- JSON mode parsing: orchestrator can return [{worker,task}] array, parsed
  with System.Text.Json. Falls back to @worker:...@EnD regex on parse failure.
- Exact match only for worker names — removed bidirectional Contains fallback
  that caused misroutes when names are substrings of each other.
- Strip backticks/quotes from worker names for robustness.
- Differentiated task instruction: 'Each worker MUST receive a DIFFERENT sub-task'
- Retry loop (up to 3 iterations) with conversation history when not all workers
  dispatched. Model remembers what it already assigned (no fresh sessions/amnesia).

Reliability Fixes (ported from tool-dispatch branch):
- BuildFreshSessionConfig helper: MCP servers, skills, system message in one place.
  Applied to reconnect handler and worker revival.
- CTS-to-TCS wiring: 10-minute timeout in SendPromptAndWaitAsync actually cancels
  the ResponseCompletion TCS instead of being a no-op.
- Reset HasUsedToolsThisTurn before reconnect retry to prevent 600s zombie timeout.
- Worker revival: detect empty response in ExecuteWorkerAsync, create fresh session
  with BuildFreshSessionConfig, retry once (~20 lines vs ~70 in tool-dispatch).

Tests:
- 6 new JSON parsing tests (array, code-fenced, unknown worker, malformed, empty)
- Updated fuzzy match tests to verify exact-match-only behavior
- Updated ConnectionRecovery + ChatExperienceSafety for BuildFreshSessionConfig

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Update planning prompt to instruct model to assign ALL workers in a single
  response instead of 'at least one'. This fixes the issue where the model
  would assign only 1 worker per iteration, requiring 3 iterations to get 3
  workers assigned (and missing workers 4-5).
- Update nudge prompt to request ALL remaining workers at once.
- Add per-group SemaphoreSlim (_groupDispatchLocks) to prevent concurrent
  dispatches to the same group. The bridge's send_message handler and the
  event queue drain can both call SendToMultiAgentGroupAsync; without this
  guard, the second call hits 'Session already processing' error.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…edup helper

- Fix reconnect lock double-check dead code: replace !IsConnectionError(ex)
  (always false) with !ReferenceEquals(_client, client) to detect if another
  worker already reconnected (CRITICAL)
- Update ComputeEffectiveTimeout test helper to match 3-tier production logic
  (was using old 2-tier formula, causing false-negative tests)
- Add WatchdogTimeout_UsedToolsIdle_Uses180s test for the middle tier
- Update WatchdogTimeout_BetweenToolRounds to expect 180s (not 600s)
- Update WatchdogTimeout_MultiAgent to expect 120s (isMultiAgent no longer escalates)
- Update AllCombinations theory data for 3-tier formula
- Extract DeduplicateAssignments() helper replacing 5 copy-pasted GroupBy chains
- Add WorkerExecutionTimeout named constant (was magic TimeSpan.FromMinutes(10))
- Document _groupDispatchLocks silent-skip invariant
- Narrow bare catch {} to catch (JsonException) in TryParseJsonAssignments
- Add StringComparison.Ordinal to json.StartsWith

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the feature/enhanced-text-dispatch-229 branch from 24722ff to 30132f9 Compare March 10, 2026 16:32
The _groupDispatchLocks guard used WaitAsync(0) which silently dropped any
message arriving while a dispatch was in progress. This caused user messages
sent to a busy orchestrator to vanish entirely. Changed to WaitAsync(ct)
so concurrent callers wait their turn and execute sequentially instead of
being discarded.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

🤖 Multi-Model Fleet Review — PR #339

PR: feat: enhanced text-parsing dispatch with reliability fixes (#229)
Method: 5 independent AI reviewers (2× Opus 4.6, Sonnet 4.6, Gemini 3 Pro, GPT-5.3-Codex) reviewed the full diff in parallel. Only findings flagged by ≥2/5 models survive the consensus filter.


📋 CI Status

⚠️ No CI checks configured. No automated tests were run against this PR.


🔍 Consensus Findings

🔴 F1: Revived worker session missing event handler registration (2/5 consensus)

File: CopilotService.Organization.csExecuteWorkerAsync, worker revival block (~line 1389)

The worker revival path calls client.CreateSessionAsync() and stores the fresh state, but never calls freshSession.On(evt => HandleSessionEvent(freshState, evt)). Every other session-creation site in the codebase registers this handler — it is the sole mechanism by which CompleteResponse and ResponseCompletion.TrySetResult() fire.

Impact: Without the event handler, the revived session sends a prompt, but no SDK events arrive → ResponseCompletion never completes → the call hangs until the 10-minute CTS timeout → returns empty. The revival feature is dead on arrival.

Fix: Add freshSession.On(evt => HandleSessionEvent(freshState, workerName)); after _sessions[workerName] = freshState;, mirroring the reconnect path.

Flagged by: Sonnet 4.6 (directly), Gemini 3 Pro (misdiagnosed as "infinite recursion" — same root cause)


🟡 F2: Silent dispatch drop via _groupDispatchLocks WaitAsync(0) (4/5 consensus)

File: CopilotService.Organization.cs:~936SendToMultiAgentGroupAsync

if (!await dispatchLock.WaitAsync(0, cancellationToken))
{
    Debug($"... dispatch already in progress, skipping");
    return;  // ← prompt is silently lost
}

If a user sends a message via the bridge while a dispatch is already running, the new prompt is silently discarded. The _queuedGroupPrompts mechanism only handles the reflect loop — there is no queue on the skip path.

Fix: Either (a) queue the prompt for post-dispatch delivery, (b) use WaitAsync(cancellationToken) to block instead of skip, or (c) return a status so the caller can retry.

Flagged by: Opus 4.6 #1, Opus 4.6 #2, Gemini 3 Pro, GPT-5.3-Codex


🟡 F3: TrySetCanceled() without CancellationToken (3/5 consensus)

File: CopilotService.Organization.cs:~1414SendPromptAndWaitAsync

cts.Token.Register(() => {
    if (_sessions.TryGetValue(sessionName, out var s))
        s.ResponseCompletion?.TrySetCanceled();  // ← no token
});

TrySetCanceled() without a token produces OperationCanceledException with CancellationToken.None. Callers cannot distinguish a 10-minute timeout from a user-initiated abort, and catch (OCE ex) when (ex.CancellationToken == ct) filters won't match.

Fix: Use s.ResponseCompletion?.TrySetCanceled(cts.Token).

Flagged by: Opus 4.6 #1, Opus 4.6 #2, GPT-5.3-Codex


🟡 F4: Multi-agent orchestrator timeout reduced 600s → 120s (2/5 consensus)

File: CopilotService.Events.cs:~1536RunProcessingWatchdogAsync

isMultiAgent was removed from the useToolTimeout calculation. An orchestrator session waiting for workers — with no active tools of its own and HasUsedToolsThisTurn = false — now gets 120s instead of 600s. The PR's own retry loop (3 iterations × nudge + worker revival) can easily exceed 120s, at which point the watchdog kills the orchestrator mid-dispatch.

Fix: Either add isMultiAgent to the 180s useUsedToolsTimeout tier, or ensure the orchestrator's HasUsedToolsThisTurn is set early in the dispatch flow.

Flagged by: Opus 4.6 #2, Sonnet 4.6


🟡 F5: Worker revival non-atomic session replacement (2/5 consensus)

File: CopilotService.Organization.cs:~1380ExecuteWorkerAsync

The revival sequence disposes the old session, mutates deadState.Info.SessionId, then replaces _sessions[workerName]. Between dispose and replace, concurrent readers (watchdog, CTS registration, event handlers) see a state with a disposed Session but a mutated Info.SessionId, creating session-ID mismatches in logging and event routing. The old watchdog continues running against the shared Info object, potentially clearing IsProcessing on the new session.

Fix: Clone Info for the fresh state so old and new watchdogs don't share mutable state.

Flagged by: Opus 4.6 #1, Opus 4.6 #2


📊 Sub-Consensus Notable Mentions (1/5 — not actionable, listed for author awareness)

Finding Model Note
Overly broad "corrupt" / "session file" substring matching in Persistence.cs Opus #1 Could match unrelated errors; consider more specific strings
_groupDispatchLocks never cleaned up → memory leak Gemini SemaphoreSlim entries accumulate; add cleanup on group deletion
Duplicate comment block on _groupDispatchLocks declaration All models noted Copy-paste artifact — 3 lines repeated verbatim

🧪 Test Coverage Assessment

Area Coverage
Watchdog 3-tier timeout logic ✅ Thorough — theory + point tests updated
JSON parsing (TryParseJsonAssignments) ✅ Good — 6 new tests covering valid, malformed, code-fence, unknown workers
Exact-match-only worker resolution ✅ Good — 2 existing tests updated to verify rejection
BuildFreshSessionConfig structural guards ✅ Good — 4 existing tests updated
Retry dispatch loop (3 iterations) ❌ No unit test
Worker revival path ❌ No unit test
_clientReconnectLock thundering-herd fix ❌ No unit test
_groupDispatchLocks concurrent skip behavior ❌ No unit test
CTS-to-TCS cancellation wiring ❌ No unit test

5 of the 10 key changes have zero test coverage. The untested paths are all concurrency-sensitive.


🏁 Verdict

⚠️ Request Changes

Blocking (must fix):

  1. F1 — Register event handler on revived sessions (the revival feature is non-functional without this)
  2. F4 — Ensure multi-agent orchestrator sessions get adequate watchdog timeout (120s is too short for the new retry loop)

Should fix:
3. F3 — Pass cts.Token to TrySetCanceled() (one-line fix)
4. F5 — Clone Info during worker revival to prevent shared-state races

Consider:
5. F2 — Queue or block on concurrent dispatch instead of dropping (design decision)

The architecture improvements (thundering-herd lock, JSON parsing, retry loop, BuildFreshSessionConfig extraction) are solid. The critical gap is F1 — without the event handler, worker revival silently fails every time.

@PureWeen PureWeen merged commit a79f9f4 into main Mar 10, 2026
@PureWeen PureWeen deleted the feature/enhanced-text-dispatch-229 branch March 10, 2026 17:54
arisng pushed a commit to arisng/PolyPilot that referenced this pull request Apr 4, 2026
…#229) (PureWeen#339)

## Summary

Replaces the tool-intercept approach (PR PureWeen#318) with enhanced
text-parsing dispatch. The tool-intercept approach fought the SDK's
event model — 22/33 commits were workarounds for the impedance mismatch,
and it *caused* the 'single worker dispatched' and 'identical tasks'
production bugs.

This PR keeps the text-parsing path that was working reliably, and
enhances it:

### Text-Parsing Enhancements
- **JSON mode parsing**: Orchestrator can return \[{worker, task}]\ JSON
arrays, parsed with System.Text.Json. Falls back to \@worker:...@EnD\
regex on parse failure.
- **Exact match only** for worker name resolution — removed
bidirectional \Contains\ fallback that caused misroutes when names are
substrings of each other.
- **Backtick/quote stripping** from worker names for robustness.
- **Differentiated task instruction**: 'Each worker MUST receive a
DIFFERENT sub-task'
- **Retry loop** (up to 3 iterations) with conversation history when not
all workers dispatched. Model remembers what it already assigned — no
fresh sessions, no amnesia.

### Reliability Fixes (ported from tool-dispatch branch)
- \_clientReconnectLock\: SemaphoreSlim thundering-herd fix for
concurrent workers hitting connection errors
- **Watchdog tier split**: active-tool=600s, used-tools-idle=180s,
default=120s (cuts zombie detection from 10min to 3min)
- **BuildFreshSessionConfig** helper: MCP servers, skills, system
message in one place
- **CTS-to-TCS wiring**: 10-minute timeout in SendPromptAndWaitAsync
actually cancels the TCS
- **HasUsedToolsThisTurn reset** before reconnect retry
- **Worker revival**: detect empty response → fresh session → retry once
(~20 lines)
- **Volatile.Read** cleanup for ActiveToolCallCount
- **Corrupt/locked session** restore fallback

### Tests
- 6 new JSON parsing tests
- Updated fuzzy match tests for exact-match-only behavior
- Updated structural regression guards for BuildFreshSessionConfig

### Why Not Tool-Intercept?
Multi-model review (Opus 4.6, Sonnet 4, Gemini 3 Pro, GPT-5.2)
unanimously recommended against the tool-intercept approach. The core
issue: \is_override\ on a custom task AIFunction returns a fake
placeholder while running real work in background Task.Run. The SDK
doesn't handle this — events don't fire correctly, sessions get stuck,
and fresh sessions lose conversation history.

Closes PureWeen#229
Supersedes PureWeen#318

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rework multi-agent orchestrator: intercept SDK task tool to surface sub-agents as PolyPilot sessions

1 participant