feat: enhanced text-parsing dispatch with reliability fixes (#229)#339
feat: enhanced text-parsing dispatch with reliability fixes (#229)#339
Conversation
When RestorePreviousSessionsAsync encounters a 'session file corrupted' error (e.g., events.jsonl locked by another copilot process), fall back to CreateSessionAsync instead of silently dropping the session. Updated error message to explain CLI lock cause. Cherry-picked from: de5f0ae, 5a21b76 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Cherry-picked from feature/intercept-task-tool-229: - Corrupt/locked session restore fallback (de5f0ae, 5a21b76) - Volatile.Read cleanup for ActiveToolCallCount (45b34b3) Ported reliability fixes: - _clientReconnectLock: SemaphoreSlim thundering-herd fix for concurrent workers hitting IsConnectionError simultaneously - Watchdog tier split: active-tool=600s, used-tools-idle=180s, default=120s (cuts zombie detection from 10min to 3min) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Text-Parsing Enhancements:
- JSON mode parsing: orchestrator can return [{worker,task}] array, parsed
with System.Text.Json. Falls back to @worker:...@EnD regex on parse failure.
- Exact match only for worker names — removed bidirectional Contains fallback
that caused misroutes when names are substrings of each other.
- Strip backticks/quotes from worker names for robustness.
- Differentiated task instruction: 'Each worker MUST receive a DIFFERENT sub-task'
- Retry loop (up to 3 iterations) with conversation history when not all workers
dispatched. Model remembers what it already assigned (no fresh sessions/amnesia).
Reliability Fixes (ported from tool-dispatch branch):
- BuildFreshSessionConfig helper: MCP servers, skills, system message in one place.
Applied to reconnect handler and worker revival.
- CTS-to-TCS wiring: 10-minute timeout in SendPromptAndWaitAsync actually cancels
the ResponseCompletion TCS instead of being a no-op.
- Reset HasUsedToolsThisTurn before reconnect retry to prevent 600s zombie timeout.
- Worker revival: detect empty response in ExecuteWorkerAsync, create fresh session
with BuildFreshSessionConfig, retry once (~20 lines vs ~70 in tool-dispatch).
Tests:
- 6 new JSON parsing tests (array, code-fenced, unknown worker, malformed, empty)
- Updated fuzzy match tests to verify exact-match-only behavior
- Updated ConnectionRecovery + ChatExperienceSafety for BuildFreshSessionConfig
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Update planning prompt to instruct model to assign ALL workers in a single response instead of 'at least one'. This fixes the issue where the model would assign only 1 worker per iteration, requiring 3 iterations to get 3 workers assigned (and missing workers 4-5). - Update nudge prompt to request ALL remaining workers at once. - Add per-group SemaphoreSlim (_groupDispatchLocks) to prevent concurrent dispatches to the same group. The bridge's send_message handler and the event queue drain can both call SendToMultiAgentGroupAsync; without this guard, the second call hits 'Session already processing' error. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…edup helper
- Fix reconnect lock double-check dead code: replace !IsConnectionError(ex)
(always false) with !ReferenceEquals(_client, client) to detect if another
worker already reconnected (CRITICAL)
- Update ComputeEffectiveTimeout test helper to match 3-tier production logic
(was using old 2-tier formula, causing false-negative tests)
- Add WatchdogTimeout_UsedToolsIdle_Uses180s test for the middle tier
- Update WatchdogTimeout_BetweenToolRounds to expect 180s (not 600s)
- Update WatchdogTimeout_MultiAgent to expect 120s (isMultiAgent no longer escalates)
- Update AllCombinations theory data for 3-tier formula
- Extract DeduplicateAssignments() helper replacing 5 copy-pasted GroupBy chains
- Add WorkerExecutionTimeout named constant (was magic TimeSpan.FromMinutes(10))
- Document _groupDispatchLocks silent-skip invariant
- Narrow bare catch {} to catch (JsonException) in TryParseJsonAssignments
- Add StringComparison.Ordinal to json.StartsWith
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
24722ff to
30132f9
Compare
The _groupDispatchLocks guard used WaitAsync(0) which silently dropped any message arriving while a dispatch was in progress. This caused user messages sent to a busy orchestrator to vanish entirely. Changed to WaitAsync(ct) so concurrent callers wait their turn and execute sequentially instead of being discarded. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🤖 Multi-Model Fleet Review — PR #339PR: feat: enhanced text-parsing dispatch with reliability fixes (#229) 📋 CI Status🔍 Consensus Findings🔴 F1: Revived worker session missing event handler registration (2/5 consensus)File: The worker revival path calls Impact: Without the event handler, the revived session sends a prompt, but no SDK events arrive → Fix: Add
🟡 F2: Silent dispatch drop via
|
| Finding | Model | Note |
|---|---|---|
Overly broad "corrupt" / "session file" substring matching in Persistence.cs |
Opus #1 | Could match unrelated errors; consider more specific strings |
_groupDispatchLocks never cleaned up → memory leak |
Gemini | SemaphoreSlim entries accumulate; add cleanup on group deletion |
Duplicate comment block on _groupDispatchLocks declaration |
All models noted | Copy-paste artifact — 3 lines repeated verbatim |
🧪 Test Coverage Assessment
| Area | Coverage |
|---|---|
| Watchdog 3-tier timeout logic | ✅ Thorough — theory + point tests updated |
JSON parsing (TryParseJsonAssignments) |
✅ Good — 6 new tests covering valid, malformed, code-fence, unknown workers |
| Exact-match-only worker resolution | ✅ Good — 2 existing tests updated to verify rejection |
BuildFreshSessionConfig structural guards |
✅ Good — 4 existing tests updated |
| Retry dispatch loop (3 iterations) | ❌ No unit test |
| Worker revival path | ❌ No unit test |
_clientReconnectLock thundering-herd fix |
❌ No unit test |
_groupDispatchLocks concurrent skip behavior |
❌ No unit test |
| CTS-to-TCS cancellation wiring | ❌ No unit test |
5 of the 10 key changes have zero test coverage. The untested paths are all concurrency-sensitive.
🏁 Verdict
⚠️ Request Changes
Blocking (must fix):
- F1 — Register event handler on revived sessions (the revival feature is non-functional without this)
- F4 — Ensure multi-agent orchestrator sessions get adequate watchdog timeout (120s is too short for the new retry loop)
Should fix:
3. F3 — Pass cts.Token to TrySetCanceled() (one-line fix)
4. F5 — Clone Info during worker revival to prevent shared-state races
Consider:
5. F2 — Queue or block on concurrent dispatch instead of dropping (design decision)
The architecture improvements (thundering-herd lock, JSON parsing, retry loop, BuildFreshSessionConfig extraction) are solid. The critical gap is F1 — without the event handler, worker revival silently fails every time.
…#229) (PureWeen#339) ## Summary Replaces the tool-intercept approach (PR PureWeen#318) with enhanced text-parsing dispatch. The tool-intercept approach fought the SDK's event model — 22/33 commits were workarounds for the impedance mismatch, and it *caused* the 'single worker dispatched' and 'identical tasks' production bugs. This PR keeps the text-parsing path that was working reliably, and enhances it: ### Text-Parsing Enhancements - **JSON mode parsing**: Orchestrator can return \[{worker, task}]\ JSON arrays, parsed with System.Text.Json. Falls back to \@worker:...@EnD\ regex on parse failure. - **Exact match only** for worker name resolution — removed bidirectional \Contains\ fallback that caused misroutes when names are substrings of each other. - **Backtick/quote stripping** from worker names for robustness. - **Differentiated task instruction**: 'Each worker MUST receive a DIFFERENT sub-task' - **Retry loop** (up to 3 iterations) with conversation history when not all workers dispatched. Model remembers what it already assigned — no fresh sessions, no amnesia. ### Reliability Fixes (ported from tool-dispatch branch) - \_clientReconnectLock\: SemaphoreSlim thundering-herd fix for concurrent workers hitting connection errors - **Watchdog tier split**: active-tool=600s, used-tools-idle=180s, default=120s (cuts zombie detection from 10min to 3min) - **BuildFreshSessionConfig** helper: MCP servers, skills, system message in one place - **CTS-to-TCS wiring**: 10-minute timeout in SendPromptAndWaitAsync actually cancels the TCS - **HasUsedToolsThisTurn reset** before reconnect retry - **Worker revival**: detect empty response → fresh session → retry once (~20 lines) - **Volatile.Read** cleanup for ActiveToolCallCount - **Corrupt/locked session** restore fallback ### Tests - 6 new JSON parsing tests - Updated fuzzy match tests for exact-match-only behavior - Updated structural regression guards for BuildFreshSessionConfig ### Why Not Tool-Intercept? Multi-model review (Opus 4.6, Sonnet 4, Gemini 3 Pro, GPT-5.2) unanimously recommended against the tool-intercept approach. The core issue: \is_override\ on a custom task AIFunction returns a fake placeholder while running real work in background Task.Run. The SDK doesn't handle this — events don't fire correctly, sessions get stuck, and fresh sessions lose conversation history. Closes PureWeen#229 Supersedes PureWeen#318 --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Replaces the tool-intercept approach (PR #318) with enhanced text-parsing dispatch. The tool-intercept approach fought the SDK's event model — 22/33 commits were workarounds for the impedance mismatch, and it caused the 'single worker dispatched' and 'identical tasks' production bugs.
This PR keeps the text-parsing path that was working reliably, and enhances it:
Text-Parsing Enhancements
Reliability Fixes (ported from tool-dispatch branch)
Tests
Why Not Tool-Intercept?
Multi-model review (Opus 4.6, Sonnet 4, Gemini 3 Pro, GPT-5.2) unanimously recommended against the tool-intercept approach. The core issue: \is_override\ on a custom task AIFunction returns a fake placeholder while running real work in background Task.Run. The SDK doesn't handle this — events don't fire correctly, sessions get stuck, and fresh sessions lose conversation history.
Closes #229
Supersedes #318