Fix multi-agent dispatch bypass and premature watchdog timeout after restore by PureWeen · Pull Request #284 · PureWeen/PolyPilot

PureWeen · 2026-03-05T04:20:29Z

Problem

After app relaunch, two issues surfaced:

Multi-agent dispatch broken: The orchestrator generated @worker dispatch commands, but they were silently ignored because GetOrchestratorGroupId returned null.
Premature 'stuck' timeout: Sessions were killed by the watchdog after ~2 minutes instead of the expected 10 minutes for multi-agent work.

Root Cause

Both share the same root cause: ReconcileOrganization was fully skipped while IsRestoring=true (to prevent pruning sessions not yet loaded). This left Organization.Sessions stale:

GetOrchestratorGroupId couldn't find the session metadata → returned null → dispatch bypassed
IsSessionInMultiAgentGroup returned false → watchdog used 120s timeout instead of 600s

Fix

ReconcileOrganization(allowPruning: false): New mode that adds missing metadata for active sessions but never deletes anything. Safe to call during restore.
Queue drain guard: CompleteResponse now forces this additive reconciliation if IsRestoring is true, ensuring metadata is available for routing and watchdog configuration from the moment a restored turn completes.

Files Changed

CopilotService.Organization.cs — allowPruning parameter on ReconcileOrganization
CopilotService.Events.cs — Additive reconcile call during queue drain

…restore During app relaunch, ReconcileOrganization was fully skipped while IsRestoring=true to prevent pruning sessions not yet loaded. This left Organization.Sessions stale, causing two issues: 1. GetOrchestratorGroupId returned null for restored orchestrator sessions, bypassing the multi-agent dispatch pipeline entirely. 2. IsSessionInMultiAgentGroup returned false, making the watchdog use the 120s inactivity timeout instead of the 600s tool-execution timeout, killing long-running orchestrator/worker tasks prematurely. Fix: Add allowPruning parameter to ReconcileOrganization. When false, it adds missing metadata for active sessions but never deletes anything. CompleteResponse now forces this additive reconciliation during queue drain if IsRestoring is true, ensuring metadata is available for routing and watchdog configuration. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Bug: IsMultiAgentSession was never set during RestoreSingleSessionAsync, causing the watchdog to use the 120s inactivity timeout instead of 600s for multi-agent workers restored after relaunch. Fixed by calling IsSessionInMultiAgentGroup() before StartProcessingWatchdog. Documentation updates: - processing-state-safety: Add INV-9 (restore must init all watchdog state), add mistake #5 (missing restore initialization), update description to cover restore paths and IsRestoring window, update regression history with PR #284 root cause and pattern - performance-optimization: Add PERF-6 (ReconcileOrganization during IsRestoring needs allowPruning:false for additive-only mode) - regression-history: Add PR #284 entry with full analysis Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ReconcileOrganization(allowPruning: false) was updating _lastReconcileSessionHash even though it skipped pruning. This caused the post-restore full ReconcileOrganization() to match the hash and return early, permanently suppressing pruning of dead sessions. Fix: only update the hash cache when doing a full reconciliation (allowPruning=true). Additive-only calls leave the hash stale so the next full call will always run. Found by multi-model review (Opus 4.6, GPT-5.2, Gemini 3 Pro consensus). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds ChatExperienceSafetyTests.cs with 41 tests (40 active, 1 pending PR #330) covering the invariants documented in processing-state-safety skill: - INV-1: All 8 termination paths clear state correctly (CompleteResponse, SessionErrorEvent, AbortSessionAsync, watchdog) - INV-2: State mutations marshaled to UI thread via InvokeOnUI - INV-3: ProcessingGeneration guard prevents stale IDLE from killing new turns - INV-5: HasUsedToolsThisTurn protects sessions between tool rounds (not just while tools are active) - INV-9: IsMultiAgentSession set before StartProcessingWatchdog in both SendPromptAsync and RestoreSingleSessionAsync paths Behavioral tests (demo mode integration): - Multi-turn message preservation (5 sequential turns, all history retained) - Abort clears all 9 INV-1 fields, fires OnSessionComplete - Post-abort send succeeds without deadlock (SendingFlag cleared) - Session isolation (stuck session doesn't block others) - WatchdogToolExecutionTimeoutSeconds > WatchdogInactivityTimeoutSeconds - WatchdogMaxProcessingTimeSeconds >= 30 minutes Source-code assertion tests (regression guards against future refactors): - useToolTimeout formula has all 4 conditions (INV-5) - TurnEnd fallback checks HasUsedToolsThisTurn before firing CompleteResponse - FlushCurrentResponse called at AssistantTurnEndEvent (content persistence fix) - FlushCurrentResponse dedup guard prevents SDK-replay duplicates - CompleteResponse cancels watchdog before cleanup - Reconnect path carries forward IsMultiAgentSession + HasUsedToolsThisTurn These tests are designed to catch the class of regressions documented in regression-history.md (PRs #141-#284). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen and others added 3 commits March 4, 2026 22:20

PureWeen merged commit d2b7c2b into main Mar 5, 2026

PureWeen deleted the fix/multi-agent-dispatch-restore branch March 5, 2026 13:39

gemini-code-assist Bot mentioned this pull request Mar 7, 2026

Update from head repo omonimus1/PolyPilot#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multi-agent dispatch bypass and premature watchdog timeout after restore#284

Fix multi-agent dispatch bypass and premature watchdog timeout after restore#284
PureWeen merged 3 commits intomainfrom
fix/multi-agent-dispatch-restore

PureWeen commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PureWeen commented Mar 5, 2026

Problem

Root Cause

Fix

Files Changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant