Fix multi-agent dispatch bypass and premature watchdog timeout after restore#284
Merged
Fix multi-agent dispatch bypass and premature watchdog timeout after restore#284
Conversation
…restore During app relaunch, ReconcileOrganization was fully skipped while IsRestoring=true to prevent pruning sessions not yet loaded. This left Organization.Sessions stale, causing two issues: 1. GetOrchestratorGroupId returned null for restored orchestrator sessions, bypassing the multi-agent dispatch pipeline entirely. 2. IsSessionInMultiAgentGroup returned false, making the watchdog use the 120s inactivity timeout instead of the 600s tool-execution timeout, killing long-running orchestrator/worker tasks prematurely. Fix: Add allowPruning parameter to ReconcileOrganization. When false, it adds missing metadata for active sessions but never deletes anything. CompleteResponse now forces this additive reconciliation during queue drain if IsRestoring is true, ensuring metadata is available for routing and watchdog configuration. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Bug: IsMultiAgentSession was never set during RestoreSingleSessionAsync, causing the watchdog to use the 120s inactivity timeout instead of 600s for multi-agent workers restored after relaunch. Fixed by calling IsSessionInMultiAgentGroup() before StartProcessingWatchdog. Documentation updates: - processing-state-safety: Add INV-9 (restore must init all watchdog state), add mistake #5 (missing restore initialization), update description to cover restore paths and IsRestoring window, update regression history with PR #284 root cause and pattern - performance-optimization: Add PERF-6 (ReconcileOrganization during IsRestoring needs allowPruning:false for additive-only mode) - regression-history: Add PR #284 entry with full analysis Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ReconcileOrganization(allowPruning: false) was updating _lastReconcileSessionHash even though it skipped pruning. This caused the post-restore full ReconcileOrganization() to match the hash and return early, permanently suppressing pruning of dead sessions. Fix: only update the hash cache when doing a full reconciliation (allowPruning=true). Additive-only calls leave the hash stale so the next full call will always run. Found by multi-model review (Opus 4.6, GPT-5.2, Gemini 3 Pro consensus). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen
added a commit
that referenced
this pull request
Mar 9, 2026
Adds ChatExperienceSafetyTests.cs with 41 tests (40 active, 1 pending PR #330) covering the invariants documented in processing-state-safety skill: - INV-1: All 8 termination paths clear state correctly (CompleteResponse, SessionErrorEvent, AbortSessionAsync, watchdog) - INV-2: State mutations marshaled to UI thread via InvokeOnUI - INV-3: ProcessingGeneration guard prevents stale IDLE from killing new turns - INV-5: HasUsedToolsThisTurn protects sessions between tool rounds (not just while tools are active) - INV-9: IsMultiAgentSession set before StartProcessingWatchdog in both SendPromptAsync and RestoreSingleSessionAsync paths Behavioral tests (demo mode integration): - Multi-turn message preservation (5 sequential turns, all history retained) - Abort clears all 9 INV-1 fields, fires OnSessionComplete - Post-abort send succeeds without deadlock (SendingFlag cleared) - Session isolation (stuck session doesn't block others) - WatchdogToolExecutionTimeoutSeconds > WatchdogInactivityTimeoutSeconds - WatchdogMaxProcessingTimeSeconds >= 30 minutes Source-code assertion tests (regression guards against future refactors): - useToolTimeout formula has all 4 conditions (INV-5) - TurnEnd fallback checks HasUsedToolsThisTurn before firing CompleteResponse - FlushCurrentResponse called at AssistantTurnEndEvent (content persistence fix) - FlushCurrentResponse dedup guard prevents SDK-replay duplicates - CompleteResponse cancels watchdog before cleanup - Reconnect path carries forward IsMultiAgentSession + HasUsedToolsThisTurn These tests are designed to catch the class of regressions documented in regression-history.md (PRs #141-#284). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen
added a commit
that referenced
this pull request
Mar 9, 2026
Adds ChatExperienceSafetyTests.cs with 41 tests (40 active, 1 pending PR #330) covering the invariants documented in processing-state-safety skill: - INV-1: All 8 termination paths clear state correctly (CompleteResponse, SessionErrorEvent, AbortSessionAsync, watchdog) - INV-2: State mutations marshaled to UI thread via InvokeOnUI - INV-3: ProcessingGeneration guard prevents stale IDLE from killing new turns - INV-5: HasUsedToolsThisTurn protects sessions between tool rounds (not just while tools are active) - INV-9: IsMultiAgentSession set before StartProcessingWatchdog in both SendPromptAsync and RestoreSingleSessionAsync paths Behavioral tests (demo mode integration): - Multi-turn message preservation (5 sequential turns, all history retained) - Abort clears all 9 INV-1 fields, fires OnSessionComplete - Post-abort send succeeds without deadlock (SendingFlag cleared) - Session isolation (stuck session doesn't block others) - WatchdogToolExecutionTimeoutSeconds > WatchdogInactivityTimeoutSeconds - WatchdogMaxProcessingTimeSeconds >= 30 minutes Source-code assertion tests (regression guards against future refactors): - useToolTimeout formula has all 4 conditions (INV-5) - TurnEnd fallback checks HasUsedToolsThisTurn before firing CompleteResponse - FlushCurrentResponse called at AssistantTurnEndEvent (content persistence fix) - FlushCurrentResponse dedup guard prevents SDK-replay duplicates - CompleteResponse cancels watchdog before cleanup - Reconnect path carries forward IsMultiAgentSession + HasUsedToolsThisTurn These tests are designed to catch the class of regressions documented in regression-history.md (PRs #141-#284). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen
added a commit
that referenced
this pull request
Mar 9, 2026
Adds ChatExperienceSafetyTests.cs with 41 tests (40 active, 1 pending PR #330) covering the invariants documented in processing-state-safety skill: - INV-1: All 8 termination paths clear state correctly (CompleteResponse, SessionErrorEvent, AbortSessionAsync, watchdog) - INV-2: State mutations marshaled to UI thread via InvokeOnUI - INV-3: ProcessingGeneration guard prevents stale IDLE from killing new turns - INV-5: HasUsedToolsThisTurn protects sessions between tool rounds (not just while tools are active) - INV-9: IsMultiAgentSession set before StartProcessingWatchdog in both SendPromptAsync and RestoreSingleSessionAsync paths Behavioral tests (demo mode integration): - Multi-turn message preservation (5 sequential turns, all history retained) - Abort clears all 9 INV-1 fields, fires OnSessionComplete - Post-abort send succeeds without deadlock (SendingFlag cleared) - Session isolation (stuck session doesn't block others) - WatchdogToolExecutionTimeoutSeconds > WatchdogInactivityTimeoutSeconds - WatchdogMaxProcessingTimeSeconds >= 30 minutes Source-code assertion tests (regression guards against future refactors): - useToolTimeout formula has all 4 conditions (INV-5) - TurnEnd fallback checks HasUsedToolsThisTurn before firing CompleteResponse - FlushCurrentResponse called at AssistantTurnEndEvent (content persistence fix) - FlushCurrentResponse dedup guard prevents SDK-replay duplicates - CompleteResponse cancels watchdog before cleanup - Reconnect path carries forward IsMultiAgentSession + HasUsedToolsThisTurn These tests are designed to catch the class of regressions documented in regression-history.md (PRs #141-#284). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
After app relaunch, two issues surfaced:
@workerdispatch commands, but they were silently ignored becauseGetOrchestratorGroupIdreturnednull.Root Cause
Both share the same root cause:
ReconcileOrganizationwas fully skipped whileIsRestoring=true(to prevent pruning sessions not yet loaded). This leftOrganization.Sessionsstale:GetOrchestratorGroupIdcouldn't find the session metadata → returnednull→ dispatch bypassedIsSessionInMultiAgentGroupreturnedfalse→ watchdog used 120s timeout instead of 600sFix
ReconcileOrganization(allowPruning: false): New mode that adds missing metadata for active sessions but never deletes anything. Safe to call during restore.CompleteResponsenow forces this additive reconciliation ifIsRestoringis true, ensuring metadata is available for routing and watchdog configuration from the moment a restored turn completes.Files Changed
CopilotService.Organization.cs—allowPruningparameter onReconcileOrganizationCopilotService.Events.cs— Additive reconcile call during queue drain