Skip to content

Fix multi-agent dispatch bypass and premature watchdog timeout after restore#284

Merged
PureWeen merged 3 commits intomainfrom
fix/multi-agent-dispatch-restore
Mar 5, 2026
Merged

Fix multi-agent dispatch bypass and premature watchdog timeout after restore#284
PureWeen merged 3 commits intomainfrom
fix/multi-agent-dispatch-restore

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

@PureWeen PureWeen commented Mar 5, 2026

Problem

After app relaunch, two issues surfaced:

  1. Multi-agent dispatch broken: The orchestrator generated @worker dispatch commands, but they were silently ignored because GetOrchestratorGroupId returned null.
  2. Premature 'stuck' timeout: Sessions were killed by the watchdog after ~2 minutes instead of the expected 10 minutes for multi-agent work.

Root Cause

Both share the same root cause: ReconcileOrganization was fully skipped while IsRestoring=true (to prevent pruning sessions not yet loaded). This left Organization.Sessions stale:

  • GetOrchestratorGroupId couldn't find the session metadata → returned null → dispatch bypassed
  • IsSessionInMultiAgentGroup returned false → watchdog used 120s timeout instead of 600s

Fix

  • ReconcileOrganization(allowPruning: false): New mode that adds missing metadata for active sessions but never deletes anything. Safe to call during restore.
  • Queue drain guard: CompleteResponse now forces this additive reconciliation if IsRestoring is true, ensuring metadata is available for routing and watchdog configuration from the moment a restored turn completes.

Files Changed

  • CopilotService.Organization.csallowPruning parameter on ReconcileOrganization
  • CopilotService.Events.cs — Additive reconcile call during queue drain

PureWeen and others added 3 commits March 4, 2026 22:20
…restore

During app relaunch, ReconcileOrganization was fully skipped while
IsRestoring=true to prevent pruning sessions not yet loaded. This left
Organization.Sessions stale, causing two issues:

1. GetOrchestratorGroupId returned null for restored orchestrator sessions,
   bypassing the multi-agent dispatch pipeline entirely.
2. IsSessionInMultiAgentGroup returned false, making the watchdog use the
   120s inactivity timeout instead of the 600s tool-execution timeout,
   killing long-running orchestrator/worker tasks prematurely.

Fix: Add allowPruning parameter to ReconcileOrganization. When false,
it adds missing metadata for active sessions but never deletes anything.
CompleteResponse now forces this additive reconciliation during queue
drain if IsRestoring is true, ensuring metadata is available for routing
and watchdog configuration.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Bug: IsMultiAgentSession was never set during RestoreSingleSessionAsync,
causing the watchdog to use the 120s inactivity timeout instead of 600s
for multi-agent workers restored after relaunch. Fixed by calling
IsSessionInMultiAgentGroup() before StartProcessingWatchdog.

Documentation updates:
- processing-state-safety: Add INV-9 (restore must init all watchdog state),
  add mistake #5 (missing restore initialization), update description to
  cover restore paths and IsRestoring window, update regression history
  with PR #284 root cause and pattern
- performance-optimization: Add PERF-6 (ReconcileOrganization during
  IsRestoring needs allowPruning:false for additive-only mode)
- regression-history: Add PR #284 entry with full analysis

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ReconcileOrganization(allowPruning: false) was updating
_lastReconcileSessionHash even though it skipped pruning. This caused
the post-restore full ReconcileOrganization() to match the hash and
return early, permanently suppressing pruning of dead sessions.

Fix: only update the hash cache when doing a full reconciliation
(allowPruning=true). Additive-only calls leave the hash stale so the
next full call will always run.

Found by multi-model review (Opus 4.6, GPT-5.2, Gemini 3 Pro consensus).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen merged commit d2b7c2b into main Mar 5, 2026
@PureWeen PureWeen deleted the fix/multi-agent-dispatch-restore branch March 5, 2026 13:39
PureWeen added a commit that referenced this pull request Mar 9, 2026
Adds ChatExperienceSafetyTests.cs with 41 tests (40 active, 1 pending PR #330)
covering the invariants documented in processing-state-safety skill:

- INV-1: All 8 termination paths clear state correctly (CompleteResponse,
  SessionErrorEvent, AbortSessionAsync, watchdog)
- INV-2: State mutations marshaled to UI thread via InvokeOnUI
- INV-3: ProcessingGeneration guard prevents stale IDLE from killing new turns
- INV-5: HasUsedToolsThisTurn protects sessions between tool rounds (not just
  while tools are active)
- INV-9: IsMultiAgentSession set before StartProcessingWatchdog in both
  SendPromptAsync and RestoreSingleSessionAsync paths

Behavioral tests (demo mode integration):
- Multi-turn message preservation (5 sequential turns, all history retained)
- Abort clears all 9 INV-1 fields, fires OnSessionComplete
- Post-abort send succeeds without deadlock (SendingFlag cleared)
- Session isolation (stuck session doesn't block others)
- WatchdogToolExecutionTimeoutSeconds > WatchdogInactivityTimeoutSeconds
- WatchdogMaxProcessingTimeSeconds >= 30 minutes

Source-code assertion tests (regression guards against future refactors):
- useToolTimeout formula has all 4 conditions (INV-5)
- TurnEnd fallback checks HasUsedToolsThisTurn before firing CompleteResponse
- FlushCurrentResponse called at AssistantTurnEndEvent (content persistence fix)
- FlushCurrentResponse dedup guard prevents SDK-replay duplicates
- CompleteResponse cancels watchdog before cleanup
- Reconnect path carries forward IsMultiAgentSession + HasUsedToolsThisTurn

These tests are designed to catch the class of regressions documented in
regression-history.md (PRs #141-#284).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen added a commit that referenced this pull request Mar 9, 2026
Adds ChatExperienceSafetyTests.cs with 41 tests (40 active, 1 pending PR #330)
covering the invariants documented in processing-state-safety skill:

- INV-1: All 8 termination paths clear state correctly (CompleteResponse,
  SessionErrorEvent, AbortSessionAsync, watchdog)
- INV-2: State mutations marshaled to UI thread via InvokeOnUI
- INV-3: ProcessingGeneration guard prevents stale IDLE from killing new turns
- INV-5: HasUsedToolsThisTurn protects sessions between tool rounds (not just
  while tools are active)
- INV-9: IsMultiAgentSession set before StartProcessingWatchdog in both
  SendPromptAsync and RestoreSingleSessionAsync paths

Behavioral tests (demo mode integration):
- Multi-turn message preservation (5 sequential turns, all history retained)
- Abort clears all 9 INV-1 fields, fires OnSessionComplete
- Post-abort send succeeds without deadlock (SendingFlag cleared)
- Session isolation (stuck session doesn't block others)
- WatchdogToolExecutionTimeoutSeconds > WatchdogInactivityTimeoutSeconds
- WatchdogMaxProcessingTimeSeconds >= 30 minutes

Source-code assertion tests (regression guards against future refactors):
- useToolTimeout formula has all 4 conditions (INV-5)
- TurnEnd fallback checks HasUsedToolsThisTurn before firing CompleteResponse
- FlushCurrentResponse called at AssistantTurnEndEvent (content persistence fix)
- FlushCurrentResponse dedup guard prevents SDK-replay duplicates
- CompleteResponse cancels watchdog before cleanup
- Reconnect path carries forward IsMultiAgentSession + HasUsedToolsThisTurn

These tests are designed to catch the class of regressions documented in
regression-history.md (PRs #141-#284).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen added a commit that referenced this pull request Mar 9, 2026
Adds ChatExperienceSafetyTests.cs with 41 tests (40 active, 1 pending PR #330)
covering the invariants documented in processing-state-safety skill:

- INV-1: All 8 termination paths clear state correctly (CompleteResponse,
  SessionErrorEvent, AbortSessionAsync, watchdog)
- INV-2: State mutations marshaled to UI thread via InvokeOnUI
- INV-3: ProcessingGeneration guard prevents stale IDLE from killing new turns
- INV-5: HasUsedToolsThisTurn protects sessions between tool rounds (not just
  while tools are active)
- INV-9: IsMultiAgentSession set before StartProcessingWatchdog in both
  SendPromptAsync and RestoreSingleSessionAsync paths

Behavioral tests (demo mode integration):
- Multi-turn message preservation (5 sequential turns, all history retained)
- Abort clears all 9 INV-1 fields, fires OnSessionComplete
- Post-abort send succeeds without deadlock (SendingFlag cleared)
- Session isolation (stuck session doesn't block others)
- WatchdogToolExecutionTimeoutSeconds > WatchdogInactivityTimeoutSeconds
- WatchdogMaxProcessingTimeSeconds >= 30 minutes

Source-code assertion tests (regression guards against future refactors):
- useToolTimeout formula has all 4 conditions (INV-5)
- TurnEnd fallback checks HasUsedToolsThisTurn before firing CompleteResponse
- FlushCurrentResponse called at AssistantTurnEndEvent (content persistence fix)
- FlushCurrentResponse dedup guard prevents SDK-replay duplicates
- CompleteResponse cancels watchdog before cleanup
- Reconnect path carries forward IsMultiAgentSession + HasUsedToolsThisTurn

These tests are designed to catch the class of regressions documented in
regression-history.md (PRs #141-#284).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant