Fix multi-agent bridge bugs and watchdog timeout for workers#195
Merged
Fix multi-agent bridge bugs and watchdog timeout for workers#195
Conversation
…top IsProcessing race Three fixes for mobile bridge reliability: 1. WsBridgeServer: Fire-and-forget SendPromptAsync in SendMessage handler. The handler was awaiting ResponseCompletion which blocks for the entire response duration (minutes), preventing abort/switch/new messages from being processed by the per-client WebSocket read loop. 2. CopilotService.Bridge: On TurnEnd, request fresh history before clearing the streaming guard. Previously, removing from _remoteStreamingSessions immediately allowed SyncRemoteSessions to overwrite incrementally-built history with a stale SessionHistories cache, losing the last message. 3. CopilotService.Bridge: Skip IsProcessing updates from SessionsList for sessions that are actively streaming. The periodic sessions list could race with event-driven TurnStart/TurnEnd, causing stop button flicker. Also fixes: ParseTaskAssignments regex now captures worker names with spaces (e.g. 'PR Review Squad-worker-1') instead of only the first word. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
OnStateChanged only broadcasted SessionsList, not OrganizationState. This caused mobile to have stale group assignments — sessions moved between groups on desktop wouldn't update on mobile until a specific org-triggering operation occurred. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CreateGroupFromPresetAsync had role/group/model assignment inside the same try block as CreateSessionAsync. If the session already existed (e.g. recreating the same Squad team), CreateSessionAsync threw and the orchestrator lost its Orchestrator role, workers lost their group assignment and system prompts. Move assignment outside the try so it runs regardless of whether session creation succeeded or was skipped. Also adds 3 tests for ParseTaskAssignments with worker names containing spaces (the regex fix from the prior commit). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Workers doing text-heavy tasks (e.g., PR reviews) can take 2-4 minutes without tool calls. The 120s inactivity timeout was killing workers prematurely — the watchdog cleared IsProcessing and added a 'stuck' warning, then the actual response arrived but CompleteResponse skipped because IsProcessing was already false, losing the response. Now sessions in multi-agent groups use the 600s tool-execution timeout. The multi-agent flag is cached on SessionState at send time (UI thread) so the watchdog can read it safely from its background thread without accessing the Organization lists (plain List<T>, UI-thread-only). The orchestration loop already has its own 10-minute per-worker timeout via CancelAfter, so the watchdog is a safety net, not the primary guard. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen
added a commit
that referenced
this pull request
Feb 26, 2026
Three bugs fixed: 1. Sessions stuck forever when SDK sends repeated SessionUsageInfoEvent (e.g., FailedDelegation) without terminal events. Root cause: HandleSessionEvent unconditionally updated LastEventAtTicks for ALL events, including metrics-only events that don't indicate progress. Fix: Skip LastEventAtTicks update for SessionUsageInfoEvent and AssistantUsageEvent. 2. Added WatchdogMaxProcessingTimeSeconds (3600s) as absolute safety net. Even if progress events keep arriving, no turn should run for 60 minutes without user notification. Uses ProcessingStartedAt (set in SendPromptAsync) so it cannot be reset by events. 3. False 'session stuck' warnings after app restart. Root cause: GetEventsFileRestoreHints used WatchdogInactivityTimeoutSeconds (120s) as file age threshold, but tool executions can go 5-10 minutes without events.jsonl writes. Fix: Use WatchdogToolExecutionTimeoutSeconds (600s) threshold. Added ~25 regression guard tests covering every known failure mode from PRs #148, #163, #195, #211, #224, plus invariant tests for the 8 processing state safety invariants. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen
added a commit
that referenced
this pull request
Feb 26, 2026
Three bugs fixed: 1. Sessions stuck forever when SDK sends repeated SessionUsageInfoEvent (e.g., FailedDelegation) without terminal events. Root cause: HandleSessionEvent unconditionally updated LastEventAtTicks for ALL events, including metrics-only events that don't indicate progress. Fix: Skip LastEventAtTicks update for SessionUsageInfoEvent and AssistantUsageEvent. 2. Added WatchdogMaxProcessingTimeSeconds (3600s) as absolute safety net. Even if progress events keep arriving, no turn should run for 60 minutes without user notification. Uses ProcessingStartedAt (set in SendPromptAsync) so it cannot be reset by events. 3. False 'session stuck' warnings after app restart. Root cause: GetEventsFileRestoreHints used WatchdogInactivityTimeoutSeconds (120s) as file age threshold, but tool executions can go 5-10 minutes without events.jsonl writes. Fix: Use WatchdogToolExecutionTimeoutSeconds (600s) threshold. Added ~25 regression guard tests covering every known failure mode from PRs #148, #163, #195, #211, #224, plus invariant tests for the 8 processing state safety invariants. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen
added a commit
that referenced
this pull request
Feb 26, 2026
## Problem Three bugs in the processing watchdog: 1. **Session stuck at 361 minutes** — MergeNET11 session had `IsProcessing=true` for 361m. The SDK was sending repeated `SessionUsageInfoEvent` (FailedDelegation) which reset `LastEventAtTicks` without ever sending a terminal event. The watchdog timer kept resetting and never fired. 2. **No max processing time** — Even with fix #1, there was no absolute ceiling. A session could run forever as long as progress events arrived. 3. **False 'stuck' warnings** — After app restart, sessions got `⚠️ Session appears stuck — no events received for over 30 seconds` even though they were actively working. `GetEventsFileRestoreHints` used 120s threshold, but tool executions can go 5-10 minutes without events.jsonl writes. ## Fix 1. **Gate `LastEventAtTicks` update** — Skip for `SessionUsageInfoEvent` and `AssistantUsageEvent` (metrics-only events that don't indicate turn progress). 2. **Add `WatchdogMaxProcessingTimeSeconds`** (3600s = 60 min) — Absolute safety net using `ProcessingStartedAt`, which cannot be reset by events. 3. **Fix restore hints threshold** — Changed from `WatchdogInactivityTimeoutSeconds` (120s) to `WatchdogToolExecutionTimeoutSeconds` (600s). ## Testing - ~25 new regression guard tests covering every known failure mode from PRs #148, #163, #195, #211, #224 - Invariant tests for INV-1, INV-4, INV-5, INV-6 - End-to-end scenario tests - All 1447 tests pass --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes 6 bugs discovered while testing multi-agent PR review orchestration on desktop and mobile.
Bridge fixes (mobile)
SendMessagehandler wasawaitingSendPromptAsync(blocks until full response), preventing all other client messages. Now fire-and-forget viaTask.Run.SyncRemoteSessionswas overwriting incrementally-built streaming history with stale cache onTurnEnd. Now requests fresh history before clearing the streaming guard.SyncRemoteSessionsunconditionally overwroteIsProcessingfrom periodic sessions list, racing with event-drivenTurnStart/TurnEnd. Now skips processing state updates for actively streaming sessions.OnStateChangedonly sentSessionsList, notOrganizationState, so mobile never saw group/role changes.Multi-agent orchestration fixes
ParseTaskAssignmentsregex(\S+)only captured first word of worker names with spaces (e.g. "PR Review Squad-worker-1"). Changed to([^\n]+?).CreateSessionAsync, so recreating an existing Squad skipped all assignments.Watchdog timeout fix
CompleteResponseskipped whenIsProcessingwas already false. Now cachesIsMultiAgentSessiononSessionStateat send time (thread-safe) and uses the 600s timeout.Tests
Review
Fix reviewed by Opus 4.6, Sonnet 4.5, and GPT-5.2 — all agreed on the thread-safety fix (cache multi-agent flag at send time vs. reading Organization lists from background thread).