diff --git a/.claude/skills/performance-optimization/SKILL.md b/.claude/skills/performance-optimization/SKILL.md index ffd6ea01b1..7dbbab431a 100644 --- a/.claude/skills/performance-optimization/SKILL.md +++ b/.claude/skills/performance-optimization/SKILL.md @@ -61,6 +61,17 @@ Skips work when the active session set is unchanged (hash of session names). If you add new session types or visibility rules, ensure the hash accounts for them or reconciliation may be incorrectly skipped. +### PERF-6: ReconcileOrganization() during IsRestoring window +`ReconcileOrganization` is skipped during `IsRestoring=true` to prevent pruning +sessions not yet loaded. But code that needs metadata during restore (e.g., +`CompleteResponse` queue drain, `GetOrchestratorGroupId`, `IsSessionInMultiAgentGroup`) +must call `ReconcileOrganization(allowPruning: false)` to trigger an additive-only +update. This mode adds missing `SessionMeta` entries but never deletes anything. + +**Impact:** Without this, multi-agent dispatch is silently bypassed after relaunch, +and the watchdog uses the wrong timeout tier (120s instead of 600s), killing workers. +See PR #284 and processing-state-safety INV-9. + ## Caching Architecture ### Markdown Cache (`ChatMessageList.razor`) diff --git a/.claude/skills/processing-state-safety/SKILL.md b/.claude/skills/processing-state-safety/SKILL.md index b6281f203e..fc745343c2 100644 --- a/.claude/skills/processing-state-safety/SKILL.md +++ b/.claude/skills/processing-state-safety/SKILL.md @@ -7,7 +7,10 @@ description: > CompleteResponse, AbortSessionAsync, or the processing watchdog, (3) Adding new SDK event handlers, (4) Debugging stuck sessions showing "Thinking..." forever, (5) Modifying IsResumed, HasUsedToolsThisTurn, or ActiveToolCallCount, - (6) Adding diagnostic log tags. Covers: 8 invariants from 7 PRs of fix cycles, + (6) Adding diagnostic log tags, (7) Modifying session restore paths + (RestoreSingleSessionAsync) that must initialize watchdog-dependent state, + (8) Modifying ReconcileOrganization or any code that reads Organization.Sessions + during the IsRestoring window. Covers: 9 invariants from 8 PRs of fix cycles, the 8 code paths that clear IsProcessing, and common regression patterns. --- @@ -55,7 +58,7 @@ that crash the app. The DB is a write-through cache; `events.jsonl` is the sourc and replays on session resume via `BulkInsertAsync`. DB write failures are self-healing. **NEVER narrow the ChatDatabase catch filters** — use `catch (Exception ex)` always. -## 8 Invariants +## 9 Invariants ### INV-1: Complete state cleanup Every IsProcessing=false path clears ALL fields. See checklist above. @@ -89,7 +92,20 @@ Clearing guarded on `!hasActiveTool && !HasUsedToolsThisTurn`. `HandleComplete` is already on UI thread. `InvokeAsync` defers execution causing stale renders. -## Top 4 Recurring Mistakes +### INV-9: Session restore must initialize all watchdog-dependent state +The restore path (`RestoreSingleSessionAsync`) is separate from `SendPromptAsync`. +Any field that affects watchdog timeout selection or dispatch routing must be +initialized in BOTH paths: +- `IsMultiAgentSession` — set via `IsSessionInMultiAgentGroup()` before `StartProcessingWatchdog` +- `HasReceivedEventsSinceResume` / `HasUsedToolsThisTurn` — set via `GetEventsFileRestoreHints()` +- `IsResumed` — set on the `AgentSessionInfo` when `isStillProcessing` is true + +When `ReconcileOrganization` hasn't run yet (during `IsRestoring` window), +`Organization.Sessions` metadata may be stale. Any code that reads metadata +during this window must call `ReconcileOrganization(allowPruning: false)` first. +This additive mode safely adds missing entries without pruning loading sessions. + +## Top 5 Recurring Mistakes 1. **Incomplete cleanup** — modifying one IsProcessing path without updating ALL fields that must be cleared simultaneously. @@ -101,8 +117,15 @@ causing stale renders. must be called at every point where accumulated text could be lost (turn_end, tool_start, abort, error, watchdog). The turn_end call was missing until PR #224, causing response loss on app restart. +5. **Missing state initialization on session restore** — `IsMultiAgentSession`, + `IsResumed`, and other flags must be set on restored sessions BEFORE + `StartProcessingWatchdog` is called. The restore path in + `RestoreSingleSessionAsync` is separate from `SendPromptAsync` and must + independently initialize all state the watchdog depends on. PR #284 fixed + `IsMultiAgentSession` not being set during restore, causing the watchdog + to use 120s instead of 600s for multi-agent workers. ## Regression History -8 PRs of fix/regression cycles: #141 → #147 → #148 → #153 → #158 → #163 → #164 → #276. +9 PRs of fix/regression cycles: #141 → #147 → #148 → #153 → #158 → #163 → #164 → #276 → #284. See `references/regression-history.md` for the full timeline with root causes. diff --git a/.claude/skills/processing-state-safety/references/regression-history.md b/.claude/skills/processing-state-safety/references/regression-history.md index 25c3494ad0..3ac8d4057a 100644 --- a/.claude/skills/processing-state-safety/references/regression-history.md +++ b/.claude/skills/processing-state-safety/references/regression-history.md @@ -50,3 +50,23 @@ ## Known Remaining Issues - `HasUsedToolsThisTurn` resets use plain assignment (not Volatile.Write) - InvokeOnUI dispatch for IsResumed adds 15s delay (one watchdog cycle) + +## PR #284 — Multi-Agent Dispatch Bypass + Premature Watchdog on Restore +- **Root cause**: `ReconcileOrganization` was fully skipped during `IsRestoring=true` + to prevent pruning sessions not yet loaded. But `CompleteResponse` runs during + restore (when a restored turn completes), and it needs metadata to: + 1. Route through multi-agent dispatch (`GetOrchestratorGroupId`) + 2. Apply correct watchdog timeout (`IsSessionInMultiAgentGroup`) +- **Symptom 1**: Orchestrator generated `@worker` dispatch commands but they were + silently ignored — dispatch pipeline bypassed because `GetOrchestratorGroupId` + returned `null` (no metadata entry for the restored session) +- **Symptom 2**: Sessions killed as "stuck" after ~2 minutes — `IsMultiAgentSession` + was never set on restored sessions, so watchdog used 120s instead of 600s +- **Fix 1**: `ReconcileOrganization(allowPruning: false)` — new additive mode that + adds missing metadata without deleting anything. Called in `CompleteResponse` + during the `IsRestoring` window. +- **Fix 2**: Set `state.IsMultiAgentSession` in `RestoreSingleSessionAsync` before + `StartProcessingWatchdog` (reads from organization.json loaded at startup). +- **Pattern**: This is a variant of mistake #5 ("Missing state initialization on + session restore") — the restore path must independently initialize ALL state that + `SendPromptAsync` initializes, because events/watchdog/dispatch all depend on it. diff --git a/PolyPilot/Services/CopilotService.Events.cs b/PolyPilot/Services/CopilotService.Events.cs index a618c401b0..c65c593195 100644 --- a/PolyPilot/Services/CopilotService.Events.cs +++ b/PolyPilot/Services/CopilotService.Events.cs @@ -892,6 +892,11 @@ private void CompleteResponse(SessionState state, long? expectedGeneration = nul // Check if the dequeued message is for an orchestrator session — if so, // route through the multi-agent dispatch pipeline instead of direct send. + + // If we are restoring, the global ReconcileOrganization() hasn't run yet. + // We must force an additive-only update so this session's metadata exists. + if (IsRestoring) ReconcileOrganization(allowPruning: false); + var orchGroupId = GetOrchestratorGroupId(state.Info.Name); // Use Task.Run to dispatch on a clean stack frame, avoiding reentrancy diff --git a/PolyPilot/Services/CopilotService.Organization.cs b/PolyPilot/Services/CopilotService.Organization.cs index fbc0f01ab7..05936c305f 100644 --- a/PolyPilot/Services/CopilotService.Organization.cs +++ b/PolyPilot/Services/CopilotService.Organization.cs @@ -207,14 +207,14 @@ private void WriteOrgFile(string json) /// Skips work if the active session set hasn't changed since last reconciliation. /// private int _lastReconcileSessionHash; - internal void ReconcileOrganization() + internal void ReconcileOrganization(bool allowPruning = true) { var activeNames = _sessions.Where(kv => !kv.Value.Info.IsHidden).Select(kv => kv.Key).ToHashSet(); // Safety: skip reconciliation during startup when sessions haven't been restored yet. // LoadOrganization loads the org before RestorePreviousSessionsAsync populates _sessions, // so reconciling then would prune all sessions. Use IsRestoring as the precise scope guard. - if (IsRestoring) + if (IsRestoring && allowPruning) { Debug("ReconcileOrganization: skipping — session restore in progress"); return; @@ -232,7 +232,10 @@ internal void ReconcileOrganization() var currentHash = activeNames.Count; unchecked { foreach (var name in activeNames) currentHash += name.GetHashCode() * 31; } if (currentHash == _lastReconcileSessionHash && currentHash != 0) return; - _lastReconcileSessionHash = currentHash; + // Only update the hash when doing a full reconciliation (with pruning). + // Additive-only calls (allowPruning=false) during restore must not poison the cache, + // or the post-restore full reconciliation will be skipped via hash match. (PR #284 review) + if (allowPruning) _lastReconcileSessionHash = currentHash; bool changed = false; // Build lookup of multi-agent group IDs so we can protect their sessions @@ -365,13 +368,16 @@ internal void ReconcileOrganization() .Select(m => m.SessionName)); // Remove metadata only for sessions that are truly gone (not in any known set) - var toRemove = Organization.Sessions.Where(m => !knownNames.Contains(m.SessionName) && !protectedNames.Contains(m.SessionName)).ToList(); - if (toRemove.Count > 0) + if (allowPruning) { - Debug($"ReconcileOrganization: pruning {toRemove.Count} sessions: {string.Join(", ", toRemove.Select(m => m.SessionName))}"); - changed = true; + var toRemove = Organization.Sessions.Where(m => !knownNames.Contains(m.SessionName) && !protectedNames.Contains(m.SessionName)).ToList(); + if (toRemove.Count > 0) + { + Debug($"ReconcileOrganization: pruning {toRemove.Count} sessions: {string.Join(", ", toRemove.Select(m => m.SessionName))}"); + changed = true; + } + Organization.Sessions.RemoveAll(m => !knownNames.Contains(m.SessionName) && !protectedNames.Contains(m.SessionName)); } - Organization.Sessions.RemoveAll(m => !knownNames.Contains(m.SessionName) && !protectedNames.Contains(m.SessionName)); if (changed) SaveOrganization(); } diff --git a/PolyPilot/Services/CopilotService.cs b/PolyPilot/Services/CopilotService.cs index 415decaf95..377ee73285 100644 --- a/PolyPilot/Services/CopilotService.cs +++ b/PolyPilot/Services/CopilotService.cs @@ -1321,6 +1321,13 @@ public async Task ResumeSessionAsync(string sessionId, string Info = info }; + // Cache multi-agent membership for the watchdog timeout tier. + // Must be set BEFORE StartProcessingWatchdog — otherwise the watchdog uses the + // 120s inactivity timeout instead of the 600s tool timeout, killing workers prematurely. + // IsSessionInMultiAgentGroup reads Organization.Sessions which was loaded from disk + // by LoadOrganization() before RestorePreviousSessionsAsync runs. + state.IsMultiAgentSession = IsSessionInMultiAgentGroup(displayName); + // Wire up event handler BEFORE starting watchdog/timeout so events // arriving immediately after SDK resume are not missed. copilotSession.On(evt => HandleSessionEvent(state, evt));