Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions .claude/skills/performance-optimization/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,17 @@ Skips work when the active session set is unchanged (hash of session names).
If you add new session types or visibility rules, ensure the hash accounts
for them or reconciliation may be incorrectly skipped.

### PERF-6: ReconcileOrganization() during IsRestoring window
`ReconcileOrganization` is skipped during `IsRestoring=true` to prevent pruning
sessions not yet loaded. But code that needs metadata during restore (e.g.,
`CompleteResponse` queue drain, `GetOrchestratorGroupId`, `IsSessionInMultiAgentGroup`)
must call `ReconcileOrganization(allowPruning: false)` to trigger an additive-only
update. This mode adds missing `SessionMeta` entries but never deletes anything.

**Impact:** Without this, multi-agent dispatch is silently bypassed after relaunch,
and the watchdog uses the wrong timeout tier (120s instead of 600s), killing workers.
See PR #284 and processing-state-safety INV-9.

## Caching Architecture

### Markdown Cache (`ChatMessageList.razor`)
Expand Down
31 changes: 27 additions & 4 deletions .claude/skills/processing-state-safety/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,10 @@ description: >
CompleteResponse, AbortSessionAsync, or the processing watchdog, (3) Adding new
SDK event handlers, (4) Debugging stuck sessions showing "Thinking..." forever,
(5) Modifying IsResumed, HasUsedToolsThisTurn, or ActiveToolCallCount,
(6) Adding diagnostic log tags. Covers: 8 invariants from 7 PRs of fix cycles,
(6) Adding diagnostic log tags, (7) Modifying session restore paths
(RestoreSingleSessionAsync) that must initialize watchdog-dependent state,
(8) Modifying ReconcileOrganization or any code that reads Organization.Sessions
during the IsRestoring window. Covers: 9 invariants from 8 PRs of fix cycles,
the 8 code paths that clear IsProcessing, and common regression patterns.
---

Expand Down Expand Up @@ -55,7 +58,7 @@ that crash the app. The DB is a write-through cache; `events.jsonl` is the sourc
and replays on session resume via `BulkInsertAsync`. DB write failures are self-healing.
**NEVER narrow the ChatDatabase catch filters** — use `catch (Exception ex)` always.

## 8 Invariants
## 9 Invariants

### INV-1: Complete state cleanup
Every IsProcessing=false path clears ALL fields. See checklist above.
Expand Down Expand Up @@ -89,7 +92,20 @@ Clearing guarded on `!hasActiveTool && !HasUsedToolsThisTurn`.
`HandleComplete` is already on UI thread. `InvokeAsync` defers execution
causing stale renders.

## Top 4 Recurring Mistakes
### INV-9: Session restore must initialize all watchdog-dependent state
The restore path (`RestoreSingleSessionAsync`) is separate from `SendPromptAsync`.
Any field that affects watchdog timeout selection or dispatch routing must be
initialized in BOTH paths:
- `IsMultiAgentSession` — set via `IsSessionInMultiAgentGroup()` before `StartProcessingWatchdog`
- `HasReceivedEventsSinceResume` / `HasUsedToolsThisTurn` — set via `GetEventsFileRestoreHints()`
- `IsResumed` — set on the `AgentSessionInfo` when `isStillProcessing` is true

When `ReconcileOrganization` hasn't run yet (during `IsRestoring` window),
`Organization.Sessions` metadata may be stale. Any code that reads metadata
during this window must call `ReconcileOrganization(allowPruning: false)` first.
This additive mode safely adds missing entries without pruning loading sessions.

## Top 5 Recurring Mistakes

1. **Incomplete cleanup** — modifying one IsProcessing path without
updating ALL fields that must be cleared simultaneously.
Expand All @@ -101,8 +117,15 @@ causing stale renders.
must be called at every point where accumulated text could be lost
(turn_end, tool_start, abort, error, watchdog). The turn_end call
was missing until PR #224, causing response loss on app restart.
5. **Missing state initialization on session restore** — `IsMultiAgentSession`,
`IsResumed`, and other flags must be set on restored sessions BEFORE
`StartProcessingWatchdog` is called. The restore path in
`RestoreSingleSessionAsync` is separate from `SendPromptAsync` and must
independently initialize all state the watchdog depends on. PR #284 fixed
`IsMultiAgentSession` not being set during restore, causing the watchdog
to use 120s instead of 600s for multi-agent workers.

## Regression History

8 PRs of fix/regression cycles: #141 → #147 → #148 → #153 → #158 → #163 → #164 → #276.
9 PRs of fix/regression cycles: #141 → #147 → #148 → #153 → #158 → #163 → #164 → #276 → #284.
See `references/regression-history.md` for the full timeline with root causes.
Original file line number Diff line number Diff line change
Expand Up @@ -50,3 +50,23 @@
## Known Remaining Issues
- `HasUsedToolsThisTurn` resets use plain assignment (not Volatile.Write)
- InvokeOnUI dispatch for IsResumed adds 15s delay (one watchdog cycle)

## PR #284 — Multi-Agent Dispatch Bypass + Premature Watchdog on Restore
- **Root cause**: `ReconcileOrganization` was fully skipped during `IsRestoring=true`
to prevent pruning sessions not yet loaded. But `CompleteResponse` runs during
restore (when a restored turn completes), and it needs metadata to:
1. Route through multi-agent dispatch (`GetOrchestratorGroupId`)
2. Apply correct watchdog timeout (`IsSessionInMultiAgentGroup`)
- **Symptom 1**: Orchestrator generated `@worker` dispatch commands but they were
silently ignored — dispatch pipeline bypassed because `GetOrchestratorGroupId`
returned `null` (no metadata entry for the restored session)
- **Symptom 2**: Sessions killed as "stuck" after ~2 minutes — `IsMultiAgentSession`
was never set on restored sessions, so watchdog used 120s instead of 600s
- **Fix 1**: `ReconcileOrganization(allowPruning: false)` — new additive mode that
adds missing metadata without deleting anything. Called in `CompleteResponse`
during the `IsRestoring` window.
- **Fix 2**: Set `state.IsMultiAgentSession` in `RestoreSingleSessionAsync` before
`StartProcessingWatchdog` (reads from organization.json loaded at startup).
- **Pattern**: This is a variant of mistake #5 ("Missing state initialization on
session restore") — the restore path must independently initialize ALL state that
`SendPromptAsync` initializes, because events/watchdog/dispatch all depend on it.
5 changes: 5 additions & 0 deletions PolyPilot/Services/CopilotService.Events.cs
Original file line number Diff line number Diff line change
Expand Up @@ -892,6 +892,11 @@ private void CompleteResponse(SessionState state, long? expectedGeneration = nul

// Check if the dequeued message is for an orchestrator session — if so,
// route through the multi-agent dispatch pipeline instead of direct send.

// If we are restoring, the global ReconcileOrganization() hasn't run yet.
// We must force an additive-only update so this session's metadata exists.
if (IsRestoring) ReconcileOrganization(allowPruning: false);

var orchGroupId = GetOrchestratorGroupId(state.Info.Name);

// Use Task.Run to dispatch on a clean stack frame, avoiding reentrancy
Expand Down
22 changes: 14 additions & 8 deletions PolyPilot/Services/CopilotService.Organization.cs
Original file line number Diff line number Diff line change
Expand Up @@ -207,14 +207,14 @@ private void WriteOrgFile(string json)
/// Skips work if the active session set hasn't changed since last reconciliation.
/// </summary>
private int _lastReconcileSessionHash;
internal void ReconcileOrganization()
internal void ReconcileOrganization(bool allowPruning = true)
{
var activeNames = _sessions.Where(kv => !kv.Value.Info.IsHidden).Select(kv => kv.Key).ToHashSet();

// Safety: skip reconciliation during startup when sessions haven't been restored yet.
// LoadOrganization loads the org before RestorePreviousSessionsAsync populates _sessions,
// so reconciling then would prune all sessions. Use IsRestoring as the precise scope guard.
if (IsRestoring)
if (IsRestoring && allowPruning)
{
Debug("ReconcileOrganization: skipping — session restore in progress");
return;
Expand All @@ -232,7 +232,10 @@ internal void ReconcileOrganization()
var currentHash = activeNames.Count;
unchecked { foreach (var name in activeNames) currentHash += name.GetHashCode() * 31; }
if (currentHash == _lastReconcileSessionHash && currentHash != 0) return;
_lastReconcileSessionHash = currentHash;
// Only update the hash when doing a full reconciliation (with pruning).
// Additive-only calls (allowPruning=false) during restore must not poison the cache,
// or the post-restore full reconciliation will be skipped via hash match. (PR #284 review)
if (allowPruning) _lastReconcileSessionHash = currentHash;
bool changed = false;

// Build lookup of multi-agent group IDs so we can protect their sessions
Expand Down Expand Up @@ -365,13 +368,16 @@ internal void ReconcileOrganization()
.Select(m => m.SessionName));

// Remove metadata only for sessions that are truly gone (not in any known set)
var toRemove = Organization.Sessions.Where(m => !knownNames.Contains(m.SessionName) && !protectedNames.Contains(m.SessionName)).ToList();
if (toRemove.Count > 0)
if (allowPruning)
{
Debug($"ReconcileOrganization: pruning {toRemove.Count} sessions: {string.Join(", ", toRemove.Select(m => m.SessionName))}");
changed = true;
var toRemove = Organization.Sessions.Where(m => !knownNames.Contains(m.SessionName) && !protectedNames.Contains(m.SessionName)).ToList();
if (toRemove.Count > 0)
{
Debug($"ReconcileOrganization: pruning {toRemove.Count} sessions: {string.Join(", ", toRemove.Select(m => m.SessionName))}");
changed = true;
}
Organization.Sessions.RemoveAll(m => !knownNames.Contains(m.SessionName) && !protectedNames.Contains(m.SessionName));
}
Organization.Sessions.RemoveAll(m => !knownNames.Contains(m.SessionName) && !protectedNames.Contains(m.SessionName));

if (changed) SaveOrganization();
}
Expand Down
7 changes: 7 additions & 0 deletions PolyPilot/Services/CopilotService.cs
Original file line number Diff line number Diff line change
Expand Up @@ -1321,6 +1321,13 @@ public async Task<AgentSessionInfo> ResumeSessionAsync(string sessionId, string
Info = info
};

// Cache multi-agent membership for the watchdog timeout tier.
// Must be set BEFORE StartProcessingWatchdog — otherwise the watchdog uses the
// 120s inactivity timeout instead of the 600s tool timeout, killing workers prematurely.
// IsSessionInMultiAgentGroup reads Organization.Sessions which was loaded from disk
// by LoadOrganization() before RestorePreviousSessionsAsync runs.
state.IsMultiAgentSession = IsSessionInMultiAgentGroup(displayName);

// Wire up event handler BEFORE starting watchdog/timeout so events
// arriving immediately after SDK resume are not missed.
copilotSession.On(evt => HandleSessionEvent(state, evt));
Expand Down