Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
2e26c72
Strengthen multi-agent skills, Fix flow, and stability tests
PureWeen Mar 15, 2026
75ddcd1
Fix dead event stream after worker revival + long-running session saf…
PureWeen Mar 15, 2026
eaab7fd
Fix premature session.idle + PR #375 review findings
PureWeen Mar 15, 2026
4d61424
Queue messages to busy orchestrators instead of steering
PureWeen Mar 15, 2026
1403381
Add orchestrator-steer conflict regression tests
PureWeen Mar 15, 2026
1d68dd1
Address PR review findings: crash log path, ReadAllLines, orchestrato…
PureWeen Mar 15, 2026
66a1cb5
Add premature session.idle recovery for multi-agent workers
PureWeen Mar 15, 2026
f4e297d
fix: replace polling with ManualResetEventSlim for premature idle det…
PureWeen Mar 16, 2026
5ea0318
Fix premature idle recovery: events.jsonl freshness + multi-round loop
PureWeen Mar 16, 2026
7dd372e
fix: lazy session resume to prevent blue screen on startup
PureWeen Mar 16, 2026
52e1a3d
Fix SyncContext deadlock causing blue screen on launch
PureWeen Mar 16, 2026
04d23a7
Fix 4-model code review findings: recovery loop, thread safety, race
PureWeen Mar 16, 2026
3454da3
Fix disk fallback paths missing dispatchTime filter (PR review N4)
PureWeen Mar 16, 2026
ce3f62c
Fix flaky RenameSession bridge tests: wait for initial sync
PureWeen Mar 16, 2026
875eec1
Fix Round 5 review findings: DateTime mismatch, OCE discard, resume f…
PureWeen Mar 16, 2026
87a6dc7
Fix Round 6 review findings: wrong client for lazy-resume fallback, O…
PureWeen Mar 16, 2026
7049eae
Dispose PrematureIdleSignal on session teardown + eager resume for in…
PureWeen Mar 16, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
504 changes: 498 additions & 6 deletions .claude/skills/multi-agent-orchestration/SKILL.md

Large diffs are not rendered by default.

55 changes: 54 additions & 1 deletion .claude/skills/processing-state-safety/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,9 @@ Every code path that sets `IsProcessing = false` MUST also:
12. Run on UI thread (via `InvokeOnUI()` or already on UI thread)
13. After changes, run `ProcessingWatchdogTests.cs` to catch regressions

## The 9 Paths That Clear IsProcessing
## The 10 Paths That Set/Clear IsProcessing

### Paths that CLEAR IsProcessing (→ false)

| # | Path | File | Thread | Notes |
|---|------|------|--------|-------|
Expand All @@ -48,6 +50,17 @@ Every code path that sets `IsProcessing = false` MUST also:
| 8 | SendAsync initial failure | CopilotService.cs | UI | Prompt send failed |
| 9 | Bridge OnTurnEnd | Bridge.cs | Background → InvokeOnUI | Remote mode turn complete |

### Path that RE-ARMS IsProcessing (→ true)

| # | Path | File | Thread | Notes |
|---|------|------|--------|-------|
| 10 | TurnStart re-arm | Events.cs | Background → InvokeOnUI | Premature session.idle recovery (PR #375) |

Path #10 fires when `AssistantTurnStartEvent` arrives with `IsProcessing=false` on the
current non-orphaned state. This detects premature `session.idle` (SDK sends idle mid-turn
then continues). Re-arm sets `IsProcessing=true`, restarts the watchdog, and logs `[EVT-REARM]`.
Does NOT create a new TCS — the old one was already completed with partial content.

## Content Persistence Safety

### Turn-End Flush
Expand Down Expand Up @@ -170,6 +183,46 @@ Use the class-level `InvokeOnUI()` method in all `Task.Run` and timer callbacks
for explicit, unambiguous UI thread dispatch. The local `Invoke` works but the
intent is less clear when reading cross-threaded code.

### INV-14: IsOrphaned guards on all event/timer entry points (PR #373)
When a `SessionState` is orphaned (after reconnect creates a replacement):
1. Set `state.IsOrphaned = true` (volatile)
2. Set `ProcessingGeneration = long.MaxValue` (prevents any generation check from passing)
3. Call `state.ResponseCompletion?.TrySetCanceled()` (unblocks orchestrator waits)

ALL event/timer entry points must check `state.IsOrphaned` and return immediately:
- `HandleSessionEvent` (line ~214)
- `CompleteResponse` (line ~913) — TrySetCanceled + return
- Watchdog loop (line ~1820) — exit loop
- Watchdog InvokeOnUI callbacks (line ~2095) — skip
- Tool health/recovery handlers — skip

Without this, stale SDK events from the disposed old `CopilotSession` pass through
to the shared `Info` object and corrupt the replacement session's state.

### INV-15: TryUpdate for atomic state swaps (PR #373)
When replacing a `SessionState` in `_sessions` after reconnect, use
`_sessions.TryUpdate(key, newState, expectedOldState)` instead of
`_sessions[key] = newState`. This prevents a stale `Task.Run` (from an earlier
reconnect) from overwriting a newer reconnect's state. If TryUpdate fails,
discard the result — someone else already updated.

### INV-16: Register handler BEFORE publishing to dictionary (PR #373)
When creating a new `SessionState` (reconnect or sibling re-resume):
```csharp
resumed.On(evt => HandleSessionEvent(newState, evt)); // 1. Handler first
_sessions.TryUpdate(key, newState, oldState); // 2. Publish second
```
If reversed, a race window exists where events arrive before the handler is
registered, and those events are lost permanently.

### INV-17: Sibling re-resume must reload MCP servers (PR #373)
Both the primary reconnect path and the sibling loop must call:
- `cfg.LoadMcpServers()` — MCP server handles are tied to the disposed client
- `cfg.LoadSkillDirectories()` — same issue

The primary path was missing these until PR #373 Round 5. Asymmetry between
the sibling and primary reconnect configs is a recurring bug pattern.

## Top 5 Recurring Mistakes

1. **Incomplete cleanup** — modifying one IsProcessing path without
Expand Down
512 changes: 512 additions & 0 deletions PolyPilot.Tests/LongRunningSessionSafetyTests.cs

Large diffs are not rendered by default.

Loading