Conversation
…ulti-agent synthesis by ~16s ## Problem In RecoverFromPrematureIdleIfNeededAsync, after a worker completes legitimately, IsEventsFileActive() returned true because events.jsonl was just written by the idle event itself (mtime < PrematureIdleEventsFileFreshnessSeconds = 15s). This triggered unnecessary DISPATCH-RECOVER polling — 8 rounds over ~16 seconds — before the code finally confirmed the worker was truly done. Observed in diagnostics (FlyoutLeak session, 2026-03-22): [COMPLETE] 'FlyoutLeak-worker-1' CompleteResponse executing (responseLen=0, flushedLen=3406) [DISPATCH-RECOVER] premature idle detected via events.jsonl freshness — truncated response=3406 chars [DISPATCH-RECOVER] events.jsonl stale and not processing after round 8 — worker is truly done (3406 chars) [DISPATCH-RECOVER] recovery found no additional content after 8 rounds — using original response (3406 chars) (16 seconds wasted; synthesis was delayed by this amount per worker) ## Fix Replace the raw IsEventsFileActive() check with a mtime-comparison approach: 1. Snapshot events.jsonl mtime at TCS completion 2. Wait PrematureIdleEventsGracePeriodMs (2s) to observe file activity 3. Only declare premature idle if mtime CHANGED — proving CLI is still writing Normal completions: idle event writes events.jsonl → mtime frozen during 2s → not detected True premature idles: CLI continues writing events → mtime advances → detected correctly The polling loop also updated to compare against the stable mtime baseline rather than using raw freshness, making it consistent with the initial detection approach. ## New constant PrematureIdleEventsGracePeriodMs = 2000 — grace period for mtime observation ## New helper GetEventsFileMtime(sessionId) — internal DateTime? returning events.jsonl mtime ## Tests added (5 new, all in MultiAgentRegressionTests.cs) - PrematureIdleEventsGracePeriodMs_ConstantExists — bounds check (500ms–5s) - GetEventsFileMtime_HelperExists — structural check for internal DateTime? helper - RecoverFromPrematureIdleIfNeededAsync_UsesMtimeComparisonForInitialDetection - RecoverFromPrematureIdleIfNeededAsync_PollingLoopUsesMtimeComparison - HasDiskFallback char window increased 8000→12000 (method grew by ~10 lines) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…re-existing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… migration ReconcileOrganization's promotion/heal migration compared wt.Path (raw, potentially using forward slashes or relative forms) against normalizedExtPath/normalizedLocalPath (computed via Path.GetFullPath). On Windows, unix-style paths like /tmp/wt-1 normalize to C:\tmp\wt-1, causing the equality check to always fail and incorrectly treating every session as 'stranded', migrating them to newly-created URL groups. Two sites fixed: 1. Promotion migration (line ~673): wt.Path normalized before StartsWith/Equals check 2. Heal stranded sessions (line ~696): wt.Path normalized before StartsWith/Equals check This fixes Scenario_MixedSessions_ReconcileDoesNotScatter which was failing because sessions in a repo group were incorrectly being migrated to a different group ID. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extend the existing 'zero tolerance' callout to explicitly prohibit worker/implementer agents from dismissing failures as 'pre-existing'. The previous wording was clear but insufficient — agents still reported pre-existing failures without fixing them. New language makes the requirement unambiguous: encountering a pre-existing failure and not fixing it is itself a task failure. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🔍 PR Review Squad -- Round 2 (re-review with multi-agent skill analysis)4 commits, HEAD Previous Findings Status
New Findings (2+ model consensus)
Multi-Agent Skill Consistency✅ Cardinal rule respected: Mtime-comparison snapshots ONCE before grace period, compares ONCE after — distinct from the ❌-listed "mtime staleness counter (kill after N unchanged checks)". Long-running sessions with thinking pauses are safe. ✅ IDLE-DEFER primacy: Primary fix (PR #399 BackgroundTasks check) unaffected. This PR improves defense-in-depth only.
Verdict: ✅ Approve (2 non-blocking follow-ups)Core fix is sound and correctly eliminates false positives without violating long-running session safety invariants. N2 and N3 are pre-existing or documentation issues that don't block merging:
|
## Problem
When a concurrent worker's connection error sets `IsInitialized=false`,
all subsequent workers in the same orchestration run immediately throw
`InvalidOperationException("Service not initialized")`. This was **not
treated as retryable** in `ExecuteWorkerAsync`, causing the entire
worker wave to fail at 0.0s elapsed — the "all workers failed with
'Service not initialized'" pattern seen during PR #421 review.
## Root Cause
`ExecuteWorkerAsync`'s retry gate only catches `IsConnectionError(ex)`:
```csharp
catch (Exception ex) when (attempt < maxRetries && IsConnectionError(ex))
```
`InvalidOperationException` is not an `IOException`/`SocketException`,
so it falls straight through to the final catch and returns a failed
`WorkerResult`.
## Fix
1. **`CopilotService.Utilities.cs`**: Add `IsInitializationError()` —
matches `InvalidOperationException` with "not initialized" in the
message.
2. **`CopilotService.Organization.cs`** (~line 2250): Extend the retry
gate:
```csharp
catch (Exception ex) when (attempt < maxRetries &&
(IsConnectionError(ex) || IsInitializationError(ex)))
```
And inside the catch, attempt lazy `InitializeAsync()` before the 2s
delay so the next attempt finds the client ready.
## Tests
10 new tests in `InitializationErrorDetectionTests` covering:
- True/false detection for `InvalidOperationException` variants
- Case-insensitivity
- Wrong exception type returns false
- Structural verification that the retry gate includes both checks
**All 2911 tests pass. Build clean.**
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…eWeen#422) ## Problem When a concurrent worker's connection error sets `IsInitialized=false`, all subsequent workers in the same orchestration run immediately throw `InvalidOperationException("Service not initialized")`. This was **not treated as retryable** in `ExecuteWorkerAsync`, causing the entire worker wave to fail at 0.0s elapsed — the "all workers failed with 'Service not initialized'" pattern seen during PR PureWeen#421 review. ## Root Cause `ExecuteWorkerAsync`'s retry gate only catches `IsConnectionError(ex)`: ```csharp catch (Exception ex) when (attempt < maxRetries && IsConnectionError(ex)) ``` `InvalidOperationException` is not an `IOException`/`SocketException`, so it falls straight through to the final catch and returns a failed `WorkerResult`. ## Fix 1. **`CopilotService.Utilities.cs`**: Add `IsInitializationError()` — matches `InvalidOperationException` with "not initialized" in the message. 2. **`CopilotService.Organization.cs`** (~line 2250): Extend the retry gate: ```csharp catch (Exception ex) when (attempt < maxRetries && (IsConnectionError(ex) || IsInitializationError(ex))) ``` And inside the catch, attempt lazy `InitializeAsync()` before the 2s delay so the next attempt finds the client ready. ## Tests 10 new tests in `InitializationErrorDetectionTests` covering: - True/false detection for `InvalidOperationException` variants - Case-insensitivity - Wrong exception type returns false - Structural verification that the retry gate includes both checks **All 2911 tests pass. Build clean.** Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ulti-agent synthesis by ~16s (PureWeen#421) ## Problem After a worker completed legitimately, \IsEventsFileActive()\ returned true because \�vents.jsonl\ was just written by the idle event itself (mtime < 15s threshold). This triggered 8 rounds of unnecessary DISPATCH-RECOVER polling (~16 seconds) per worker before synthesis could proceed. ## Root Cause Detection compared file age against an absolute threshold, making it impossible to distinguish 'file fresh because idle event just wrote it' (false positive) from 'file fresh because CLI is still actively writing events' (true positive). ## Fix Replaced absolute-threshold detection with mtime-comparison approach: 1. Snapshot events.jsonl mtime at TCS completion 2. Wait \PrematureIdleEventsGracePeriodMs\ (2s) to observe activity 3. Only flag premature idle if mtime changed during that window ## Impact - Normal completion: returns in 2s with no recovery loop (was: 16s wasted) - True premature idle: detected within 2s if CLI writes events - No risk to long-running sessions ## Also fixed: worktree path normalization in ReconcileOrganization The promotion/heal migration code was comparing raw \wt.Path\ against \Path.GetFullPath()\-normalized paths. On Windows, unix-style paths like \/tmp/wt-1\ normalize to \C:\\tmp\\wt-1\, causing the equality check to always fail and incorrectly treating sessions as stranded. Both comparison sites now normalize \wt.Path\ via \Path.GetFullPath()\ before comparison. ## Tests - 5 new structural tests in \MultiAgentRegressionTests.cs\ covering the mtime-comparison fix - All 2905 tests pass (including \Scenario_MixedSessions_ReconcileDoesNotScatter\ which was previously failing) --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Problem
After a worker completed legitimately, \IsEventsFileActive()\ returned true because \�vents.jsonl\ was just written by the idle event itself (mtime < 15s threshold). This triggered 8 rounds of unnecessary DISPATCH-RECOVER polling (~16 seconds) per worker before synthesis could proceed.
Root Cause
Detection compared file age against an absolute threshold, making it impossible to distinguish 'file fresh because idle event just wrote it' (false positive) from 'file fresh because CLI is still actively writing events' (true positive).
Fix
Replaced absolute-threshold detection with mtime-comparison approach:
Impact
Also fixed: worktree path normalization in ReconcileOrganization
The promotion/heal migration code was comparing raw \wt.Path\ against \Path.GetFullPath()-normalized paths. On Windows, unix-style paths like /tmp/wt-1\ normalize to \C:\tmp\wt-1, causing the equality check to always fail and incorrectly treating sessions as stranded. Both comparison sites now normalize \wt.Path\ via \Path.GetFullPath()\ before comparison.
Tests