Skip to content

fix: eliminate false-positive premature idle detection that delayed multi-agent synthesis by ~16s#421

Merged
PureWeen merged 4 commits intomainfrom
MultiAgent-Fix-orchestrator-af46
Mar 23, 2026
Merged

fix: eliminate false-positive premature idle detection that delayed multi-agent synthesis by ~16s#421
PureWeen merged 4 commits intomainfrom
MultiAgent-Fix-orchestrator-af46

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

Problem

After a worker completed legitimately, \IsEventsFileActive()\ returned true because \�vents.jsonl\ was just written by the idle event itself (mtime < 15s threshold). This triggered 8 rounds of unnecessary DISPATCH-RECOVER polling (~16 seconds) per worker before synthesis could proceed.

Root Cause

Detection compared file age against an absolute threshold, making it impossible to distinguish 'file fresh because idle event just wrote it' (false positive) from 'file fresh because CLI is still actively writing events' (true positive).

Fix

Replaced absolute-threshold detection with mtime-comparison approach:

  1. Snapshot events.jsonl mtime at TCS completion
  2. Wait \PrematureIdleEventsGracePeriodMs\ (2s) to observe activity
  3. Only flag premature idle if mtime changed during that window

Impact

  • Normal completion: returns in 2s with no recovery loop (was: 16s wasted)
  • True premature idle: detected within 2s if CLI writes events
  • No risk to long-running sessions

Also fixed: worktree path normalization in ReconcileOrganization

The promotion/heal migration code was comparing raw \wt.Path\ against \Path.GetFullPath()-normalized paths. On Windows, unix-style paths like /tmp/wt-1\ normalize to \C:\tmp\wt-1, causing the equality check to always fail and incorrectly treating sessions as stranded. Both comparison sites now normalize \wt.Path\ via \Path.GetFullPath()\ before comparison.

Tests

  • 5 new structural tests in \MultiAgentRegressionTests.cs\ covering the mtime-comparison fix
  • All 2905 tests pass (including \Scenario_MixedSessions_ReconcileDoesNotScatter\ which was previously failing)

PureWeen and others added 4 commits March 23, 2026 15:15
…ulti-agent synthesis by ~16s

## Problem

In RecoverFromPrematureIdleIfNeededAsync, after a worker completes legitimately,
IsEventsFileActive() returned true because events.jsonl was just written by the idle
event itself (mtime < PrematureIdleEventsFileFreshnessSeconds = 15s). This triggered
unnecessary DISPATCH-RECOVER polling — 8 rounds over ~16 seconds — before the code
finally confirmed the worker was truly done.

Observed in diagnostics (FlyoutLeak session, 2026-03-22):
  [COMPLETE] 'FlyoutLeak-worker-1' CompleteResponse executing (responseLen=0, flushedLen=3406)
  [DISPATCH-RECOVER] premature idle detected via events.jsonl freshness — truncated response=3406 chars
  [DISPATCH-RECOVER] events.jsonl stale and not processing after round 8 — worker is truly done (3406 chars)
  [DISPATCH-RECOVER] recovery found no additional content after 8 rounds — using original response (3406 chars)
  (16 seconds wasted; synthesis was delayed by this amount per worker)

## Fix

Replace the raw IsEventsFileActive() check with a mtime-comparison approach:
1. Snapshot events.jsonl mtime at TCS completion
2. Wait PrematureIdleEventsGracePeriodMs (2s) to observe file activity
3. Only declare premature idle if mtime CHANGED — proving CLI is still writing

Normal completions: idle event writes events.jsonl → mtime frozen during 2s → not detected
True premature idles: CLI continues writing events → mtime advances → detected correctly

The polling loop also updated to compare against the stable mtime baseline rather than
using raw freshness, making it consistent with the initial detection approach.

## New constant

PrematureIdleEventsGracePeriodMs = 2000 — grace period for mtime observation

## New helper

GetEventsFileMtime(sessionId) — internal DateTime? returning events.jsonl mtime

## Tests added (5 new, all in MultiAgentRegressionTests.cs)

- PrematureIdleEventsGracePeriodMs_ConstantExists — bounds check (500ms–5s)
- GetEventsFileMtime_HelperExists — structural check for internal DateTime? helper
- RecoverFromPrematureIdleIfNeededAsync_UsesMtimeComparisonForInitialDetection
- RecoverFromPrematureIdleIfNeededAsync_PollingLoopUsesMtimeComparison
- HasDiskFallback char window increased 8000→12000 (method grew by ~10 lines)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…re-existing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… migration

ReconcileOrganization's promotion/heal migration compared wt.Path (raw, potentially
using forward slashes or relative forms) against normalizedExtPath/normalizedLocalPath
(computed via Path.GetFullPath). On Windows, unix-style paths like /tmp/wt-1 normalize
to C:\tmp\wt-1, causing the equality check to always fail and incorrectly treating
every session as 'stranded', migrating them to newly-created URL groups.

Two sites fixed:
1. Promotion migration (line ~673): wt.Path normalized before StartsWith/Equals check
2. Heal stranded sessions (line ~696): wt.Path normalized before StartsWith/Equals check

This fixes Scenario_MixedSessions_ReconcileDoesNotScatter which was failing because
sessions in a repo group were incorrectly being migrated to a different group ID.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extend the existing 'zero tolerance' callout to explicitly prohibit worker/implementer
agents from dismissing failures as 'pre-existing'. The previous wording was clear but
insufficient — agents still reported pre-existing failures without fixing them.

New language makes the requirement unambiguous: encountering a pre-existing failure
and not fixing it is itself a task failure.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

🔍 PR Review Squad -- Round 2 (re-review with multi-agent skill analysis)

4 commits, HEAD bd7ba54d | Tests: ✅ 235/235 MultiAgentRegressionTests pass


Previous Findings Status

  • N1 🟢 OCE propagation in polling loopSTILL PRESENT (4/5 models). WaitForPrematureIdleSignal() catches ObjectDisposedException but not OperationCanceledException when cancellationToken fires in the detection polling loop. OCE propagates to the method's outer catch (OperationCanceledException) which returns bestResponse ?? initialResponse — behavior is safe, just inconsistent with the explicit catch 15 lines above in the grace period block. Non-blocking.

New Findings (2+ model consensus)

# Severity Location Description Consensus
N2 🟡 MODERATE CopilotService.Organization.cs:2499 Recovery loop termination !IsEventsFileActive(...) && !IsProcessing still uses the old absolute 15s threshold. After a real premature idle resolves (worker's final idle event writes events.jsonl), IsEventsFileActive=true blocks the && condition even when IsProcessing=false. Loop spins up to 15 more seconds before the file ages out. Pre-existing, not introduced by this PR, but directly relevant to the same root cause — worth a follow-up. 3/5
N3 🟡 MODERATE CopilotService.Organization.cs (detection phase) PR description says "Normal completion: returns in 2s". Actual behavior: 2s grace period + up to 10s PrematureIdleDetectionWindowMs polling = up to 12s before returning initialResponse. Pre-PR, normal completion took 16s (false-positive triggered 8-round recovery loop). Post-PR, no recovery loop but detection polling still runs. The improvement is real (16s→12s, no recovery CPU/disk I/O) but the "2s" claim in the description is inaccurate. 2/5
N4 🟢 MINOR .claude/skills/multi-agent-orchestration/SKILL.md Skill doc describes IsEventsFileActive as part of the initial detection mechanism. After this PR, IsEventsFileActive is only used for recovery loop termination (line 2499); initial detection now uses GetEventsFileMtime mtime-comparison. Future agents reading the skill doc will reason incorrectly about detection latency and mechanism. 3/5

Multi-Agent Skill Consistency

Cardinal rule respected: Mtime-comparison snapshots ONCE before grace period, compares ONCE after — distinct from the ❌-listed "mtime staleness counter (kill after N unchanged checks)". Long-running sessions with thinking pauses are safe.

IDLE-DEFER primacy: Primary fix (PR #399 BackgroundTasks check) unaffected. This PR improves defense-in-depth only.

⚠️ Skill doc stale: The defense-in-depth section still lists IsEventsFileActive for initial detection. Should be updated to describe the new GetEventsFileMtime/mtime-comparison approach.


Verdict: ✅ Approve (2 non-blocking follow-ups)

Core fix is sound and correctly eliminates false positives without violating long-running session safety invariants. N2 and N3 are pre-existing or documentation issues that don't block merging:

  1. Follow-up: Update line ~2499 recovery termination to use mtime-comparison instead of IsEventsFileActive (same fix applied to the termination check)
  2. Follow-up: Update multi-agent-orchestration/SKILL.md defense-in-depth section to reflect new detection mechanism

PureWeen added a commit that referenced this pull request Mar 23, 2026
## Problem

When a concurrent worker's connection error sets `IsInitialized=false`,
all subsequent workers in the same orchestration run immediately throw
`InvalidOperationException("Service not initialized")`. This was **not
treated as retryable** in `ExecuteWorkerAsync`, causing the entire
worker wave to fail at 0.0s elapsed — the "all workers failed with
'Service not initialized'" pattern seen during PR #421 review.

## Root Cause

`ExecuteWorkerAsync`'s retry gate only catches `IsConnectionError(ex)`:

```csharp
catch (Exception ex) when (attempt < maxRetries && IsConnectionError(ex))
```

`InvalidOperationException` is not an `IOException`/`SocketException`,
so it falls straight through to the final catch and returns a failed
`WorkerResult`.

## Fix

1. **`CopilotService.Utilities.cs`**: Add `IsInitializationError()` —
matches `InvalidOperationException` with "not initialized" in the
message.

2. **`CopilotService.Organization.cs`** (~line 2250): Extend the retry
gate:
   ```csharp
catch (Exception ex) when (attempt < maxRetries &&
(IsConnectionError(ex) || IsInitializationError(ex)))
   ```
And inside the catch, attempt lazy `InitializeAsync()` before the 2s
delay so the next attempt finds the client ready.

## Tests

10 new tests in `InitializationErrorDetectionTests` covering:
- True/false detection for `InvalidOperationException` variants
- Case-insensitivity
- Wrong exception type returns false
- Structural verification that the retry gate includes both checks

**All 2911 tests pass. Build clean.**

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen merged commit ba53257 into main Mar 23, 2026
@PureWeen PureWeen deleted the MultiAgent-Fix-orchestrator-af46 branch March 23, 2026 21:44
arisng pushed a commit to arisng/PolyPilot that referenced this pull request Apr 4, 2026
…eWeen#422)

## Problem

When a concurrent worker's connection error sets `IsInitialized=false`,
all subsequent workers in the same orchestration run immediately throw
`InvalidOperationException("Service not initialized")`. This was **not
treated as retryable** in `ExecuteWorkerAsync`, causing the entire
worker wave to fail at 0.0s elapsed — the "all workers failed with
'Service not initialized'" pattern seen during PR PureWeen#421 review.

## Root Cause

`ExecuteWorkerAsync`'s retry gate only catches `IsConnectionError(ex)`:

```csharp
catch (Exception ex) when (attempt < maxRetries && IsConnectionError(ex))
```

`InvalidOperationException` is not an `IOException`/`SocketException`,
so it falls straight through to the final catch and returns a failed
`WorkerResult`.

## Fix

1. **`CopilotService.Utilities.cs`**: Add `IsInitializationError()` —
matches `InvalidOperationException` with "not initialized" in the
message.

2. **`CopilotService.Organization.cs`** (~line 2250): Extend the retry
gate:
   ```csharp
catch (Exception ex) when (attempt < maxRetries &&
(IsConnectionError(ex) || IsInitializationError(ex)))
   ```
And inside the catch, attempt lazy `InitializeAsync()` before the 2s
delay so the next attempt finds the client ready.

## Tests

10 new tests in `InitializationErrorDetectionTests` covering:
- True/false detection for `InvalidOperationException` variants
- Case-insensitivity
- Wrong exception type returns false
- Structural verification that the retry gate includes both checks

**All 2911 tests pass. Build clean.**

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
arisng pushed a commit to arisng/PolyPilot that referenced this pull request Apr 4, 2026
…ulti-agent synthesis by ~16s (PureWeen#421)

## Problem
After a worker completed legitimately, \IsEventsFileActive()\ returned
true because \�vents.jsonl\ was just written by the idle event itself
(mtime < 15s threshold). This triggered 8 rounds of unnecessary
DISPATCH-RECOVER polling (~16 seconds) per worker before synthesis could
proceed.

## Root Cause
Detection compared file age against an absolute threshold, making it
impossible to distinguish 'file fresh because idle event just wrote it'
(false positive) from 'file fresh because CLI is still actively writing
events' (true positive).

## Fix
Replaced absolute-threshold detection with mtime-comparison approach:
1. Snapshot events.jsonl mtime at TCS completion
2. Wait \PrematureIdleEventsGracePeriodMs\ (2s) to observe activity
3. Only flag premature idle if mtime changed during that window

## Impact
- Normal completion: returns in 2s with no recovery loop (was: 16s
wasted)
- True premature idle: detected within 2s if CLI writes events
- No risk to long-running sessions

## Also fixed: worktree path normalization in ReconcileOrganization
The promotion/heal migration code was comparing raw \wt.Path\ against
\Path.GetFullPath()\-normalized paths. On Windows, unix-style paths like
\/tmp/wt-1\ normalize to \C:\\tmp\\wt-1\, causing the equality check to
always fail and incorrectly treating sessions as stranded. Both
comparison sites now normalize \wt.Path\ via \Path.GetFullPath()\ before
comparison.

## Tests
- 5 new structural tests in \MultiAgentRegressionTests.cs\ covering the
mtime-comparison fix
- All 2905 tests pass (including
\Scenario_MixedSessions_ReconcileDoesNotScatter\ which was previously
failing)

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant