Skip to content

fix: make 'Service not initialized' retryable in worker dispatch#422

Merged
PureWeen merged 1 commit intomainfrom
fix/retryable-init-error
Mar 23, 2026
Merged

fix: make 'Service not initialized' retryable in worker dispatch#422
PureWeen merged 1 commit intomainfrom
fix/retryable-init-error

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

Problem

When a concurrent worker's connection error sets IsInitialized=false, all subsequent workers in the same orchestration run immediately throw InvalidOperationException("Service not initialized"). This was not treated as retryable in ExecuteWorkerAsync, causing the entire worker wave to fail at 0.0s elapsed — the "all workers failed with 'Service not initialized'" pattern seen during PR #421 review.

Root Cause

ExecuteWorkerAsync's retry gate only catches IsConnectionError(ex):

catch (Exception ex) when (attempt < maxRetries && IsConnectionError(ex))

InvalidOperationException is not an IOException/SocketException, so it falls straight through to the final catch and returns a failed WorkerResult.

Fix

  1. CopilotService.Utilities.cs: Add IsInitializationError() — matches InvalidOperationException with "not initialized" in the message.

  2. CopilotService.Organization.cs (~line 2250): Extend the retry gate:

    catch (Exception ex) when (attempt < maxRetries && (IsConnectionError(ex) || IsInitializationError(ex)))

    And inside the catch, attempt lazy InitializeAsync() before the 2s delay so the next attempt finds the client ready.

Tests

10 new tests in InitializationErrorDetectionTests covering:

  • True/false detection for InvalidOperationException variants
  • Case-insensitivity
  • Wrong exception type returns false
  • Structural verification that the retry gate includes both checks

All 2911 tests pass. Build clean.

When a concurrent worker's connection error sets IsInitialized=false, subsequent
workers in the same orchestration run immediately throw InvalidOperationException
('Service not initialized') which was not treated as retryable, causing all-workers-
failed results at 0.0s elapsed (the 'PR Review Squad all failed' pattern).

Changes:
- Add IsInitializationError() helper to CopilotService.Utilities.cs: matches
  InvalidOperationException with 'not initialized' in the message
- Extend the ExecuteWorkerAsync retry gate to include IsInitializationError alongside
  IsConnectionError (line ~2250 in Organization.cs)
- Inside the retry catch, attempt lazy InitializeAsync() before the 2s delay so the
  next attempt finds the client ready
- Add 10 new tests in InitializationErrorDetectionTests covering: true/false detection,
  wrong exception type, case-insensitivity, and structural verification that the retry
  gate includes both checks

All 2911 tests pass. Build clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

🔍 PR Review Squad — Round 2 (Full Review)

CI Status

⚠️ No checks configured on branch fix/retryable-init-error

Build & Tests

  • ✅ 10/10 new InitializationErrorDetectionTests pass
  • ✅ Full test suite passes (pre-existing 2 flaky WsBridge timing tests only)

PR Summary

When a concurrent worker's connection error sets IsInitialized=false globally, all subsequent workers in the same dispatch wave throw InvalidOperationException("Service not initialized"). This was not in the retry gate, causing all workers to fail with 0s elapsed. Fix: add IsInitializationError() to the retry gate, plus lazy InitializeAsync() before the retry attempt.


Multi-Agent Orchestration Skill Review

Checked against invariants:

The skill documents that worker failures are collected (not fatal to orchestration) — the retry fix is additive and compatible with all documented invariants (INV-O1 through INV-O13). Specifically:

  • INV-O9 (IsOrphaned): Lazy re-init creates a new _client but doesn't touch SessionState objects — orphaned states are unaffected ✅
  • INV-O13 (handler-before-publish ordering): Re-init registers handlers before publishing to _sessions
  • Phase 2 (DISPATCH): SavePendingOrchestration() is called BEFORE workers are dispatched — a retry within ExecuteWorkerAsync doesn't affect PendingOrchestration state ✅
  • Worker failure pattern: Failed workers still return WorkerResult(success: false) if all retries fail — synthesis receives the error, not a crash ✅

Skill documentation gap (not a blocker): The skill's ExecuteWorkerAsync pseudocode example doesn't show the retry mechanism. The new IsInitializationError retry path should be documented there. Low priority.


Processing State Safety Review

Checked against invariants:

The lazy InitializeAsync in the catch block runs on the worker's task thread (from Task.WhenAll). Key concern: does calling InitializeAsync from a background thread violate INV-2 (UI thread for mutations)?

  • InitializeAsync sets IsInitialized = true and assigns _client — these are NOT IsProcessing mutations. The 18 invariants in the skill cover IsProcessing mutations specifically.
  • InitializeAsync internally calls OnStateChanged?.Invoke() which may trigger UI rendering — this is already done in the existing reconnect path from background threads.
  • Verdict: no invariant violation. The re-init is intentionally lightweight and idempotent.

One pre-existing observation: InitializeAsync has no concurrency lock. Multiple workers hitting IsInitializationError simultaneously could all enter InitializeAsync concurrently — but the if (IsInitialized) return; fast path is effectively a read-only check (not atomic). In practice, IsInitialized is set to true early in the method, so subsequent callers return quickly. Low risk, pre-existing.


Code Analysis

IsInitializationError implementation:

internal static bool IsInitializationError(Exception ex) =>
    ex is InvalidOperationException && ex.Message.Contains("not initialized", StringComparison.OrdinalIgnoreCase);
  • ✅ Type guard (InvalidOperationException) prevents false-positives from other exception types
  • "not initialized" is distinct from IsConnectionError's "not connected" check
  • ✅ Case-insensitive match handles SDK message variations
  • ✅ Doesn't walk InnerException chain — initialization errors are always direct throws, not wrapped

Retry gate:

catch (Exception ex) when (attempt < maxRetries && (IsConnectionError(ex) || IsInitializationError(ex)))
  • ✅ Short-circuit: attempt < maxRetries evaluated first (cheap)
  • ✅ If lazy re-init fails, the Debug log captures it but execution continues to the retry — correct (re-init failure will likely cause the next attempt to throw again, eventually exhausting retries and falling to the final catch)
  • await Task.Delay(2000, cancellationToken) — cancellable, respects shutdown

Potential concern: If InitializeAsync throws (not just logs), the exception propagates OUT of the catch block, bypassing the retry continue. But the code wraps it in try/catch (Exception reinitEx) — the inner exception is logged and swallowed. After swallowing, execution reaches await Task.Delay(2000) then continue — the next attempt will likely hit the same IsInitializationError again (since re-init failed), eventually exhausting retries. This is correct behavior: failed re-init → all retries fail → WorkerResult(success: false) → synthesis gets error message.


Test Coverage Assessment

8 new unit tests in InitializationErrorDetectionTests:

  • ✅ True positives: both "Service not initialized" and "Client is not initialized" variants
  • ✅ Case-insensitivity: "NOT INITIALIZED" via [Theory]
  • ✅ False positives: wrong exception type (Exception not InvalidOperationException), wrong message, unrelated InvalidOperationException
  • ✅ Structural: verifies IsInitializationError exists in Utilities.cs
  • ✅ Structural: verifies retry gate includes IsInitializationError(ex)

Missing: no behavioral integration test that simulates the cascade failure scenario (one worker sets IsInitialized=false, concurrent workers detect and retry). Acceptable — difficult to test deterministically without mocking.


Verdict: ✅ Approve

Fix is correct, minimal, and well-tested. The IsInitializationError helper is consistent with the existing IsConnectionError family. No multi-agent orchestration invariants are violated. The cascade-failure pattern that killed all workers during PR #421 review is correctly addressed.

@PureWeen PureWeen merged commit ecf81dd into main Mar 23, 2026
@PureWeen PureWeen deleted the fix/retryable-init-error branch March 23, 2026 21:43
arisng pushed a commit to arisng/PolyPilot that referenced this pull request Apr 4, 2026
…eWeen#422)

## Problem

When a concurrent worker's connection error sets `IsInitialized=false`,
all subsequent workers in the same orchestration run immediately throw
`InvalidOperationException("Service not initialized")`. This was **not
treated as retryable** in `ExecuteWorkerAsync`, causing the entire
worker wave to fail at 0.0s elapsed — the "all workers failed with
'Service not initialized'" pattern seen during PR PureWeen#421 review.

## Root Cause

`ExecuteWorkerAsync`'s retry gate only catches `IsConnectionError(ex)`:

```csharp
catch (Exception ex) when (attempt < maxRetries && IsConnectionError(ex))
```

`InvalidOperationException` is not an `IOException`/`SocketException`,
so it falls straight through to the final catch and returns a failed
`WorkerResult`.

## Fix

1. **`CopilotService.Utilities.cs`**: Add `IsInitializationError()` —
matches `InvalidOperationException` with "not initialized" in the
message.

2. **`CopilotService.Organization.cs`** (~line 2250): Extend the retry
gate:
   ```csharp
catch (Exception ex) when (attempt < maxRetries &&
(IsConnectionError(ex) || IsInitializationError(ex)))
   ```
And inside the catch, attempt lazy `InitializeAsync()` before the 2s
delay so the next attempt finds the client ready.

## Tests

10 new tests in `InitializationErrorDetectionTests` covering:
- True/false detection for `InvalidOperationException` variants
- Case-insensitivity
- Wrong exception type returns false
- Structural verification that the retry gate includes both checks

**All 2911 tests pass. Build clean.**

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant