Skip to content

fix: preserve IsProcessing through sibling reconnect instead of force-completing#600

Merged
PureWeen merged 10 commits intomainfrom
fix/sibling-reconnect-preserve-processing
Apr 18, 2026
Merged

fix: preserve IsProcessing through sibling reconnect instead of force-completing#600
PureWeen merged 10 commits intomainfrom
fix/sibling-reconnect-preserve-processing

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

What

When a reconnect triggers client recreation, the sibling re-resume loop was unconditionally force-completing ALL IsProcessing siblings via ForceCompleteProcessingAsync. This killed in-flight responses — the CLI continued working for minutes afterward but all events got EVT-REARM-SKIP'd because IsProcessing was already false.

Real-world impact (session 527, April 17)

14:51:55 [SEND] '527' IsProcessing=true gen=1  (multi-model review started)
14:55:23 [RECONNECT] Sibling '527' force-completed (responseLen=78)
14:55:24 [RECONNECT] Re-resumed sibling '527'
15:02-15:12  17 minutes of events arrive → all EVT-REARM-SKIP'd
15:12:12 session.idle → skipped (IsProcessing already false)

The response was on disk the whole time but invisible until the user sent "please continue."

Fix

Instead of force-completing, preserve IsProcessing through the re-resume:

  1. Capture siblingWasProcessing before the re-resume
  2. Skip the concurrent-send guard for siblings that were already processing
  3. After successful re-resume, restore IsProcessing=true on the new state + start a fresh watchdog

The new event handler registered on the re-resumed session picks up CLI events on the new connection. The watchdog (600s tool tier) handles genuinely dead sessions.

Testing

  • Updated ConnectionRecoveryTests and MultiAgentRegressionTests to verify the new behavior
  • 2 pre-existing DiffParser test failures unrelated to this change

Fixes #599

@PureWeen PureWeen force-pushed the fix/sibling-reconnect-preserve-processing branch from 6f17995 to 9cadec4 Compare April 17, 2026 16:18
@PureWeen
Copy link
Copy Markdown
Owner Author

PureWeen commented Apr 17, 2026

🔍 Multi-Model Code Review — PR #600 (Final)

fix: preserve IsProcessing through sibling reconnect instead of force-completing

7 review rounds, 10 commits, 3 independent reviewers + adversarial consensus. All 3475 tests pass.


All 16 Findings — Final Status

# Finding Status
1 IsProcessing on background thread (INV-2)
2 TCS cancellation + orchestrator retry ✅ (reverted TrySetResult to TrySetCanceled)
3 Stale siblingWasProcessing resurrection
4 Race window: handler before IsProcessing
5 Missing NotifyStateChangedCoalesced
6 Tests structural, not behavioral ✅ (resurrection guard test added)
7 ProcessingPhase hardcoded to 3
8 Pre-existing DiffParser test failures
9 Catch block missing IsProcessing cleanup
10 Stale XML doc comment
11 Missing guard assertions in tests
12 Inconsistent DiffParser path style
13 TrySetResult breaks orchestrator handoff ✅ (reverted)
14 FlushedResponse not carried forward
15 Incorrect "not published" comment
16 Behavioral test doesn't exercise reconnect ✅ (rewritten with state transfer + resurrection guard)

🏁 Recommendation: ✅ Approve

All 16 findings resolved. 3475 tests pass. The PR:

  • Fixes the core bug (force-completing discarded in-flight responses)
  • Preserves IsProcessing + FlushedResponse through sibling reconnect
  • 3-layer resurrection defense (IsOrphaned → ProcessingGeneration → !IsProcessing)
  • Correctly uses TrySetCanceled for orchestrator worker handoff
  • Cleans up stuck state on failed re-resume
  • Fixes pre-existing cross-platform DiffParser test failures
  • Has both structural and behavioral test coverage for all critical invariants

PureWeen and others added 6 commits April 17, 2026 18:29
…-completing

When a reconnect triggers client recreation, the sibling re-resume loop was unconditionally force-completing ALL IsProcessing siblings. This killed in-flight responses — the CLI continued working but all events got EVT-REARM-SKIP'd because IsProcessing was already false.

Fix: instead of force-completing, capture siblingWasProcessing before re-resume and restore IsProcessing + start a fresh watchdog on the new connection after successful re-resume. The new event handler picks up CLI events on the new connection.

Fixes #599

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Review findings addressed:
1. IsProcessing mutations now wrapped in InvokeOnUI (INV-2 thread-safety)
2. NotifyStateChangedCoalesced added for UI re-render
3. HasUsedToolsThisTurn + ResponseCompletion set BEFORE handler registration to close the race window where early events could trigger premature CompleteResponse
4. IsOrphaned guard inside InvokeOnUI lambda

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The InvokeOnUI callback for sibling reconnect processing restoration was missing the generation capture/check required by INV-12. A concurrent SendPromptAsync could race with the posted callback.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
If CompleteResponse runs before the InvokeOnUI callback (both serialized on the UI thread), IsProcessing is already false. Without this guard, the callback would set IsProcessing=true on a session whose turn already finished, leaving it stuck for 600s until the watchdog fires.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. ProcessingPhase: capture actual phase before reconnect instead of hardcoding 3
2. Failed re-resume catch: clear IsProcessing via ClearProcessingState when siblingWasProcessing was true and re-resume throws, preventing stuck 'Thinking...' on dead connections
3. DiffParser path traversal tests: use Path.DirectorySeparatorChar instead of hardcoded backslashes (fixes macOS test failures)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. XML doc comment on ReconnectLoop test updated to describe preservation behavior (was describing the old force-complete behavior this PR reverses)
3. DiffParser path traversal tests now consistently use Path.Combine instead of mixed interpolation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the fix/sibling-reconnect-preserve-processing branch from 9b594cf to 6af3540 Compare April 17, 2026 23:30
PureWeen and others added 4 commits April 17, 2026 18:37
1. TCS coupling (#2): complete old TCS with partial response via TrySetResult instead of TrySetCanceled, so orchestrator workers collect content gracefully instead of hitting the exception retry path
2. Race window (#4): set IsProcessing/IsResumed/ProcessingPhase on siblingState BEFORE handler registration (safe because state isn't published to _sessions yet), eliminating the ~16ms InvokeOnUI scheduling window
3. Behavioral test (#6): added SiblingReconnect_PreservesProcessingState_OnNewState
4. SessionStabilityTests: fixed comment-matching false positive in TryUpdate ordering assertion

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…onse, fix comment

Review round 6 findings:
1. TrySetResult('') breaks orchestrator: empty result triggers revival which orphans the healthy reconnected state. TrySetCanceled is correct — the OperationCanceledException catch re-acquires the new state and re-awaits its TCS.
2. FlushedResponse from old state now copied to siblingState so partial mid-turn content isn't lost.
3. Safety comment corrected: writes are safe because they're idempotent on already-true fields, not because the state is unpublished.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The test now sets up FlushedResponse content on the old state and verifies the state contract that the reconnect code depends on, including siblingWasProcessing capture and content preservation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replaces the manual state-setup test with two behavioral tests:

1. SiblingReconnect_PreservesProcessingState_OnNewState — simulates the
   actual reconnect state transfer (TCS cancellation, FlushedResponse
   carry-forward, IsProcessing/IsResumed preservation) and verifies
   the full state contract.

2. SiblingReconnect_ResurrectionGuard_PreventsStuckSession — verifies
   completed during the reconnect window (the fix from round 3).

All 3475 tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen merged commit 3d7a7bd into main Apr 18, 2026
@PureWeen PureWeen deleted the fix/sibling-reconnect-preserve-processing branch April 18, 2026 03:05
PureWeen added a commit that referenced this pull request Apr 18, 2026
…resurrection

ClearProcessingState is called by CompleteResponse, AbortSessionAsync, and
error recovery paths to reset all processing state. Previously it did NOT
increment ProcessingGeneration, which meant any InvokeOnUI callback that
captured the generation before ClearProcessingState ran would see a matching
generation and could re-set IsProcessing=true on a completed session.

This was the root cause of the resurrection bug found in PR #600 (took 3
in the InvokeOnUI lambda. This change closes the entire class of bugs at
the source: now the generation guard itself catches completed turns, and

There are 4 InvokeOnUI-with-generation-guard sites today (sibling reconnect,
checks, so this is a no-op for current code — but it protects any future
callback that follows the capture-gen → InvokeOnUI → check-gen pattern.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sibling reconnect force-completes actively processing sessions

1 participant