Skip to content

fix: ClearProcessingState increments ProcessingGeneration to prevent resurrection#612

Merged
PureWeen merged 4 commits intomainfrom
fix/clear-processing-increments-generation
Apr 18, 2026
Merged

fix: ClearProcessingState increments ProcessingGeneration to prevent resurrection#612
PureWeen merged 4 commits intomainfrom
fix/clear-processing-increments-generation

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

What

ClearProcessingState is called by CompleteResponse, AbortSessionAsync, and error recovery paths to reset all processing state. Previously it did not increment ProcessingGeneration, which meant any InvokeOnUI callback that captured the generation before ClearProcessingState ran would see a matching generation and could re-set IsProcessing=true on a completed session — a "resurrection" bug.

Why

This was the root cause of the resurrection bug found in PR #600 (took 3 review rounds to identify). The fix there was a manual !IsProcessing guard in the InvokeOnUI lambda. This change closes the entire class of bugs at the source.

Before (fragile)

InvokeOnUI captures gen=5 → CompleteResponse runs (gen stays 5) → guard passes → resurrection!

After (defense in depth)

InvokeOnUI captures gen=5 → ClearProcessingState increments to 6 → guard fails → skip ✅

Scope

  • 1 line added to ClearProcessingState: Interlocked.Increment(ref state.ProcessingGeneration)
  • 1 behavioral test verifying the increment

Testing

All 3476 tests pass.

…resurrection

ClearProcessingState is called by CompleteResponse, AbortSessionAsync, and
error recovery paths to reset all processing state. Previously it did NOT
increment ProcessingGeneration, which meant any InvokeOnUI callback that
captured the generation before ClearProcessingState ran would see a matching
generation and could re-set IsProcessing=true on a completed session.

This was the root cause of the resurrection bug found in PR #600 (took 3
in the InvokeOnUI lambda. This change closes the entire class of bugs at
the source: now the generation guard itself catches completed turns, and

There are 4 InvokeOnUI-with-generation-guard sites today (sibling reconnect,
checks, so this is a no-op for current code — but it protects any future
callback that follows the capture-gen → InvokeOnUI → check-gen pattern.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

PureWeen commented Apr 18, 2026

🔍 Multi-Model Code Review — PR #612 (Final Re-Review)

fix: ClearProcessingState increments ProcessingGeneration to prevent resurrection

Re-reviewed by 3 independent reviewers after 3 fix commits addressing all prior review + expert review findings. CI: ⚠️ No checks reported.


Previous Review Findings — All Addressed

# Finding Status
1 Core fix correct (ClearProcessingState increment) ✅ Unchanged
2 ClearProcessingStateForRecoveryFailure same gap ✅ FIXED (commit ec5ac94)
3 No negative resurrection test ✅ FIXED (GenerationGuard_PreventsResurrection_AfterCompletion)
4 STEER-ERROR path bypasses increment (expert review) ✅ FIXED (commit 40cddef)
5 MCP-reload path bypasses increment (expert review) ✅ FIXED (commit 40cddef)
6 long.MaxValue overflow on orphaned state (expert review) ✅ FIXED (overflow guard on ClearProcessingState + ClearProcessingStateForRecoveryFailure)

New Questions From This Round

Is the missing overflow guard on STEER-ERROR/MCP-reload a problem?

No (3/3 reviewers agree). ProcessingGeneration = long.MaxValue is only set on orphan paths where IsOrphaned = true is written first. STEER-ERROR and MCP-reload only operate on live, non-orphaned sessions — long.MaxValue is unreachable on those paths. The guard is correctly omitted.

Is if (Read != MaxValue) Increment a TOCTOU race?

Technically yes, but provably benign (2/3 reviewers confirm safe; 1/3 flagged as concern). The race requires a concurrent reconnect to orphan the state between Read and Increment. If that happens, the state is already dead (IsOrphaned = true, removed from _sessions), so wrapping to long.MinValue is harmless — no future code uses the orphaned state. A CAS loop would close the window atomically but adds complexity with no correctness benefit.


📊 Final Coverage

All 4 paths that set IsProcessing = false with a SessionState now increment ProcessingGeneration:

  • ClearProcessingState — with overflow guard ✅
  • ClearProcessingStateForRecoveryFailure — with overflow guard ✅
  • STEER-ERROR — bare increment (orphaned state unreachable) ✅
  • MCP-reload — bare increment (orphaned state unreachable) ✅

2 tests verify the invariant:

  • ClearProcessingState_IncrementsGeneration — mechanical increment verification
  • GenerationGuard_PreventsResurrection_AfterCompletion — full resurrection scenario

🏁 Recommendation: ✅ Approve

All 6 findings (2 from initial review + 4 from expert review) are resolved. The implementation is minimal, correctly scoped, and consistent across all processing-clear paths. 3477 tests pass.

…yFailure + add resurrection test

1. ClearProcessingStateForRecoveryFailure now increments ProcessingGeneration
   before IsProcessing=false, matching ClearProcessingState. This closes the
   same resurrection gap for the 3 permission-recovery failure paths.

2. Added GenerationGuard_PreventsResurrection_AfterCompletion test that
   demonstrates the actual resurrection scenario: captures gen before
   completion, completes the turn, verifies the stale callback would see
   a generation mismatch.

All 3477 tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

/review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 18, 2026

Expert Code Review completed successfully!

@github-actions

This comment has been minimized.

… guard overflow

Addresses expert review findings:

1. STEER-ERROR path (~line 4484) and MCP-reload path (~line 4764) both
   set IsProcessing=false inline without incrementing ProcessingGeneration.
   Added the increment before IsProcessing=false in both paths.

2. All 4 increment sites now guard against long.MaxValue overflow on
   orphaned states (ProcessingGeneration set to long.MaxValue during
   orphaning). Incrementing would wrap to long.MinValue — unreachable
   in practice but worth guarding defensively.

All 3477 tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

/review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 18, 2026

Expert Code Review completed successfully!

@github-actions
Copy link
Copy Markdown
Contributor

🟡 Design-Level Findings (Outside Diff)

Two IsProcessing = false paths were not touched by this PR but suffer from the exact same vulnerability it fixes:

1. SessionErrorEvent handler — CopilotService.Events.cs:1189 (3/3 reviewers)

The SessionErrorEvent handler (lines 1158–1205) is a standalone InvokeOnUI block that manually clears all processing state but does not increment ProcessingGeneration. A stale watchdog/TurnEnd-fallback callback that captured the generation before the error handler ran would pass the generation guard, leaving only the !IsProcessing check — the exact fragile single-point-of-failure this PR aims to eliminate.

Scenario: TurnEnd fallback timer fires → posts InvokeOnUI(CompleteResponse(state, N))SessionErrorEvent clears IsProcessing at gen=N (no increment) → IDLE-DEFER-REARM re-arms IsProcessing=true at gen=N → stale fallback passes generation guard → premature completion.

2. Permission recovery success path — CopilotService.Events.cs:3261 (3/3 reviewers)

TryRecoverPermissionAsync success branch (lines 3243–3291) posts an InvokeOnUI that sets IsProcessing = false at line 3261 without incrementing ProcessingGeneration. Between this clear and the subsequent SendPromptAsync (which increments generation), there is a window where stale callbacks can pass the guard.


Recommendation: Add the guarded increment to both paths, or refactor them to call ClearProcessingState() directly to ensure they stay in sync with future changes.

Discarded finding (1/3 — single reviewer only): Missing ClearFlushedReplayDedup in STEER-ERROR path. Follow-up reviewer confirmed FlushedResponse.Clear() already prevents the dedup issue.

Generated by Expert Code Review for issue #612 ·

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expert Code Review — PR #612

Methodology: 3 independent reviewers with adversarial consensus. Findings required 2/3+ agreement to be included. Single-reviewer findings were escalated to the other 2 for follow-up before inclusion/discard.

CI Status

Tests pass (66/66 in ChatExperienceSafetyTests). Build succeeds.


Findings Summary

# Severity Finding Consensus Location
1 🟡 MODERATE STEER-ERROR path missing long.MaxValue overflow guard 3/3 CopilotService.cs:4486 (in diff)
2 🟡 MODERATE MCP-reload path missing long.MaxValue overflow guard 3/3 CopilotService.cs:4766 (in diff)
3 🟡 MODERATE SessionErrorEvent handler missing generation increment 3/3 Events.cs:1189 (outside diff)
4 🟡 MODERATE Permission recovery missing generation increment 3/3 Events.cs:3261 (outside diff)
5 🟢 MINOR TOCTOU window in read-then-increment pattern 2/3 CopilotService.cs:794 (in diff)
6 🟢 MINOR Second test verifies arithmetic, not behavior; 3/4 paths untested 2/3 ChatExperienceSafetyTests.cs:335 (in diff)

Discarded: Missing ClearFlushedReplayDedup in STEER-ERROR (1/3 only — follow-up reviewer confirmed FlushedResponse.Clear() prevents the issue).


Assessment

The core fix is correct and well-motivated — incrementing ProcessingGeneration before IsProcessing = false eliminates the fragile single-point-of-failure where only the !IsProcessing check prevented resurrection.

Blocking issues (1–2): The two bare Interlocked.Increment calls in the STEER-ERROR and MCP-reload paths are inconsistent with the guarded pattern used everywhere else. If a concurrent reconnect orphans the session, these will overflow long.MaxValue to long.MinValue, breaking the monotonicity invariant. The fix is a one-line change each.

Recommended improvements (3–4): Two IsProcessing = false paths outside the diff (SessionErrorEvent and permission recovery) were missed. They suffer from the exact same vulnerability this PR fixes. Adding the increment there would complete the defense-in-depth story.

Non-blocking (5–6): The TOCTOU window and test quality are real observations but low-risk and acceptable as follow-up items.

Generated by Expert Code Review for issue #612

state.IsReconnectedSend = false; // INV-1 item 8: prevent stale 35s timeout on next watchdog start
Interlocked.Exchange(ref state.SendingFlag, 0);
// Clear IsProcessing BEFORE completing TCS (INV-O3)
Interlocked.Increment(ref state.ProcessingGeneration); // Invalidate stale callbacks
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE — Missing long.MaxValue overflow guard (Flagged by: 3/3 reviewers)

This bare Interlocked.Increment lacks the long.MaxValue sentinel check that ClearProcessingState (line 794) and ClearProcessingStateForRecoveryFailure (line 3120) both have. If a concurrent reconnect orphans this session (Exchange(..., long.MaxValue)) while the STEER-ERROR catch is in-flight, the increment wraps to long.MinValue, breaking the monotonicity invariant for all future generation comparisons.

Suggested fix:

if (Interlocked.Read(ref state.ProcessingGeneration) != long.MaxValue)
    Interlocked.Increment(ref state.ProcessingGeneration); // Invalidate stale callbacks

await InvokeOnUIAsync(() =>
{
FlushCurrentResponse(state);
Interlocked.Increment(ref state.ProcessingGeneration); // Invalidate stale callbacks
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE — Missing long.MaxValue overflow guard (Flagged by: 3/3 reviewers)

Same issue as the STEER-ERROR path. This unconditional Interlocked.Increment can wrap to long.MinValue if a concurrent reconnect has already set the sentinel. The session is orphaned at line 4789, after this block runs — so a concurrent reconnect at lines 3966/4000/4011 can set long.MaxValue between the IsProcessing check (line 4760) and this increment.

Suggested fix:

if (Interlocked.Read(ref state.ProcessingGeneration) != long.MaxValue)
    Interlocked.Increment(ref state.ProcessingGeneration); // Invalidate stale callbacks

// bail out — preventing resurrection of a completed turn. Without this,
// the generation guard passes and only the !IsProcessing check saves us.
// Skip if orphaned (long.MaxValue) — incrementing would overflow to long.MinValue.
if (Interlocked.Read(ref state.ProcessingGeneration) != long.MaxValue)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR — TOCTOU window in read-then-increment (Flagged by: 2/3 reviewers)

Theoretical race between Interlocked.Read and Interlocked.Increment: a background thread could Exchange(..., long.MaxValue) between them. Some ClearProcessingState callers (SendPromptAsync catch blocks, AbortSessionAsync) may run on thread-pool threads, making the window wider than "UI thread only" implies.

Practically low-risk. A CompareExchange loop would close it definitively. Not blocking — acceptable as follow-up.

/// generation before CompleteResponse ran must see a mismatch and bail out,
/// rather than re-setting IsProcessing=true on a completed session.
/// </summary>
[Fact]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR — Test verifies arithmetic, not behavior (Flagged by: 2/3 reviewers)

This test checks that 6 ≠ 5 after incrementing — tautologically true regardless of whether the production guard works. A stronger test would create a closure simulating a stale callback, complete the session, execute the closure, and assert IsProcessing was NOT changed.

Also: only the ClearProcessingState path (via CompleteResponse) is tested. The other 3 modified paths have zero generation increment verification.

Not blocking — the first test (ClearProcessingState_IncrementsGeneration) is solid.

…d paths

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen merged commit 9c6f593 into main Apr 18, 2026
@PureWeen PureWeen deleted the fix/clear-processing-increments-generation branch April 18, 2026 20:02
PureWeen added a commit that referenced this pull request Apr 19, 2026
…ssion history

7 new behavioral tests for GetLastEventType (the static method used by
the TurnEnd fallback's last-event-type check):
- tool.execution_start as last event → detected
- session.idle as terminal event → detected
- file doesn't exist → returns null
- empty file → returns null
- invalid JSON → returns null (fail-open)
- trailing blank lines → ignored
- FileShare.ReadWrite → concurrent read works

Update regression history: add #612 and #619.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant