Skip to content

Remove RESUME-ABORT: never abort sessions on resume#452

Merged
PureWeen merged 5 commits intomainfrom
fix/remove-resume-abort
Mar 29, 2026
Merged

Remove RESUME-ABORT: never abort sessions on resume#452
PureWeen merged 5 commits intomainfrom
fix/remove-resume-abort

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

Summary

Removes the RESUME-ABORT logic that was killing legitimate long-running sessions on app restart.

Problem

In persistent mode, the headless CLI keeps running tools while PolyPilot is down. On resume, the old code aborted sessions with unmatched tool starts, destructively killing 15-30+ min tool runs.

Fix

Single RESUME-ACTIVE path: mark as processing, let events flow, watchdog handles dead sessions via timeout.

PureWeen and others added 2 commits March 29, 2026 09:22
In persistent mode the headless CLI server keeps running tools even while
PolyPilot is down — tool results WILL arrive once we reconnect. The abort
was killing legitimate long-running tool executions (builds, tests that
run 15-30+ min without writing to events.jsonl).

The watchdog already handles truly dead sessions via timeout (30-600s
depending on state), making the abort redundant and destructive.

Replaced both RESUME-ABORT and RESUME-SKIP-ABORT branches with a single
RESUME-ACTIVE path that marks the session as processing and lets events
flow naturally.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Diagnostic filter: [RESUME-ABORT -> [RESUME-ACTIVE] + [RESUME-CHECK]
  (without this, RESUME-ACTIVE entries wouldn't appear in event-diagnostics.log)
- Utilities.cs: [RESUME-ABORT] -> [RESUME-CHECK] in HasInterruptedToolExecution
- RESUME-ACTIVE InvokeOnUI: add ProcessingGeneration capture/check per INV-3/INV-12
  to prevent stale callback from re-arming IsProcessing after a user-initiated
  turn has already completed (race window: send -> complete -> stale callback)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the fix/remove-resume-abort branch from 474db57 to d466c7b Compare March 29, 2026 14:24
@PureWeen
Copy link
Copy Markdown
Owner Author

🤖 Multi-Model Code Review — PR #452

Remove RESUME-ABORT: never abort sessions on resume
Reviewed by: Claude Opus 4.6, GPT-5.3-Codex · 2-model consensus applied


🟡 MODERATE — Dead sessions now wait up to 600s instead of immediate resolution

File: PolyPilot/Services/CopilotService.Persistence.cs:436-437
Flagged by: Opus, Codex

The code unconditionally sets HasUsedToolsThisTurn = true, which forces the watchdog into the 600s tool-execution timeout tier. This bypasses the 30s resume-quiescence path. Previously, IsSessionStillProcessing() checked events.jsonl freshness — if stale (>600s), the session was immediately aborted. Now, even for sessions where events.jsonl is hours old, the user sees "Working…" for up to 10 minutes before the watchdog fires.

Suggestion: Use IsSessionStillProcessing() to differentiate: if events.jsonl is clearly stale, skip setting HasUsedToolsThisTurn = true so the 30s resume-quiescence path kicks in instead. Only set it when events.jsonl indicates the CLI was recently active. This preserves the fix for long-running tools while giving fast cleanup for dead sessions.


🟡 MODERATE — No tests for the behavioral change

File: PR-wide
Flagged by: Opus, Codex

High-risk state-machine change with no regression coverage added. Key untested scenarios:

  • Resume with dead CLI: HasInterruptedToolExecution=true, CLI stopped → verify 600s timeout path
  • Resume with live CLI: events arrive → verify watchdog clears IsResumed normally
  • Generation guard race: resume callback fires after user sends new prompt → verify callback is a no-op
  • Watchdog integration: after timeout on dead resumed session, verify SendAsync still works

🟢 POSITIVE — Generation guard is a good addition

File: CopilotService.Persistence.cs:433-435, 440
Flagged by: Opus

The ProcessingGeneration capture-and-check pattern (INV-3/INV-12) correctly prevents a race where a stale resume callback re-arms IsProcessing=true after a user-initiated turn has completed. The old branch lacked this guard — genuine improvement.


📋 Additional Notes

  • CI: ⚠️ No checks configured for this branch
  • Prior review comments: None

Opus-only observation (not consensus, for consideration):

The watchdog clears UI state but does NOT call AbortAsync() on the SDK session when it fires for resumed sessions with zero events. If the CLI truly died and the SDK has pending tool-result expectations, subsequent SendAsync calls may be silently dropped. Consider calling AbortAsync() in the watchdog timeout path for IsResumed sessions that received zero events.


Recommended action: ⚠️ Request changes

  1. Use IsSessionStillProcessing() to gate HasUsedToolsThisTurn — fast cleanup for dead sessions, patience for live ones
  2. Add test coverage for the resume-active path and generation guard race

Use IsSessionStillProcessing() to choose watchdog flags, not to decide abort:
- CLI active: HasUsedToolsThisTurn=true → 600s tool timeout (events will flow)
- CLI stale: leave HasUsedToolsThisTurn=false → 30s quiescence (fast cleanup)

Never abort in either case. Fixes stuck session UX regression where dead
CLI sessions appeared processing for ~10 minutes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

🤖 Multi-Model Code Review — PR #452 (v2)

Remove RESUME-ABORT: never abort sessions on resume (updated diff)
Reviewed by: Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.3-Codex · 2-model consensus applied


Previous findings status

Finding Status
🟡 Dead sessions stuck for 600s Fixed — new two-branch approach uses IsSessionStillProcessing() to gate timeout tier (30s quiescence for dead CLI, 600s for live CLI)
🟡 No tests for behavioral change ⚠️ Still open — no tests added in updated diff

🟡 MODERATE — SDK pending tool state not cleared on dead-CLI resume path

File: CopilotService.Persistence.cs, quiescence branch (~lines 430-446 in diff)
Flagged by: Opus, GPT (Sonnet disagrees — see note)

The old code called AbortAsync() when the CLI was dead, clearing SDK-internal pending tool expectations. The new quiescence path sets IsProcessing=true for 30s, then the watchdog clears it — but never calls AbortAsync.

Verified regression path:

  1. Resume with interrupted tools + stale CLI → enters RESUME-QUIESCE branch
  2. 30s watchdog fires → clears IsProcessing (user sees session as idle)
  3. User sends message → SendPromptAsync finds state.Session != null (line 3122) → skips EnsureSessionConnectedAsync → calls SendAsync directly
  4. SDK still has pending tool expectations → silently accepts but ignores the message (per old code's own comment: "silently queues/ignores new SendAsync calls")
  5. No error thrown → reconnect catch block never fires
  6. Recovery happens via watchdog Case D after ~30s more (AbortAsync on file-size stall)

Net effect: User's first message after a dead-CLI resume appears to hang for ~60s total (30s quiescence + 30s Case D) before the session recovers. Not permanent — but a meaningful UX regression vs the old immediate abort.

Suggested fix: Call AbortAsync() in the quiescence branch after the 30s watchdog fires (or inline before marking processing), OR null state.Session in the watchdog timeout so the next send goes through EnsureSessionConnectedAsync.

Sonnet's counter-argument: Sonnet found that SendAsync would hit a connection error and trigger reconnect. However, code verification shows SendAsync does NOT throw when the SDK has pending tool expectations — it silently accepts the message per the codebase's own documentation.


🟡 MODERATE — No test coverage for new resume branches

File: PR-wide
Flagged by: Opus, GPT

No tests exist for RESUME-ACTIVE or RESUME-QUIESCE paths (confirmed via grep — zero matches in PolyPilot.Tests/). Key untested scenarios:

  • CLI alive → HasUsedToolsThisTurn=true, IsResumed=true (600s tier)
  • CLI stale → HasUsedToolsThisTurn=false, IsResumed=true (30s tier)
  • Generation guard: resume callback fires after user sends new prompt → verify no-op

✅ Improvements confirmed (all 3 models agree)

  • Generation guard (ProcessingGeneration capture + check) — correctly added to both branches. The old RESUME-SKIP-ABORT branch lacked this guard, so this is a net improvement.
  • Two-branch timeout approach — correctly maps to watchdog tiers: IsResumed=true + HasUsedToolsThisTurn=false → 30s quiescence; with HasUsedToolsThisTurn=true → 600s tool timeout.
  • Log tag migration[RESUME-ABORT]/[RESUME-SKIP-ABORT] fully replaced with [RESUME-ACTIVE]/[RESUME-QUIESCE]/[RESUME-CHECK]. ShouldPersistDiagnostic updated.

📋 Notes

  • CI: ⚠️ No checks configured for this branch
  • Prior review comments: Previous review's "600s stuck" finding has been addressed

Recommended action: ⚠️ Request changes

  1. Clear SDK pending tool state for the dead-CLI (quiescence) path — either call AbortAsync inline, or null state.Session so next send triggers full reconnect
  2. Add test coverage for the two resume branches and generation guard

Address review findings:
1. RESUME-QUIESCE (dead CLI) now calls AbortAsync() to clear SDK-internal
   pending tool expectations before entering 30s quiescence. Without this,
   the user's first message after watchdog clears IsProcessing would be
   silently dropped by the SDK (~60s total delay before recovery).

2. Add 4 tests for the resume branch selection logic:
   - ResumeActive_CliAlive_SetsHasUsedToolsThisTurn (600s tier)
   - ResumeQuiesce_CliDead_DoesNotSetHasUsedToolsThisTurn (30s tier)
   - ResumeActive_WatchdogTimeout_NotQuiescence (tier verification)
   - GenerationGuard_StaleCallback_IsNoOp (INV-3/INV-12 race)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

🤖 Multi-Model Code Review — PR #452 (v3)

Remove RESUME-ABORT: never abort sessions on resume (3rd revision)
Reviewed by: Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.3-Codex + manual verification


Previous findings status

Finding Status
🟡 SDK pending tool state not cleared (v2) FixedAbortAsync restored in RESUME-QUIESCE branch
🟡 No test coverage (v1-v2) Addressed — 4 tests added (with caveats, see below)

🟡 MODERATE — AbortAsync race can push dead-CLI timeout from 30s to 120s

File: CopilotService.Persistence.cs, RESUME-QUIESCE branch (~lines 435-450 in diff)
Flagged by: Sonnet (as CRITICAL/600s), manually verified as MODERATE/120s

The RESUME-QUIESCE path calls await copilotSession.AbortAsync() before InvokeOnUI() sets IsResumed=true. If AbortAsync triggers any SDK event (e.g., SessionIdleEvent), HandleSessionEvent (Events.cs line 223) sets HasReceivedEventsSinceResume = true via Volatile.Write on a background thread — bypassing SyncContext FIFO ordering.

When the watchdog first evaluates, hasReceivedEvents=true → quiescence condition fails. However, this does NOT fall to 600s as Sonnet initially claimed. The watchdog (Events.cs lines 2075-2079) detects IsResumed=true && HasReceivedEventsSinceResume=true && !hasActiveTool && !HasUsedToolsThisTurn and clears IsResumed, dropping to the 120s inactivity timeout — not 600s.

Net effect: Dead-CLI sessions may take ~120s instead of 30s to clear if AbortAsync triggers an event. Still better than the old behavior (immediate abort could kill legitimate work), but worse than intended 30s.

Suggested fix: Reset HasReceivedEventsSinceResume = false inside the RESUME-QUIESCE InvokeOnUI callback, after AbortAsync completes.


🟢 MINOR — Tests 3 & 4 test local variables, not production code

File: StuckSessionRecoveryTests.cs, ResumeActive_WatchdogTimeout_NotQuiescence and GenerationGuard_StaleCallback_IsNoOp
Flagged by: Opus, Sonnet, GPT (all 3)

GenerationGuard_StaleCallback_IsNoOp asserts 43 != 42 — would pass even if the production generation guard were deleted. ResumeActive_WatchdogTimeout_NotQuiescence evaluates a hardcoded boolean formula. These provide documentation value but zero regression protection.

Tests 1 & 2 (ResumeActive_CliAlive_SetsHasUsedToolsThisTurn and ResumeQuiesce_CliDead_DoesNotSetHasUsedToolsThisTurn) are stronger — they exercise real IsSessionStillProcessing and HasInterruptedToolExecution with filesystem fixtures.


🟢 MINOR — 30s "Working…" UX change on dead-CLI resume

File: CopilotService.Persistence.cs, RESUME-QUIESCE branch
Flagged by: Opus, GPT

Old code: CLI dead → AbortAsync + no IsProcessing → session immediately interactive.
New code: CLI dead → AbortAsync + IsProcessing=true for 30s → watchdog clears.

This is an intentional design choice (documented in comments) and a reasonable tradeoff for unified processing state management. Noting for awareness.


✅ Improvements confirmed (all 3 models)

  • Generation guard (ProcessingGeneration capture + check) added to both branches — was missing from old RESUME-SKIP-ABORT. Net improvement.
  • AbortAsync retained for dead-CLI path — SDK pending tool expectations are cleared.
  • Log tags fully migrated: [RESUME-ABORT]/[RESUME-SKIP-ABORT][RESUME-ACTIVE]/[RESUME-QUIESCE]/[RESUME-CHECK]. ShouldPersistDiagnostic updated.

📋 Notes

  • CI: ⚠️ No checks configured for this branch
  • Prior review comments: 2 previous review comments — both findings now addressed

Recommended action:Approve with minor suggestion

The critical finding from v2 (SDK tool state) is fixed. The AbortAsync race (MODERATE) has a simple one-line fix (HasReceivedEventsSinceResume = false in the InvokeOnUI callback) but the worst case is 120s instead of 30s — acceptable. The test quality is imperfect but the first two tests provide real coverage. Ship-ready.

…sume

AbortAsync may trigger SDK events on a background thread before InvokeOnUI
runs, setting HasReceivedEventsSinceResume=true. This defeats the 30s
quiescence check, pushing dead-CLI timeout from 30s to 120s.

Reset the flag inside InvokeOnUI after AbortAsync completes so the watchdog
correctly uses the 30s resume quiescence path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen merged commit 72d9db6 into main Mar 29, 2026
@PureWeen PureWeen deleted the fix/remove-resume-abort branch March 29, 2026 20:56
PureWeen added a commit that referenced this pull request Apr 1, 2026
…elaunch

Two bugs prevented worker sessions from surviving app relaunch:

1. IsSessionStillProcessing used a whitelist of 'active' event types,
   but missed intermediate states like assistant.turn_end (between tool
   rounds), assistant.message, and tool.execution_end. Sessions caught
   between tool rounds were incorrectly detected as idle.

   Fix: Use a blacklist of terminal events (session.idle, session.error,
   session.shutdown) instead. Any non-terminal event means still active.

2. Actively-processing sessions were left as lazy placeholders with no
   SDK event handler (commit ff9d3d7). Events from the CLI never reached
   PolyPilot, the watchdog timed out after 600s, and multi-agent
   orchestrator TCSs were never completed.

   Fix: Add actively-processing sessions to eagerResumeCandidates so
   EnsureSessionConnectedAsync establishes the SDK connection. PR #452
   already removed RESUME-ABORT, so ResumeSessionAsync no longer disrupts
   in-flight tool execution.

Verified end-to-end: session processing during relaunch → new app detects
active session → eager resume → SDK connected → events flow → session
completes normally. Multi-agent orchestration resumes correctly via
ResumeOrchestrationIfPendingAsync.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen added a commit that referenced this pull request Apr 1, 2026
…elaunch

Two bugs prevented worker sessions from surviving app relaunch:

1. IsSessionStillProcessing used a whitelist of 'active' event types,
   but missed intermediate states like assistant.turn_end (between tool
   rounds), assistant.message, and tool.execution_end. Sessions caught
   between tool rounds were incorrectly detected as idle.

   Fix: Use a blacklist of terminal events (session.idle, session.error,
   session.shutdown) instead. Any non-terminal event means still active.

2. Actively-processing sessions were left as lazy placeholders with no
   SDK event handler (commit ff9d3d7). Events from the CLI never reached
   PolyPilot, the watchdog timed out after 600s, and multi-agent
   orchestrator TCSs were never completed.

   Fix: Add actively-processing sessions to eagerResumeCandidates so
   EnsureSessionConnectedAsync establishes the SDK connection. PR #452
   already removed RESUME-ABORT, so ResumeSessionAsync no longer disrupts
   in-flight tool execution.

Verified end-to-end: session processing during relaunch → new app detects
active session → eager resume → SDK connected → events flow → session
completes normally. Multi-agent orchestration resumes correctly via
ResumeOrchestrationIfPendingAsync.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen added a commit that referenced this pull request Apr 1, 2026
…elaunch

Two bugs prevented worker sessions from surviving app relaunch:

1. IsSessionStillProcessing used a whitelist of 'active' event types,
   but missed intermediate states like assistant.turn_end (between tool
   rounds), assistant.message, and tool.execution_end. Sessions caught
   between tool rounds were incorrectly detected as idle.

   Fix: Use a blacklist of terminal events (session.idle, session.error,
   session.shutdown) instead. Any non-terminal event means still active.

2. Actively-processing sessions were left as lazy placeholders with no
   SDK event handler (commit ff9d3d7). Events from the CLI never reached
   PolyPilot, the watchdog timed out after 600s, and multi-agent
   orchestrator TCSs were never completed.

   Fix: Add actively-processing sessions to eagerResumeCandidates so
   EnsureSessionConnectedAsync establishes the SDK connection. PR #452
   already removed RESUME-ABORT, so ResumeSessionAsync no longer disrupts
   in-flight tool execution.

Verified end-to-end: session processing during relaunch → new app detects
active session → eager resume → SDK connected → events flow → session
completes normally. Multi-agent orchestration resumes correctly via
ResumeOrchestrationIfPendingAsync.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen added a commit that referenced this pull request Apr 1, 2026
…elaunch

Two bugs prevented worker sessions from surviving app relaunch:

1. IsSessionStillProcessing used a whitelist of 'active' event types,
   but missed intermediate states like assistant.turn_end (between tool
   rounds), assistant.message, and tool.execution_end. Sessions caught
   between tool rounds were incorrectly detected as idle.

   Fix: Use a blacklist of terminal events (session.idle, session.error,
   session.shutdown) instead. Any non-terminal event means still active.

2. Actively-processing sessions were left as lazy placeholders with no
   SDK event handler (commit ff9d3d7). Events from the CLI never reached
   PolyPilot, the watchdog timed out after 600s, and multi-agent
   orchestrator TCSs were never completed.

   Fix: Add actively-processing sessions to eagerResumeCandidates so
   EnsureSessionConnectedAsync establishes the SDK connection. PR #452
   already removed RESUME-ABORT, so ResumeSessionAsync no longer disrupts
   in-flight tool execution.

Verified end-to-end: session processing during relaunch → new app detects
active session → eager resume → SDK connected → events flow → session
completes normally. Multi-agent orchestration resumes correctly via
ResumeOrchestrationIfPendingAsync.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
arisng pushed a commit to arisng/PolyPilot that referenced this pull request Apr 4, 2026
## Summary

Removes the RESUME-ABORT logic that was killing legitimate long-running
sessions on app restart.

## Problem

In persistent mode, the headless CLI keeps running tools while PolyPilot
is down. On resume, the old code aborted sessions with unmatched tool
starts, destructively killing 15-30+ min tool runs.

## Fix

Single `RESUME-ACTIVE` path: mark as processing, let events flow,
watchdog handles dead sessions via timeout.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant