fix(engine): recover from stalled in-progress turns#2283
Open
HUQIANTAO wants to merge 3 commits into
Open
Conversation
reconcile_turn_liveness() had a blind spot: when TurnStarted arrived (setting runtime_turn_status to "in_progress") but TurnComplete never came (sub-agent hang, engine panic, lost event), neither existing watchdog branch fired. is_loading stayed true permanently, queuing all subsequent messages. Add Branch 3 with a 5-minute timeout (matched to stream idle timeout) that checks turn_started_at for staleness when the turn is stuck in "in_progress" with no running sub-agents.
a3c73bf to
8ed924f
Compare
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces a watchdog timeout (TURN_STALL_WATCHDOG_TIMEOUT set to 300 seconds) to detect and recover from stalled turns in the TUI, along with corresponding unit tests. Feedback on the changes points out that when a turn stalls and is recovered, active tool executions or streaming assistant messages are left in a running state, causing permanent spinners in the UI. It is recommended to finalize the active cell and streaming assistant as interrupted and reset the streaming state during recovery.
The watchdog Branch 3 recovery left in-flight tool cells and streaming assistant messages in a running state, causing permanent spinners in the transcript. Also left runtime_turn_id stale, showing "(in progress)" for a turn that had already been recovered. Align the cleanup with apply_engine_error_to_app: finalize thinking, streaming assistant, and active cells as interrupted; reset streaming state; clear runtime_turn_id and streaming indices.
Without this, the turn immediately after a stall recovery would inherit the scroll-lock from the stalled turn and silently skip auto-scroll, leaving the user staring at stale content.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
reconcile_turn_liveness()where a turn stuck in"in_progress"was never recovered, leavingis_loadingpermanentlytrueTURN_STALL_WATCHDOG_TIMEOUT(5 min) constant matched to stream idle timeoutturn_started_atfor staleness when no sub-agents are runningProblem
Users reported requests hanging indefinitely — the UI showed a spinner, new messages were queued ("edit last queued message"), and the app appeared frozen.
Root cause:
reconcile_turn_liveness()had two branches that missed the"in_progress"state:runtime_turn_status.is_none()TurnStartedarrivesruntime_turn_status ∈ {completed, interrupted, failed}TurnCompleteruntime_turn_status == "in_progress"+ timeoutWhen
TurnStartedarrived butTurnCompletewas lost (sub-agent hang, engine panic, etc.), neither existing branch triggered.dispatch_started_atis cleared onTurnStarted(line 1387), so Branch 1's 30s timeout only covers the dispatch window, not the actual turn execution.Fix
Added a third recovery branch that fires when:
is_loadingis trueruntime_turn_statusis"in_progress"turn_started_atexceedsTURN_STALL_WATCHDOG_TIMEOUT(300s)Recovery clears
is_loading,turn_started_at,runtime_turn_status, anddispatch_started_at, then shows an Error toast.Test plan
turn_liveness_leaves_active_turn_running— turn within timeout does NOT trigger recoveryturn_liveness_recovers_stalled_in_progress_turn— turn exceeding timeout DOES trigger recoveryturn_liveness_watchdog_clears_stale_dispatch,turn_liveness_reconciles_completed_busy_state)Greptile Summary
This PR fixes a watchdog blind spot in
reconcile_turn_liveness()where a turn stuck in\"in_progress\"(engine panic, lost completion event) was never recovered, leavingis_loadingpermanentlytrueand the UI frozen. All three issues flagged in the previous review round are addressed in this version.is_loading,turn_started_at,runtime_turn_status,runtime_turn_id,dispatch_started_at, anduser_scrolled_during_stream, calls the same cell-finalisation helpers asapply_engine_error_to_app, and fires an Error toast after 300 s with no completion signal.Confidence Score: 5/5
Safe to merge; the new watchdog branch is well-guarded and mirrors the existing error-recovery helpers correctly.
The change adds one new conditional branch with tight guards. All fields touched by the branch are the same ones cleared in the existing TurnComplete and apply_engine_error_to_app paths. The only open item is a broken rustdoc link, which has no runtime impact.
No files require special attention; both changed files are narrow in scope.
Important Files Changed
Reviews (3): Last reviewed commit: "fix(tui): reset user_scrolled_during_str..." | Re-trigger Greptile