Description
When the OpenCode server restarts (or the process crashes) while a session is actively executing tool calls, the session gets permanently stuck in a "Thinking" state. The root cause is that there is no startup recovery that cleans up orphaned assistant messages and tool parts.
What happens
- A session is actively executing tool calls (e.g., bash commands)
- The server restarts or crashes
- The in-memory session state (
SessionStatus) is lost — the session is no longer "busy"
- But the database state is stale: the last assistant message has
time.completed = undefined (never completed) and tool parts remain in status: "running" forever
- The UI sees the incomplete assistant message and shows a permanent "Thinking" spinner
- The session cannot recover — sending a new message creates a new loop iteration, but the old orphaned message still exists
Root cause analysis
The existing cleanup in processor.ts:402-417 correctly handles the normal case — when the stream ends (normally, via error, or abort), it force-sets any non-terminal tool parts to status: "error". However, this cleanup only runs if the process survives long enough to reach it.
There is zero recovery at startup:
Session.initialize() does not scan for orphaned messages
SessionStatus (in-memory map) is empty after restart — no stale detection
- No background watchdog checks for sessions stuck in busy state
The only defense is in toModelMessages() (message-v2.ts:740-746), which converts pending/running tool parts into "[Tool execution was interrupted]" when building the next LLM prompt. This helps contextual recovery if the user sends a new message, but the UI still shows the session as stuck because the orphaned assistant message has no time.completed.
Observed in production
- Session
ses_2f4299f5cffeVZfxCt3ViZ7eVJ stuck for 3+ hours with a git log tool part permanently in "running" status
- Session
ses_2e9127723ffeKJ1JpjLNS35B4z similar pattern (though this one was actually still running a long k8s test — but demonstrates the same vulnerability)
Relation to existing issues
This is the backend root cause behind several reported symptoms:
Open PRs #16907 and #17593 address frontend symptoms (making the UI more defensive about stale state), but neither fixes the backend root cause — orphaned messages and tool parts in the database.
Proposed fix
Startup recovery in Session or app bootstrap:
- On server start, query all messages where
time.completed IS NULL and the message role = "assistant"
- For each orphaned message:
- Set
time.completed = Date.now()
- Set all tool parts with
status = "running" or status = "pending" to status = "error" with error = "Tool execution was interrupted (server restart)"
- Emit Bus events so connected frontends update
This is a small, safe change — the cleanup logic already exists in processor.ts:402-417, it just needs to be callable from a recovery path at startup.
Steps to reproduce
- Start
opencode serve
- Start a session that uses tool calls (e.g., ask it to run tests)
- Kill the server process while tools are executing (
kill -9)
- Restart the server
- Open the session in the UI — it shows permanent "Thinking" spinner
- Session status API returns
{} (idle) but the UI is stuck
Environment
- opencode serve (long-running, multiple sessions)
- macOS / Linux
- Any provider (observed with gpt-5.3-codex via github-copilot)
OpenCode version
Latest dev branch (commit 814a515a8)
Description
When the OpenCode server restarts (or the process crashes) while a session is actively executing tool calls, the session gets permanently stuck in a "Thinking" state. The root cause is that there is no startup recovery that cleans up orphaned assistant messages and tool parts.
What happens
SessionStatus) is lost — the session is no longer "busy"time.completed = undefined(never completed) and tool parts remain instatus: "running"foreverRoot cause analysis
The existing cleanup in
processor.ts:402-417correctly handles the normal case — when the stream ends (normally, via error, or abort), it force-sets any non-terminal tool parts tostatus: "error". However, this cleanup only runs if the process survives long enough to reach it.There is zero recovery at startup:
Session.initialize()does not scan for orphaned messagesSessionStatus(in-memory map) is empty after restart — no stale detectionThe only defense is in
toModelMessages()(message-v2.ts:740-746), which convertspending/runningtool parts into"[Tool execution was interrupted]"when building the next LLM prompt. This helps contextual recovery if the user sends a new message, but the UI still shows the session as stuck because the orphaned assistant message has notime.completed.Observed in production
ses_2f4299f5cffeVZfxCt3ViZ7eVJstuck for 3+ hours with agit logtool part permanently in"running"statusses_2e9127723ffeKJ1JpjLNS35B4zsimilar pattern (though this one was actually still running a long k8s test — but demonstrates the same vulnerability)Relation to existing issues
This is the backend root cause behind several reported symptoms:
Open PRs #16907 and #17593 address frontend symptoms (making the UI more defensive about stale state), but neither fixes the backend root cause — orphaned messages and tool parts in the database.
Proposed fix
Startup recovery in
Sessionor app bootstrap:time.completed IS NULLand the messagerole = "assistant"time.completed = Date.now()status = "running"orstatus = "pending"tostatus = "error"witherror = "Tool execution was interrupted (server restart)"This is a small, safe change — the cleanup logic already exists in
processor.ts:402-417, it just needs to be callable from a recovery path at startup.Steps to reproduce
opencode servekill -9){}(idle) but the UI is stuckEnvironment
OpenCode version
Latest dev branch (commit
814a515a8)