Skip to content

[FEATURE]: Runtime watchdog for stuck tool/session recovery (complements #19023 startup recovery) #20099

@ESRE-dev

Description

@ESRE-dev

Feature hasn't been suggested before.

  • I have verified this feature I'm about to request hasn't been suggested before.

Describe the enhancement you want to request

This is a follow-up to #19023 (startup orphaned state recovery).

#19023 addresses cleanup at process startup after restart/crash. This request is specifically for a runtime watchdog while the server stays alive.

Requested runtime enhancements:

  1. Periodic watchdog tick that scans for tool parts stuck in running longer than a configured threshold.
  2. Leaf-level filtering so task tools waiting on child sessions are not force-failed prematurely.
  3. Session-level idle detection with configurable threshold to cancel inactive stuck sessions.
  4. Startup orphan cleanup remains covered by Sessions permanently stuck after server restart or stream interruption — no startup recovery for orphaned messages/tool parts #19023; this issue is intentionally scoped to runtime watchdog behavior.

Suggested config keys:

  • experimental.tool_timeout
  • experimental.task_timeout
  • experimental.idle_timeout

Why this matters

Startup-only recovery does not help when a session wedges during normal runtime. If abort propagation fails or a child tool deadlocks, sessions can remain stuck indefinitely until manual intervention.

A runtime watchdog provides a safety net and keeps sessions progressing or failing fast with actionable errors.

Related

Metadata

Metadata

Assignees

Labels

coreAnything pertaining to core functionality of the application (opencode server stuff)

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions