Skip to content

Tracked job lifecycle: hard timeouts and runner watchdogs #13

@JohnnyVicious

Description

@JohnnyVicious

Summary

runTrackedJob in plugins/opencode/scripts/lib/tracked-jobs.mjs:64 wraps the runner promise with no wall-clock guard:

const result = await runner({ report, log });

If the runner stalls (dropped SSE, unresponsive provider after the HTTP call returns, an exception inside getSessionDiff, a wedged subsequent API call), the await never resolves and nothing writes a terminal status for the job. The job file stays status: "running" forever until SessionEnd reaps it — and SessionEnd only reaps jobs whose PID is already dead.

Partial mitigation already exists: sendPrompt in lib/opencode-server.mjs:195 wraps its fetch in AbortSignal.timeout(600_000), so the big inference call has a 10-minute cap. But everything outside that single fetch (the diff fetch in handleTask, JSON parsing, result-file writes) is unguarded.

Suggested fix

  1. Add taskTimeoutMs option to runTrackedJob (default ~15 min, configurable via env or setup state). On expiry, write failed + phase: "timeout" and reject.
  2. Optional: idle-watchdog on report() calls — if no progress for N seconds, fail.

Upstream reference

Derived from openai/codex-plugin-cc#183 — same root-cause pattern.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions