Skip to content

Review jobs stuck in 'running' after process death — no dead-PID detection #164

@peterdrier

Description

@peterdrier

Problem

When a Codex review process crashes or times out, the companion job state file stays in "status": "running" indefinitely. The --wait polling loop continues checking for completion against a dead PID, eventually timing out with no useful error message.

Reproduction

  1. Launch a large review: /codex:review --wait --base upstream/main (148 files, ~19.5k lines)
  2. The review session starts, logs "Reviewer started", then the process dies silently
  3. Job state file still says "status": "running"
  4. The --wait poll loop spins for 10+ minutes before the caller gives up
  5. No error is recorded — the job just looks "stuck"

Evidence from job log

[2026-04-07T02:21:02.393Z] Starting Codex Review.
[2026-04-07T02:21:03.165Z] Starting Codex review thread.
[2026-04-07T02:21:04.768Z] Thread ready (019d65be-512f-7cc0-85f2-a5ddeb9bce70).
[2026-04-07T02:21:04.825Z] Reviewer started: changes against 'upstream/main'
(nothing further — process PID 2568362 died)

Expected behavior

  1. The broker should detect when the child Codex process exits unexpectedly and mark the job as "failed" with an error message
  2. The --wait poll should check if the PID is still alive and fail fast if it isn't
  3. On timeout, the error message should indicate the process died rather than just "still running"

Impact

  • Wastes 10+ minutes of wall time before the caller realizes the review failed
  • No diagnostic information about why it failed (OOM? API timeout? context overflow?)
  • Makes large reviews unreliable — the only recovery is manual cancel + retry

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions