Problem
When a Codex review process crashes or times out, the companion job state file stays in "status": "running" indefinitely. The --wait polling loop continues checking for completion against a dead PID, eventually timing out with no useful error message.
Reproduction
- Launch a large review:
/codex:review --wait --base upstream/main (148 files, ~19.5k lines)
- The review session starts, logs "Reviewer started", then the process dies silently
- Job state file still says
"status": "running"
- The
--wait poll loop spins for 10+ minutes before the caller gives up
- No error is recorded — the job just looks "stuck"
Evidence from job log
[2026-04-07T02:21:02.393Z] Starting Codex Review.
[2026-04-07T02:21:03.165Z] Starting Codex review thread.
[2026-04-07T02:21:04.768Z] Thread ready (019d65be-512f-7cc0-85f2-a5ddeb9bce70).
[2026-04-07T02:21:04.825Z] Reviewer started: changes against 'upstream/main'
(nothing further — process PID 2568362 died)
Expected behavior
- The broker should detect when the child Codex process exits unexpectedly and mark the job as
"failed" with an error message
- The
--wait poll should check if the PID is still alive and fail fast if it isn't
- On timeout, the error message should indicate the process died rather than just "still running"
Impact
- Wastes 10+ minutes of wall time before the caller realizes the review failed
- No diagnostic information about why it failed (OOM? API timeout? context overflow?)
- Makes large reviews unreliable — the only recovery is manual cancel + retry
Problem
When a Codex review process crashes or times out, the companion job state file stays in
"status": "running"indefinitely. The--waitpolling loop continues checking for completion against a dead PID, eventually timing out with no useful error message.Reproduction
/codex:review --wait --base upstream/main(148 files, ~19.5k lines)"status": "running"--waitpoll loop spins for 10+ minutes before the caller gives upEvidence from job log
Expected behavior
"failed"with an error message--waitpoll should check if the PID is still alive and fail fast if it isn'tImpact