Problem
Conductor intermittently crashes during startup, after emitting workflow_started and agent_started events but before rendering the first agent prompt. The process exits silently with no error event logged. Retrying the same command succeeds.
Observed across 56 workflow runs of pr-reviewer (v0.1.8 and v0.1.9) on Windows over 8 days. 26 of 56 runs (46%) crashed on startup. Every crash is recoverable by simply re-running the same command.
Evidence
Crash signature
All crash logs have exactly 2 events and nearly identical file sizes (90,748-90,750 bytes):
Event 1: workflow_started (timestamp T)
Event 2: agent_started (timestamp T + 0.001s)
(EOF — no further events, no error event)
Time between workflow_started and agent_started is ~1-2ms. The process dies before the next event (agent_prompt_rendered) would be emitted.
Crash rate
| Metric |
Value |
| Total pr-reviewer runs |
56 |
| 2-event crashes |
26 (46%) |
| Successful runs |
30 |
| Crash log sizes |
90,748 (4x), 90,749 (6x), 90,750 (16x) |
| Successful log sizes |
92KB - 722KB |
Crash-before-success patterns
Every crash was eventually followed by a successful run of the same workflow with the same arguments:
5 crash(es) then success
1 crash then success
3 crash(es) then success
1 crash then success
1 crash then success
1 crash then success
2 crash(es) then success
1 crash then success
3 crash(es) then success
3 crash(es) then success
1 crash then success
2 crash(es) then success
2 crash(es) then success
Worst case: 5 consecutive crashes before success. No crash was permanent -- retrying always works.
Near-crash logs (survived slightly longer)
Some runs survived past agent_started but died before completing the first turn:
| Events |
Size |
Last event |
| 2 |
90,750 bytes |
agent_started (crash) |
| 3 |
92,645 bytes |
agent_prompt_rendered (crash) |
| 4 |
92,764 bytes |
agent_turn_start (crash) |
This suggests the failure window extends from agent_started through early agent_turn_start, with most crashes happening before agent_prompt_rendered.
Environment
- Conductor: v0.1.9 (also observed on v0.1.8)
- Python: 3.13.13
- OS: Windows NT 10.0.26220.0 (Windows 11)
- Workflow:
pr-reviewer (8 agents, copilot provider)
- Provider: copilot (Copilot CLI MCP)
Reproduction
Not deterministic -- crashes are intermittent (~46% of runs). Steps:
- Run any conductor workflow that uses the copilot provider:
conductor --silent run pr-review.yaml --input pr_data_dir="tmp/pr-review" --web-bg
- Observe that approximately half of runs crash immediately (check
$TEMP/conductor/ for 2-event logs)
- Re-run the same command -- it succeeds
Impact
- Wasted time: Each crash adds 10-60 seconds (backoff) before retry
- Pipeline reliability: Automated pipelines that invoke conductor must implement retry loops with crash detection
- Silent failure: No error message or event is logged -- the process just exits. Callers must infer the crash from the 2-event log pattern or short runtime
Possible causes (speculation)
Since the crash happens between process startup and first LLM API call, possible causes include:
- Race condition during MCP server initialization (copilot provider starts filesystem server via Node.js)
- Port binding failure for the MCP transport
- Python subprocess spawning race (conductor spawns child processes for MCP servers)
- Async event loop initialization failure
The fact that retrying always works suggests a transient resource contention issue rather than a configuration problem.
Workaround
Callers should retry conductor on short-lived crashes. Example (PowerShell):
$maxRetries = 3
$crashThresholdSec = 10
for ($attempt = 0; $attempt -le $maxRetries; $attempt++) {
$start = [DateTime]::UtcNow
$proc = Start-Process conductor -ArgumentList $args -NoNewWindow -PassThru
$proc.WaitForExit()
$runtime = ([DateTime]::UtcNow - $start).TotalSeconds
if ($proc.ExitCode -eq 0) { break }
if ($runtime -ge $crashThresholdSec) { break } # real failure, not crash
$backoff = [math]::Min(10 * [math]::Pow(2, $attempt), 60)
Start-Sleep -Seconds $backoff
}
This distinguishes crashes (runtime < 10s) from real workflow failures (runtime > 10s) and only retries crashes.
Problem
Conductor intermittently crashes during startup, after emitting
workflow_startedandagent_startedevents but before rendering the first agent prompt. The process exits silently with no error event logged. Retrying the same command succeeds.Observed across 56 workflow runs of
pr-reviewer(v0.1.8 and v0.1.9) on Windows over 8 days. 26 of 56 runs (46%) crashed on startup. Every crash is recoverable by simply re-running the same command.Evidence
Crash signature
All crash logs have exactly 2 events and nearly identical file sizes (90,748-90,750 bytes):
Time between
workflow_startedandagent_startedis ~1-2ms. The process dies before the next event (agent_prompt_rendered) would be emitted.Crash rate
Crash-before-success patterns
Every crash was eventually followed by a successful run of the same workflow with the same arguments:
Worst case: 5 consecutive crashes before success. No crash was permanent -- retrying always works.
Near-crash logs (survived slightly longer)
Some runs survived past
agent_startedbut died before completing the first turn:agent_started(crash)agent_prompt_rendered(crash)agent_turn_start(crash)This suggests the failure window extends from
agent_startedthrough earlyagent_turn_start, with most crashes happening beforeagent_prompt_rendered.Environment
pr-reviewer(8 agents, copilot provider)Reproduction
Not deterministic -- crashes are intermittent (~46% of runs). Steps:
$TEMP/conductor/for 2-event logs)Impact
Possible causes (speculation)
Since the crash happens between process startup and first LLM API call, possible causes include:
The fact that retrying always works suggests a transient resource contention issue rather than a configuration problem.
Workaround
Callers should retry conductor on short-lived crashes. Example (PowerShell):
This distinguishes crashes (runtime < 10s) from real workflow failures (runtime > 10s) and only retries crashes.