fix: persist transcripts on idle + reconnect idle workers on restart#334
fix: persist transcripts on idle + reconnect idle workers on restart#334
Conversation
Both OpenCode and builtin interactive workers now write a transcript snapshot to the DB each time they go idle (initial task + every follow-up). If spacebot restarts while a worker is waiting for input, the transcript survives rather than being lost. OpenCode workers get an optional sqlite_pool field; builtin workers already have it via AgentDeps. The snapshot overwrites the blob each time with the complete accumulated history.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughResume + persistence for idle interactive workers: adds IdleWorker DB row and resume APIs, startup logic to pre-create channels and resume idle workers, OpenCode server pool directory-claiming, worker resume/spawn augmentation with secrets + SQLite, resumed-run flows, and transcript-to-history conversion (duplicate definition present). Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
src/agent/worker.rs
Outdated
| let mut follow_up_failure: Option<String> = None; | ||
| if let Some(mut input_rx) = self.input_rx.take() { | ||
| self.state = WorkerState::WaitingForInput; | ||
| self.persist_transcript(&compacted_history, &history); |
There was a problem hiding this comment.
Nice durability improvement, but since persist_transcript is fire-and-forget, there’s still a window where we emit WorkerIdle while the UPDATE hasn’t actually landed yet. If the process dies in that gap you can still lose the “idle” snapshot.
Also worth considering write ordering: with multiple idle transitions, older spawned UPDATEs can complete after newer ones and overwrite a more complete transcript.
Might be worth making the persistence async + awaited at idle boundaries (or funnel snapshots through a single ordered queue) so “idle” implies the snapshot is actually durable.
src/opencode/worker.rs
Outdated
| let worker_id = self.id.to_string(); | ||
| let pool = pool.clone(); | ||
|
|
||
| tokio::spawn(async move { |
There was a problem hiding this comment.
Same concern here: tokio::spawn makes the snapshot persistence best-effort, so a crash right after send_idle() can still drop the transcript, and multiple in-flight snapshots can complete out of order (older snapshot overwriting a newer one).
If the goal is “idle implies persisted”, consider making this an async fn and awaiting the UPDATE (or pushing snapshots through an ordered persistence task) before emitting the idle event.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/agent/worker.rs (1)
330-337: 🛠️ Refactor suggestion | 🟠 MajorUse
transition_to()in this touched lifecycle path.These assignments still bypass
can_transition_to(), so illegal worker-state regressions in the follow-up loop stay silent instead of failing fast.♻️ Minimal change
- self.state = WorkerState::WaitingForInput; + self.transition_to(WorkerState::WaitingForInput)?; self.persist_transcript(&compacted_history, &history); self.hook.send_status("waiting for input"); self.hook.send_worker_idle(); while let Some(follow_up) = input_rx.recv().await { - self.state = WorkerState::Running; + self.transition_to(WorkerState::Running)?; self.hook.send_status("processing follow-up"); @@ - self.state = WorkerState::Failed; + self.transition_to(WorkerState::Failed)?; self.hook.send_status("failed"); follow_up_failure = Some(failure_reason); break; @@ - self.state = WorkerState::WaitingForInput; + self.transition_to(WorkerState::WaitingForInput)?; self.persist_transcript(&compacted_history, &history); self.hook.send_status("waiting for input"); self.hook.send_worker_idle();As per coding guidelines, "src/agent/worker.rs: Worker state transitions must be validated with
can_transition_to()usingmatches!to catch illegal transitions as runtime errors".Also applies to: 413-423
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agent/worker.rs` around lines 330 - 337, The code sets self.state directly to WorkerState::WaitingForInput and WorkerState::Running in the follow-up lifecycle path, bypassing the validation logic; replace those direct assignments with calls to self.transition_to(...) so transitions go through can_transition_to() and produce runtime errors on illegal transitions (apply the same change for the other touched range around the follow-up loop and the block at lines 413-423). Locate the direct writes to self.state and swap them for self.transition_to(WorkerState::WaitingForInput) and self.transition_to(WorkerState::Running), ensuring existing calls like persist_transcript(...) and hook.send_status(...) remain in place and order is preserved.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/agent/worker.rs`:
- Around line 331-332: persist_transcript() is fire-and-forget and can race,
allowing an older idle snapshot to overwrite a later transcript; serialize or
guard these writes by adding per-worker write ordering (e.g., a mutex/queue) or
attach a monotonically increasing checkpoint/version to each persist and ignore
stale updates when writing worker_runs.transcript; specifically modify the calls
around persist_transcript(&compacted_history, &history) (and the other identical
idle checkpoint call) so they either acquire the worker-level lock/sequence
before persisting or pass/check a checkpoint/version and only update the DB when
the incoming checkpoint is newer.
In `@src/opencode/worker.rs`:
- Around line 761-772: The detached tokio::spawn that runs the SQL UPDATE on
worker_runs (the query updating transcript and tool_calls by worker_id) can race
with the completion path in src/agent/channel_dispatch.rs which later does an
unconditional UPDATE with get_messages(); fix by adding a monotonic
checkpoint/version or sequencing per-worker writes: either (A) add a numeric
transcript_version (or last_snapshot_seq) column and include it in the UPDATE
WHERE clause and increment it when writing so the idle snapshot only succeeds if
its expected version matches, or (B) serialize writes for a given worker by
routing all transcript persistence through a single async task/Mutex keyed by
worker_id (instead of spawning a detached task in the block that uses blob,
tool_calls, worker_id). Ensure whichever approach you choose updates the
corresponding read in channel_dispatch.rs to respect the version/serialization
so later definitive writes cannot be overwritten by stale snapshots.
---
Outside diff comments:
In `@src/agent/worker.rs`:
- Around line 330-337: The code sets self.state directly to
WorkerState::WaitingForInput and WorkerState::Running in the follow-up lifecycle
path, bypassing the validation logic; replace those direct assignments with
calls to self.transition_to(...) so transitions go through can_transition_to()
and produce runtime errors on illegal transitions (apply the same change for the
other touched range around the follow-up loop and the block at lines 413-423).
Locate the direct writes to self.state and swap them for
self.transition_to(WorkerState::WaitingForInput) and
self.transition_to(WorkerState::Running), ensuring existing calls like
persist_transcript(...) and hook.send_status(...) remain in place and order is
preserved.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 719c7d64-01dd-4162-888d-1040c9d5a47e
📒 Files selected for processing (3)
src/agent/channel_dispatch.rssrc/agent/worker.rssrc/opencode/worker.rs
src/agent/worker.rs
Outdated
| self.persist_transcript(&compacted_history, &history); | ||
| self.hook.send_status("waiting for input"); |
There was a problem hiding this comment.
These new idle checkpoints need write ordering.
persist_transcript() is fire-and-forget and overwrites the full transcript blob. With the new calls here, an earlier idle snapshot can complete after a later/final persist and roll worker_runs.transcript back to shorter history. Please serialize transcript writes per worker or guard the update with a monotonically increasing checkpoint.
Also applies to: 421-422
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/agent/worker.rs` around lines 331 - 332, persist_transcript() is
fire-and-forget and can race, allowing an older idle snapshot to overwrite a
later transcript; serialize or guard these writes by adding per-worker write
ordering (e.g., a mutex/queue) or attach a monotonically increasing
checkpoint/version to each persist and ignore stale updates when writing
worker_runs.transcript; specifically modify the calls around
persist_transcript(&compacted_history, &history) (and the other identical idle
checkpoint call) so they either acquire the worker-level lock/sequence before
persisting or pass/check a checkpoint/version and only update the DB when the
incoming checkpoint is newer.
On startup, query worker_runs for idle interactive workers and attempt to resume them instead of marking them as failed: - Split reconciliation: only running workers are marked failed; idle workers are left for reconnection - New get_idle_interactive_workers() query loads idle workers with transcript blobs and session metadata - transcript_to_history() converts persisted TranscriptStep[] back into Rig Message history for builtin worker resume - Builtin Worker::resume_interactive() creates a worker with prior history, skips initial task execution, enters follow-up loop directly - OpenCode OpenCodeWorker::resume_interactive() reconnects to existing server+session via get_or_create(), verifies session via get_messages(), repopulates accumulated_parts, enters follow-up loop - resume_idle_worker_into_state() orchestrates resumption into a ChannelState (handles both builtin and opencode worker types) - Startup code in main.rs queries idle workers, groups by channel_id, pre-creates channels with outbound routing, and resumes each worker - Workers that fail to resume are marked failed via fail_idle_worker()
There was a problem hiding this comment.
Actionable comments posted: 7
♻️ Duplicate comments (2)
src/agent/worker.rs (1)
393-396:⚠️ Potential issue | 🟠 Major
idleis still advertised before the snapshot is durable.These checkpoints still fire-and-forget the full-transcript update and then immediately emit
waiting for input/WorkerIdle. A crash or slow SQLite write in that gap can leave the row marked idle with an older transcript, and concurrent spawned updates can still land out of order.Also applies to: 484-487
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agent/worker.rs` around lines 393 - 396, The code sets WorkerState::WaitingForInput and immediately calls persist_transcript(...) fire-and-forget, then send_status("waiting for input") / send_worker_idle(), which can advertise idle before the transcript is durably written; change the sequence to await/handle the result of persist_transcript(...) before transitioning state or emitting status/idle, propagate or log errors on persistence failures, and ensure the same fix is applied to the other identical block that calls WorkerState::WaitingForInput, persist_transcript, send_status, and send_worker_idle (the later occurrence of these symbols elsewhere in the file).src/opencode/worker.rs (1)
907-918:⚠️ Potential issue | 🟠 Major
WorkerIdleis still only best-effort durable.This
tokio::spawnis detached from the idle transition, so a crash/shutdown right aftersend_idle()can still leaveworker_runs.transcriptstale. It also leaves whole-row transcript updates free to complete out of order.You can confirm the remaining ordering risk with:
#!/bin/bash rg -n -C3 'persist_transcript_snapshot|UPDATE worker_runs SET transcript|send_idle\(' srcExpected result:
persist_transcript_snapshot()is detached here, while callers emitsend_idle()immediately after invoking it and other transcriptUPDATEs exist elsewhere in the flow.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/opencode/worker.rs` around lines 907 - 918, The persist_transcript_snapshot call is being detached via tokio::spawn which makes WorkerIdle transitions only best-effort and allows out-of-order whole-row UPDATEs; change the persist_transcript_snapshot implementation (the tokio::spawn block that runs sqlx::query("UPDATE worker_runs SET transcript = ?, tool_calls = ? WHERE id = ?")...) so it is awaited inline (i.e., remove the detached tokio::spawn and return/await a Result from persist_transcript_snapshot), and ensure callers of persist_transcript_snapshot (places that call send_idle() immediately after) await its completion before calling send_idle(); additionally, to prevent update reordering use a transaction or an optimistic-lock WHERE clause (e.g., version or updated_at) in the worker_runs UPDATE to ensure updates are applied in order.
🧹 Nitpick comments (1)
src/opencode/worker.rs (1)
394-399: Use.ok()for these broadcast sends instead oflet _ =.These are the one case the repo explicitly allows to be ignored, but it wants the ignore to be spelled as
.ok()rather thanlet _ =.As per coding guidelines, "Never silently discard errors; no
let _ =on Results; handle, log, or propagate errors, with the exception of.ok()on channel sends where the receiver may be dropped".Also applies to: 443-448
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/opencode/worker.rs` around lines 394 - 399, Replace the ignored Result patterns that use "let _ = self.event_tx.send(...)" with the explicit ".ok()" discard on the send calls; specifically update the send of ProcessEvent::WorkerInitialResult (the self.event_tx.send call wrapping WorkerInitialResult with agent_id/self.id/channel_id/scrubbed_result) and the analogous send at the later block (the send that emits the worker result event around lines 443-448) to call .ok() on the Result instead of using "let _ =", e.g., change the expression ending in ".send(...)" to ".send(...).ok()" so the channel-send ignore is explicit per guidelines.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/agent/worker.rs`:
- Around line 273-277: The resumed worker loads the transcript into
compacted_history but calls prompt_once() with history, leaving the model
context empty after restart; to fix, initialize history from prior_history
(e.g., set history = self.prior_history.take().unwrap_or_default() or push a
recap marker into history) and set compacted_history to empty (compacted_history
= Vec::new()) so prompt_once() receives the restored transcript; update the
logic around prior_history, history, compacted_history and prompt_once() to
reflect this transfer.
- Around line 183-199: When resuming a worker you must recreate the SpacebotHook
to reflect the swapped-in existing_id and set the state via the state machine
instead of assigning fields directly: after calling Self::build(...) and before
using send_status()/send_worker_idle(), reconstruct the SpacebotHook instance
(or reinitialize its internal ProcessId) so its publishable ID matches
existing_id, and call transition_to(WorkerState::WaitingForInput) (guarded by
can_transition_to()/matches!) rather than setting worker.state =
WorkerState::WaitingForInput; keep prior_history as-is so run_follow_up_loop()
can pick it up and ensure all outgoing publishes use the rebuilt hook tied to
existing_id.
In `@src/conversation/history.rs`:
- Around line 507-519: The query in get_idle_interactive_workers uses
query_as::<_, IdleWorkerRow> but can fail when tool_calls is NULL; update the
query to coalesce nulls (e.g., COALESCE(tool_calls, 0) AS tool_calls) or change
the IdleWorkerRow definition to make tool_calls an Option<i64> so nullable DB
values map safely; ensure references to IdleWorkerRow,
get_idle_interactive_workers, and the tool_calls column are updated consistently
(or adapt any downstream code that expects a non-optional i64).
In `@src/conversation/worker_transcript.rs`:
- Around line 300-367: transcript_to_history currently maps every
TranscriptStep::Action/ActionContent::Text into Message::Assistant, losing
original speaker role; update the reverse mapping to preserve user vs assistant
text by encoding the role when converting in convert_history and checking it
here: modify transcript_to_history to inspect ActionContent (or a new role-tag
on TranscriptStep/Action) and push Message::User { content:
OneOrMany::one(UserContent::Text(...)) } for user-originated texts and
Message::Assistant { ... } for assistant-originated texts (keep ToolCall
handling as-is), ensuring you reference and update the constructors used here
(transcript_to_history, TranscriptStep::Action, ActionContent::Text,
Message::Assistant, Message::User) so resumed transcripts restore correct
speaker roles.
In `@src/main.rs`:
- Around line 1717-1746: The startup-only outbound router handles
OutboundResponse::Text/RichMessage/ThreadReply by sending
ApiEvent::OutboundMessage but omits emitting ApiEvent::TypingState
(Status::Thinking/Status::StopTyping), causing resumed channels to lose typing
indicators; fix by routing startup responses through the same routing helper
used by the normal channel router (the helper that consumes
spacebot::OutboundResponse and emits both ApiEvent::TypingState and
ApiEvent::OutboundMessage) instead of duplicating the match here so that
api_event_tx and ApiEvent::TypingState (Status::Thinking/Status::StopTyping) are
produced consistently.
- Around line 1595-1601: The calls to
run_logger.fail_idle_worker(...).await.ok() are silently discarding Result
errors; replace the .ok() usage with proper error handling: await the Result
returned by fail_idle_worker (and the other occurrence around the same logic),
then on Err log the error via run_logger.error or propagate the error (return
Err) so SQL update failures are not hidden; locate both occurrences by the
symbol fail_idle_worker and ensure you either match on the Result or use
.await.map_err(|e| { run_logger.error(..., e); e }) to record the failure
instead of swallowing it.
In `@src/opencode/worker.rs`:
- Around line 385-390: After rebuilding event_state in the resuming branch,
persist the recovered transcript/state to durable storage before emitting
WorkerIdle: add a call to the existing checkpoint/save routine (e.g. invoke a
method like self.checkpoint_event_state() or
self.persist_transcript(event_state)) immediately after rebuilding event_state
and before self.send_status(...) / self.send_idle(), so the recovered history is
recorded and the WorkerIdle snapshot reflects the restored transcript.
---
Duplicate comments:
In `@src/agent/worker.rs`:
- Around line 393-396: The code sets WorkerState::WaitingForInput and
immediately calls persist_transcript(...) fire-and-forget, then
send_status("waiting for input") / send_worker_idle(), which can advertise idle
before the transcript is durably written; change the sequence to await/handle
the result of persist_transcript(...) before transitioning state or emitting
status/idle, propagate or log errors on persistence failures, and ensure the
same fix is applied to the other identical block that calls
WorkerState::WaitingForInput, persist_transcript, send_status, and
send_worker_idle (the later occurrence of these symbols elsewhere in the file).
In `@src/opencode/worker.rs`:
- Around line 907-918: The persist_transcript_snapshot call is being detached
via tokio::spawn which makes WorkerIdle transitions only best-effort and allows
out-of-order whole-row UPDATEs; change the persist_transcript_snapshot
implementation (the tokio::spawn block that runs sqlx::query("UPDATE worker_runs
SET transcript = ?, tool_calls = ? WHERE id = ?")...) so it is awaited inline
(i.e., remove the detached tokio::spawn and return/await a Result from
persist_transcript_snapshot), and ensure callers of persist_transcript_snapshot
(places that call send_idle() immediately after) await its completion before
calling send_idle(); additionally, to prevent update reordering use a
transaction or an optimistic-lock WHERE clause (e.g., version or updated_at) in
the worker_runs UPDATE to ensure updates are applied in order.
---
Nitpick comments:
In `@src/opencode/worker.rs`:
- Around line 394-399: Replace the ignored Result patterns that use "let _ =
self.event_tx.send(...)" with the explicit ".ok()" discard on the send calls;
specifically update the send of ProcessEvent::WorkerInitialResult (the
self.event_tx.send call wrapping WorkerInitialResult with
agent_id/self.id/channel_id/scrubbed_result) and the analogous send at the later
block (the send that emits the worker result event around lines 443-448) to call
.ok() on the Result instead of using "let _ =", e.g., change the expression
ending in ".send(...)" to ".send(...).ok()" so the channel-send ignore is
explicit per guidelines.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 28479e10-d093-49dc-8797-4c8801146acb
📒 Files selected for processing (6)
src/agent/channel_dispatch.rssrc/agent/worker.rssrc/conversation/history.rssrc/conversation/worker_transcript.rssrc/main.rssrc/opencode/worker.rs
| let (input_tx, input_rx) = mpsc::channel(32); | ||
| let mut worker = Self::build( | ||
| channel_id, | ||
| task, | ||
| system_prompt, | ||
| deps, | ||
| browser_config, | ||
| screenshot_dir, | ||
| brave_search_key, | ||
| logs_dir, | ||
| Some(input_rx), | ||
| ); | ||
| // Reuse the original worker ID so DB row stays linked. | ||
| worker.id = existing_id; | ||
| worker.state = WorkerState::WaitingForInput; | ||
| // Stash the prior history so `run_follow_up_loop()` can pick it up. | ||
| worker.prior_history = Some(prior_history); |
There was a problem hiding this comment.
Rebuild SpacebotHook when you swap in existing_id.
Self::build() creates the hook with a fresh random ProcessId::Worker, but this constructor only rewrites worker.id. After resume, send_status() / send_worker_idle() will still publish under the stale random ID while DB/state tracking uses existing_id. Also seed WaitingForInput through transition_to() so the resumed path still uses the state machine guardrails.
🛠️ Suggested fix
let mut worker = Self::build(
channel_id,
task,
system_prompt,
deps,
browser_config,
screenshot_dir,
brave_search_key,
logs_dir,
Some(input_rx),
);
// Reuse the original worker ID so DB row stays linked.
worker.id = existing_id;
- worker.state = WorkerState::WaitingForInput;
+ worker.hook = SpacebotHook::new(
+ worker.deps.agent_id.clone(),
+ ProcessId::Worker(existing_id),
+ ProcessType::Worker,
+ worker.channel_id.clone(),
+ worker.deps.event_tx.clone(),
+ );
+ worker
+ .transition_to(WorkerState::WaitingForInput)
+ .expect("freshly built worker should be able to enter waiting state");
// Stash the prior history so `run_follow_up_loop()` can pick it up.
worker.prior_history = Some(prior_history);As per coding guidelines "Worker state transitions must be validated with can_transition_to() using matches! to catch illegal transitions as runtime errors".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/agent/worker.rs` around lines 183 - 199, When resuming a worker you must
recreate the SpacebotHook to reflect the swapped-in existing_id and set the
state via the state machine instead of assigning fields directly: after calling
Self::build(...) and before using send_status()/send_worker_idle(), reconstruct
the SpacebotHook instance (or reinitialize its internal ProcessId) so its
publishable ID matches existing_id, and call
transition_to(WorkerState::WaitingForInput) (guarded by
can_transition_to()/matches!) rather than setting worker.state =
WorkerState::WaitingForInput; keep prior_history as-is so run_follow_up_loop()
can pick it up and ensure all outgoing publishes use the rebuilt hook tied to
existing_id.
…limit OpenCode workers can no longer be spawned as non-interactive (fire-and-forget). The spawn_worker tool forces interactive=true for opencode, and spawn_opencode_worker_from_state rejects interactive=false as defense-in-depth. Adds directory claim/release tracking to OpenCodeServerPool so only one opencode worker can be active per directory at a time. The directory is claimed before the worker starts and released after it completes (success or failure).
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
src/agent/channel_dispatch.rs (1)
806-829: Consider extracting transcript persistence into a helper.The transcript persistence logic (lines 806-829) is nearly identical to the spawn path (lines 570-590). Both serialize the transcript, spawn a fire-and-forget task, and execute the same SQL UPDATE.
Extracting this into a shared helper would reduce duplication and ensure consistent error handling.
♻️ Example helper function
fn persist_transcript_async( sqlite_pool: SqlitePool, worker_id: WorkerId, transcript: &[TranscriptStep], tool_calls: i64, ) { if transcript.is_empty() { return; } let blob = crate::conversation::worker_transcript::serialize_steps(transcript); let wid = worker_id.to_string(); tokio::spawn(async move { if let Err(error) = sqlx::query( "UPDATE worker_runs SET transcript = ?, tool_calls = ? WHERE id = ?", ) .bind(&blob) .bind(tool_calls) .bind(&wid) .execute(&sqlite_pool) .await { tracing::warn!(%error, worker_id = wid, "failed to persist transcript"); } }); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agent/channel_dispatch.rs` around lines 806 - 829, Extract the duplicated transcript persistence into a shared helper (e.g., persist_transcript_async) and replace both inline blocks (the one after worker.run().await and the earlier spawn path) with calls to it; the helper should accept SqlitePool, WorkerId, a &[TranscriptStep] (or slice), and tool_calls (i64), return early if transcript.is_empty(), call crate::conversation::worker_transcript::serialize_steps to produce the blob, then tokio::spawn an async task that runs the same sqlx::query("UPDATE worker_runs SET transcript = ?, tool_calls = ? WHERE id = ?").bind(&blob).bind(tool_calls).bind(&wid).execute(&sqlite_pool).await and logs failures with tracing::warn!(%error, worker_id = wid, "failed to persist transcript"). Ensure you replace the two duplicated code sites with calls to persist_transcript_async to keep behavior identical.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/agent/channel_dispatch.rs`:
- Around line 761-772: resume_idle_worker_into_state is missing the directory
claim/release that spawn_opencode_worker_from_state uses, creating a race when
resuming OpenCode workers; before calling OpenCodeWorker::resume_interactive in
resume_idle_worker_into_state, call server_pool.claim_directory(&directory) and
ensure you release it (server_pool.release_directory(&directory)) in the resumed
worker's completion/error handler or finally block to mirror
spawn_opencode_worker_from_state behavior and avoid leaving directories claimed
on early exits.
- Line 758: The resumed OpenCode workers lose their original directory because
the directory isn't persisted; add a directory column to the worker_runs table
and a directory: String (or Option<String> if nullable) field to the
IdleWorkerRow struct, persist the directory when creating records in
spawn_opencode_worker_from_state, and change resume_idle_worker_into_state to
use idle_row.directory (falling back to rc.workspace_dir only if directory is
None) instead of unconditionally using rc.workspace_dir so resumed workers
return to their original spawn path.
In `@src/opencode/server.rs`:
- Around line 670-680: claim_directory currently inserts the canonicalized path
but release_directory falls back to the raw path on canonicalize failure, which
can leave orphaned claims; change claim_directory to return the canonical
PathBuf it inserted (e.g., make claim_directory -> Result<PathBuf, _>) and
update all callers to hold and later pass that canonical PathBuf into
release_directory (or change release_directory signature to accept the canonical
PathBuf directly), and ensure release_directory uses that canonical path for
removal; if you keep a fallback canonicalize in release_directory, log an
error-level message when the fallback raw path does not match any existing claim
to make the mismatch observable.
---
Nitpick comments:
In `@src/agent/channel_dispatch.rs`:
- Around line 806-829: Extract the duplicated transcript persistence into a
shared helper (e.g., persist_transcript_async) and replace both inline blocks
(the one after worker.run().await and the earlier spawn path) with calls to it;
the helper should accept SqlitePool, WorkerId, a &[TranscriptStep] (or slice),
and tool_calls (i64), return early if transcript.is_empty(), call
crate::conversation::worker_transcript::serialize_steps to produce the blob,
then tokio::spawn an async task that runs the same sqlx::query("UPDATE
worker_runs SET transcript = ?, tool_calls = ? WHERE id =
?").bind(&blob).bind(tool_calls).bind(&wid).execute(&sqlite_pool).await and logs
failures with tracing::warn!(%error, worker_id = wid, "failed to persist
transcript"). Ensure you replace the two duplicated code sites with calls to
persist_transcript_async to keep behavior identical.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9382d81a-1456-4cc6-9f48-a571fe60e562
📒 Files selected for processing (3)
src/agent/channel_dispatch.rssrc/opencode/server.rssrc/tools/spawn_worker.rs
| let result = crate::opencode::OpenCodeWorker::resume_interactive( | ||
| worker_id, | ||
| Some(state.channel_id.clone()), | ||
| state.deps.agent_id.clone(), | ||
| &idle_worker.task, | ||
| directory, | ||
| server_pool, | ||
| state.deps.event_tx.clone(), | ||
| session_id.to_string(), | ||
| idle_worker.transcript.clone(), | ||
| ) | ||
| .await; |
There was a problem hiding this comment.
Missing claim_directory call for resumed OpenCode workers.
spawn_opencode_worker_from_state calls server_pool.claim_directory(&directory) to prevent concurrent workers on the same directory, but resume_idle_worker_into_state does not. This creates a race condition at startup:
- An idle OpenCode worker for
/path/to/projectis resumed - Before it completes, a channel spawns a new OpenCode worker for the same directory
- The new spawn succeeds because no claim exists for the resumed worker
Additionally, there's no corresponding release_directory call in the resume path's completion handler.
🐛 Proposed fix: Add directory claim/release to resume path
let directory = rc.workspace_dir.clone();
let server_pool = rc.opencode_server_pool.load().clone();
+ // Claim directory to prevent concurrent workers (same as spawn path).
+ server_pool
+ .claim_directory(&directory)
+ .await
+ .map_err(|e| e.to_string())?;
+
+ let release_pool = server_pool.clone();
+ let release_directory = directory.clone();
+
let result = crate::opencode::OpenCodeWorker::resume_interactive(And in the async worker task (around line 807):
async move {
- let result = worker.run().await.map_err(SpacebotError::from)?;
+ let result = worker.run().await.map_err(SpacebotError::from);
+
+ // Release directory claim regardless of success/failure.
+ release_pool.release_directory(&release_directory).await;
+
+ let result = result?;
// Persist final transcript.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/agent/channel_dispatch.rs` around lines 761 - 772,
resume_idle_worker_into_state is missing the directory claim/release that
spawn_opencode_worker_from_state uses, creating a race when resuming OpenCode
workers; before calling OpenCodeWorker::resume_interactive in
resume_idle_worker_into_state, call server_pool.claim_directory(&directory) and
ensure you release it (server_pool.release_directory(&directory)) in the resumed
worker's completion/error handler or finally block to mirror
spawn_opencode_worker_from_state behavior and avoid leaving directories claimed
on early exits.
| let canonical = match directory.canonicalize() { | ||
| Ok(path) => path, | ||
| Err(error) => { | ||
| tracing::warn!( | ||
| directory = %directory.display(), | ||
| %error, | ||
| "could not canonicalize directory during release, using raw path" | ||
| ); | ||
| directory.to_path_buf() | ||
| } | ||
| }; |
There was a problem hiding this comment.
Potential orphaned directory claim if canonicalization fails on release.
When claim_directory succeeds, it inserts the canonical path. If release_directory fails to canonicalize (e.g., directory was deleted during worker run), the fallback uses the raw path which won't match the canonical path in the HashSet, leaving a stale claim.
Consider storing the canonical path at claim time and passing it to release, or logging an error-level message when the fallback path doesn't match any claim.
🛡️ Proposed approach: return the canonical path from claim_directory
- pub async fn claim_directory(&self, directory: &Path) -> anyhow::Result<()> {
+ pub async fn claim_directory(&self, directory: &Path) -> anyhow::Result<PathBuf> {
let canonical = directory
.canonicalize()
.with_context(|| format!("directory '{}' does not exist", directory.display()))?;
let mut active = self.active_directories.lock().await;
if !active.insert(canonical.clone()) {
bail!(
"An OpenCode worker is already active in directory '{}'",
canonical.display()
);
}
tracing::debug!(
directory = %canonical.display(),
"claimed directory for OpenCode worker"
);
- Ok(())
+ Ok(canonical)
}Then pass the returned canonical path to release_directory instead of the original path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/opencode/server.rs` around lines 670 - 680, claim_directory currently
inserts the canonicalized path but release_directory falls back to the raw path
on canonicalize failure, which can leave orphaned claims; change claim_directory
to return the canonical PathBuf it inserted (e.g., make claim_directory ->
Result<PathBuf, _>) and update all callers to hold and later pass that canonical
PathBuf into release_directory (or change release_directory signature to accept
the canonical PathBuf directly), and ensure release_directory uses that
canonical path for removal; if you keep a fallback canonicalize in
release_directory, log an error-level message when the fallback raw path does
not match any existing claim to make the mismatch observable.
Six fixes from review bot findings: 1. Rebuild SpacebotHook with the correct existing worker ID in Worker::resume_interactive() — Self::build() creates a hook with a fresh random ProcessId, but the resumed worker needs events published under the original ID that matches the DB row and state tracking. 2. Load restored transcript into `history` (not `compacted_history`) so the LLM actually sees prior conversation context on the first post-restart follow-up call. 3. COALESCE(tool_calls, 0) in get_idle_interactive_workers() query to prevent sqlx::FromRow failure on rows where tool_calls is NULL. 4. Replace .ok() with proper error logging on fail_idle_worker() calls in main.rs startup — these are SQL errors, not dropped channel sends. 5. Persist recovered transcript before emitting idle in the OpenCode resume path, so a second crash doesn't lose the rebuilt state. 6. Make persist_transcript() and persist_transcript_snapshot() async and await them directly instead of fire-and-forget tokio::spawn. Ensures "idle implies persisted" and prevents out-of-order writes where an older snapshot could overwrite a newer one.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
src/agent/worker.rs (1)
195-209: 🛠️ Refactor suggestion | 🟠 MajorUse
transition_to()for these new waiting-state transitions.Both assignments bypass the state machine guardrails this file is supposed to enforce. Please route them through
transition_to(WorkerState::WaitingForInput)instead.Suggested fix
- worker.state = WorkerState::WaitingForInput; + worker + .transition_to(WorkerState::WaitingForInput) + .expect("freshly built worker should be able to enter waiting state"); // Stash the prior history so `run_follow_up_loop()` can pick it up. worker.prior_history = Some(prior_history);- self.state = WorkerState::WaitingForInput; + self.transition_to(WorkerState::WaitingForInput)?; self.persist_transcript(&compacted_history, &history).await;As per coding guidelines "Worker state transitions must be validated with
can_transition_to()usingmatches!to catch illegal transitions as runtime errors".Also applies to: 404-405
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agent/worker.rs` around lines 195 - 209, This block directly assigns worker.state = WorkerState::WaitingForInput (and similar at lines ~404-405) which bypasses the state-machine checks; change the code to call the worker.transition_to(WorkerState::WaitingForInput) API instead so the transition is validated by can_transition_to() (matches! check) and will error on illegal transitions—keep the hook rebuild (SpacebotHook::new(...)) and prior_history assignment, but do not set worker.state directly; invoke transition_to on the same Worker instance after rebuilding the hook and before returning so the guarded transition logic is used.src/main.rs (1)
1729-1758:⚠️ Potential issue | 🟡 MinorReuse the normal outbound router here too.
This startup-only path forwards outbound text, but it never emits
ApiEvent::TypingState, so resumed channels lose typing indicators after a restart even though the regular channel path keeps them. Reusing the same routing helper would keep the two flows in sync.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.rs` around lines 1729 - 1758, The startup-only match on spacebot::OutboundResponse manually sends ApiEvent::OutboundMessage and omits emitting spacebot::api::ApiEvent::TypingState, causing resumed channels to lose typing indicators; instead call the existing outbound routing helper used by the regular channel path (the same function that takes a response and api_event_tx and routes TypingState + OutboundMessage) with the current response, api_event_tx, sse_agent_id and sse_channel_id so both flows stay in sync and TypingState is emitted consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/agent/worker.rs`:
- Around line 290-298: The resumed-worker branch currently logs and publishes an
initial "running" status before actually receiving input, causing stale
'running' rows if the process crashes; update the logic in run() and/or
log_worker_status() so that when resuming (detect via the resuming boolean used
in the branch and the self.hook calls like send_status/send_worker_idle), you
skip emitting the generic "running" status until the worker actually begins
processing a follow-up input; specifically, add a conditional around the
call/site that sets worker_runs.status='running' (the place where run() calls
log_worker_status("running") or similar) to only execute when resuming==false or
after the first task starts, leaving the existing send_status("resumed — waiting
for input") and send_worker_idle() unchanged.
In `@src/opencode/worker.rs`:
- Around line 297-303: The current resume logic sets
EventState.has_received_event and has_assistant_message to true, which causes
process_events() to consider a stale session.idle as completion and may skip the
first post-restart assistant output; undo that by leaving
EventState.has_received_event and EventState.has_assistant_message false when
constructing the resumed EventState (keep event_state.accumulated_parts and
event_state.tool_calls from resume but do not set the two flags), so the
early-idle guard remains active for the first resumed follow-up and
process_events() will wait for fresh assistant output.
---
Duplicate comments:
In `@src/agent/worker.rs`:
- Around line 195-209: This block directly assigns worker.state =
WorkerState::WaitingForInput (and similar at lines ~404-405) which bypasses the
state-machine checks; change the code to call the
worker.transition_to(WorkerState::WaitingForInput) API instead so the transition
is validated by can_transition_to() (matches! check) and will error on illegal
transitions—keep the hook rebuild (SpacebotHook::new(...)) and prior_history
assignment, but do not set worker.state directly; invoke transition_to on the
same Worker instance after rebuilding the hook and before returning so the
guarded transition logic is used.
In `@src/main.rs`:
- Around line 1729-1758: The startup-only match on spacebot::OutboundResponse
manually sends ApiEvent::OutboundMessage and omits emitting
spacebot::api::ApiEvent::TypingState, causing resumed channels to lose typing
indicators; instead call the existing outbound routing helper used by the
regular channel path (the same function that takes a response and api_event_tx
and routes TypingState + OutboundMessage) with the current response,
api_event_tx, sse_agent_id and sse_channel_id so both flows stay in sync and
TypingState is emitted consistently.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c25061c7-3ebb-4ba0-80c3-4d37334f8314
📒 Files selected for processing (4)
src/agent/worker.rssrc/conversation/history.rssrc/main.rssrc/opencode/worker.rs
| if resuming { | ||
| tracing::info!( | ||
| worker_id = %self.id, | ||
| prior_messages = history.len(), | ||
| "resuming interactive worker with prior history" | ||
| ); | ||
| self.hook.send_status("resumed — waiting for input"); | ||
| self.hook.send_worker_idle(); | ||
| } |
There was a problem hiding this comment.
Don't mark a resumed idle worker as running before it actually gets input.
This branch goes straight back to idle, but run() still emits the generic "running" status first. Since log_worker_status() persists that as worker_runs.status='running', a second crash before send_worker_idle() leaves the row looking orphaned and startup reconciliation will fail it on the next boot. Only publish "running" once the resumed worker starts processing a follow-up.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/agent/worker.rs` around lines 290 - 298, The resumed-worker branch
currently logs and publishes an initial "running" status before actually
receiving input, causing stale 'running' rows if the process crashes; update the
logic in run() and/or log_worker_status() so that when resuming (detect via the
resuming boolean used in the branch and the self.hook calls like
send_status/send_worker_idle), you skip emitting the generic "running" status
until the worker actually begins processing a follow-up input; specifically, add
a conditional around the call/site that sets worker_runs.status='running' (the
place where run() calls log_worker_status("running") or similar) to only execute
when resuming==false or after the first task starts, leaving the existing
send_status("resumed — waiting for input") and send_worker_idle() unchanged.
| let mut event_state = EventState::new(); | ||
| event_state.accumulated_parts = resume.accumulated_parts; | ||
| event_state.tool_calls = resume.tool_calls; | ||
| event_state.has_received_event = true; | ||
| event_state.has_assistant_message = true; | ||
|
|
||
| // Process SSE events until session goes idle or errors. | ||
| // EventState tracks status and last_text for the initial result delivery. | ||
| // The full transcript is fetched from the OpenCode API on completion. | ||
| let mut event_state = EventState::new(); | ||
| self.process_events(event_response, &session_id, &server, &mut event_state) | ||
| .await?; | ||
| (server, resume.session_id, event_state, String::new()) |
There was a problem hiding this comment.
Keep the early-idle guard active for the first resumed follow-up.
Setting has_received_event and has_assistant_message here makes the next process_events() treat a stale session.idle as completion. On the first post-restart follow-up, that can return before any new assistant output is processed.
Suggested fix
let mut event_state = EventState::new();
event_state.accumulated_parts = resume.accumulated_parts;
event_state.tool_calls = resume.tool_calls;
- event_state.has_received_event = true;
- event_state.has_assistant_message = true;
(server, resume.session_id, event_state, String::new())📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let mut event_state = EventState::new(); | |
| event_state.accumulated_parts = resume.accumulated_parts; | |
| event_state.tool_calls = resume.tool_calls; | |
| event_state.has_received_event = true; | |
| event_state.has_assistant_message = true; | |
| // Process SSE events until session goes idle or errors. | |
| // EventState tracks status and last_text for the initial result delivery. | |
| // The full transcript is fetched from the OpenCode API on completion. | |
| let mut event_state = EventState::new(); | |
| self.process_events(event_response, &session_id, &server, &mut event_state) | |
| .await?; | |
| (server, resume.session_id, event_state, String::new()) | |
| let mut event_state = EventState::new(); | |
| event_state.accumulated_parts = resume.accumulated_parts; | |
| event_state.tool_calls = resume.tool_calls; | |
| (server, resume.session_id, event_state, String::new()) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/opencode/worker.rs` around lines 297 - 303, The current resume logic sets
EventState.has_received_event and has_assistant_message to true, which causes
process_events() to consider a stale session.idle as completion and may skip the
first post-restart assistant output; undo that by leaving
EventState.has_received_event and EventState.has_assistant_message false when
constructing the resumed EventState (keep event_state.accumulated_parts and
event_state.tool_calls from resume but do not set the two flags), so the
early-idle guard remains active for the first resumed follow-up and
process_events() will wait for fresh assistant output.
Previously, when an idle worker couldn't be reconnected on startup (e.g. OpenCode server died with kill_on_drop, session expired), the code called fail_idle_worker() which transitioned the row to 'failed'. Now the worker stays as 'idle' in the DB with its transcript preserved. The resume attempt is best-effort — if it fails, the worker is effectively dead but its history remains available for inspection in the UI. Only dangling 'running' workers (from unclean shutdown) get marked as failed via reconcile_running_workers_for_agent().
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
src/main.rs (1)
1709-1738:⚠️ Potential issue | 🟡 MinorMirror typing events in the startup outbound router.
This startup-only match still never maps
Status(Thinking)/Status(StopTyping)toApiEvent::TypingState, so resumed channels lose web typing indicators after restart. Reuse the same outbound-routing helper as the normal path to keep the two behaviors in sync.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.rs` around lines 1709 - 1738, The startup-only match on `response: spacebot::OutboundResponse` duplicates outbound routing but omits typing status mapping; replace this custom match with a call to the same outbound-routing helper used by the normal path (the function that converts `spacebot::OutboundResponse` into `spacebot::api::ApiEvent` and sends via `api_event_tx`) so behaviors stay in sync, and ensure that `Status::Thinking` and `Status::StopTyping` are mapped to `spacebot::api::ApiEvent::TypingState` (using `sse_agent_id` and `sse_channel_id`) when emitting through `api_event_tx`.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/main.rs`:
- Around line 1761-1768: Pre-created resumed channels currently bypass the
post-restart gap backfill because inserting an ActiveChannel causes the
downstream contains_key check to treat the first inbound as already-active; add
a one-time "gap backfilled" flag to ActiveChannel (e.g., gap_backfilled: bool)
initialized false when you insert the pre-created channel, then in the inbound
handling path (the branch that currently checks
active_channels.contains_key(...) and before forwarding the first real inbound)
detect gap_backfilled == false, call fetch_history(conversation_id) to reconcile
missed messages, merge/update latest_message as needed, set gap_backfilled =
true, and only then forward the inbound; update any code that clones or
constructs ActiveChannel (_outbound_handle, latest_message) to include the new
flag and ensure the check-and-backfill is performed atomically (or via the same
mutex/entry) so multiple simultaneous inbounds don't double-run fetch_history.
- Around line 1557-1560: Extract the idle-worker recovery block (the code that
pre-creates channels and spawns resumed workers when agents_initialized is true)
into a named helper function (e.g., resume_idle_workers_after_init or
resume_persisted_idle_workers) and replace the inline if agents_initialized {
... } with a call to that helper; then call the same helper from both places
where initialize_agents(...) completes successfully (the original startup path
and the provider-setup branch that runs later) so stranded idle workers are
retried; finally add a regression test that simulates daemon start in setup
mode, performs the provider-setup path to reach provider-ready, and asserts
persisted idle workers are resumed (use the same helper path or integration test
hooks).
---
Duplicate comments:
In `@src/main.rs`:
- Around line 1709-1738: The startup-only match on `response:
spacebot::OutboundResponse` duplicates outbound routing but omits typing status
mapping; replace this custom match with a call to the same outbound-routing
helper used by the normal path (the function that converts
`spacebot::OutboundResponse` into `spacebot::api::ApiEvent` and sends via
`api_event_tx`) so behaviors stay in sync, and ensure that `Status::Thinking`
and `Status::StopTyping` are mapped to `spacebot::api::ApiEvent::TypingState`
(using `sse_agent_id` and `sse_channel_id`) when emitting through
`api_event_tx`.
…el after restart Two bugs preventing idle worker visibility after restart: 1. opencode_session_id was never persisted to the DB. The OpenCodeSessionCreated event is filtered out by event_is_for_channel() (returns false for that variant), so the channel handler's log_opencode_metadata() call was unreachable dead code. Fix: persist session metadata directly from the OpenCode worker's run() method via sqlite_pool, bypassing the event system entirely. 2. When resume fails (e.g. session expired because server was killed on shutdown), the channel's status block had no knowledge of the idle worker. Fix: on resume failure, still register the worker into the status block as 'idle (session expired)' so it appears in the channel's UI and the LLM knows about it.
Fixes zombie cycle where OpenCode workers with expired sessions were left as 'idle' in the DB on every restart, reappearing in the channel status block as 'idle (session expired)' and showing as 'running' in the workers tab with no way to dismiss them. - Retire irrecoverable idle workers as 'done' (not 'failed') on startup since the worker completed its work — only the follow-up session expired - Pre-filter OpenCode workers without session metadata before creating the channel (these can never resume due to kill_on_drop) - Stop registering expired workers in the channel status block - Add channel_id to OpenCodeSessionCreated event so it routes through event_is_for_channel() properly, removing the persist_opencode_metadata workaround (log_opencode_metadata in channel.rs is no longer dead code) - Add TranscriptStep::UserText variant to preserve user vs assistant text distinction across transcript serialization round-trips - Emit TypingState SSE events from pre-created channel outbound router
Resumed OpenCode workers were losing their original directory because it was never persisted to the DB — resume fell back to workspace_dir, spawning a new server in the wrong directory and ignoring the one-per-directory limit. - Add 'directory' column to worker_runs (new migration) - Persist directory from spawn_opencode_worker_from_state via UPDATE - Read directory from IdleWorkerRow in resume_idle_worker_into_state - Fall back to workspace_dir only when directory is NULL (pre-existing workers without the column)
Summary
Changes
Commit 1: Incremental transcript persistence on idle
worker.rs/opencode/worker.rs: persist transcript snapshot on each idle transitionCommit 2: Reconnect idle interactive workers on restart
conversation/history.rs: Split reconciliation — onlyrunningworkers marked failed; idle workers left for reconnection. Newget_idle_interactive_workers(),fail_idle_worker(),IdleWorkerRowconversation/worker_transcript.rs: Newtranscript_to_history()— convertsVec<TranscriptStep>back toVec<rig::message::Message>for builtin worker resumeagent/worker.rs:resume_interactive()constructor +prior_historyfield — creates worker with pre-existing ID and prior history, skips initial task, enters follow-up loopopencode/worker.rs:ResumeSessionstruct +resume_interactive()— reconnects to server viaget_or_create(), verifies session viaget_messages(), repopulatesaccumulated_partsagent/channel_dispatch.rs:resume_idle_worker_into_state()— orchestrates worker resumption into aChannelStatefor both worker typesmain.rs: Startup code queries idle workers, groups by channel, pre-creates channels with outbound routing, resumes workers; failures marked viafail_idle_worker()Testing
cargo clippy --all-targetsclean (no warnings)just gate-prpasses all gatesRace/terminal-state reasoning
reconcile_running_workers_for_agent()runs first (marksrunningas failed), thenget_idle_interactive_workers()loads remaining idle rows — no race between the tworesume_interactive()returnsNone(server dead, session gone),fail_idle_worker()marks the DB row asfailed— no orphaned idle rowsWorkerStarted+WorkerIdlebut skip re-emittingWorkerInitialResult(already relayed before restart)