Skip to content

fix: persist transcripts on idle + reconnect idle workers on restart#334

Merged
jamiepine merged 8 commits intomainfrom
fix/opencode-worker-resilience
Mar 6, 2026
Merged

fix: persist transcripts on idle + reconnect idle workers on restart#334
jamiepine merged 8 commits intomainfrom
fix/opencode-worker-resilience

Conversation

@jamiepine
Copy link
Member

@jamiepine jamiepine commented Mar 6, 2026

Summary

  • Incremental transcript persistence: Both OpenCode and builtin interactive workers now snapshot their transcript to DB each time they transition to idle, ensuring no work is lost if the process exits while a worker is waiting for input.
  • Idle worker reconnection on restart: At startup, idle interactive workers are resumed instead of being marked as failed. OpenCode workers reconnect to their surviving server+session; builtin workers restore conversation history from the transcript blob. Channels are pre-created with full outbound routing so follow-up messages route correctly.

Changes

Commit 1: Incremental transcript persistence on idle

  • worker.rs / opencode/worker.rs: persist transcript snapshot on each idle transition
  • Covers both builtin (Rig history → TranscriptStep) and OpenCode (accumulated_parts → TranscriptStep) workers

Commit 2: Reconnect idle interactive workers on restart

  • conversation/history.rs: Split reconciliation — only running workers marked failed; idle workers left for reconnection. New get_idle_interactive_workers(), fail_idle_worker(), IdleWorkerRow
  • conversation/worker_transcript.rs: New transcript_to_history() — converts Vec<TranscriptStep> back to Vec<rig::message::Message> for builtin worker resume
  • agent/worker.rs: resume_interactive() constructor + prior_history field — creates worker with pre-existing ID and prior history, skips initial task, enters follow-up loop
  • opencode/worker.rs: ResumeSession struct + resume_interactive() — reconnects to server via get_or_create(), verifies session via get_messages(), repopulates accumulated_parts
  • agent/channel_dispatch.rs: resume_idle_worker_into_state() — orchestrates worker resumption into a ChannelState for both worker types
  • main.rs: Startup code queries idle workers, groups by channel, pre-creates channels with outbound routing, resumes workers; failures marked via fail_idle_worker()

Testing

  • All 428 unit tests pass
  • cargo clippy --all-targets clean (no warnings)
  • Integration tests compile
  • just gate-pr passes all gates

Race/terminal-state reasoning

  • Reconciliation ordering: reconcile_running_workers_for_agent() runs first (marks running as failed), then get_idle_interactive_workers() loads remaining idle rows — no race between the two
  • Channel pre-creation: Channels are created before the main event loop starts, so any inbound message arriving during startup finds the channel already active with resumed workers registered
  • Failed resume path: If resume_interactive() returns None (server dead, session gone), fail_idle_worker() marks the DB row as failed — no orphaned idle rows
  • Event emission: Resumed workers emit WorkerStarted + WorkerIdle but skip re-emitting WorkerInitialResult (already relayed before restart)

Both OpenCode and builtin interactive workers now write a transcript
snapshot to the DB each time they go idle (initial task + every
follow-up). If spacebot restarts while a worker is waiting for
input, the transcript survives rather than being lost.

OpenCode workers get an optional sqlite_pool field; builtin workers
already have it via AgentDeps. The snapshot overwrites the blob each
time with the complete accumulated history.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 6, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Resume + persistence for idle interactive workers: adds IdleWorker DB row and resume APIs, startup logic to pre-create channels and resume idle workers, OpenCode server pool directory-claiming, worker resume/spawn augmentation with secrets + SQLite, resumed-run flows, and transcript-to-history conversion (duplicate definition present).

Changes

Cohort / File(s) Summary
Channel dispatch & startup
src/main.rs, src/agent/channel_dispatch.rs
Startup pre-creates channels and resumes idle interactive workers via new resume_idle_worker_into_state, which handles builtin and OpenCode resume paths, attaches secrets store + sqlite pool, registers inputs, spawns follow-up loops, emits WorkerStarted, and persists transcripts.
Worker runtime
src/agent/worker.rs
Added prior_history: Option<Vec<rig::message::Message>> and resume_interactive(...); run() branches for resumed workers to initialize history, skip initial prompt/task, synthesize initial result, and persist transcripts at idle boundaries.
OpenCode worker & pool
src/opencode/worker.rs, src/opencode/server.rs
OpenCodeWorker: new ResumeSession, resuming_session, sqlite_pool, with_sqlite_pool(), resume_interactive(...), resumed-run handling, and persist_transcript_snapshot(). OpenCodeServerPool: active_directories: Mutex<HashSet<PathBuf>> with claim_directory/release_directory to enforce directory exclusivity.
Conversation history & DB
src/conversation/history.rs
New public IdleWorkerRow (sqlx::FromRow); get_idle_interactive_workers(agent_id) to load idle interactive rows; fail_idle_worker(worker_id, reason) to mark idle workers failed; reconciler now updates only running rows.
Transcript conversion
src/conversation/worker_transcript.rs
Added transcript_to_history(steps: &[TranscriptStep]) -> Vec<rig::message::Message> to reconstruct Rig message history from transcripts; note: function is duplicated (two identical public definitions).
CLI/wiring & spawn helper
src/tools/spawn_worker.rs
Treat OpenCode workers as always interactive (ignore input interactive flag), compute effectively_interactive, and set SpawnWorkerOutput.interactive accordingly.
Persistence wiring
src/opencode/..., src/agent/...
Spawn and resume flows consistently attach secrets store and SQLite pool to workers; transcript snapshots persisted for resumed and fresh OpenCode workers.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'persist transcripts on idle + reconnect idle workers on restart' directly and accurately describes the main changes: incremental transcript persistence and idle worker reconnection on startup.
Description check ✅ Passed The description comprehensively covers the changeset, detailing incremental transcript persistence, idle worker reconnection, specific file modifications, testing results, and race condition reasoning.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/opencode-worker-resilience

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

let mut follow_up_failure: Option<String> = None;
if let Some(mut input_rx) = self.input_rx.take() {
self.state = WorkerState::WaitingForInput;
self.persist_transcript(&compacted_history, &history);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice durability improvement, but since persist_transcript is fire-and-forget, there’s still a window where we emit WorkerIdle while the UPDATE hasn’t actually landed yet. If the process dies in that gap you can still lose the “idle” snapshot.

Also worth considering write ordering: with multiple idle transitions, older spawned UPDATEs can complete after newer ones and overwrite a more complete transcript.

Might be worth making the persistence async + awaited at idle boundaries (or funnel snapshots through a single ordered queue) so “idle” implies the snapshot is actually durable.

let worker_id = self.id.to_string();
let pool = pool.clone();

tokio::spawn(async move {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern here: tokio::spawn makes the snapshot persistence best-effort, so a crash right after send_idle() can still drop the transcript, and multiple in-flight snapshots can complete out of order (older snapshot overwriting a newer one).

If the goal is “idle implies persisted”, consider making this an async fn and awaiting the UPDATE (or pushing snapshots through an ordered persistence task) before emitting the idle event.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/agent/worker.rs (1)

330-337: 🛠️ Refactor suggestion | 🟠 Major

Use transition_to() in this touched lifecycle path.

These assignments still bypass can_transition_to(), so illegal worker-state regressions in the follow-up loop stay silent instead of failing fast.

♻️ Minimal change
-            self.state = WorkerState::WaitingForInput;
+            self.transition_to(WorkerState::WaitingForInput)?;
             self.persist_transcript(&compacted_history, &history);
             self.hook.send_status("waiting for input");
             self.hook.send_worker_idle();

             while let Some(follow_up) = input_rx.recv().await {
-                self.state = WorkerState::Running;
+                self.transition_to(WorkerState::Running)?;
                 self.hook.send_status("processing follow-up");
@@
-                        self.state = WorkerState::Failed;
+                        self.transition_to(WorkerState::Failed)?;
                         self.hook.send_status("failed");
                         follow_up_failure = Some(failure_reason);
                         break;
@@
-                self.state = WorkerState::WaitingForInput;
+                self.transition_to(WorkerState::WaitingForInput)?;
                 self.persist_transcript(&compacted_history, &history);
                 self.hook.send_status("waiting for input");
                 self.hook.send_worker_idle();

As per coding guidelines, "src/agent/worker.rs: Worker state transitions must be validated with can_transition_to() using matches! to catch illegal transitions as runtime errors".

Also applies to: 413-423

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agent/worker.rs` around lines 330 - 337, The code sets self.state
directly to WorkerState::WaitingForInput and WorkerState::Running in the
follow-up lifecycle path, bypassing the validation logic; replace those direct
assignments with calls to self.transition_to(...) so transitions go through
can_transition_to() and produce runtime errors on illegal transitions (apply the
same change for the other touched range around the follow-up loop and the block
at lines 413-423). Locate the direct writes to self.state and swap them for
self.transition_to(WorkerState::WaitingForInput) and
self.transition_to(WorkerState::Running), ensuring existing calls like
persist_transcript(...) and hook.send_status(...) remain in place and order is
preserved.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/agent/worker.rs`:
- Around line 331-332: persist_transcript() is fire-and-forget and can race,
allowing an older idle snapshot to overwrite a later transcript; serialize or
guard these writes by adding per-worker write ordering (e.g., a mutex/queue) or
attach a monotonically increasing checkpoint/version to each persist and ignore
stale updates when writing worker_runs.transcript; specifically modify the calls
around persist_transcript(&compacted_history, &history) (and the other identical
idle checkpoint call) so they either acquire the worker-level lock/sequence
before persisting or pass/check a checkpoint/version and only update the DB when
the incoming checkpoint is newer.

In `@src/opencode/worker.rs`:
- Around line 761-772: The detached tokio::spawn that runs the SQL UPDATE on
worker_runs (the query updating transcript and tool_calls by worker_id) can race
with the completion path in src/agent/channel_dispatch.rs which later does an
unconditional UPDATE with get_messages(); fix by adding a monotonic
checkpoint/version or sequencing per-worker writes: either (A) add a numeric
transcript_version (or last_snapshot_seq) column and include it in the UPDATE
WHERE clause and increment it when writing so the idle snapshot only succeeds if
its expected version matches, or (B) serialize writes for a given worker by
routing all transcript persistence through a single async task/Mutex keyed by
worker_id (instead of spawning a detached task in the block that uses blob,
tool_calls, worker_id). Ensure whichever approach you choose updates the
corresponding read in channel_dispatch.rs to respect the version/serialization
so later definitive writes cannot be overwritten by stale snapshots.

---

Outside diff comments:
In `@src/agent/worker.rs`:
- Around line 330-337: The code sets self.state directly to
WorkerState::WaitingForInput and WorkerState::Running in the follow-up lifecycle
path, bypassing the validation logic; replace those direct assignments with
calls to self.transition_to(...) so transitions go through can_transition_to()
and produce runtime errors on illegal transitions (apply the same change for the
other touched range around the follow-up loop and the block at lines 413-423).
Locate the direct writes to self.state and swap them for
self.transition_to(WorkerState::WaitingForInput) and
self.transition_to(WorkerState::Running), ensuring existing calls like
persist_transcript(...) and hook.send_status(...) remain in place and order is
preserved.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 719c7d64-01dd-4162-888d-1040c9d5a47e

📥 Commits

Reviewing files that changed from the base of the PR and between eb6e877 and 58acc81.

📒 Files selected for processing (3)
  • src/agent/channel_dispatch.rs
  • src/agent/worker.rs
  • src/opencode/worker.rs

Comment on lines 331 to 332
self.persist_transcript(&compacted_history, &history);
self.hook.send_status("waiting for input");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

These new idle checkpoints need write ordering.

persist_transcript() is fire-and-forget and overwrites the full transcript blob. With the new calls here, an earlier idle snapshot can complete after a later/final persist and roll worker_runs.transcript back to shorter history. Please serialize transcript writes per worker or guard the update with a monotonically increasing checkpoint.

Also applies to: 421-422

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agent/worker.rs` around lines 331 - 332, persist_transcript() is
fire-and-forget and can race, allowing an older idle snapshot to overwrite a
later transcript; serialize or guard these writes by adding per-worker write
ordering (e.g., a mutex/queue) or attach a monotonically increasing
checkpoint/version to each persist and ignore stale updates when writing
worker_runs.transcript; specifically modify the calls around
persist_transcript(&compacted_history, &history) (and the other identical idle
checkpoint call) so they either acquire the worker-level lock/sequence before
persisting or pass/check a checkpoint/version and only update the DB when the
incoming checkpoint is newer.

On startup, query worker_runs for idle interactive workers and attempt
to resume them instead of marking them as failed:

- Split reconciliation: only running workers are marked failed; idle
  workers are left for reconnection
- New get_idle_interactive_workers() query loads idle workers with
  transcript blobs and session metadata
- transcript_to_history() converts persisted TranscriptStep[] back into
  Rig Message history for builtin worker resume
- Builtin Worker::resume_interactive() creates a worker with prior
  history, skips initial task execution, enters follow-up loop directly
- OpenCode OpenCodeWorker::resume_interactive() reconnects to existing
  server+session via get_or_create(), verifies session via get_messages(),
  repopulates accumulated_parts, enters follow-up loop
- resume_idle_worker_into_state() orchestrates resumption into a
  ChannelState (handles both builtin and opencode worker types)
- Startup code in main.rs queries idle workers, groups by channel_id,
  pre-creates channels with outbound routing, and resumes each worker
- Workers that fail to resume are marked failed via fail_idle_worker()
@jamiepine jamiepine changed the title fix: persist transcript incrementally for interactive workers on idle fix: persist transcripts on idle + reconnect idle workers on restart Mar 6, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (2)
src/agent/worker.rs (1)

393-396: ⚠️ Potential issue | 🟠 Major

idle is still advertised before the snapshot is durable.

These checkpoints still fire-and-forget the full-transcript update and then immediately emit waiting for input / WorkerIdle. A crash or slow SQLite write in that gap can leave the row marked idle with an older transcript, and concurrent spawned updates can still land out of order.

Also applies to: 484-487

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agent/worker.rs` around lines 393 - 396, The code sets
WorkerState::WaitingForInput and immediately calls persist_transcript(...)
fire-and-forget, then send_status("waiting for input") / send_worker_idle(),
which can advertise idle before the transcript is durably written; change the
sequence to await/handle the result of persist_transcript(...) before
transitioning state or emitting status/idle, propagate or log errors on
persistence failures, and ensure the same fix is applied to the other identical
block that calls WorkerState::WaitingForInput, persist_transcript, send_status,
and send_worker_idle (the later occurrence of these symbols elsewhere in the
file).
src/opencode/worker.rs (1)

907-918: ⚠️ Potential issue | 🟠 Major

WorkerIdle is still only best-effort durable.

This tokio::spawn is detached from the idle transition, so a crash/shutdown right after send_idle() can still leave worker_runs.transcript stale. It also leaves whole-row transcript updates free to complete out of order.

You can confirm the remaining ordering risk with:

#!/bin/bash
rg -n -C3 'persist_transcript_snapshot|UPDATE worker_runs SET transcript|send_idle\(' src

Expected result: persist_transcript_snapshot() is detached here, while callers emit send_idle() immediately after invoking it and other transcript UPDATEs exist elsewhere in the flow.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/opencode/worker.rs` around lines 907 - 918, The
persist_transcript_snapshot call is being detached via tokio::spawn which makes
WorkerIdle transitions only best-effort and allows out-of-order whole-row
UPDATEs; change the persist_transcript_snapshot implementation (the tokio::spawn
block that runs sqlx::query("UPDATE worker_runs SET transcript = ?, tool_calls =
? WHERE id = ?")...) so it is awaited inline (i.e., remove the detached
tokio::spawn and return/await a Result from persist_transcript_snapshot), and
ensure callers of persist_transcript_snapshot (places that call send_idle()
immediately after) await its completion before calling send_idle();
additionally, to prevent update reordering use a transaction or an
optimistic-lock WHERE clause (e.g., version or updated_at) in the worker_runs
UPDATE to ensure updates are applied in order.
🧹 Nitpick comments (1)
src/opencode/worker.rs (1)

394-399: Use .ok() for these broadcast sends instead of let _ =.

These are the one case the repo explicitly allows to be ignored, but it wants the ignore to be spelled as .ok() rather than let _ =.

As per coding guidelines, "Never silently discard errors; no let _ = on Results; handle, log, or propagate errors, with the exception of .ok() on channel sends where the receiver may be dropped".

Also applies to: 443-448

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/opencode/worker.rs` around lines 394 - 399, Replace the ignored Result
patterns that use "let _ = self.event_tx.send(...)" with the explicit ".ok()"
discard on the send calls; specifically update the send of
ProcessEvent::WorkerInitialResult (the self.event_tx.send call wrapping
WorkerInitialResult with agent_id/self.id/channel_id/scrubbed_result) and the
analogous send at the later block (the send that emits the worker result event
around lines 443-448) to call .ok() on the Result instead of using "let _ =",
e.g., change the expression ending in ".send(...)" to ".send(...).ok()" so the
channel-send ignore is explicit per guidelines.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/agent/worker.rs`:
- Around line 273-277: The resumed worker loads the transcript into
compacted_history but calls prompt_once() with history, leaving the model
context empty after restart; to fix, initialize history from prior_history
(e.g., set history = self.prior_history.take().unwrap_or_default() or push a
recap marker into history) and set compacted_history to empty (compacted_history
= Vec::new()) so prompt_once() receives the restored transcript; update the
logic around prior_history, history, compacted_history and prompt_once() to
reflect this transfer.
- Around line 183-199: When resuming a worker you must recreate the SpacebotHook
to reflect the swapped-in existing_id and set the state via the state machine
instead of assigning fields directly: after calling Self::build(...) and before
using send_status()/send_worker_idle(), reconstruct the SpacebotHook instance
(or reinitialize its internal ProcessId) so its publishable ID matches
existing_id, and call transition_to(WorkerState::WaitingForInput) (guarded by
can_transition_to()/matches!) rather than setting worker.state =
WorkerState::WaitingForInput; keep prior_history as-is so run_follow_up_loop()
can pick it up and ensure all outgoing publishes use the rebuilt hook tied to
existing_id.

In `@src/conversation/history.rs`:
- Around line 507-519: The query in get_idle_interactive_workers uses
query_as::<_, IdleWorkerRow> but can fail when tool_calls is NULL; update the
query to coalesce nulls (e.g., COALESCE(tool_calls, 0) AS tool_calls) or change
the IdleWorkerRow definition to make tool_calls an Option<i64> so nullable DB
values map safely; ensure references to IdleWorkerRow,
get_idle_interactive_workers, and the tool_calls column are updated consistently
(or adapt any downstream code that expects a non-optional i64).

In `@src/conversation/worker_transcript.rs`:
- Around line 300-367: transcript_to_history currently maps every
TranscriptStep::Action/ActionContent::Text into Message::Assistant, losing
original speaker role; update the reverse mapping to preserve user vs assistant
text by encoding the role when converting in convert_history and checking it
here: modify transcript_to_history to inspect ActionContent (or a new role-tag
on TranscriptStep/Action) and push Message::User { content:
OneOrMany::one(UserContent::Text(...)) } for user-originated texts and
Message::Assistant { ... } for assistant-originated texts (keep ToolCall
handling as-is), ensuring you reference and update the constructors used here
(transcript_to_history, TranscriptStep::Action, ActionContent::Text,
Message::Assistant, Message::User) so resumed transcripts restore correct
speaker roles.

In `@src/main.rs`:
- Around line 1717-1746: The startup-only outbound router handles
OutboundResponse::Text/RichMessage/ThreadReply by sending
ApiEvent::OutboundMessage but omits emitting ApiEvent::TypingState
(Status::Thinking/Status::StopTyping), causing resumed channels to lose typing
indicators; fix by routing startup responses through the same routing helper
used by the normal channel router (the helper that consumes
spacebot::OutboundResponse and emits both ApiEvent::TypingState and
ApiEvent::OutboundMessage) instead of duplicating the match here so that
api_event_tx and ApiEvent::TypingState (Status::Thinking/Status::StopTyping) are
produced consistently.
- Around line 1595-1601: The calls to
run_logger.fail_idle_worker(...).await.ok() are silently discarding Result
errors; replace the .ok() usage with proper error handling: await the Result
returned by fail_idle_worker (and the other occurrence around the same logic),
then on Err log the error via run_logger.error or propagate the error (return
Err) so SQL update failures are not hidden; locate both occurrences by the
symbol fail_idle_worker and ensure you either match on the Result or use
.await.map_err(|e| { run_logger.error(..., e); e }) to record the failure
instead of swallowing it.

In `@src/opencode/worker.rs`:
- Around line 385-390: After rebuilding event_state in the resuming branch,
persist the recovered transcript/state to durable storage before emitting
WorkerIdle: add a call to the existing checkpoint/save routine (e.g. invoke a
method like self.checkpoint_event_state() or
self.persist_transcript(event_state)) immediately after rebuilding event_state
and before self.send_status(...) / self.send_idle(), so the recovered history is
recorded and the WorkerIdle snapshot reflects the restored transcript.

---

Duplicate comments:
In `@src/agent/worker.rs`:
- Around line 393-396: The code sets WorkerState::WaitingForInput and
immediately calls persist_transcript(...) fire-and-forget, then
send_status("waiting for input") / send_worker_idle(), which can advertise idle
before the transcript is durably written; change the sequence to await/handle
the result of persist_transcript(...) before transitioning state or emitting
status/idle, propagate or log errors on persistence failures, and ensure the
same fix is applied to the other identical block that calls
WorkerState::WaitingForInput, persist_transcript, send_status, and
send_worker_idle (the later occurrence of these symbols elsewhere in the file).

In `@src/opencode/worker.rs`:
- Around line 907-918: The persist_transcript_snapshot call is being detached
via tokio::spawn which makes WorkerIdle transitions only best-effort and allows
out-of-order whole-row UPDATEs; change the persist_transcript_snapshot
implementation (the tokio::spawn block that runs sqlx::query("UPDATE worker_runs
SET transcript = ?, tool_calls = ? WHERE id = ?")...) so it is awaited inline
(i.e., remove the detached tokio::spawn and return/await a Result from
persist_transcript_snapshot), and ensure callers of persist_transcript_snapshot
(places that call send_idle() immediately after) await its completion before
calling send_idle(); additionally, to prevent update reordering use a
transaction or an optimistic-lock WHERE clause (e.g., version or updated_at) in
the worker_runs UPDATE to ensure updates are applied in order.

---

Nitpick comments:
In `@src/opencode/worker.rs`:
- Around line 394-399: Replace the ignored Result patterns that use "let _ =
self.event_tx.send(...)" with the explicit ".ok()" discard on the send calls;
specifically update the send of ProcessEvent::WorkerInitialResult (the
self.event_tx.send call wrapping WorkerInitialResult with
agent_id/self.id/channel_id/scrubbed_result) and the analogous send at the later
block (the send that emits the worker result event around lines 443-448) to call
.ok() on the Result instead of using "let _ =", e.g., change the expression
ending in ".send(...)" to ".send(...).ok()" so the channel-send ignore is
explicit per guidelines.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 28479e10-d093-49dc-8797-4c8801146acb

📥 Commits

Reviewing files that changed from the base of the PR and between 58acc81 and 14e9b0f.

📒 Files selected for processing (6)
  • src/agent/channel_dispatch.rs
  • src/agent/worker.rs
  • src/conversation/history.rs
  • src/conversation/worker_transcript.rs
  • src/main.rs
  • src/opencode/worker.rs

Comment on lines +183 to +199
let (input_tx, input_rx) = mpsc::channel(32);
let mut worker = Self::build(
channel_id,
task,
system_prompt,
deps,
browser_config,
screenshot_dir,
brave_search_key,
logs_dir,
Some(input_rx),
);
// Reuse the original worker ID so DB row stays linked.
worker.id = existing_id;
worker.state = WorkerState::WaitingForInput;
// Stash the prior history so `run_follow_up_loop()` can pick it up.
worker.prior_history = Some(prior_history);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Rebuild SpacebotHook when you swap in existing_id.

Self::build() creates the hook with a fresh random ProcessId::Worker, but this constructor only rewrites worker.id. After resume, send_status() / send_worker_idle() will still publish under the stale random ID while DB/state tracking uses existing_id. Also seed WaitingForInput through transition_to() so the resumed path still uses the state machine guardrails.

🛠️ Suggested fix
         let mut worker = Self::build(
             channel_id,
             task,
             system_prompt,
             deps,
             browser_config,
             screenshot_dir,
             brave_search_key,
             logs_dir,
             Some(input_rx),
         );
         // Reuse the original worker ID so DB row stays linked.
         worker.id = existing_id;
-        worker.state = WorkerState::WaitingForInput;
+        worker.hook = SpacebotHook::new(
+            worker.deps.agent_id.clone(),
+            ProcessId::Worker(existing_id),
+            ProcessType::Worker,
+            worker.channel_id.clone(),
+            worker.deps.event_tx.clone(),
+        );
+        worker
+            .transition_to(WorkerState::WaitingForInput)
+            .expect("freshly built worker should be able to enter waiting state");
         // Stash the prior history so `run_follow_up_loop()` can pick it up.
         worker.prior_history = Some(prior_history);

As per coding guidelines "Worker state transitions must be validated with can_transition_to() using matches! to catch illegal transitions as runtime errors".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agent/worker.rs` around lines 183 - 199, When resuming a worker you must
recreate the SpacebotHook to reflect the swapped-in existing_id and set the
state via the state machine instead of assigning fields directly: after calling
Self::build(...) and before using send_status()/send_worker_idle(), reconstruct
the SpacebotHook instance (or reinitialize its internal ProcessId) so its
publishable ID matches existing_id, and call
transition_to(WorkerState::WaitingForInput) (guarded by
can_transition_to()/matches!) rather than setting worker.state =
WorkerState::WaitingForInput; keep prior_history as-is so run_follow_up_loop()
can pick it up and ensure all outgoing publishes use the rebuilt hook tied to
existing_id.

…limit

OpenCode workers can no longer be spawned as non-interactive (fire-and-forget).
The spawn_worker tool forces interactive=true for opencode, and
spawn_opencode_worker_from_state rejects interactive=false as defense-in-depth.

Adds directory claim/release tracking to OpenCodeServerPool so only one
opencode worker can be active per directory at a time. The directory is
claimed before the worker starts and released after it completes (success
or failure).
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
src/agent/channel_dispatch.rs (1)

806-829: Consider extracting transcript persistence into a helper.

The transcript persistence logic (lines 806-829) is nearly identical to the spawn path (lines 570-590). Both serialize the transcript, spawn a fire-and-forget task, and execute the same SQL UPDATE.

Extracting this into a shared helper would reduce duplication and ensure consistent error handling.

♻️ Example helper function
fn persist_transcript_async(
    sqlite_pool: SqlitePool,
    worker_id: WorkerId,
    transcript: &[TranscriptStep],
    tool_calls: i64,
) {
    if transcript.is_empty() {
        return;
    }
    let blob = crate::conversation::worker_transcript::serialize_steps(transcript);
    let wid = worker_id.to_string();
    tokio::spawn(async move {
        if let Err(error) = sqlx::query(
            "UPDATE worker_runs SET transcript = ?, tool_calls = ? WHERE id = ?",
        )
        .bind(&blob)
        .bind(tool_calls)
        .bind(&wid)
        .execute(&sqlite_pool)
        .await
        {
            tracing::warn!(%error, worker_id = wid, "failed to persist transcript");
        }
    });
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agent/channel_dispatch.rs` around lines 806 - 829, Extract the duplicated
transcript persistence into a shared helper (e.g., persist_transcript_async) and
replace both inline blocks (the one after worker.run().await and the earlier
spawn path) with calls to it; the helper should accept SqlitePool, WorkerId, a
&[TranscriptStep] (or slice), and tool_calls (i64), return early if
transcript.is_empty(), call
crate::conversation::worker_transcript::serialize_steps to produce the blob,
then tokio::spawn an async task that runs the same sqlx::query("UPDATE
worker_runs SET transcript = ?, tool_calls = ? WHERE id =
?").bind(&blob).bind(tool_calls).bind(&wid).execute(&sqlite_pool).await and logs
failures with tracing::warn!(%error, worker_id = wid, "failed to persist
transcript"). Ensure you replace the two duplicated code sites with calls to
persist_transcript_async to keep behavior identical.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/agent/channel_dispatch.rs`:
- Around line 761-772: resume_idle_worker_into_state is missing the directory
claim/release that spawn_opencode_worker_from_state uses, creating a race when
resuming OpenCode workers; before calling OpenCodeWorker::resume_interactive in
resume_idle_worker_into_state, call server_pool.claim_directory(&directory) and
ensure you release it (server_pool.release_directory(&directory)) in the resumed
worker's completion/error handler or finally block to mirror
spawn_opencode_worker_from_state behavior and avoid leaving directories claimed
on early exits.
- Line 758: The resumed OpenCode workers lose their original directory because
the directory isn't persisted; add a directory column to the worker_runs table
and a directory: String (or Option<String> if nullable) field to the
IdleWorkerRow struct, persist the directory when creating records in
spawn_opencode_worker_from_state, and change resume_idle_worker_into_state to
use idle_row.directory (falling back to rc.workspace_dir only if directory is
None) instead of unconditionally using rc.workspace_dir so resumed workers
return to their original spawn path.

In `@src/opencode/server.rs`:
- Around line 670-680: claim_directory currently inserts the canonicalized path
but release_directory falls back to the raw path on canonicalize failure, which
can leave orphaned claims; change claim_directory to return the canonical
PathBuf it inserted (e.g., make claim_directory -> Result<PathBuf, _>) and
update all callers to hold and later pass that canonical PathBuf into
release_directory (or change release_directory signature to accept the canonical
PathBuf directly), and ensure release_directory uses that canonical path for
removal; if you keep a fallback canonicalize in release_directory, log an
error-level message when the fallback raw path does not match any existing claim
to make the mismatch observable.

---

Nitpick comments:
In `@src/agent/channel_dispatch.rs`:
- Around line 806-829: Extract the duplicated transcript persistence into a
shared helper (e.g., persist_transcript_async) and replace both inline blocks
(the one after worker.run().await and the earlier spawn path) with calls to it;
the helper should accept SqlitePool, WorkerId, a &[TranscriptStep] (or slice),
and tool_calls (i64), return early if transcript.is_empty(), call
crate::conversation::worker_transcript::serialize_steps to produce the blob,
then tokio::spawn an async task that runs the same sqlx::query("UPDATE
worker_runs SET transcript = ?, tool_calls = ? WHERE id =
?").bind(&blob).bind(tool_calls).bind(&wid).execute(&sqlite_pool).await and logs
failures with tracing::warn!(%error, worker_id = wid, "failed to persist
transcript"). Ensure you replace the two duplicated code sites with calls to
persist_transcript_async to keep behavior identical.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9382d81a-1456-4cc6-9f48-a571fe60e562

📥 Commits

Reviewing files that changed from the base of the PR and between 14e9b0f and 26d578f.

📒 Files selected for processing (3)
  • src/agent/channel_dispatch.rs
  • src/opencode/server.rs
  • src/tools/spawn_worker.rs

Comment on lines +761 to +772
let result = crate::opencode::OpenCodeWorker::resume_interactive(
worker_id,
Some(state.channel_id.clone()),
state.deps.agent_id.clone(),
&idle_worker.task,
directory,
server_pool,
state.deps.event_tx.clone(),
session_id.to_string(),
idle_worker.transcript.clone(),
)
.await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing claim_directory call for resumed OpenCode workers.

spawn_opencode_worker_from_state calls server_pool.claim_directory(&directory) to prevent concurrent workers on the same directory, but resume_idle_worker_into_state does not. This creates a race condition at startup:

  1. An idle OpenCode worker for /path/to/project is resumed
  2. Before it completes, a channel spawns a new OpenCode worker for the same directory
  3. The new spawn succeeds because no claim exists for the resumed worker

Additionally, there's no corresponding release_directory call in the resume path's completion handler.

🐛 Proposed fix: Add directory claim/release to resume path
             let directory = rc.workspace_dir.clone();
             let server_pool = rc.opencode_server_pool.load().clone();

+            // Claim directory to prevent concurrent workers (same as spawn path).
+            server_pool
+                .claim_directory(&directory)
+                .await
+                .map_err(|e| e.to_string())?;
+
+            let release_pool = server_pool.clone();
+            let release_directory = directory.clone();
+
             let result = crate::opencode::OpenCodeWorker::resume_interactive(

And in the async worker task (around line 807):

                 async move {
-                    let result = worker.run().await.map_err(SpacebotError::from)?;
+                    let result = worker.run().await.map_err(SpacebotError::from);
+
+                    // Release directory claim regardless of success/failure.
+                    release_pool.release_directory(&release_directory).await;
+
+                    let result = result?;
                     // Persist final transcript.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agent/channel_dispatch.rs` around lines 761 - 772,
resume_idle_worker_into_state is missing the directory claim/release that
spawn_opencode_worker_from_state uses, creating a race when resuming OpenCode
workers; before calling OpenCodeWorker::resume_interactive in
resume_idle_worker_into_state, call server_pool.claim_directory(&directory) and
ensure you release it (server_pool.release_directory(&directory)) in the resumed
worker's completion/error handler or finally block to mirror
spawn_opencode_worker_from_state behavior and avoid leaving directories claimed
on early exits.

Comment on lines +670 to +680
let canonical = match directory.canonicalize() {
Ok(path) => path,
Err(error) => {
tracing::warn!(
directory = %directory.display(),
%error,
"could not canonicalize directory during release, using raw path"
);
directory.to_path_buf()
}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential orphaned directory claim if canonicalization fails on release.

When claim_directory succeeds, it inserts the canonical path. If release_directory fails to canonicalize (e.g., directory was deleted during worker run), the fallback uses the raw path which won't match the canonical path in the HashSet, leaving a stale claim.

Consider storing the canonical path at claim time and passing it to release, or logging an error-level message when the fallback path doesn't match any claim.

🛡️ Proposed approach: return the canonical path from claim_directory
-    pub async fn claim_directory(&self, directory: &Path) -> anyhow::Result<()> {
+    pub async fn claim_directory(&self, directory: &Path) -> anyhow::Result<PathBuf> {
         let canonical = directory
             .canonicalize()
             .with_context(|| format!("directory '{}' does not exist", directory.display()))?;

         let mut active = self.active_directories.lock().await;
         if !active.insert(canonical.clone()) {
             bail!(
                 "An OpenCode worker is already active in directory '{}'",
                 canonical.display()
             );
         }
         tracing::debug!(
             directory = %canonical.display(),
             "claimed directory for OpenCode worker"
         );
-        Ok(())
+        Ok(canonical)
     }

Then pass the returned canonical path to release_directory instead of the original path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/opencode/server.rs` around lines 670 - 680, claim_directory currently
inserts the canonicalized path but release_directory falls back to the raw path
on canonicalize failure, which can leave orphaned claims; change claim_directory
to return the canonical PathBuf it inserted (e.g., make claim_directory ->
Result<PathBuf, _>) and update all callers to hold and later pass that canonical
PathBuf into release_directory (or change release_directory signature to accept
the canonical PathBuf directly), and ensure release_directory uses that
canonical path for removal; if you keep a fallback canonicalize in
release_directory, log an error-level message when the fallback raw path does
not match any existing claim to make the mismatch observable.

Six fixes from review bot findings:

1. Rebuild SpacebotHook with the correct existing worker ID in
   Worker::resume_interactive() — Self::build() creates a hook with a
   fresh random ProcessId, but the resumed worker needs events published
   under the original ID that matches the DB row and state tracking.

2. Load restored transcript into `history` (not `compacted_history`)
   so the LLM actually sees prior conversation context on the first
   post-restart follow-up call.

3. COALESCE(tool_calls, 0) in get_idle_interactive_workers() query to
   prevent sqlx::FromRow failure on rows where tool_calls is NULL.

4. Replace .ok() with proper error logging on fail_idle_worker() calls
   in main.rs startup — these are SQL errors, not dropped channel sends.

5. Persist recovered transcript before emitting idle in the OpenCode
   resume path, so a second crash doesn't lose the rebuilt state.

6. Make persist_transcript() and persist_transcript_snapshot() async
   and await them directly instead of fire-and-forget tokio::spawn.
   Ensures "idle implies persisted" and prevents out-of-order writes
   where an older snapshot could overwrite a newer one.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
src/agent/worker.rs (1)

195-209: 🛠️ Refactor suggestion | 🟠 Major

Use transition_to() for these new waiting-state transitions.

Both assignments bypass the state machine guardrails this file is supposed to enforce. Please route them through transition_to(WorkerState::WaitingForInput) instead.

Suggested fix
-        worker.state = WorkerState::WaitingForInput;
+        worker
+            .transition_to(WorkerState::WaitingForInput)
+            .expect("freshly built worker should be able to enter waiting state");
         // Stash the prior history so `run_follow_up_loop()` can pick it up.
         worker.prior_history = Some(prior_history);
-                self.state = WorkerState::WaitingForInput;
+                self.transition_to(WorkerState::WaitingForInput)?;
                 self.persist_transcript(&compacted_history, &history).await;

As per coding guidelines "Worker state transitions must be validated with can_transition_to() using matches! to catch illegal transitions as runtime errors".

Also applies to: 404-405

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agent/worker.rs` around lines 195 - 209, This block directly assigns
worker.state = WorkerState::WaitingForInput (and similar at lines ~404-405)
which bypasses the state-machine checks; change the code to call the
worker.transition_to(WorkerState::WaitingForInput) API instead so the transition
is validated by can_transition_to() (matches! check) and will error on illegal
transitions—keep the hook rebuild (SpacebotHook::new(...)) and prior_history
assignment, but do not set worker.state directly; invoke transition_to on the
same Worker instance after rebuilding the hook and before returning so the
guarded transition logic is used.
src/main.rs (1)

1729-1758: ⚠️ Potential issue | 🟡 Minor

Reuse the normal outbound router here too.

This startup-only path forwards outbound text, but it never emits ApiEvent::TypingState, so resumed channels lose typing indicators after a restart even though the regular channel path keeps them. Reusing the same routing helper would keep the two flows in sync.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.rs` around lines 1729 - 1758, The startup-only match on
spacebot::OutboundResponse manually sends ApiEvent::OutboundMessage and omits
emitting spacebot::api::ApiEvent::TypingState, causing resumed channels to lose
typing indicators; instead call the existing outbound routing helper used by the
regular channel path (the same function that takes a response and api_event_tx
and routes TypingState + OutboundMessage) with the current response,
api_event_tx, sse_agent_id and sse_channel_id so both flows stay in sync and
TypingState is emitted consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/agent/worker.rs`:
- Around line 290-298: The resumed-worker branch currently logs and publishes an
initial "running" status before actually receiving input, causing stale
'running' rows if the process crashes; update the logic in run() and/or
log_worker_status() so that when resuming (detect via the resuming boolean used
in the branch and the self.hook calls like send_status/send_worker_idle), you
skip emitting the generic "running" status until the worker actually begins
processing a follow-up input; specifically, add a conditional around the
call/site that sets worker_runs.status='running' (the place where run() calls
log_worker_status("running") or similar) to only execute when resuming==false or
after the first task starts, leaving the existing send_status("resumed — waiting
for input") and send_worker_idle() unchanged.

In `@src/opencode/worker.rs`:
- Around line 297-303: The current resume logic sets
EventState.has_received_event and has_assistant_message to true, which causes
process_events() to consider a stale session.idle as completion and may skip the
first post-restart assistant output; undo that by leaving
EventState.has_received_event and EventState.has_assistant_message false when
constructing the resumed EventState (keep event_state.accumulated_parts and
event_state.tool_calls from resume but do not set the two flags), so the
early-idle guard remains active for the first resumed follow-up and
process_events() will wait for fresh assistant output.

---

Duplicate comments:
In `@src/agent/worker.rs`:
- Around line 195-209: This block directly assigns worker.state =
WorkerState::WaitingForInput (and similar at lines ~404-405) which bypasses the
state-machine checks; change the code to call the
worker.transition_to(WorkerState::WaitingForInput) API instead so the transition
is validated by can_transition_to() (matches! check) and will error on illegal
transitions—keep the hook rebuild (SpacebotHook::new(...)) and prior_history
assignment, but do not set worker.state directly; invoke transition_to on the
same Worker instance after rebuilding the hook and before returning so the
guarded transition logic is used.

In `@src/main.rs`:
- Around line 1729-1758: The startup-only match on spacebot::OutboundResponse
manually sends ApiEvent::OutboundMessage and omits emitting
spacebot::api::ApiEvent::TypingState, causing resumed channels to lose typing
indicators; instead call the existing outbound routing helper used by the
regular channel path (the same function that takes a response and api_event_tx
and routes TypingState + OutboundMessage) with the current response,
api_event_tx, sse_agent_id and sse_channel_id so both flows stay in sync and
TypingState is emitted consistently.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c25061c7-3ebb-4ba0-80c3-4d37334f8314

📥 Commits

Reviewing files that changed from the base of the PR and between 26d578f and b58bca5.

📒 Files selected for processing (4)
  • src/agent/worker.rs
  • src/conversation/history.rs
  • src/main.rs
  • src/opencode/worker.rs

Comment on lines +290 to +298
if resuming {
tracing::info!(
worker_id = %self.id,
prior_messages = history.len(),
"resuming interactive worker with prior history"
);
self.hook.send_status("resumed — waiting for input");
self.hook.send_worker_idle();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't mark a resumed idle worker as running before it actually gets input.

This branch goes straight back to idle, but run() still emits the generic "running" status first. Since log_worker_status() persists that as worker_runs.status='running', a second crash before send_worker_idle() leaves the row looking orphaned and startup reconciliation will fail it on the next boot. Only publish "running" once the resumed worker starts processing a follow-up.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agent/worker.rs` around lines 290 - 298, The resumed-worker branch
currently logs and publishes an initial "running" status before actually
receiving input, causing stale 'running' rows if the process crashes; update the
logic in run() and/or log_worker_status() so that when resuming (detect via the
resuming boolean used in the branch and the self.hook calls like
send_status/send_worker_idle), you skip emitting the generic "running" status
until the worker actually begins processing a follow-up input; specifically, add
a conditional around the call/site that sets worker_runs.status='running' (the
place where run() calls log_worker_status("running") or similar) to only execute
when resuming==false or after the first task starts, leaving the existing
send_status("resumed — waiting for input") and send_worker_idle() unchanged.

Comment on lines +297 to +303
let mut event_state = EventState::new();
event_state.accumulated_parts = resume.accumulated_parts;
event_state.tool_calls = resume.tool_calls;
event_state.has_received_event = true;
event_state.has_assistant_message = true;

// Process SSE events until session goes idle or errors.
// EventState tracks status and last_text for the initial result delivery.
// The full transcript is fetched from the OpenCode API on completion.
let mut event_state = EventState::new();
self.process_events(event_response, &session_id, &server, &mut event_state)
.await?;
(server, resume.session_id, event_state, String::new())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep the early-idle guard active for the first resumed follow-up.

Setting has_received_event and has_assistant_message here makes the next process_events() treat a stale session.idle as completion. On the first post-restart follow-up, that can return before any new assistant output is processed.

Suggested fix
                 let mut event_state = EventState::new();
                 event_state.accumulated_parts = resume.accumulated_parts;
                 event_state.tool_calls = resume.tool_calls;
-                event_state.has_received_event = true;
-                event_state.has_assistant_message = true;

                 (server, resume.session_id, event_state, String::new())
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let mut event_state = EventState::new();
event_state.accumulated_parts = resume.accumulated_parts;
event_state.tool_calls = resume.tool_calls;
event_state.has_received_event = true;
event_state.has_assistant_message = true;
// Process SSE events until session goes idle or errors.
// EventState tracks status and last_text for the initial result delivery.
// The full transcript is fetched from the OpenCode API on completion.
let mut event_state = EventState::new();
self.process_events(event_response, &session_id, &server, &mut event_state)
.await?;
(server, resume.session_id, event_state, String::new())
let mut event_state = EventState::new();
event_state.accumulated_parts = resume.accumulated_parts;
event_state.tool_calls = resume.tool_calls;
(server, resume.session_id, event_state, String::new())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/opencode/worker.rs` around lines 297 - 303, The current resume logic sets
EventState.has_received_event and has_assistant_message to true, which causes
process_events() to consider a stale session.idle as completion and may skip the
first post-restart assistant output; undo that by leaving
EventState.has_received_event and EventState.has_assistant_message false when
constructing the resumed EventState (keep event_state.accumulated_parts and
event_state.tool_calls from resume but do not set the two flags), so the
early-idle guard remains active for the first resumed follow-up and
process_events() will wait for fresh assistant output.

Previously, when an idle worker couldn't be reconnected on startup
(e.g. OpenCode server died with kill_on_drop, session expired), the
code called fail_idle_worker() which transitioned the row to 'failed'.

Now the worker stays as 'idle' in the DB with its transcript preserved.
The resume attempt is best-effort — if it fails, the worker is
effectively dead but its history remains available for inspection in
the UI. Only dangling 'running' workers (from unclean shutdown) get
marked as failed via reconcile_running_workers_for_agent().
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
src/main.rs (1)

1709-1738: ⚠️ Potential issue | 🟡 Minor

Mirror typing events in the startup outbound router.

This startup-only match still never maps Status(Thinking) / Status(StopTyping) to ApiEvent::TypingState, so resumed channels lose web typing indicators after restart. Reuse the same outbound-routing helper as the normal path to keep the two behaviors in sync.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.rs` around lines 1709 - 1738, The startup-only match on `response:
spacebot::OutboundResponse` duplicates outbound routing but omits typing status
mapping; replace this custom match with a call to the same outbound-routing
helper used by the normal path (the function that converts
`spacebot::OutboundResponse` into `spacebot::api::ApiEvent` and sends via
`api_event_tx`) so behaviors stay in sync, and ensure that `Status::Thinking`
and `Status::StopTyping` are mapped to `spacebot::api::ApiEvent::TypingState`
(using `sse_agent_id` and `sse_channel_id`) when emitting through
`api_event_tx`.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/main.rs`:
- Around line 1761-1768: Pre-created resumed channels currently bypass the
post-restart gap backfill because inserting an ActiveChannel causes the
downstream contains_key check to treat the first inbound as already-active; add
a one-time "gap backfilled" flag to ActiveChannel (e.g., gap_backfilled: bool)
initialized false when you insert the pre-created channel, then in the inbound
handling path (the branch that currently checks
active_channels.contains_key(...) and before forwarding the first real inbound)
detect gap_backfilled == false, call fetch_history(conversation_id) to reconcile
missed messages, merge/update latest_message as needed, set gap_backfilled =
true, and only then forward the inbound; update any code that clones or
constructs ActiveChannel (_outbound_handle, latest_message) to include the new
flag and ensure the check-and-backfill is performed atomically (or via the same
mutex/entry) so multiple simultaneous inbounds don't double-run fetch_history.
- Around line 1557-1560: Extract the idle-worker recovery block (the code that
pre-creates channels and spawns resumed workers when agents_initialized is true)
into a named helper function (e.g., resume_idle_workers_after_init or
resume_persisted_idle_workers) and replace the inline if agents_initialized {
... } with a call to that helper; then call the same helper from both places
where initialize_agents(...) completes successfully (the original startup path
and the provider-setup branch that runs later) so stranded idle workers are
retried; finally add a regression test that simulates daemon start in setup
mode, performs the provider-setup path to reach provider-ready, and asserts
persisted idle workers are resumed (use the same helper path or integration test
hooks).

---

Duplicate comments:
In `@src/main.rs`:
- Around line 1709-1738: The startup-only match on `response:
spacebot::OutboundResponse` duplicates outbound routing but omits typing status
mapping; replace this custom match with a call to the same outbound-routing
helper used by the normal path (the function that converts
`spacebot::OutboundResponse` into `spacebot::api::ApiEvent` and sends via
`api_event_tx`) so behaviors stay in sync, and ensure that `Status::Thinking`
and `Status::StopTyping` are mapped to `spacebot::api::ApiEvent::TypingState`
(using `sse_agent_id` and `sse_channel_id`) when emitting through
`api_event_tx`.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a360de4b-bc94-4ebd-963c-ddb3add3382b

📥 Commits

Reviewing files that changed from the base of the PR and between b58bca5 and 43c4d03.

📒 Files selected for processing (1)
  • src/main.rs

…el after restart

Two bugs preventing idle worker visibility after restart:

1. opencode_session_id was never persisted to the DB. The
   OpenCodeSessionCreated event is filtered out by event_is_for_channel()
   (returns false for that variant), so the channel handler's
   log_opencode_metadata() call was unreachable dead code. Fix: persist
   session metadata directly from the OpenCode worker's run() method
   via sqlite_pool, bypassing the event system entirely.

2. When resume fails (e.g. session expired because server was killed on
   shutdown), the channel's status block had no knowledge of the idle
   worker. Fix: on resume failure, still register the worker into the
   status block as 'idle (session expired)' so it appears in the
   channel's UI and the LLM knows about it.
Fixes zombie cycle where OpenCode workers with expired sessions were
left as 'idle' in the DB on every restart, reappearing in the channel
status block as 'idle (session expired)' and showing as 'running' in
the workers tab with no way to dismiss them.

- Retire irrecoverable idle workers as 'done' (not 'failed') on startup
  since the worker completed its work — only the follow-up session expired
- Pre-filter OpenCode workers without session metadata before creating
  the channel (these can never resume due to kill_on_drop)
- Stop registering expired workers in the channel status block
- Add channel_id to OpenCodeSessionCreated event so it routes through
  event_is_for_channel() properly, removing the persist_opencode_metadata
  workaround (log_opencode_metadata in channel.rs is no longer dead code)
- Add TranscriptStep::UserText variant to preserve user vs assistant text
  distinction across transcript serialization round-trips
- Emit TypingState SSE events from pre-created channel outbound router
Resumed OpenCode workers were losing their original directory because
it was never persisted to the DB — resume fell back to workspace_dir,
spawning a new server in the wrong directory and ignoring the
one-per-directory limit.

- Add 'directory' column to worker_runs (new migration)
- Persist directory from spawn_opencode_worker_from_state via UPDATE
- Read directory from IdleWorkerRow in resume_idle_worker_into_state
- Fall back to workspace_dir only when directory is NULL (pre-existing
  workers without the column)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant