fix: stop worker secret scans from killing internal work by jamiepine · Pull Request #297 · spacedriveapp/spacebot

jamiepine · 2026-03-03T09:31:50Z

Summary

Move secret-pattern hard-blocking to channel egress (reply + plaintext fallback) so worker/internal tool output no longer kills long-running jobs.
In sandbox passthrough mode, preserve parent HOME and allow shell/exec working directories outside the workspace when sandbox is disabled.
Fix stuck running workers by handling worker-task panics as failed completions, reconciling orphaned running rows to failed on startup, and allowing API cancel to close detached runs after restart.

coderabbitai · 2026-03-03T09:31:57Z

Walkthrough

This PR changes sandbox HOME handling (workspace vs parent HOME), shifts leak detection to channel-egress enforcement with redaction and non-fatal worker output handling, makes workspace-boundary enforcement for working directories conditional on sandbox mode, and adds worker-run reconciliation/cancellation and status removal logic.

Changes

Cohort / File(s)	Summary
Documentation `README.md`, `docs/content/docs/(configuration)/config.mdx`, `docs/content/docs/(configuration)/sandbox.mdx`, `docs/content/docs/(features)/workers.mdx`	Update docs to describe mode-dependent HOME behavior, channel-egress leak handling (blocking outbound text), output scrubbing, worker secret management, and durable tools/bin note.
Channel & Fallback Handling `src/agent/channel.rs`, `src/agent/channel_dispatch.rs`	Adds runtime leak scans in plaintext fallback paths and wraps worker futures with unwind (panic) handling; preserves scrubbing but logs leak detections and reports panics as worker failures.
Hook / Reply Enforcement `src/hooks/spacebot.rs`, `src/tools/reply.rs`	Restricts strict leak enforcement to channel egress for the `reply` tool (on call/result); blocks reply content on detected secrets and returns explicit errors/reasons.
Worker Runtime Behavior `src/opencode/worker.rs`, `src/agent/status.rs`	Downgrades fatal leak termination to warnings for worker outputs; adds `StatusBlock::remove_worker` to remove active workers by ID for cancellation cleanup.
Sandbox / Environment `src/sandbox.rs`	In passthrough mode, derive HOME from existing parent HOME when present, otherwise fall back to workspace; preserves other env sanitization behaviors.
Tool Execution & Shell `src/tools/exec.rs`, `src/tools/shell.rs`	Resolve relative working_dir against workspace, canonicalize, and only enforce workspace-boundary restrictions when sandbox mode is enabled (absolute paths allowed otherwise).
API & Worker Cancellation `src/api/channels.rs`	Two-stage cancel: try live channel_state cancellation, then fall back to iterating agent pools to cancel detached runs; returns detailed messages and error handling for detached cancellations.
Process Run Persistence / Reconciliation `src/conversation/history.rs`, `src/main.rs`	Adds ProcessRunLogger methods to reconcile or cancel running worker runs and invokes reconciliation at per-agent startup to mark orphaned workers failed.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Secret store: credential isolation, encryption at rest, output scrubbing #260: implements overlapping secret-store, output-scrubbing, and worker secret-injection changes touching sandbox, agent/channel, and scrub utilities.
Sandbox hardening: dynamic mode, env sanitization, leak detection #259: touches sandbox hardening, env sanitization/passthrough_env, and leak-detection scrubbing—overlaps with this PR's sandbox and secrets changes.
Workers tab: full transcript viewer, live SSE streaming, introspection tool #192: relates to ProcessRunLogger reconciliation/cancellation and worker run persistence changes introduced here.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: moving secret scans from worker termination to channel egress to prevent internal work disruption.
Description check	✅ Passed	The description is directly related to the changeset, outlining the three primary objectives: secret-pattern blocking at channel egress, sandbox passthrough mode changes, and worker lifecycle handling.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/worker-leak-egress-sandbox-off

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

src/tools/exec.rs (1)
162-192: Extract shared working_dir resolution to a helper.

Line 162–192 duplicates the same resolution/boundary logic used in src/tools/shell.rs (Line 105–135). Centralizing this in one utility will reduce policy drift risk.

As per coding guidelines, "Do not abbreviate variable names; use queue not q, message not msg, channel not ch; common abbreviations like config are acceptable."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tools/exec.rs` around lines 162 - 192, Extract the duplicated working_dir
resolution and workspace-boundary check into a shared helper named
resolve_working_dir that takes (workspace: &Path, sandbox: &Sandbox,
working_dir_option: Option<&str>) -> Result<PathBuf, ExecError> (or a small
shared error type) and replace the inline logic in the Exec method (where
args.working_dir, self.workspace, and self.sandbox.mode_enabled() are used) and
the identical block in the shell code to call this helper; ensure the helper
resolves absolute vs relative paths, canonicalizes safely (falling back to the
original resolved path), enforces the workspace canonical start-with check when
sandbox.mode_enabled() is true, returns the canonical PathBuf on success, and
use full, non-abbreviated names like resolve_working_dir, workspace_canonical,
and working_dir_option per naming guidelines.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/content/docs/`(configuration)/config.mdx:
- Line 574: Update the sentence describing injected environment variables to
also mention that tool-category secrets from the secret store are injected into
worker runtimes; modify the paragraph that currently lists `PATH`, safe vars,
and `passthrough_env` to explicitly include "tool-category secrets (from the
secret store)" and clarify whether those secrets are always injected or
conditional, and add a brief operator guidance to limit passthrough_env and
secrets to least privilege use. Reference the existing identifiers `PATH`,
`HOME`, `passthrough_env`, and "tool-category secrets" so the change clearly
augments the current environment list.

In `@README.md`:
- Line 251: Update the README's "Environment sanitization" safe-variable list to
match runtime behavior by adding USER, TERM, and TMPDIR alongside PATH, HOME,
and LANG; edit the sentence that currently names only PATH, HOME, and LANG so it
enumerates PATH, HOME, LANG, USER, TERM, and TMPDIR and clarifies which of those
are set in sandbox vs passthrough modes (HOME behavior remains as described).

In `@src/hooks/spacebot.rs`:
- Around line 151-163: The early return in the leak-detection branch inside the
Channel reply path skips cleanup of the per-call timer; before returning the
HookAction::Terminate (in the block checking self.process_type ==
ProcessType::Channel && tool_name == "reply" && let Some(leak) =
self.scan_for_leaks(result)), remove the `_internal_call_id` entry from
TOOL_CALL_TIMERS (or otherwise stop/clear the timer) so metrics don't leak, then
log and return; you can call the same cleanup logic used elsewhere for removing
`_internal_call_id` to keep behavior consistent with non-leak paths.

---

Nitpick comments:
In `@src/tools/exec.rs`:
- Around line 162-192: Extract the duplicated working_dir resolution and
workspace-boundary check into a shared helper named resolve_working_dir that
takes (workspace: &Path, sandbox: &Sandbox, working_dir_option: Option<&str>) ->
Result<PathBuf, ExecError> (or a small shared error type) and replace the inline
logic in the Exec method (where args.working_dir, self.workspace, and
self.sandbox.mode_enabled() are used) and the identical block in the shell code
to call this helper; ensure the helper resolves absolute vs relative paths,
canonicalizes safely (falling back to the original resolved path), enforces the
workspace canonical start-with check when sandbox.mode_enabled() is true,
returns the canonical PathBuf on success, and use full, non-abbreviated names
like resolve_working_dir, workspace_canonical, and working_dir_option per naming
guidelines.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 301bab5 and d6d8322.

📒 Files selected for processing (11)

README.md
docs/content/docs/(configuration)/config.mdx
docs/content/docs/(configuration)/sandbox.mdx
docs/content/docs/(features)/workers.mdx
src/agent/channel.rs
src/hooks/spacebot.rs
src/opencode/worker.rs
src/sandbox.rs
src/tools/exec.rs
src/tools/reply.rs
src/tools/shell.rs

coderabbitai · 2026-03-03T09:38:21Z

docs/content/docs/(configuration)/config.mdx

 The agent's data directory (databases, config) is explicitly re-mounted read-only even if it would otherwise be writable due to path overlap.

-Regardless of mode, all worker subprocesses start with a clean environment. Only `PATH` (with `tools/bin` prepended), safe variables (`HOME`, `USER`, `LANG`, `TERM`, `TMPDIR`), and any `passthrough_env` entries are injected. Use `passthrough_env` with least privilege: only forward variables required by worker tools (for example specific credentials set in Docker Compose or systemd), and avoid forwarding broad or highly sensitive credentials.
+Regardless of mode, all worker subprocesses start with a clean environment. Only `PATH` (with `tools/bin` prepended), safe variables (`HOME`, `USER`, `LANG`, `TERM`, `TMPDIR`), and any `passthrough_env` entries are injected. `HOME` is mode-dependent: workspace path when sandboxed, parent `HOME` when passthrough is enabled. Use `passthrough_env` with least privilege: only forward variables required by worker tools (for example specific credentials set in Docker Compose or systemd), and avoid forwarding broad or highly sensitive credentials.


⚠️ Potential issue | 🟡 Minor

Environment list omits tool secret injection.

Line 574 says only PATH, safe vars, and passthrough_env are injected, but runtime also injects tool-category secrets from the secret store. Please include that to avoid operator confusion.

Suggested doc tweak

-Regardless of mode, all worker subprocesses start with a clean environment. Only `PATH` (with `tools/bin` prepended), safe variables (`HOME`, `USER`, `LANG`, `TERM`, `TMPDIR`), and any `passthrough_env` entries are injected. +Regardless of mode, all worker subprocesses start with a clean environment. Only `PATH` (with `tools/bin` prepended), safe variables (`HOME`, `USER`, `LANG`, `TERM`, `TMPDIR`), tool-category secrets from the secret store, and any `passthrough_env` entries are injected.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Regardless of mode, all worker subprocesses start with a clean environment. Only `PATH` (with `tools/bin` prepended), safe variables (`HOME`, `USER`, `LANG`, `TERM`, `TMPDIR`), and any `passthrough_env` entries are injected. `HOME` is mode-dependent: workspace path when sandboxed, parent `HOME` when passthrough is enabled. Use `passthrough_env` with least privilege: only forward variables required by worker tools (for example specific credentials set in Docker Compose or systemd), and avoid forwarding broad or highly sensitive credentials.

Regardless of mode, all worker subprocesses start with a clean environment. Only `PATH` (with `tools/bin` prepended), safe variables (`HOME`, `USER`, `LANG`, `TERM`, `TMPDIR`), tool-category secrets from the secret store, and any `passthrough_env` entries are injected. `HOME` is mode-dependent: workspace path when sandboxed, parent `HOME` when passthrough is enabled. Use `passthrough_env` with least privilege: only forward variables required by worker tools (for example specific credentials set in Docker Compose or systemd), and avoid forwarding broad or highly sensitive credentials.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/content/docs/`(configuration)/config.mdx at line 574, Update the sentence describing injected environment variables to also mention that tool-category secrets from the secret store are injected into worker runtimes; modify the paragraph that currently lists `PATH`, safe vars, and `passthrough_env` to explicitly include "tool-category secrets (from the secret store)" and clarify whether those secrets are always injected or conditional, and add a brief operator guidance to limit passthrough_env and secrets to least privilege use. Reference the existing identifiers `PATH`, `HOME`, `passthrough_env`, and "tool-category secrets" so the change clearly augments the current environment list.

coderabbitai · 2026-03-03T09:38:21Z

README.md

 Secrets are split into two categories: **system** (LLM API keys, messaging tokens — never exposed to subprocesses) and **tool** (CLI credentials like `GH_TOKEN` — injected as env vars into workers). The category is auto-assigned based on the secret name, or set explicitly.

- **Environment sanitization** — every subprocess starts with a clean environment (`--clearenv` on Linux, `env_clear()` everywhere else). Only safe baseline vars (`PATH`, `HOME`, `LANG`), tool-category secrets, and explicit `passthrough_env` entries are present. System secrets never enter any subprocess
+- **Environment sanitization** — every subprocess starts with a clean environment (`--clearenv` on Linux, `env_clear()` everywhere else). Only safe baseline vars (`PATH`, `HOME`, `LANG`), tool-category secrets, and explicit `passthrough_env` entries are present. In sandbox mode, `HOME` is set to the workspace; in passthrough mode, `HOME` uses the parent environment. System secrets never enter any subprocess


⚠️ Potential issue | 🟡 Minor

Safe-variable list is currently incomplete.

Line 251 lists only PATH, HOME, and LANG, but current runtime also injects USER, TERM, and TMPDIR. Please align the README list to avoid ambiguity.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@README.md` at line 251, Update the README's "Environment sanitization" safe-variable list to match runtime behavior by adding USER, TERM, and TMPDIR alongside PATH, HOME, and LANG; edit the sentence that currently names only PATH, HOME, and LANG so it enumerates PATH, HOME, LANG, USER, TERM, and TMPDIR and clarifies which of those are set in sandbox vs passthrough modes (HOME behavior remains as described).

coderabbitai · 2026-03-03T09:38:21Z

src/hooks/spacebot.rs

+        if self.process_type == ProcessType::Channel
+            && tool_name == "reply"
+            && let Some(leak) = self.scan_for_leaks(result)
+        {
            tracing::error!(
                process_id = %self.process_id,
                tool_name = %tool_name,
                leak_prefix = %&leak[..leak.len().min(8)],
-                "secret leak detected in tool output, terminating agent"
+                "secret leak detected in reply result, terminating channel turn"
            );
            return HookAction::Terminate {
-                reason: "Tool output contained a secret. Agent terminated to prevent exfiltration."
-                    .into(),
+                reason: "Reply contained a secret. Channel turn terminated.".into(),
            };


⚠️ Potential issue | 🟠 Major

Leak-termination path skips metrics timer cleanup.

Line 151–163 returns early before removing _internal_call_id from TOOL_CALL_TIMERS, so repeated blocked replies can accumulate stale timer entries when metrics are enabled.

Suggested fix

if self.process_type == ProcessType::Channel && tool_name == "reply" && let Some(leak) = self.scan_for_leaks(result) { + #[cfg(feature = "metrics")] + if let Ok(mut timers) = TOOL_CALL_TIMERS.lock() { + timers.remove(_internal_call_id); + } + tracing::error!( process_id = %self.process_id, tool_name = %tool_name, leak_prefix = %&leak[..leak.len().min(8)], "secret leak detected in reply result, terminating channel turn" ); return HookAction::Terminate { reason: "Reply contained a secret. Channel turn terminated.".into(), }; }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/hooks/spacebot.rs` around lines 151 - 163, The early return in the leak-detection branch inside the Channel reply path skips cleanup of the per-call timer; before returning the HookAction::Terminate (in the block checking self.process_type == ProcessType::Channel && tool_name == "reply" && let Some(leak) = self.scan_for_leaks(result)), remove the `_internal_call_id` entry from TOOL_CALL_TIMERS (or otherwise stop/clear the timer) so metrics don't leak, then log and return; you can call the same cleanup logic used elsewhere for removing `_internal_call_id` to keep behavior consistent with non-leak paths.

tembo · 2026-03-03T09:49:11Z

src/tools/shell.rs

+            } else {
+                self.workspace.join(raw_path)
+            };
+            let canonical = resolved.canonicalize().unwrap_or(resolved);


If canonicalize() fails, resolved can still contain .. (e.g. workspace/../outside), and the starts_with(workspace_canonical) check will pass. In sandbox mode that looks like a workspace-boundary bypass. Consider requiring a successful canonicalization (or lexically normalizing) before the boundary check.

Suggested change

let canonical = resolved.canonicalize().unwrap_or(resolved);

let canonicalized = resolved.canonicalize();

let canonical = if self.sandbox.mode_enabled() {

canonicalized.map_err(|_| ShellError {

message: format!(

"working_dir must exist and be within the workspace ({}).",

self.workspace.display()

),

exit_code: -1,

})?

} else {

canonicalized.unwrap_or(resolved)

};

tembo · 2026-03-03T09:49:17Z

src/tools/exec.rs

+            } else {
+                self.workspace.join(raw_path)
+            };
+            let canonical = resolved.canonicalize().unwrap_or(resolved);


Same canonicalize().unwrap_or(resolved) issue here: when canonicalization fails, a relative working_dir containing .. can still pass the starts_with(workspace) check in sandbox mode.

Suggested change

let canonical = resolved.canonicalize().unwrap_or(resolved);

let canonicalized = resolved.canonicalize();

let canonical = if self.sandbox.mode_enabled() {

canonicalized.map_err(|_| ExecError {

message: format!(

"working_dir must exist and be within the workspace ({}).",

self.workspace.display()

),

exit_code: -1,

})?

} else {

canonicalized.unwrap_or(resolved)

};

tembo · 2026-03-03T09:49:27Z

src/agent/channel.rs

                                "blocked retrigger fallback output containing structured or tool syntax"
                            );
+                        } else if let Some(leak) = crate::secrets::scrub::scan_for_leaks(text) {
+                            tracing::warn!(


scan_for_leaks() returns the matched secret substring (or "url-encoded: {matched}", etc). Logging leak_prefix leaks part of the secret into logs, and the [..] slice can also panic on non-ASCII boundaries. Maybe log only the encoding kind/length instead (same pattern repeats in the other fallback blocks below).

Suggested change

tracing::warn!(

} else if let Some(leak) = crate::secrets::scrub::scan_for_leaks(text) {

let leak_kind = leak

.split_once(": ")

.map(|(kind, _)| kind)

.unwrap_or("plaintext");

tracing::warn!(

channel_id = %self.id,

leak_kind,

"blocked retrigger fallback output matching secret pattern"

);

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

src/api/channels.rs (1)
210-241: Add targeted async race tests for this two-stage cancellation path.

Given the new live-then-detached reconciliation flow, please add focused tests for terminal-state races (live success, live miss + detached success, detached error, already-finished worker).

Based on learnings: "For changes in async/stateful paths (worker lifecycle, cancellation, retrigger, recall cache behavior), include explicit race/terminal-state reasoning in the PR summary and run targeted tests in addition to just gate-pr".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/api/channels.rs` around lines 210 - 241, Add targeted async race tests
covering the two-stage cancellation flow exercised by the handler in
channels.rs: simulate the live-path and fallback detached-path behavior by
invoking the same cancellation flow that calls
ProcessRunLogger::cancel_running_worker (and the HTTP cancel endpoint that wraps
it) under concurrent interleavings; include tests for (1) live success (worker
canceled while live), (2) live miss + detached success (no live state then
cancel_running_worker returns true), (3) detached error (cancel_running_worker
returns Err and the handler returns 500), and (4) already-finished worker
(returns not found). Use tokio or the existing test harness to orchestrate races
and assert terminal states and HTTP/status-code outcomes so the reconciliation
logic is explicitly exercised.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/agent/channel_dispatch.rs`:
- Around line 531-547: The panic branch in channel_dispatch.rs returns and logs
the raw panic_message, which can leak secrets; before logging and returning, run
the panic_message through the same scrubber used on success (call
scrub_with_store(panic_message, &secret_store) or the local secret store
instance used elsewhere) and use the scrubbed string for tracing::error and the
returned formatted message (i.e., replace uses of panic_message with the
scrubbed result); ensure you reference the same scrub_with_store function and
the same secret store variable used by the success path so panic output is
consistently scrubbed.

In `@src/api/channels.rs`:
- Around line 201-203: The code currently discards the result of
channel_state.cancel_worker(worker_id).await by calling .is_ok(), which hides
errors; change it to match on the Result from
channel_state.cancel_worker(worker_id).await (or bind it to a variable) and if
Err(e) log the error (using the appropriate logger) or handle/propagate it
before falling back to detached reconciliation; ensure the branch still performs
the successful cancellation path when Ok(_) and only proceeds to the fallback
when cancellation truly failed or when channel_state is None.
- Around line 229-237: The current Err(error) arm in the pool-scan loop (the
block that logs via tracing::warn with %error, channel_id and process_id and
returns StatusCode::INTERNAL_SERVER_ERROR) short-circuits the scan and can
produce a 500 even if another pool might succeed; change this so the loop
continues on errors: introduce two boolean flags (e.g., any_errored and
any_cancelled), in the Err(error) branch set any_errored = true and log the
error but do not return, in the Ok branch set any_cancelled = true (and
optionally break/return success if you want first-success behavior), and after
the loop decide the response: if any_cancelled return success, else if
any_errored return StatusCode::INTERNAL_SERVER_ERROR, otherwise return
StatusCode::NOT_FOUND; update the Err(error) arm that currently returns
immediately (and references tracing::warn and StatusCode::INTERNAL_SERVER_ERROR)
accordingly.

---

Nitpick comments:
In `@src/api/channels.rs`:
- Around line 210-241: Add targeted async race tests covering the two-stage
cancellation flow exercised by the handler in channels.rs: simulate the
live-path and fallback detached-path behavior by invoking the same cancellation
flow that calls ProcessRunLogger::cancel_running_worker (and the HTTP cancel
endpoint that wraps it) under concurrent interleavings; include tests for (1)
live success (worker canceled while live), (2) live miss + detached success (no
live state then cancel_running_worker returns true), (3) detached error
(cancel_running_worker returns Err and the handler returns 500), and (4)
already-finished worker (returns not found). Use tokio or the existing test
harness to orchestrate races and assert terminal states and HTTP/status-code
outcomes so the reconciliation logic is explicitly exercised.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d6d8322 and 6a3f93d.

📒 Files selected for processing (6)

src/agent/channel.rs
src/agent/channel_dispatch.rs
src/agent/status.rs
src/api/channels.rs
src/conversation/history.rs
src/main.rs

🚧 Files skipped from review as they are similar to previous changes (1)

src/agent/channel.rs

coderabbitai · 2026-03-03T10:37:15Z

src/agent/channel_dispatch.rs

+            Err(panic_payload) => {
+                let panic_message = panic_payload
+                    .downcast_ref::<&str>()
+                    .map(|message| (*message).to_string())
+                    .or_else(|| panic_payload.downcast_ref::<String>().cloned())
+                    .unwrap_or_else(|| "unknown panic payload".to_string());
+                tracing::error!(
+                    worker_id = %worker_id,
+                    panic_message = %panic_message,
+                    "worker task panicked"
+                );
+                (
+                    format!("Worker failed: worker task panicked: {panic_message}"),
+                    true,
+                    false,
+                )
+            }


⚠️ Potential issue | 🟡 Minor

Apply secret scrubbing to panic messages.

The success path scrubs secrets via scrub_with_store() before the result reaches the channel, but the panic path sends panic_message unscrubbed. If a worker panics while processing sensitive data (e.g., panic!("failed at step: {}", data)), the secret could leak through the event.

Given the PR's focus on enforcing secret-pattern blocking at egress, consider scrubbing the panic message consistently with the success path.

🛡️ Proposed fix to scrub panic messages

Err(panic_payload) => { let panic_message = panic_payload .downcast_ref::<&str>() .map(|message| (*message).to_string()) .or_else(|| panic_payload.downcast_ref::<String>().cloned()) .unwrap_or_else(|| "unknown panic payload".to_string()); + let panic_message = if let Some(store) = &secrets_store { + crate::secrets::scrub::scrub_with_store(&panic_message, store) + } else { + panic_message + }; tracing::error!( worker_id = %worker_id, panic_message = %panic_message, "worker task panicked" );

Based on learnings: "Implement leak detection with regex patterns for API keys, tokens, PEM keys; scan in SpacebotHook.on_tool_result() and before outbound HTTP to block exfiltration"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/agent/channel_dispatch.rs` around lines 531 - 547, The panic branch in channel_dispatch.rs returns and logs the raw panic_message, which can leak secrets; before logging and returning, run the panic_message through the same scrubber used on success (call scrub_with_store(panic_message, &secret_store) or the local secret store instance used elsewhere) and use the scrubbed string for tracing::error and the returned formatted message (i.e., replace uses of panic_message with the scrubbed result); ensure you reference the same scrub_with_store function and the same secret store variable used by the success path so panic output is consistently scrubbed.

coderabbitai · 2026-03-03T10:37:15Z

src/api/channels.rs

+            if let Some(channel_state) = channel_state
+                && channel_state.cancel_worker(worker_id).await.is_ok()
+            {


⚠️ Potential issue | 🟠 Major

Don’t suppress live cancellation errors before fallback.

At Line 202, .is_ok() drops the actual cancel_worker error, which makes failure diagnosis and terminal-state handling harder. Please log (or branch on) the error before attempting detached reconciliation.

Proposed fix

- if let Some(channel_state) = channel_state - && channel_state.cancel_worker(worker_id).await.is_ok() - { - return Ok(Json(CancelProcessResponse { - success: true, - message: format!("Worker {} cancelled", request.process_id), - })); - } + if let Some(channel_state) = channel_state { + match channel_state.cancel_worker(worker_id).await { + Ok(()) => { + return Ok(Json(CancelProcessResponse { + success: true, + message: format!("Worker {} cancelled", request.process_id), + })); + } + Err(error) => { + tracing::warn!( + %error, + channel_id = %request.channel_id, + process_id = %request.process_id, + "live worker cancel failed; attempting detached reconciliation" + ); + } + } + }

As per coding guidelines: "Don't silently discard errors; use let _ = only on channel sends where the receiver may be dropped; handle, log, or propagate all other errors".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/api/channels.rs` around lines 201 - 203, The code currently discards the result of channel_state.cancel_worker(worker_id).await by calling .is_ok(), which hides errors; change it to match on the Result from channel_state.cancel_worker(worker_id).await (or bind it to a variable) and if Err(e) log the error (using the appropriate logger) or handle/propagate it before falling back to detached reconciliation; ensure the branch still performs the successful cancellation path when Ok(_) and only proceeds to the fallback when cancellation truly failed or when channel_state is None.

coderabbitai · 2026-03-03T10:37:15Z

src/api/channels.rs

+                    Err(error) => {
+                        tracing::warn!(
+                            %error,
+                            channel_id = %request.channel_id,
+                            process_id = %request.process_id,
+                            "failed to cancel detached worker run"
+                        );
+                        return Err(StatusCode::INTERNAL_SERVER_ERROR);
+                    }


⚠️ Potential issue | 🟠 Major

Avoid returning 500 on the first detached-pool error.

At Lines 229-237, a single pool failure short-circuits the scan. That can produce a 500 even when another pool could still cancel the worker. Continue scanning, track whether any pool errored, and decide 500 vs 404 after the loop.

Proposed fix

let pools = state.agent_pools.load(); + let mut saw_detached_error = false; for (_agent_id, pool) in pools.iter() { let logger = ProcessRunLogger::new(pool.clone()); match logger .cancel_running_worker(&request.channel_id, worker_id) .await { Ok(true) => { return Ok(Json(CancelProcessResponse { success: true, message: format!( "Worker {} cancelled (detached run reconciled)", request.process_id ), })); } Ok(false) => {} Err(error) => { + saw_detached_error = true; tracing::warn!( %error, channel_id = %request.channel_id, process_id = %request.process_id, "failed to cancel detached worker run" ); - return Err(StatusCode::INTERNAL_SERVER_ERROR); } } } + if saw_detached_error { + return Err(StatusCode::INTERNAL_SERVER_ERROR); + } Err(StatusCode::NOT_FOUND)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/api/channels.rs` around lines 229 - 237, The current Err(error) arm in the pool-scan loop (the block that logs via tracing::warn with %error, channel_id and process_id and returns StatusCode::INTERNAL_SERVER_ERROR) short-circuits the scan and can produce a 500 even if another pool might succeed; change this so the loop continues on errors: introduce two boolean flags (e.g., any_errored and any_cancelled), in the Err(error) branch set any_errored = true and log the error but do not return, in the Ok branch set any_cancelled = true (and optionally break/return success if you want first-success behavior), and after the loop decide the response: if any_cancelled return success, else if any_errored return StatusCode::INTERNAL_SERVER_ERROR, otherwise return StatusCode::NOT_FOUND; update the Err(error) arm that currently returns immediately (and references tracing::warn and StatusCode::INTERNAL_SERVER_ERROR) accordingly.

fix: keep workers alive on internal secret handling

d6d8322

jamiepine marked this pull request as ready for review March 3, 2026 09:32

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

tembo bot reviewed Mar 3, 2026

View reviewed changes

fix: reconcile orphaned running workers

6a3f93d

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

jamiepine enabled auto-merge March 3, 2026 10:56

jamiepine disabled auto-merge March 3, 2026 10:56

jamiepine merged commit 69627b4 into main Mar 3, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: stop worker secret scans from killing internal work#297

fix: stop worker secret scans from killing internal work#297
jamiepine merged 2 commits intomainfrom
fix/worker-leak-egress-sandbox-off

jamiepine commented Mar 3, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Mar 3, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 3, 2026

Uh oh!

coderabbitai bot Mar 3, 2026

Uh oh!

coderabbitai bot Mar 3, 2026

Uh oh!

tembo bot Mar 3, 2026

Uh oh!

tembo bot Mar 3, 2026

Uh oh!

tembo bot Mar 3, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 3, 2026

Uh oh!

coderabbitai bot Mar 3, 2026

Uh oh!

coderabbitai bot Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-            let canonical = resolved.canonicalize().unwrap_or(resolved);
+            let canonicalized = resolved.canonicalize();
+            let canonical = if self.sandbox.mode_enabled() {
+                canonicalized.map_err(|_| ShellError {
+                    message: format!(
+                        "working_dir must exist and be within the workspace ({}).",
+                        self.workspace.display()
+                    ),
+                    exit_code: -1,
+                })?
+            } else {
+                canonicalized.unwrap_or(resolved)
+            };

-                            tracing::warn!(
+                        } else if let Some(leak) = crate::secrets::scrub::scan_for_leaks(text) {
+                            let leak_kind = leak
+                                .split_once(": ")
+                                .map(|(kind, _)| kind)
+                                .unwrap_or("plaintext");
+                            tracing::warn!(
+                                channel_id = %self.id,
+                                leak_kind,
+                                "blocked retrigger fallback output matching secret pattern"
+                            );

Conversation

jamiepine commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

coderabbitai bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

tembo bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

tembo bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

tembo bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jamiepine commented Mar 3, 2026 •

edited

Loading

coderabbitai bot commented Mar 3, 2026 •

edited

Loading