Skip to content

fix: stop worker secret scans from killing internal work#297

Merged
jamiepine merged 2 commits intomainfrom
fix/worker-leak-egress-sandbox-off
Mar 3, 2026
Merged

fix: stop worker secret scans from killing internal work#297
jamiepine merged 2 commits intomainfrom
fix/worker-leak-egress-sandbox-off

Conversation

@jamiepine
Copy link
Member

@jamiepine jamiepine commented Mar 3, 2026

Summary

  • Move secret-pattern hard-blocking to channel egress (reply + plaintext fallback) so worker/internal tool output no longer kills long-running jobs.
  • In sandbox passthrough mode, preserve parent HOME and allow shell/exec working directories outside the workspace when sandbox is disabled.
  • Fix stuck running workers by handling worker-task panics as failed completions, reconciling orphaned running rows to failed on startup, and allowing API cancel to close detached runs after restart.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 3, 2026

Walkthrough

This PR changes sandbox HOME handling (workspace vs parent HOME), shifts leak detection to channel-egress enforcement with redaction and non-fatal worker output handling, makes workspace-boundary enforcement for working directories conditional on sandbox mode, and adds worker-run reconciliation/cancellation and status removal logic.

Changes

Cohort / File(s) Summary
Documentation
README.md, docs/content/docs/(configuration)/config.mdx, docs/content/docs/(configuration)/sandbox.mdx, docs/content/docs/(features)/workers.mdx
Update docs to describe mode-dependent HOME behavior, channel-egress leak handling (blocking outbound text), output scrubbing, worker secret management, and durable tools/bin note.
Channel & Fallback Handling
src/agent/channel.rs, src/agent/channel_dispatch.rs
Adds runtime leak scans in plaintext fallback paths and wraps worker futures with unwind (panic) handling; preserves scrubbing but logs leak detections and reports panics as worker failures.
Hook / Reply Enforcement
src/hooks/spacebot.rs, src/tools/reply.rs
Restricts strict leak enforcement to channel egress for the reply tool (on call/result); blocks reply content on detected secrets and returns explicit errors/reasons.
Worker Runtime Behavior
src/opencode/worker.rs, src/agent/status.rs
Downgrades fatal leak termination to warnings for worker outputs; adds StatusBlock::remove_worker to remove active workers by ID for cancellation cleanup.
Sandbox / Environment
src/sandbox.rs
In passthrough mode, derive HOME from existing parent HOME when present, otherwise fall back to workspace; preserves other env sanitization behaviors.
Tool Execution & Shell
src/tools/exec.rs, src/tools/shell.rs
Resolve relative working_dir against workspace, canonicalize, and only enforce workspace-boundary restrictions when sandbox mode is enabled (absolute paths allowed otherwise).
API & Worker Cancellation
src/api/channels.rs
Two-stage cancel: try live channel_state cancellation, then fall back to iterating agent pools to cancel detached runs; returns detailed messages and error handling for detached cancellations.
Process Run Persistence / Reconciliation
src/conversation/history.rs, src/main.rs
Adds ProcessRunLogger methods to reconcile or cancel running worker runs and invokes reconciliation at per-agent startup to mark orphaned workers failed.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: moving secret scans from worker termination to channel egress to prevent internal work disruption.
Description check ✅ Passed The description is directly related to the changeset, outlining the three primary objectives: secret-pattern blocking at channel egress, sandbox passthrough mode changes, and worker lifecycle handling.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/worker-leak-egress-sandbox-off

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@jamiepine jamiepine marked this pull request as ready for review March 3, 2026 09:32
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
src/tools/exec.rs (1)

162-192: Extract shared working_dir resolution to a helper.

Line 162–192 duplicates the same resolution/boundary logic used in src/tools/shell.rs (Line 105–135). Centralizing this in one utility will reduce policy drift risk.

As per coding guidelines, "Do not abbreviate variable names; use queue not q, message not msg, channel not ch; common abbreviations like config are acceptable."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tools/exec.rs` around lines 162 - 192, Extract the duplicated working_dir
resolution and workspace-boundary check into a shared helper named
resolve_working_dir that takes (workspace: &Path, sandbox: &Sandbox,
working_dir_option: Option<&str>) -> Result<PathBuf, ExecError> (or a small
shared error type) and replace the inline logic in the Exec method (where
args.working_dir, self.workspace, and self.sandbox.mode_enabled() are used) and
the identical block in the shell code to call this helper; ensure the helper
resolves absolute vs relative paths, canonicalizes safely (falling back to the
original resolved path), enforces the workspace canonical start-with check when
sandbox.mode_enabled() is true, returns the canonical PathBuf on success, and
use full, non-abbreviated names like resolve_working_dir, workspace_canonical,
and working_dir_option per naming guidelines.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/content/docs/`(configuration)/config.mdx:
- Line 574: Update the sentence describing injected environment variables to
also mention that tool-category secrets from the secret store are injected into
worker runtimes; modify the paragraph that currently lists `PATH`, safe vars,
and `passthrough_env` to explicitly include "tool-category secrets (from the
secret store)" and clarify whether those secrets are always injected or
conditional, and add a brief operator guidance to limit passthrough_env and
secrets to least privilege use. Reference the existing identifiers `PATH`,
`HOME`, `passthrough_env`, and "tool-category secrets" so the change clearly
augments the current environment list.

In `@README.md`:
- Line 251: Update the README's "Environment sanitization" safe-variable list to
match runtime behavior by adding USER, TERM, and TMPDIR alongside PATH, HOME,
and LANG; edit the sentence that currently names only PATH, HOME, and LANG so it
enumerates PATH, HOME, LANG, USER, TERM, and TMPDIR and clarifies which of those
are set in sandbox vs passthrough modes (HOME behavior remains as described).

In `@src/hooks/spacebot.rs`:
- Around line 151-163: The early return in the leak-detection branch inside the
Channel reply path skips cleanup of the per-call timer; before returning the
HookAction::Terminate (in the block checking self.process_type ==
ProcessType::Channel && tool_name == "reply" && let Some(leak) =
self.scan_for_leaks(result)), remove the `_internal_call_id` entry from
TOOL_CALL_TIMERS (or otherwise stop/clear the timer) so metrics don't leak, then
log and return; you can call the same cleanup logic used elsewhere for removing
`_internal_call_id` to keep behavior consistent with non-leak paths.

---

Nitpick comments:
In `@src/tools/exec.rs`:
- Around line 162-192: Extract the duplicated working_dir resolution and
workspace-boundary check into a shared helper named resolve_working_dir that
takes (workspace: &Path, sandbox: &Sandbox, working_dir_option: Option<&str>) ->
Result<PathBuf, ExecError> (or a small shared error type) and replace the inline
logic in the Exec method (where args.working_dir, self.workspace, and
self.sandbox.mode_enabled() are used) and the identical block in the shell code
to call this helper; ensure the helper resolves absolute vs relative paths,
canonicalizes safely (falling back to the original resolved path), enforces the
workspace canonical start-with check when sandbox.mode_enabled() is true,
returns the canonical PathBuf on success, and use full, non-abbreviated names
like resolve_working_dir, workspace_canonical, and working_dir_option per naming
guidelines.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 301bab5 and d6d8322.

📒 Files selected for processing (11)
  • README.md
  • docs/content/docs/(configuration)/config.mdx
  • docs/content/docs/(configuration)/sandbox.mdx
  • docs/content/docs/(features)/workers.mdx
  • src/agent/channel.rs
  • src/hooks/spacebot.rs
  • src/opencode/worker.rs
  • src/sandbox.rs
  • src/tools/exec.rs
  • src/tools/reply.rs
  • src/tools/shell.rs

The agent's data directory (databases, config) is explicitly re-mounted read-only even if it would otherwise be writable due to path overlap.

Regardless of mode, all worker subprocesses start with a clean environment. Only `PATH` (with `tools/bin` prepended), safe variables (`HOME`, `USER`, `LANG`, `TERM`, `TMPDIR`), and any `passthrough_env` entries are injected. Use `passthrough_env` with least privilege: only forward variables required by worker tools (for example specific credentials set in Docker Compose or systemd), and avoid forwarding broad or highly sensitive credentials.
Regardless of mode, all worker subprocesses start with a clean environment. Only `PATH` (with `tools/bin` prepended), safe variables (`HOME`, `USER`, `LANG`, `TERM`, `TMPDIR`), and any `passthrough_env` entries are injected. `HOME` is mode-dependent: workspace path when sandboxed, parent `HOME` when passthrough is enabled. Use `passthrough_env` with least privilege: only forward variables required by worker tools (for example specific credentials set in Docker Compose or systemd), and avoid forwarding broad or highly sensitive credentials.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Environment list omits tool secret injection.

Line 574 says only PATH, safe vars, and passthrough_env are injected, but runtime also injects tool-category secrets from the secret store. Please include that to avoid operator confusion.

Suggested doc tweak
-Regardless of mode, all worker subprocesses start with a clean environment. Only `PATH` (with `tools/bin` prepended), safe variables (`HOME`, `USER`, `LANG`, `TERM`, `TMPDIR`), and any `passthrough_env` entries are injected.
+Regardless of mode, all worker subprocesses start with a clean environment. Only `PATH` (with `tools/bin` prepended), safe variables (`HOME`, `USER`, `LANG`, `TERM`, `TMPDIR`), tool-category secrets from the secret store, and any `passthrough_env` entries are injected.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Regardless of mode, all worker subprocesses start with a clean environment. Only `PATH` (with `tools/bin` prepended), safe variables (`HOME`, `USER`, `LANG`, `TERM`, `TMPDIR`), and any `passthrough_env` entries are injected. `HOME` is mode-dependent: workspace path when sandboxed, parent `HOME` when passthrough is enabled. Use `passthrough_env` with least privilege: only forward variables required by worker tools (for example specific credentials set in Docker Compose or systemd), and avoid forwarding broad or highly sensitive credentials.
Regardless of mode, all worker subprocesses start with a clean environment. Only `PATH` (with `tools/bin` prepended), safe variables (`HOME`, `USER`, `LANG`, `TERM`, `TMPDIR`), tool-category secrets from the secret store, and any `passthrough_env` entries are injected. `HOME` is mode-dependent: workspace path when sandboxed, parent `HOME` when passthrough is enabled. Use `passthrough_env` with least privilege: only forward variables required by worker tools (for example specific credentials set in Docker Compose or systemd), and avoid forwarding broad or highly sensitive credentials.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/content/docs/`(configuration)/config.mdx at line 574, Update the
sentence describing injected environment variables to also mention that
tool-category secrets from the secret store are injected into worker runtimes;
modify the paragraph that currently lists `PATH`, safe vars, and
`passthrough_env` to explicitly include "tool-category secrets (from the secret
store)" and clarify whether those secrets are always injected or conditional,
and add a brief operator guidance to limit passthrough_env and secrets to least
privilege use. Reference the existing identifiers `PATH`, `HOME`,
`passthrough_env`, and "tool-category secrets" so the change clearly augments
the current environment list.

Secrets are split into two categories: **system** (LLM API keys, messaging tokens — never exposed to subprocesses) and **tool** (CLI credentials like `GH_TOKEN` — injected as env vars into workers). The category is auto-assigned based on the secret name, or set explicitly.

- **Environment sanitization** — every subprocess starts with a clean environment (`--clearenv` on Linux, `env_clear()` everywhere else). Only safe baseline vars (`PATH`, `HOME`, `LANG`), tool-category secrets, and explicit `passthrough_env` entries are present. System secrets never enter any subprocess
- **Environment sanitization** — every subprocess starts with a clean environment (`--clearenv` on Linux, `env_clear()` everywhere else). Only safe baseline vars (`PATH`, `HOME`, `LANG`), tool-category secrets, and explicit `passthrough_env` entries are present. In sandbox mode, `HOME` is set to the workspace; in passthrough mode, `HOME` uses the parent environment. System secrets never enter any subprocess
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Safe-variable list is currently incomplete.

Line 251 lists only PATH, HOME, and LANG, but current runtime also injects USER, TERM, and TMPDIR. Please align the README list to avoid ambiguity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` at line 251, Update the README's "Environment sanitization"
safe-variable list to match runtime behavior by adding USER, TERM, and TMPDIR
alongside PATH, HOME, and LANG; edit the sentence that currently names only
PATH, HOME, and LANG so it enumerates PATH, HOME, LANG, USER, TERM, and TMPDIR
and clarifies which of those are set in sandbox vs passthrough modes (HOME
behavior remains as described).

Comment on lines +151 to 163
if self.process_type == ProcessType::Channel
&& tool_name == "reply"
&& let Some(leak) = self.scan_for_leaks(result)
{
tracing::error!(
process_id = %self.process_id,
tool_name = %tool_name,
leak_prefix = %&leak[..leak.len().min(8)],
"secret leak detected in tool output, terminating agent"
"secret leak detected in reply result, terminating channel turn"
);
return HookAction::Terminate {
reason: "Tool output contained a secret. Agent terminated to prevent exfiltration."
.into(),
reason: "Reply contained a secret. Channel turn terminated.".into(),
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Leak-termination path skips metrics timer cleanup.

Line 151–163 returns early before removing _internal_call_id from TOOL_CALL_TIMERS, so repeated blocked replies can accumulate stale timer entries when metrics are enabled.

Suggested fix
         if self.process_type == ProcessType::Channel
             && tool_name == "reply"
             && let Some(leak) = self.scan_for_leaks(result)
         {
+            #[cfg(feature = "metrics")]
+            if let Ok(mut timers) = TOOL_CALL_TIMERS.lock() {
+                timers.remove(_internal_call_id);
+            }
+
             tracing::error!(
                 process_id = %self.process_id,
                 tool_name = %tool_name,
                 leak_prefix = %&leak[..leak.len().min(8)],
                 "secret leak detected in reply result, terminating channel turn"
             );
             return HookAction::Terminate {
                 reason: "Reply contained a secret. Channel turn terminated.".into(),
             };
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/hooks/spacebot.rs` around lines 151 - 163, The early return in the
leak-detection branch inside the Channel reply path skips cleanup of the
per-call timer; before returning the HookAction::Terminate (in the block
checking self.process_type == ProcessType::Channel && tool_name == "reply" &&
let Some(leak) = self.scan_for_leaks(result)), remove the `_internal_call_id`
entry from TOOL_CALL_TIMERS (or otherwise stop/clear the timer) so metrics don't
leak, then log and return; you can call the same cleanup logic used elsewhere
for removing `_internal_call_id` to keep behavior consistent with non-leak
paths.

} else {
self.workspace.join(raw_path)
};
let canonical = resolved.canonicalize().unwrap_or(resolved);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If canonicalize() fails, resolved can still contain .. (e.g. workspace/../outside), and the starts_with(workspace_canonical) check will pass. In sandbox mode that looks like a workspace-boundary bypass. Consider requiring a successful canonicalization (or lexically normalizing) before the boundary check.

Suggested change
let canonical = resolved.canonicalize().unwrap_or(resolved);
let canonicalized = resolved.canonicalize();
let canonical = if self.sandbox.mode_enabled() {
canonicalized.map_err(|_| ShellError {
message: format!(
"working_dir must exist and be within the workspace ({}).",
self.workspace.display()
),
exit_code: -1,
})?
} else {
canonicalized.unwrap_or(resolved)
};

} else {
self.workspace.join(raw_path)
};
let canonical = resolved.canonicalize().unwrap_or(resolved);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same canonicalize().unwrap_or(resolved) issue here: when canonicalization fails, a relative working_dir containing .. can still pass the starts_with(workspace) check in sandbox mode.

Suggested change
let canonical = resolved.canonicalize().unwrap_or(resolved);
let canonicalized = resolved.canonicalize();
let canonical = if self.sandbox.mode_enabled() {
canonicalized.map_err(|_| ExecError {
message: format!(
"working_dir must exist and be within the workspace ({}).",
self.workspace.display()
),
exit_code: -1,
})?
} else {
canonicalized.unwrap_or(resolved)
};

"blocked retrigger fallback output containing structured or tool syntax"
);
} else if let Some(leak) = crate::secrets::scrub::scan_for_leaks(text) {
tracing::warn!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scan_for_leaks() returns the matched secret substring (or "url-encoded: {matched}", etc). Logging leak_prefix leaks part of the secret into logs, and the [..] slice can also panic on non-ASCII boundaries. Maybe log only the encoding kind/length instead (same pattern repeats in the other fallback blocks below).

Suggested change
tracing::warn!(
} else if let Some(leak) = crate::secrets::scrub::scan_for_leaks(text) {
let leak_kind = leak
.split_once(": ")
.map(|(kind, _)| kind)
.unwrap_or("plaintext");
tracing::warn!(
channel_id = %self.id,
leak_kind,
"blocked retrigger fallback output matching secret pattern"
);

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
src/api/channels.rs (1)

210-241: Add targeted async race tests for this two-stage cancellation path.

Given the new live-then-detached reconciliation flow, please add focused tests for terminal-state races (live success, live miss + detached success, detached error, already-finished worker).

Based on learnings: "For changes in async/stateful paths (worker lifecycle, cancellation, retrigger, recall cache behavior), include explicit race/terminal-state reasoning in the PR summary and run targeted tests in addition to just gate-pr".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/api/channels.rs` around lines 210 - 241, Add targeted async race tests
covering the two-stage cancellation flow exercised by the handler in
channels.rs: simulate the live-path and fallback detached-path behavior by
invoking the same cancellation flow that calls
ProcessRunLogger::cancel_running_worker (and the HTTP cancel endpoint that wraps
it) under concurrent interleavings; include tests for (1) live success (worker
canceled while live), (2) live miss + detached success (no live state then
cancel_running_worker returns true), (3) detached error (cancel_running_worker
returns Err and the handler returns 500), and (4) already-finished worker
(returns not found). Use tokio or the existing test harness to orchestrate races
and assert terminal states and HTTP/status-code outcomes so the reconciliation
logic is explicitly exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/agent/channel_dispatch.rs`:
- Around line 531-547: The panic branch in channel_dispatch.rs returns and logs
the raw panic_message, which can leak secrets; before logging and returning, run
the panic_message through the same scrubber used on success (call
scrub_with_store(panic_message, &secret_store) or the local secret store
instance used elsewhere) and use the scrubbed string for tracing::error and the
returned formatted message (i.e., replace uses of panic_message with the
scrubbed result); ensure you reference the same scrub_with_store function and
the same secret store variable used by the success path so panic output is
consistently scrubbed.

In `@src/api/channels.rs`:
- Around line 201-203: The code currently discards the result of
channel_state.cancel_worker(worker_id).await by calling .is_ok(), which hides
errors; change it to match on the Result from
channel_state.cancel_worker(worker_id).await (or bind it to a variable) and if
Err(e) log the error (using the appropriate logger) or handle/propagate it
before falling back to detached reconciliation; ensure the branch still performs
the successful cancellation path when Ok(_) and only proceeds to the fallback
when cancellation truly failed or when channel_state is None.
- Around line 229-237: The current Err(error) arm in the pool-scan loop (the
block that logs via tracing::warn with %error, channel_id and process_id and
returns StatusCode::INTERNAL_SERVER_ERROR) short-circuits the scan and can
produce a 500 even if another pool might succeed; change this so the loop
continues on errors: introduce two boolean flags (e.g., any_errored and
any_cancelled), in the Err(error) branch set any_errored = true and log the
error but do not return, in the Ok branch set any_cancelled = true (and
optionally break/return success if you want first-success behavior), and after
the loop decide the response: if any_cancelled return success, else if
any_errored return StatusCode::INTERNAL_SERVER_ERROR, otherwise return
StatusCode::NOT_FOUND; update the Err(error) arm that currently returns
immediately (and references tracing::warn and StatusCode::INTERNAL_SERVER_ERROR)
accordingly.

---

Nitpick comments:
In `@src/api/channels.rs`:
- Around line 210-241: Add targeted async race tests covering the two-stage
cancellation flow exercised by the handler in channels.rs: simulate the
live-path and fallback detached-path behavior by invoking the same cancellation
flow that calls ProcessRunLogger::cancel_running_worker (and the HTTP cancel
endpoint that wraps it) under concurrent interleavings; include tests for (1)
live success (worker canceled while live), (2) live miss + detached success (no
live state then cancel_running_worker returns true), (3) detached error
(cancel_running_worker returns Err and the handler returns 500), and (4)
already-finished worker (returns not found). Use tokio or the existing test
harness to orchestrate races and assert terminal states and HTTP/status-code
outcomes so the reconciliation logic is explicitly exercised.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d6d8322 and 6a3f93d.

📒 Files selected for processing (6)
  • src/agent/channel.rs
  • src/agent/channel_dispatch.rs
  • src/agent/status.rs
  • src/api/channels.rs
  • src/conversation/history.rs
  • src/main.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/agent/channel.rs

Comment on lines +531 to +547
Err(panic_payload) => {
let panic_message = panic_payload
.downcast_ref::<&str>()
.map(|message| (*message).to_string())
.or_else(|| panic_payload.downcast_ref::<String>().cloned())
.unwrap_or_else(|| "unknown panic payload".to_string());
tracing::error!(
worker_id = %worker_id,
panic_message = %panic_message,
"worker task panicked"
);
(
format!("Worker failed: worker task panicked: {panic_message}"),
true,
false,
)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Apply secret scrubbing to panic messages.

The success path scrubs secrets via scrub_with_store() before the result reaches the channel, but the panic path sends panic_message unscrubbed. If a worker panics while processing sensitive data (e.g., panic!("failed at step: {}", data)), the secret could leak through the event.

Given the PR's focus on enforcing secret-pattern blocking at egress, consider scrubbing the panic message consistently with the success path.

🛡️ Proposed fix to scrub panic messages
             Err(panic_payload) => {
                 let panic_message = panic_payload
                     .downcast_ref::<&str>()
                     .map(|message| (*message).to_string())
                     .or_else(|| panic_payload.downcast_ref::<String>().cloned())
                     .unwrap_or_else(|| "unknown panic payload".to_string());
+                let panic_message = if let Some(store) = &secrets_store {
+                    crate::secrets::scrub::scrub_with_store(&panic_message, store)
+                } else {
+                    panic_message
+                };
                 tracing::error!(
                     worker_id = %worker_id,
                     panic_message = %panic_message,
                     "worker task panicked"
                 );

Based on learnings: "Implement leak detection with regex patterns for API keys, tokens, PEM keys; scan in SpacebotHook.on_tool_result() and before outbound HTTP to block exfiltration"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agent/channel_dispatch.rs` around lines 531 - 547, The panic branch in
channel_dispatch.rs returns and logs the raw panic_message, which can leak
secrets; before logging and returning, run the panic_message through the same
scrubber used on success (call scrub_with_store(panic_message, &secret_store) or
the local secret store instance used elsewhere) and use the scrubbed string for
tracing::error and the returned formatted message (i.e., replace uses of
panic_message with the scrubbed result); ensure you reference the same
scrub_with_store function and the same secret store variable used by the success
path so panic output is consistently scrubbed.

Comment on lines +201 to +203
if let Some(channel_state) = channel_state
&& channel_state.cancel_worker(worker_id).await.is_ok()
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t suppress live cancellation errors before fallback.

At Line 202, .is_ok() drops the actual cancel_worker error, which makes failure diagnosis and terminal-state handling harder. Please log (or branch on) the error before attempting detached reconciliation.

Proposed fix
-            if let Some(channel_state) = channel_state
-                && channel_state.cancel_worker(worker_id).await.is_ok()
-            {
-                return Ok(Json(CancelProcessResponse {
-                    success: true,
-                    message: format!("Worker {} cancelled", request.process_id),
-                }));
-            }
+            if let Some(channel_state) = channel_state {
+                match channel_state.cancel_worker(worker_id).await {
+                    Ok(()) => {
+                        return Ok(Json(CancelProcessResponse {
+                            success: true,
+                            message: format!("Worker {} cancelled", request.process_id),
+                        }));
+                    }
+                    Err(error) => {
+                        tracing::warn!(
+                            %error,
+                            channel_id = %request.channel_id,
+                            process_id = %request.process_id,
+                            "live worker cancel failed; attempting detached reconciliation"
+                        );
+                    }
+                }
+            }

As per coding guidelines: "Don't silently discard errors; use let _ = only on channel sends where the receiver may be dropped; handle, log, or propagate all other errors".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/api/channels.rs` around lines 201 - 203, The code currently discards the
result of channel_state.cancel_worker(worker_id).await by calling .is_ok(),
which hides errors; change it to match on the Result from
channel_state.cancel_worker(worker_id).await (or bind it to a variable) and if
Err(e) log the error (using the appropriate logger) or handle/propagate it
before falling back to detached reconciliation; ensure the branch still performs
the successful cancellation path when Ok(_) and only proceeds to the fallback
when cancellation truly failed or when channel_state is None.

Comment on lines +229 to +237
Err(error) => {
tracing::warn!(
%error,
channel_id = %request.channel_id,
process_id = %request.process_id,
"failed to cancel detached worker run"
);
return Err(StatusCode::INTERNAL_SERVER_ERROR);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid returning 500 on the first detached-pool error.

At Lines 229-237, a single pool failure short-circuits the scan. That can produce a 500 even when another pool could still cancel the worker. Continue scanning, track whether any pool errored, and decide 500 vs 404 after the loop.

Proposed fix
             let pools = state.agent_pools.load();
+            let mut saw_detached_error = false;
             for (_agent_id, pool) in pools.iter() {
                 let logger = ProcessRunLogger::new(pool.clone());
                 match logger
                     .cancel_running_worker(&request.channel_id, worker_id)
                     .await
                 {
                     Ok(true) => {
                         return Ok(Json(CancelProcessResponse {
                             success: true,
                             message: format!(
                                 "Worker {} cancelled (detached run reconciled)",
                                 request.process_id
                             ),
                         }));
                     }
                     Ok(false) => {}
                     Err(error) => {
+                        saw_detached_error = true;
                         tracing::warn!(
                             %error,
                             channel_id = %request.channel_id,
                             process_id = %request.process_id,
                             "failed to cancel detached worker run"
                         );
-                        return Err(StatusCode::INTERNAL_SERVER_ERROR);
                     }
                 }
             }
 
+            if saw_detached_error {
+                return Err(StatusCode::INTERNAL_SERVER_ERROR);
+            }
             Err(StatusCode::NOT_FOUND)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/api/channels.rs` around lines 229 - 237, The current Err(error) arm in
the pool-scan loop (the block that logs via tracing::warn with %error,
channel_id and process_id and returns StatusCode::INTERNAL_SERVER_ERROR)
short-circuits the scan and can produce a 500 even if another pool might
succeed; change this so the loop continues on errors: introduce two boolean
flags (e.g., any_errored and any_cancelled), in the Err(error) branch set
any_errored = true and log the error but do not return, in the Ok branch set
any_cancelled = true (and optionally break/return success if you want
first-success behavior), and after the loop decide the response: if
any_cancelled return success, else if any_errored return
StatusCode::INTERNAL_SERVER_ERROR, otherwise return StatusCode::NOT_FOUND;
update the Err(error) arm that currently returns immediately (and references
tracing::warn and StatusCode::INTERNAL_SERVER_ERROR) accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant