fix: stop worker secret scans from killing internal work#297
Conversation
WalkthroughThis PR changes sandbox HOME handling (workspace vs parent HOME), shifts leak detection to channel-egress enforcement with redaction and non-fatal worker output handling, makes workspace-boundary enforcement for working directories conditional on sandbox mode, and adds worker-run reconciliation/cancellation and status removal logic. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
src/tools/exec.rs (1)
162-192: Extract sharedworking_dirresolution to a helper.Line 162–192 duplicates the same resolution/boundary logic used in
src/tools/shell.rs(Line 105–135). Centralizing this in one utility will reduce policy drift risk.As per coding guidelines, "Do not abbreviate variable names; use
queuenotq,messagenotmsg,channelnotch; common abbreviations likeconfigare acceptable."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tools/exec.rs` around lines 162 - 192, Extract the duplicated working_dir resolution and workspace-boundary check into a shared helper named resolve_working_dir that takes (workspace: &Path, sandbox: &Sandbox, working_dir_option: Option<&str>) -> Result<PathBuf, ExecError> (or a small shared error type) and replace the inline logic in the Exec method (where args.working_dir, self.workspace, and self.sandbox.mode_enabled() are used) and the identical block in the shell code to call this helper; ensure the helper resolves absolute vs relative paths, canonicalizes safely (falling back to the original resolved path), enforces the workspace canonical start-with check when sandbox.mode_enabled() is true, returns the canonical PathBuf on success, and use full, non-abbreviated names like resolve_working_dir, workspace_canonical, and working_dir_option per naming guidelines.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/content/docs/`(configuration)/config.mdx:
- Line 574: Update the sentence describing injected environment variables to
also mention that tool-category secrets from the secret store are injected into
worker runtimes; modify the paragraph that currently lists `PATH`, safe vars,
and `passthrough_env` to explicitly include "tool-category secrets (from the
secret store)" and clarify whether those secrets are always injected or
conditional, and add a brief operator guidance to limit passthrough_env and
secrets to least privilege use. Reference the existing identifiers `PATH`,
`HOME`, `passthrough_env`, and "tool-category secrets" so the change clearly
augments the current environment list.
In `@README.md`:
- Line 251: Update the README's "Environment sanitization" safe-variable list to
match runtime behavior by adding USER, TERM, and TMPDIR alongside PATH, HOME,
and LANG; edit the sentence that currently names only PATH, HOME, and LANG so it
enumerates PATH, HOME, LANG, USER, TERM, and TMPDIR and clarifies which of those
are set in sandbox vs passthrough modes (HOME behavior remains as described).
In `@src/hooks/spacebot.rs`:
- Around line 151-163: The early return in the leak-detection branch inside the
Channel reply path skips cleanup of the per-call timer; before returning the
HookAction::Terminate (in the block checking self.process_type ==
ProcessType::Channel && tool_name == "reply" && let Some(leak) =
self.scan_for_leaks(result)), remove the `_internal_call_id` entry from
TOOL_CALL_TIMERS (or otherwise stop/clear the timer) so metrics don't leak, then
log and return; you can call the same cleanup logic used elsewhere for removing
`_internal_call_id` to keep behavior consistent with non-leak paths.
---
Nitpick comments:
In `@src/tools/exec.rs`:
- Around line 162-192: Extract the duplicated working_dir resolution and
workspace-boundary check into a shared helper named resolve_working_dir that
takes (workspace: &Path, sandbox: &Sandbox, working_dir_option: Option<&str>) ->
Result<PathBuf, ExecError> (or a small shared error type) and replace the inline
logic in the Exec method (where args.working_dir, self.workspace, and
self.sandbox.mode_enabled() are used) and the identical block in the shell code
to call this helper; ensure the helper resolves absolute vs relative paths,
canonicalizes safely (falling back to the original resolved path), enforces the
workspace canonical start-with check when sandbox.mode_enabled() is true,
returns the canonical PathBuf on success, and use full, non-abbreviated names
like resolve_working_dir, workspace_canonical, and working_dir_option per naming
guidelines.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
README.mddocs/content/docs/(configuration)/config.mdxdocs/content/docs/(configuration)/sandbox.mdxdocs/content/docs/(features)/workers.mdxsrc/agent/channel.rssrc/hooks/spacebot.rssrc/opencode/worker.rssrc/sandbox.rssrc/tools/exec.rssrc/tools/reply.rssrc/tools/shell.rs
| The agent's data directory (databases, config) is explicitly re-mounted read-only even if it would otherwise be writable due to path overlap. | ||
|
|
||
| Regardless of mode, all worker subprocesses start with a clean environment. Only `PATH` (with `tools/bin` prepended), safe variables (`HOME`, `USER`, `LANG`, `TERM`, `TMPDIR`), and any `passthrough_env` entries are injected. Use `passthrough_env` with least privilege: only forward variables required by worker tools (for example specific credentials set in Docker Compose or systemd), and avoid forwarding broad or highly sensitive credentials. | ||
| Regardless of mode, all worker subprocesses start with a clean environment. Only `PATH` (with `tools/bin` prepended), safe variables (`HOME`, `USER`, `LANG`, `TERM`, `TMPDIR`), and any `passthrough_env` entries are injected. `HOME` is mode-dependent: workspace path when sandboxed, parent `HOME` when passthrough is enabled. Use `passthrough_env` with least privilege: only forward variables required by worker tools (for example specific credentials set in Docker Compose or systemd), and avoid forwarding broad or highly sensitive credentials. |
There was a problem hiding this comment.
Environment list omits tool secret injection.
Line 574 says only PATH, safe vars, and passthrough_env are injected, but runtime also injects tool-category secrets from the secret store. Please include that to avoid operator confusion.
Suggested doc tweak
-Regardless of mode, all worker subprocesses start with a clean environment. Only `PATH` (with `tools/bin` prepended), safe variables (`HOME`, `USER`, `LANG`, `TERM`, `TMPDIR`), and any `passthrough_env` entries are injected.
+Regardless of mode, all worker subprocesses start with a clean environment. Only `PATH` (with `tools/bin` prepended), safe variables (`HOME`, `USER`, `LANG`, `TERM`, `TMPDIR`), tool-category secrets from the secret store, and any `passthrough_env` entries are injected.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Regardless of mode, all worker subprocesses start with a clean environment. Only `PATH` (with `tools/bin` prepended), safe variables (`HOME`, `USER`, `LANG`, `TERM`, `TMPDIR`), and any `passthrough_env` entries are injected. `HOME` is mode-dependent: workspace path when sandboxed, parent `HOME` when passthrough is enabled. Use `passthrough_env` with least privilege: only forward variables required by worker tools (for example specific credentials set in Docker Compose or systemd), and avoid forwarding broad or highly sensitive credentials. | |
| Regardless of mode, all worker subprocesses start with a clean environment. Only `PATH` (with `tools/bin` prepended), safe variables (`HOME`, `USER`, `LANG`, `TERM`, `TMPDIR`), tool-category secrets from the secret store, and any `passthrough_env` entries are injected. `HOME` is mode-dependent: workspace path when sandboxed, parent `HOME` when passthrough is enabled. Use `passthrough_env` with least privilege: only forward variables required by worker tools (for example specific credentials set in Docker Compose or systemd), and avoid forwarding broad or highly sensitive credentials. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/content/docs/`(configuration)/config.mdx at line 574, Update the
sentence describing injected environment variables to also mention that
tool-category secrets from the secret store are injected into worker runtimes;
modify the paragraph that currently lists `PATH`, safe vars, and
`passthrough_env` to explicitly include "tool-category secrets (from the secret
store)" and clarify whether those secrets are always injected or conditional,
and add a brief operator guidance to limit passthrough_env and secrets to least
privilege use. Reference the existing identifiers `PATH`, `HOME`,
`passthrough_env`, and "tool-category secrets" so the change clearly augments
the current environment list.
| Secrets are split into two categories: **system** (LLM API keys, messaging tokens — never exposed to subprocesses) and **tool** (CLI credentials like `GH_TOKEN` — injected as env vars into workers). The category is auto-assigned based on the secret name, or set explicitly. | ||
|
|
||
| - **Environment sanitization** — every subprocess starts with a clean environment (`--clearenv` on Linux, `env_clear()` everywhere else). Only safe baseline vars (`PATH`, `HOME`, `LANG`), tool-category secrets, and explicit `passthrough_env` entries are present. System secrets never enter any subprocess | ||
| - **Environment sanitization** — every subprocess starts with a clean environment (`--clearenv` on Linux, `env_clear()` everywhere else). Only safe baseline vars (`PATH`, `HOME`, `LANG`), tool-category secrets, and explicit `passthrough_env` entries are present. In sandbox mode, `HOME` is set to the workspace; in passthrough mode, `HOME` uses the parent environment. System secrets never enter any subprocess |
There was a problem hiding this comment.
Safe-variable list is currently incomplete.
Line 251 lists only PATH, HOME, and LANG, but current runtime also injects USER, TERM, and TMPDIR. Please align the README list to avoid ambiguity.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@README.md` at line 251, Update the README's "Environment sanitization"
safe-variable list to match runtime behavior by adding USER, TERM, and TMPDIR
alongside PATH, HOME, and LANG; edit the sentence that currently names only
PATH, HOME, and LANG so it enumerates PATH, HOME, LANG, USER, TERM, and TMPDIR
and clarifies which of those are set in sandbox vs passthrough modes (HOME
behavior remains as described).
| if self.process_type == ProcessType::Channel | ||
| && tool_name == "reply" | ||
| && let Some(leak) = self.scan_for_leaks(result) | ||
| { | ||
| tracing::error!( | ||
| process_id = %self.process_id, | ||
| tool_name = %tool_name, | ||
| leak_prefix = %&leak[..leak.len().min(8)], | ||
| "secret leak detected in tool output, terminating agent" | ||
| "secret leak detected in reply result, terminating channel turn" | ||
| ); | ||
| return HookAction::Terminate { | ||
| reason: "Tool output contained a secret. Agent terminated to prevent exfiltration." | ||
| .into(), | ||
| reason: "Reply contained a secret. Channel turn terminated.".into(), | ||
| }; |
There was a problem hiding this comment.
Leak-termination path skips metrics timer cleanup.
Line 151–163 returns early before removing _internal_call_id from TOOL_CALL_TIMERS, so repeated blocked replies can accumulate stale timer entries when metrics are enabled.
Suggested fix
if self.process_type == ProcessType::Channel
&& tool_name == "reply"
&& let Some(leak) = self.scan_for_leaks(result)
{
+ #[cfg(feature = "metrics")]
+ if let Ok(mut timers) = TOOL_CALL_TIMERS.lock() {
+ timers.remove(_internal_call_id);
+ }
+
tracing::error!(
process_id = %self.process_id,
tool_name = %tool_name,
leak_prefix = %&leak[..leak.len().min(8)],
"secret leak detected in reply result, terminating channel turn"
);
return HookAction::Terminate {
reason: "Reply contained a secret. Channel turn terminated.".into(),
};
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/hooks/spacebot.rs` around lines 151 - 163, The early return in the
leak-detection branch inside the Channel reply path skips cleanup of the
per-call timer; before returning the HookAction::Terminate (in the block
checking self.process_type == ProcessType::Channel && tool_name == "reply" &&
let Some(leak) = self.scan_for_leaks(result)), remove the `_internal_call_id`
entry from TOOL_CALL_TIMERS (or otherwise stop/clear the timer) so metrics don't
leak, then log and return; you can call the same cleanup logic used elsewhere
for removing `_internal_call_id` to keep behavior consistent with non-leak
paths.
| } else { | ||
| self.workspace.join(raw_path) | ||
| }; | ||
| let canonical = resolved.canonicalize().unwrap_or(resolved); |
There was a problem hiding this comment.
If canonicalize() fails, resolved can still contain .. (e.g. workspace/../outside), and the starts_with(workspace_canonical) check will pass. In sandbox mode that looks like a workspace-boundary bypass. Consider requiring a successful canonicalization (or lexically normalizing) before the boundary check.
| let canonical = resolved.canonicalize().unwrap_or(resolved); | |
| let canonicalized = resolved.canonicalize(); | |
| let canonical = if self.sandbox.mode_enabled() { | |
| canonicalized.map_err(|_| ShellError { | |
| message: format!( | |
| "working_dir must exist and be within the workspace ({}).", | |
| self.workspace.display() | |
| ), | |
| exit_code: -1, | |
| })? | |
| } else { | |
| canonicalized.unwrap_or(resolved) | |
| }; |
| } else { | ||
| self.workspace.join(raw_path) | ||
| }; | ||
| let canonical = resolved.canonicalize().unwrap_or(resolved); |
There was a problem hiding this comment.
Same canonicalize().unwrap_or(resolved) issue here: when canonicalization fails, a relative working_dir containing .. can still pass the starts_with(workspace) check in sandbox mode.
| let canonical = resolved.canonicalize().unwrap_or(resolved); | |
| let canonicalized = resolved.canonicalize(); | |
| let canonical = if self.sandbox.mode_enabled() { | |
| canonicalized.map_err(|_| ExecError { | |
| message: format!( | |
| "working_dir must exist and be within the workspace ({}).", | |
| self.workspace.display() | |
| ), | |
| exit_code: -1, | |
| })? | |
| } else { | |
| canonicalized.unwrap_or(resolved) | |
| }; |
| "blocked retrigger fallback output containing structured or tool syntax" | ||
| ); | ||
| } else if let Some(leak) = crate::secrets::scrub::scan_for_leaks(text) { | ||
| tracing::warn!( |
There was a problem hiding this comment.
scan_for_leaks() returns the matched secret substring (or "url-encoded: {matched}", etc). Logging leak_prefix leaks part of the secret into logs, and the [..] slice can also panic on non-ASCII boundaries. Maybe log only the encoding kind/length instead (same pattern repeats in the other fallback blocks below).
| tracing::warn!( | |
| } else if let Some(leak) = crate::secrets::scrub::scan_for_leaks(text) { | |
| let leak_kind = leak | |
| .split_once(": ") | |
| .map(|(kind, _)| kind) | |
| .unwrap_or("plaintext"); | |
| tracing::warn!( | |
| channel_id = %self.id, | |
| leak_kind, | |
| "blocked retrigger fallback output matching secret pattern" | |
| ); |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
src/api/channels.rs (1)
210-241: Add targeted async race tests for this two-stage cancellation path.Given the new live-then-detached reconciliation flow, please add focused tests for terminal-state races (live success, live miss + detached success, detached error, already-finished worker).
Based on learnings: "For changes in async/stateful paths (worker lifecycle, cancellation, retrigger, recall cache behavior), include explicit race/terminal-state reasoning in the PR summary and run targeted tests in addition to
just gate-pr".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/api/channels.rs` around lines 210 - 241, Add targeted async race tests covering the two-stage cancellation flow exercised by the handler in channels.rs: simulate the live-path and fallback detached-path behavior by invoking the same cancellation flow that calls ProcessRunLogger::cancel_running_worker (and the HTTP cancel endpoint that wraps it) under concurrent interleavings; include tests for (1) live success (worker canceled while live), (2) live miss + detached success (no live state then cancel_running_worker returns true), (3) detached error (cancel_running_worker returns Err and the handler returns 500), and (4) already-finished worker (returns not found). Use tokio or the existing test harness to orchestrate races and assert terminal states and HTTP/status-code outcomes so the reconciliation logic is explicitly exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/agent/channel_dispatch.rs`:
- Around line 531-547: The panic branch in channel_dispatch.rs returns and logs
the raw panic_message, which can leak secrets; before logging and returning, run
the panic_message through the same scrubber used on success (call
scrub_with_store(panic_message, &secret_store) or the local secret store
instance used elsewhere) and use the scrubbed string for tracing::error and the
returned formatted message (i.e., replace uses of panic_message with the
scrubbed result); ensure you reference the same scrub_with_store function and
the same secret store variable used by the success path so panic output is
consistently scrubbed.
In `@src/api/channels.rs`:
- Around line 201-203: The code currently discards the result of
channel_state.cancel_worker(worker_id).await by calling .is_ok(), which hides
errors; change it to match on the Result from
channel_state.cancel_worker(worker_id).await (or bind it to a variable) and if
Err(e) log the error (using the appropriate logger) or handle/propagate it
before falling back to detached reconciliation; ensure the branch still performs
the successful cancellation path when Ok(_) and only proceeds to the fallback
when cancellation truly failed or when channel_state is None.
- Around line 229-237: The current Err(error) arm in the pool-scan loop (the
block that logs via tracing::warn with %error, channel_id and process_id and
returns StatusCode::INTERNAL_SERVER_ERROR) short-circuits the scan and can
produce a 500 even if another pool might succeed; change this so the loop
continues on errors: introduce two boolean flags (e.g., any_errored and
any_cancelled), in the Err(error) branch set any_errored = true and log the
error but do not return, in the Ok branch set any_cancelled = true (and
optionally break/return success if you want first-success behavior), and after
the loop decide the response: if any_cancelled return success, else if
any_errored return StatusCode::INTERNAL_SERVER_ERROR, otherwise return
StatusCode::NOT_FOUND; update the Err(error) arm that currently returns
immediately (and references tracing::warn and StatusCode::INTERNAL_SERVER_ERROR)
accordingly.
---
Nitpick comments:
In `@src/api/channels.rs`:
- Around line 210-241: Add targeted async race tests covering the two-stage
cancellation flow exercised by the handler in channels.rs: simulate the
live-path and fallback detached-path behavior by invoking the same cancellation
flow that calls ProcessRunLogger::cancel_running_worker (and the HTTP cancel
endpoint that wraps it) under concurrent interleavings; include tests for (1)
live success (worker canceled while live), (2) live miss + detached success (no
live state then cancel_running_worker returns true), (3) detached error
(cancel_running_worker returns Err and the handler returns 500), and (4)
already-finished worker (returns not found). Use tokio or the existing test
harness to orchestrate races and assert terminal states and HTTP/status-code
outcomes so the reconciliation logic is explicitly exercised.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
src/agent/channel.rssrc/agent/channel_dispatch.rssrc/agent/status.rssrc/api/channels.rssrc/conversation/history.rssrc/main.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- src/agent/channel.rs
| Err(panic_payload) => { | ||
| let panic_message = panic_payload | ||
| .downcast_ref::<&str>() | ||
| .map(|message| (*message).to_string()) | ||
| .or_else(|| panic_payload.downcast_ref::<String>().cloned()) | ||
| .unwrap_or_else(|| "unknown panic payload".to_string()); | ||
| tracing::error!( | ||
| worker_id = %worker_id, | ||
| panic_message = %panic_message, | ||
| "worker task panicked" | ||
| ); | ||
| ( | ||
| format!("Worker failed: worker task panicked: {panic_message}"), | ||
| true, | ||
| false, | ||
| ) | ||
| } |
There was a problem hiding this comment.
Apply secret scrubbing to panic messages.
The success path scrubs secrets via scrub_with_store() before the result reaches the channel, but the panic path sends panic_message unscrubbed. If a worker panics while processing sensitive data (e.g., panic!("failed at step: {}", data)), the secret could leak through the event.
Given the PR's focus on enforcing secret-pattern blocking at egress, consider scrubbing the panic message consistently with the success path.
🛡️ Proposed fix to scrub panic messages
Err(panic_payload) => {
let panic_message = panic_payload
.downcast_ref::<&str>()
.map(|message| (*message).to_string())
.or_else(|| panic_payload.downcast_ref::<String>().cloned())
.unwrap_or_else(|| "unknown panic payload".to_string());
+ let panic_message = if let Some(store) = &secrets_store {
+ crate::secrets::scrub::scrub_with_store(&panic_message, store)
+ } else {
+ panic_message
+ };
tracing::error!(
worker_id = %worker_id,
panic_message = %panic_message,
"worker task panicked"
);Based on learnings: "Implement leak detection with regex patterns for API keys, tokens, PEM keys; scan in SpacebotHook.on_tool_result() and before outbound HTTP to block exfiltration"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/agent/channel_dispatch.rs` around lines 531 - 547, The panic branch in
channel_dispatch.rs returns and logs the raw panic_message, which can leak
secrets; before logging and returning, run the panic_message through the same
scrubber used on success (call scrub_with_store(panic_message, &secret_store) or
the local secret store instance used elsewhere) and use the scrubbed string for
tracing::error and the returned formatted message (i.e., replace uses of
panic_message with the scrubbed result); ensure you reference the same
scrub_with_store function and the same secret store variable used by the success
path so panic output is consistently scrubbed.
| if let Some(channel_state) = channel_state | ||
| && channel_state.cancel_worker(worker_id).await.is_ok() | ||
| { |
There was a problem hiding this comment.
Don’t suppress live cancellation errors before fallback.
At Line 202, .is_ok() drops the actual cancel_worker error, which makes failure diagnosis and terminal-state handling harder. Please log (or branch on) the error before attempting detached reconciliation.
Proposed fix
- if let Some(channel_state) = channel_state
- && channel_state.cancel_worker(worker_id).await.is_ok()
- {
- return Ok(Json(CancelProcessResponse {
- success: true,
- message: format!("Worker {} cancelled", request.process_id),
- }));
- }
+ if let Some(channel_state) = channel_state {
+ match channel_state.cancel_worker(worker_id).await {
+ Ok(()) => {
+ return Ok(Json(CancelProcessResponse {
+ success: true,
+ message: format!("Worker {} cancelled", request.process_id),
+ }));
+ }
+ Err(error) => {
+ tracing::warn!(
+ %error,
+ channel_id = %request.channel_id,
+ process_id = %request.process_id,
+ "live worker cancel failed; attempting detached reconciliation"
+ );
+ }
+ }
+ }As per coding guidelines: "Don't silently discard errors; use let _ = only on channel sends where the receiver may be dropped; handle, log, or propagate all other errors".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/api/channels.rs` around lines 201 - 203, The code currently discards the
result of channel_state.cancel_worker(worker_id).await by calling .is_ok(),
which hides errors; change it to match on the Result from
channel_state.cancel_worker(worker_id).await (or bind it to a variable) and if
Err(e) log the error (using the appropriate logger) or handle/propagate it
before falling back to detached reconciliation; ensure the branch still performs
the successful cancellation path when Ok(_) and only proceeds to the fallback
when cancellation truly failed or when channel_state is None.
| Err(error) => { | ||
| tracing::warn!( | ||
| %error, | ||
| channel_id = %request.channel_id, | ||
| process_id = %request.process_id, | ||
| "failed to cancel detached worker run" | ||
| ); | ||
| return Err(StatusCode::INTERNAL_SERVER_ERROR); | ||
| } |
There was a problem hiding this comment.
Avoid returning 500 on the first detached-pool error.
At Lines 229-237, a single pool failure short-circuits the scan. That can produce a 500 even when another pool could still cancel the worker. Continue scanning, track whether any pool errored, and decide 500 vs 404 after the loop.
Proposed fix
let pools = state.agent_pools.load();
+ let mut saw_detached_error = false;
for (_agent_id, pool) in pools.iter() {
let logger = ProcessRunLogger::new(pool.clone());
match logger
.cancel_running_worker(&request.channel_id, worker_id)
.await
{
Ok(true) => {
return Ok(Json(CancelProcessResponse {
success: true,
message: format!(
"Worker {} cancelled (detached run reconciled)",
request.process_id
),
}));
}
Ok(false) => {}
Err(error) => {
+ saw_detached_error = true;
tracing::warn!(
%error,
channel_id = %request.channel_id,
process_id = %request.process_id,
"failed to cancel detached worker run"
);
- return Err(StatusCode::INTERNAL_SERVER_ERROR);
}
}
}
+ if saw_detached_error {
+ return Err(StatusCode::INTERNAL_SERVER_ERROR);
+ }
Err(StatusCode::NOT_FOUND)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/api/channels.rs` around lines 229 - 237, The current Err(error) arm in
the pool-scan loop (the block that logs via tracing::warn with %error,
channel_id and process_id and returns StatusCode::INTERNAL_SERVER_ERROR)
short-circuits the scan and can produce a 500 even if another pool might
succeed; change this so the loop continues on errors: introduce two boolean
flags (e.g., any_errored and any_cancelled), in the Err(error) branch set
any_errored = true and log the error but do not return, in the Ok branch set
any_cancelled = true (and optionally break/return success if you want
first-success behavior), and after the loop decide the response: if
any_cancelled return success, else if any_errored return
StatusCode::INTERNAL_SERVER_ERROR, otherwise return StatusCode::NOT_FOUND;
update the Err(error) arm that currently returns immediately (and references
tracing::warn and StatusCode::INTERNAL_SERVER_ERROR) accordingly.
Summary
reply+ plaintext fallback) so worker/internal tool output no longer kills long-running jobs.HOMEand allowshell/execworking directories outside the workspace when sandbox is disabled.runningworkers by handling worker-task panics as failed completions, reconciling orphanedrunningrows tofailedon startup, and allowing API cancel to close detached runs after restart.