cortex: phase2/3 health supervision and detached lifecycle hardening#307
cortex: phase2/3 health supervision and detached lifecycle hardening#307vsumner wants to merge 8 commits intospacedriveapp:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a cancellation-aware, embedding-backed memory maintenance pipeline (decay, prune, merge) and integrates it into the cortex tick loop with timeout/cancellation controls; extends config and TOML schema with maintenance tuning fields and validation; expands process control registry with registration IDs, stale-entry handling, detached-worker snapshots, and new cancellation APIs; adds store merge helpers and integration tests. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
src/tools/cancel.rs (1)
86-118:⚠️ Potential issue | 🟡 MinorNormalize empty/whitespace cancellation reasons before forwarding.
Some("")(or whitespace) currently bypasses the default and produces an empty reason downstream/message text.Proposed fix
- let reason = args.reason.as_deref().unwrap_or("cancelled by tool"); + let reason = args + .reason + .as_deref() + .map(str::trim) + .filter(|reason| !reason.is_empty()) + .unwrap_or("cancelled by tool");🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tools/cancel.rs` around lines 86 - 118, Normalize the cancellation reason by treating empty or all-whitespace strings as absent before using the default and before forwarding to downstream calls: trim args.reason (or clone/inspect args.reason.as_deref()) and if the trimmed string is empty treat it like None so the local variable reason uses "cancelled by tool"; pass that normalized reason into cancel_branch_with_reason and cancel_worker_with_reason and use it when building the message text in the final message block (affecting the variables reason, args.reason check, and the calls to cancel_branch_with_reason / cancel_worker_with_reason).src/agent/compactor.rs (1)
226-236: 🛠️ Refactor suggestion | 🟠 MajorCompactor path still runs an LLM agent; move this work out of compactor.
This code keeps the compactor executing LLM summarization directly. Please keep compactor purely programmatic and delegate summarization/extraction to a spawned worker path.
As per coding guidelines
src/agent/compactor.rs: "The Compactor must NOT be an LLM process; it is a programmatic monitor per channel that watches context size and triggers compaction using tiered thresholds: >80% background, >85% aggressive, >95% emergency truncation".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agent/compactor.rs` around lines 226 - 236, This compactor currently constructs an LLM Agent (creates ToolServerHandle via create_cortex_tool_server and builds an AgentBuilder with compactor_prompt and agent), which violates the rule that Compactor must be purely programmatic; remove the AgentBuilder/ToolServer creation from compactor.rs and instead have the compactor monitor channel context sizes and, when thresholds are crossed (>80% background, >85% aggressive, >95% emergency truncation), spawn or send a task to a dedicated summarization worker (e.g., spawn_compaction_worker or summary_worker) passing the minimal deps (deps.agent_id, deps.memory_event_tx, deps.memory_search and the compaction request payload) so the worker performs LLM summarization/extraction; ensure compactor only contains the tiered threshold logic and a one-way trigger to the worker rather than invoking AgentBuilder, compactor_prompt, create_cortex_tool_server, or running the LLM inline.docs/content/docs/(configuration)/config.mdx (1)
500-508:⚠️ Potential issue | 🟡 MinorAdd the new maintenance/supervision cortex keys to this config table.
This section still documents only the older cortex fields, but the PR adds new user-facing knobs (maintenance interval, decay/prune/similarity, retry/budget limits). Without listing them here, the config reference is incomplete.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/content/docs/`(configuration)/config.mdx around lines 500 - 508, Update the `[defaults.cortex]` config table to include the new maintenance/supervision keys added by the PR: add rows for the exact keys introduced (e.g., maintenance_interval_secs, prune_decay_secs, prune_interval_secs, similarity_threshold, decay_rate, retry_budget, max_retries — use the exact key names from the PR), specifying each key's Type, Default value and a one-line Description (e.g., maintenance interval, prune/decay behavior, similarity threshold, retry/budget limits) so the docs match the code; ensure you reference the same symbol names used in the codebase (defaults.cortex.<key>) and fill in the correct defaults and descriptions from the PR.src/agent/channel.rs (1)
1689-1708:⚠️ Potential issue | 🟡 MinorCancelled branches can leave stale branch-tracking entries.
When a branch is cancelled,
active_branchesis already removed before the syntheticBranchResultis observed; this early-return path skips cleanup ofmemory_persistence_branchesandbranch_reply_targets.🧹 Suggested fix
ProcessEvent::BranchResult { branch_id, conclusion, .. } => { let mut branches = self.state.active_branches.write().await; - if branches.remove(branch_id).is_none() { + let was_active = branches.remove(branch_id).is_some(); + if !was_active { + self.memory_persistence_branches.remove(branch_id); + self.branch_reply_targets.remove(branch_id); return Ok(()); - }; + } drop(branches); run_logger.log_branch_completed(*branch_id, conclusion);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agent/channel.rs` around lines 1689 - 1708, The early return after branches.remove(branch_id).is_none() causes cancelled branches to skip cleanup of memory_persistence_branches and branch_reply_targets; modify the logic so that when branches.remove(branch_id) returns None you still perform the same cleanup steps (remove from self.memory_persistence_branches and self.branch_reply_targets, log the memory persistence info via tracing::info! if applicable, and update metrics/counts) before returning; specifically update the block around branches.remove(branch_id) in the function containing run_logger.log_branch_completed and ensure cleanup of self.memory_persistence_branches and self.branch_reply_targets happens unconditionally for the branch_id (or at least in the branch where branches.remove returned None) so no stale tracking entries remain.
🧹 Nitpick comments (2)
src/config.rs (2)
4856-4906: Add bounds validation for new maintenance numeric knobs at config load.These paths currently accept invalid values (e.g., zero interval, negative age, out-of-range/non-finite thresholds), which can cause unstable runtime maintenance behavior.
Proposed validation guardrails
+fn validate_cortex_maintenance(cortex: &CortexConfig, scope: &str) -> Result<()> { + if cortex.maintenance_interval_secs == 0 { + return Err(ConfigError::Invalid(format!( + "{scope}.maintenance_interval_secs must be > 0" + )) + .into()); + } + if cortex.maintenance_min_age_days < 0 { + return Err(ConfigError::Invalid(format!( + "{scope}.maintenance_min_age_days must be >= 0" + )) + .into()); + } + for (name, value) in [ + ("maintenance_decay_rate", cortex.maintenance_decay_rate), + ("maintenance_prune_threshold", cortex.maintenance_prune_threshold), + ( + "maintenance_merge_similarity_threshold", + cortex.maintenance_merge_similarity_threshold, + ), + ] { + if !value.is_finite() || !(0.0..=1.0).contains(&value) { + return Err(ConfigError::Invalid(format!( + "{scope}.{name} must be a finite value in [0.0, 1.0]" + )) + .into()); + } + } + Ok(()) +}let defaults = DefaultsConfig { // ... }; + validate_cortex_maintenance(&defaults.cortex, "defaults.cortex")?;- cortex: a.cortex.map(|c| CortexConfig { + cortex: a.cortex.map(|c| CortexConfig { // ... }),(then validate
AgentConfig.cortexafter construction, similar to defaults scope)Also applies to: 5083-5133
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/config.rs` around lines 4856 - 4906, The new numeric maintenance knobs (e.g., maintenance_interval_secs, worker_timeout_secs, branch_timeout_secs, detached_worker_timeout_retry_limit, supervisor_kill_budget_per_tick, circuit_breaker_threshold, bulletin_interval_secs, bulletin_max_words, bulletin_max_turns, association_interval_secs, association_similarity_threshold, association_updates_threshold, association_max_per_pass, maintenance_decay_rate, maintenance_prune_threshold, maintenance_min_age_days, maintenance_merge_similarity_threshold) currently accept invalid values; add bounds validation after building AgentConfig.cortex (similar to how defaults are validated) to reject or clamp zero/negative intervals, non-positive counts, out-of-range/non-finite thresholds, and unreasonable ages. Implement a validate_agent_cortex or AgentConfig::validate_cortex function and call it after construction to return an error (or apply safe clamping) when any field is invalid, using the same defaults scope/base_defaults.cortex for reference and include clear error messages mentioning the offending field.
7613-7660: Extend this regression test to assert the two new supervision control fields as well.The test currently validates maintenance interval/decay-related propagation, but it does not verify
detached_worker_timeout_retry_limitandsupervisor_kill_budget_per_tick.Suggested test additions
[defaults.cortex] maintenance_interval_secs = 1200 +detached_worker_timeout_retry_limit = 4 +supervisor_kill_budget_per_tick = 12 maintenance_decay_rate = 0.11 maintenance_prune_threshold = 0.21 maintenance_min_age_days = 17 maintenance_merge_similarity_threshold = 0.97 @@ assert_eq!(resolved.cortex.maintenance_interval_secs, 1200); + assert_eq!(resolved.cortex.detached_worker_timeout_retry_limit, 4); + assert_eq!(resolved.cortex.supervisor_kill_budget_per_tick, 12); assert_eq!(resolved.cortex.maintenance_decay_rate, 0.11); @@ -cortex = { maintenance_interval_secs = 300, maintenance_decay_rate = 0.33 } +cortex = { maintenance_interval_secs = 300, maintenance_decay_rate = 0.33, detached_worker_timeout_retry_limit = 6, supervisor_kill_budget_per_tick = 16 } @@ assert_eq!(resolved_override.cortex.maintenance_interval_secs, 300); + assert_eq!(resolved_override.cortex.detached_worker_timeout_retry_limit, 6); + assert_eq!(resolved_override.cortex.supervisor_kill_budget_per_tick, 16); assert_eq!(resolved_override.cortex.maintenance_decay_rate, 0.33);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/config.rs` around lines 7613 - 7660, The test test_cortex_maintenance_config_resolution needs to also assert the two new supervision control fields on the resolved cortex config: add assertions after the first resolved variable checks to verify resolved.cortex.detached_worker_timeout_retry_limit and resolved.cortex.supervisor_kill_budget_per_tick equal the values coming from config.defaults.cortex (use config.defaults.cortex.detached_worker_timeout_retry_limit and config.defaults.cortex.supervisor_kill_budget_per_tick), and likewise add assertions after resolved_override checks to verify resolved_override.cortex.detached_worker_timeout_retry_limit and resolved_override.cortex.supervisor_kill_budget_per_tick equal config_override.defaults.cortex.* so the test covers both default propagation and override behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/design-docs/cortex-history.md`:
- Around line 71-74: Update the document to remove the contradictory
rollout-status by making the implementation status consistent: either mark
"Health supervision (Phase 2, implemented)" as implemented throughout or change
it to "planned/partially implemented" and clarify what is actually emitted
initially where the bulletin sentence occurs; specifically edit the "Health
supervision" heading and the earlier sentence that currently says "only bulletin
events are emitted initially" (around the bulletin text) so both state the same
rollout phase and exact emitted events (e.g., "bulletin events only initially;
full health supervision events (worker_killed, branch_killed,
circuit_breaker_tripped, health_check) are available from Phase 2 after rollout"
or mark Phase 2 as not yet rolled out), and ensure you reference the Phase 2
breaker volatility note remains accurate.
In `@docs/design-docs/cortex-implementation.md`:
- Around line 86-89: The fenced lifecycle diagram block containing the lines "0
active -> 1 completing -> 3 terminal" and "0 active -> 2 killing -> 3
terminal" is missing a language specifier which triggers markdownlint MD040;
update that fenced block by adding a language tag (e.g., "text") immediately
after the opening triple backticks so the block becomes a labeled code fence
(open the file and edit the block that contains those two lifecycle lines).
- Around line 19-21: The heading and bullet contradict each other — either
remove "never called" from the section title or change the bullet to indicate
these functions are not wired; specifically update the text to unambiguously
state the real status of apply_decay, prune_memories, and merge_similar_memories
(e.g., "Implemented and wired into the cortex loop" if they are invoked, or
"Implemented but not yet wired into the cortex loop" if they are not), and
ensure the section references the correct runtime behavior of the cortex loop
rather than both claims.
In `@src/agent/process_control.rs`:
- Around line 102-112: The code removes the map entry unconditionally when the
cloned Weak handle fails to upgrade, which races with a concurrent registration
that may have replaced the entry; change the cleanup to be conditional: keep the
originally read Weak (e.g. the variable bound to handle before upgrade), and
when upgrade() returns None acquire the write lock and only remove(channel_id)
if the current map entry still equals that original Weak
(pointer-equality/Weak::ptr_eq or by comparing the cloned Weak), otherwise leave
it intact; preserve current behavior of calling
handle.cancel_worker_with_reason(worker_id, reason).await when upgrade succeeds
and return ControlActionResult::NotFound only when the entry was indeed the same
stale handle.
In `@src/api/channels.rs`:
- Around line 293-298: The code currently collapses all errors from
channel_state.cancel_worker_with_reason(...) into boolean checks and map_err(|_|
NOT_FOUND), losing actionable failure details; update the handling in the
channel_state branches (the one using cancel_worker_with_reason(...) near the
shown diff and the similar block around lines 349-352) to match on the Result
instead of calling is_ok()/map_err(|_| ...), then log the error and translate
known domain errors to the appropriate API response (e.g., return NOT_FOUND only
when the error truly means "not found") and propagate or convert unexpected
errors to a 5xx response; reference the cancel_worker_with_reason(...) call and
worker_id to locate and change the logic.
In `@src/api/config.rs`:
- Around line 600-605: The code currently casts
cortex.detached_worker_timeout_retry_limit and
cortex.supervisor_kill_budget_per_tick directly to i64 which can overflow;
replace the direct casts with checked conversions (e.g., i64::try_from or
TryInto) and if conversion fails return a 400 BAD_REQUEST (use the same error
path used by the warmup update logic). Update both occurrences (the block
handling detached_worker_timeout_retry_limit and supervisor_kill_budget_per_tick
and the similar block at the later occurrence) to perform the conversion into an
i64 variable, use toml_edit::value on that variable, and map conversion errors
to the existing BAD_REQUEST response type used elsewhere in this module.
- Around line 621-632: The code is writing maintenance knobs directly into the
TOML table; add validation for each optional field before persisting: ensure
cortex.maintenance_decay_rate, cortex.maintenance_prune_threshold, and
cortex.maintenance_merge_similarity_threshold (the three places where you cast
to f64 and assign to table keys "maintenance_decay_rate",
"maintenance_prune_threshold", "maintenance_merge_similarity_threshold") are
within 0.0..=1.0, and ensure cortex.maintenance_min_age_days (the
"maintenance_min_age_days" assignment) is >= 0; if any value is out of range,
return an appropriate error (or propagate a validation error) instead of writing
to the table so the API rejects invalid inputs. Ensure error messages mention
the specific field name for clarity.
In `@src/lib.rs`:
- Around line 267-277: The function create_process_event_buses_with_capacity
must guard against zero capacities to avoid Tokios internal panic: validate
control_event_capacity and memory_event_capacity are > 0 at the start of the
function and return an error instead of calling broadcast::channel when either
is zero (change the signature to return a
Result<(tokio::sync::broadcast::Sender<ProcessEvent>,
tokio::sync::broadcast::Sender<ProcessEvent>), SomeErrorType> or use an existing
error type in the crate); ensure you reference the incoming parameter names
control_event_capacity and memory_event_capacity and return an Err with a clear
message when either is zero, otherwise proceed to create the channels (event_tx,
_event_rx) and (memory_event_tx, _memory_event_rx) and return Ok((event_tx,
memory_event_tx)).
In `@src/memory/maintenance.rs`:
- Around line 324-377: The merge flow (the sequence calling
maintenance_cancelable_op with memory_store.update, get_associations,
delete_associations_for_memory, create_association,
memory_store.create_association, memory_store.forget and embedding_table.delete)
must be executed atomically: open a single DB transaction around that whole
sequence (e.g., in merge_pair), perform one pre-check for cancellation (call
check_maintenance_cancellation once before beginning the transaction), run all
reads/writes inside the transaction, and commit or rollback the transaction on
success/error so partial rewires cannot persist; remove per-association
cancellation checks and avoid issuing intermediate independent
maintenance_cancelable_op calls for those operations so they happen as a single
transactional unit.
In `@tests/maintenance.rs`:
- Around line 8-11: The fixture make_memory_maintenance_fixture drops the
TempDir before embedding_table is used, causing the backing directory to be
deleted; update the fixture signature and return value to include the tempdir
(e.g., return std::sync::Arc<MemoryStore>, spacebot::memory::EmbeddingTable, and
the TempDir) so the TempDir lives as long as embedding_table, update all call
sites that destructure the tuple accordingly, and ensure the TempDir variable is
held for the lifetime of tests that access embedding_table.
---
Outside diff comments:
In `@docs/content/docs/`(configuration)/config.mdx:
- Around line 500-508: Update the `[defaults.cortex]` config table to include
the new maintenance/supervision keys added by the PR: add rows for the exact
keys introduced (e.g., maintenance_interval_secs, prune_decay_secs,
prune_interval_secs, similarity_threshold, decay_rate, retry_budget, max_retries
— use the exact key names from the PR), specifying each key's Type, Default
value and a one-line Description (e.g., maintenance interval, prune/decay
behavior, similarity threshold, retry/budget limits) so the docs match the code;
ensure you reference the same symbol names used in the codebase
(defaults.cortex.<key>) and fill in the correct defaults and descriptions from
the PR.
In `@src/agent/channel.rs`:
- Around line 1689-1708: The early return after
branches.remove(branch_id).is_none() causes cancelled branches to skip cleanup
of memory_persistence_branches and branch_reply_targets; modify the logic so
that when branches.remove(branch_id) returns None you still perform the same
cleanup steps (remove from self.memory_persistence_branches and
self.branch_reply_targets, log the memory persistence info via tracing::info! if
applicable, and update metrics/counts) before returning; specifically update the
block around branches.remove(branch_id) in the function containing
run_logger.log_branch_completed and ensure cleanup of
self.memory_persistence_branches and self.branch_reply_targets happens
unconditionally for the branch_id (or at least in the branch where
branches.remove returned None) so no stale tracking entries remain.
In `@src/agent/compactor.rs`:
- Around line 226-236: This compactor currently constructs an LLM Agent (creates
ToolServerHandle via create_cortex_tool_server and builds an AgentBuilder with
compactor_prompt and agent), which violates the rule that Compactor must be
purely programmatic; remove the AgentBuilder/ToolServer creation from
compactor.rs and instead have the compactor monitor channel context sizes and,
when thresholds are crossed (>80% background, >85% aggressive, >95% emergency
truncation), spawn or send a task to a dedicated summarization worker (e.g.,
spawn_compaction_worker or summary_worker) passing the minimal deps
(deps.agent_id, deps.memory_event_tx, deps.memory_search and the compaction
request payload) so the worker performs LLM summarization/extraction; ensure
compactor only contains the tiered threshold logic and a one-way trigger to the
worker rather than invoking AgentBuilder, compactor_prompt,
create_cortex_tool_server, or running the LLM inline.
In `@src/tools/cancel.rs`:
- Around line 86-118: Normalize the cancellation reason by treating empty or
all-whitespace strings as absent before using the default and before forwarding
to downstream calls: trim args.reason (or clone/inspect args.reason.as_deref())
and if the trimmed string is empty treat it like None so the local variable
reason uses "cancelled by tool"; pass that normalized reason into
cancel_branch_with_reason and cancel_worker_with_reason and use it when building
the message text in the final message block (affecting the variables reason,
args.reason check, and the calls to cancel_branch_with_reason /
cancel_worker_with_reason).
---
Nitpick comments:
In `@src/config.rs`:
- Around line 4856-4906: The new numeric maintenance knobs (e.g.,
maintenance_interval_secs, worker_timeout_secs, branch_timeout_secs,
detached_worker_timeout_retry_limit, supervisor_kill_budget_per_tick,
circuit_breaker_threshold, bulletin_interval_secs, bulletin_max_words,
bulletin_max_turns, association_interval_secs, association_similarity_threshold,
association_updates_threshold, association_max_per_pass, maintenance_decay_rate,
maintenance_prune_threshold, maintenance_min_age_days,
maintenance_merge_similarity_threshold) currently accept invalid values; add
bounds validation after building AgentConfig.cortex (similar to how defaults are
validated) to reject or clamp zero/negative intervals, non-positive counts,
out-of-range/non-finite thresholds, and unreasonable ages. Implement a
validate_agent_cortex or AgentConfig::validate_cortex function and call it after
construction to return an error (or apply safe clamping) when any field is
invalid, using the same defaults scope/base_defaults.cortex for reference and
include clear error messages mentioning the offending field.
- Around line 7613-7660: The test test_cortex_maintenance_config_resolution
needs to also assert the two new supervision control fields on the resolved
cortex config: add assertions after the first resolved variable checks to verify
resolved.cortex.detached_worker_timeout_retry_limit and
resolved.cortex.supervisor_kill_budget_per_tick equal the values coming from
config.defaults.cortex (use
config.defaults.cortex.detached_worker_timeout_retry_limit and
config.defaults.cortex.supervisor_kill_budget_per_tick), and likewise add
assertions after resolved_override checks to verify
resolved_override.cortex.detached_worker_timeout_retry_limit and
resolved_override.cortex.supervisor_kill_budget_per_tick equal
config_override.defaults.cortex.* so the test covers both default propagation
and override behavior.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 6443e609-32e5-4011-bbce-78571bb04b30
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock,!**/*.lock
📒 Files selected for processing (29)
docs/content/docs/(configuration)/config.mdxdocs/content/docs/(core)/architecture.mdxdocs/design-docs/cortex-history.mddocs/design-docs/cortex-implementation.mdsrc/agent.rssrc/agent/channel.rssrc/agent/channel_dispatch.rssrc/agent/channel_history.rssrc/agent/compactor.rssrc/agent/cortex.rssrc/agent/ingestion.rssrc/agent/process_control.rssrc/agent/status.rssrc/api/agents.rssrc/api/channels.rssrc/api/config.rssrc/config.rssrc/lib.rssrc/main.rssrc/memory/maintenance.rssrc/memory/store.rssrc/telemetry/registry.rssrc/tools.rssrc/tools/cancel.rssrc/tools/memory_save.rstests/bulletin.rstests/context_dump.rstests/detached_worker_bootstrap.rstests/maintenance.rs
a41e2c7 to
87f1038
Compare
There was a problem hiding this comment.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/agent/worker.rs (1)
348-383:⚠️ Potential issue | 🟠 MajorHandle follow-up cancellation explicitly instead of treating it as failure.
In the follow-up loop,
PromptCancelledcurrently falls into the genericErr(error)arm (Line 377), which reclassifies cancellation as a failure and changes downstream completion semantics (failedvscancelled).
♻️ Duplicate comments (8)
src/api/config.rs (2)
621-632:⚠️ Potential issue | 🟠 MajorValidate maintenance parameter bounds before persisting.
These fields are accepted and written without bounds checks. Please reject invalid values (
0.0..=1.0for decay/prune/similarity;>= 0for min age) before mutating the TOML document.Suggested fix
+fn validate_unit_interval(value: f32) -> Result<(), StatusCode> { + if (0.0..=1.0).contains(&value) { + Ok(()) + } else { + Err(StatusCode::BAD_REQUEST) + } +} @@ - if let Some(v) = cortex.maintenance_decay_rate { - table["maintenance_decay_rate"] = toml_edit::value(v as f64); + if let Some(value) = cortex.maintenance_decay_rate { + validate_unit_interval(value)?; + table["maintenance_decay_rate"] = toml_edit::value(value as f64); } - if let Some(v) = cortex.maintenance_prune_threshold { - table["maintenance_prune_threshold"] = toml_edit::value(v as f64); + if let Some(value) = cortex.maintenance_prune_threshold { + validate_unit_interval(value)?; + table["maintenance_prune_threshold"] = toml_edit::value(value as f64); } - if let Some(v) = cortex.maintenance_min_age_days { - table["maintenance_min_age_days"] = toml_edit::value(v); + if let Some(value) = cortex.maintenance_min_age_days { + if value < 0 { + return Err(StatusCode::BAD_REQUEST); + } + table["maintenance_min_age_days"] = toml_edit::value(value); } - if let Some(v) = cortex.maintenance_merge_similarity_threshold { - table["maintenance_merge_similarity_threshold"] = toml_edit::value(v as f64); + if let Some(value) = cortex.maintenance_merge_similarity_threshold { + validate_unit_interval(value)?; + table["maintenance_merge_similarity_threshold"] = toml_edit::value(value as f64); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/api/config.rs` around lines 621 - 632, The maintenance parameters are written into the TOML without validation; add range checks before mutating table: for maintenance_decay_rate, maintenance_prune_threshold, and maintenance_merge_similarity_threshold (the cortex fields referenced) ensure their values are within 0.0..=1.0 and for maintenance_min_age_days ensure value is >= 0; if a value is out of bounds, return an error (or propagate a validation error) instead of setting table[...] — perform these checks immediately before the existing assignments that set table["maintenance_decay_rate"], table["maintenance_prune_threshold"], table["maintenance_min_age_days"], and table["maintenance_merge_similarity_threshold"] so invalid inputs are rejected prior to TOML mutation.
603-605:⚠️ Potential issue | 🟠 MajorGuard lossy integer conversions before writing cortex values.
supervisor_kill_budget_per_tick(usize) andmaintenance_interval_secs(u64) are cast withas i64; oversized inputs can wrap into negative persisted values.Suggested fix
- if let Some(v) = cortex.supervisor_kill_budget_per_tick { - table["supervisor_kill_budget_per_tick"] = toml_edit::value(v as i64); + if let Some(value) = cortex.supervisor_kill_budget_per_tick { + let value = i64::try_from(value).map_err(|_| StatusCode::BAD_REQUEST)?; + table["supervisor_kill_budget_per_tick"] = toml_edit::value(value); } @@ - if let Some(v) = cortex.maintenance_interval_secs { - table["maintenance_interval_secs"] = toml_edit::value(v as i64); + if let Some(value) = cortex.maintenance_interval_secs { + let value = i64::try_from(value).map_err(|_| StatusCode::BAD_REQUEST)?; + table["maintenance_interval_secs"] = toml_edit::value(value); }Run this read-only check to confirm remaining risky casts in this path:
#!/bin/bash rg -nP -C2 'supervisor_kill_budget_per_tick"\]\s*=\s*toml_edit::value\(.+\s+as\s+i64\)|maintenance_interval_secs"\]\s*=\s*toml_edit::value\(.+\s+as\s+i64\)' src/api/config.rsAlso applies to: 618-620
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/api/config.rs` around lines 603 - 605, The code is performing lossy casts like `cortex.supervisor_kill_budget_per_tick as i64` (and similarly `maintenance_interval_secs as i64`) which can wrap negative for oversized inputs; change these assignments to validate and convert safely using `i64::try_from(...)` or `usize::try_into::<i64>()` and handle the Err case (return an error, skip persisting with a logged warning, or clamp to a safe max) instead of an unchecked `as` cast so only representable values are written into `table["supervisor_kill_budget_per_tick"]` and `table["maintenance_interval_secs"]`. Ensure you update the code paths around the `table[...] = toml_edit::value(...)` assignments (references: cortex.supervisor_kill_budget_per_tick, maintenance_interval_secs) to perform the checked conversion and appropriate error handling or fallback.src/memory/maintenance.rs (1)
324-377:⚠️ Potential issue | 🔴 CriticalMerge rewiring still is not atomic across cancellation/error boundaries.
This sequence mutates survivor content, rewires associations, writes
Updates, forgets the loser, and deletes embeddings in separate operations with cancellation checkpoints in the middle. A mid-sequence failure can persist a partially merged graph.Please verify available transaction primitives in storage code, then wrap this critical section as one atomic unit:
#!/bin/bash rg -nP --type=rust -C2 'impl\s+MemoryStore|fn\s+.*(transaction|begin|commit|rollback)|sqlx::Transaction' src/memory🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/memory/maintenance.rs` around lines 324 - 377, The merge sequence mutates survivor content and rewrites associations via memory_store.update, get_associations, delete_associations_for_memory, create_association, creates an Updates association, calls memory_store.forget and embedding_table.delete with cancellation checkpoints between them; wrap all of these operations in a single storage transaction (use whatever transaction primitive exists on MemoryStore — e.g., begin_transaction/transaction/tx or sqlx::Transaction if present) so the update + rewire + create Updates + forget + embedding delete are executed atomically, remove intermediate maintenance_cancelable_op awaits inside the transactional block (perform a single cancellable check before starting and ensure cancellation causes a rollback), and on error ensure you rollback the transaction and only after commit run any non-transactional cleanup (or queue embedding_table.delete inside the same transaction if supported) so partial merges cannot persist; reference functions: memory_store.update, memory_store.get_associations, memory_store.delete_associations_for_memory, memory_store.create_association, memory_store.forget, embedding_table.delete, Association::new, maintenance_cancelable_op, check_maintenance_cancellation.docs/design-docs/cortex-implementation.md (2)
19-21:⚠️ Potential issue | 🟡 MinorResolve the contradictory maintenance status wording.
Line 19 says “never called” while Line 20 says “wired into the cortex loop.” Keep one unambiguous state.
🛠️ Suggested wording fix
-**Implemented but never called:** +**Implemented and called:** - `memory/maintenance.rs` — `apply_decay()`, `prune_memories()`, and `merge_similar_memories()` are implemented and wired into the cortex loop.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/design-docs/cortex-implementation.md` around lines 19 - 21, The document currently contradicts itself by saying the maintenance functions are "Implemented but never called" while also stating they are "wired into the cortex loop"; update the wording to a single clear state: either mark memory/maintenance.rs and its functions apply_decay(), prune_memories(), and merge_similar_memories() as "Implemented and wired into the cortex loop" or as "Implemented but not yet wired into the cortex loop" (choose the true runtime state), and make the cortex loop description consistent with that choice so there is no ambiguity.
86-89:⚠️ Potential issue | 🟡 MinorAdd language identifiers to fenced code blocks.
These fences are unlabeled and will keep triggering MD040.
🛠️ Suggested lint-safe fix
-``` +```text 0 active -> 1 completing -> 3 terminal 0 active -> 2 killing -> 3 terminal@@
-+text
Phase 1 (tick loop) — standalone, just plumbing
Phase 2 (health) — depends on Phase 1
Phase 3 (maintenance) — depends on Phase 1, independent of Phase 2
Phase 4 (consolidation) — depends on Phase 1, benefits from Phase 3
Phase 5 (hook) — independent, can be done anytime after Phase 1Also applies to: 204-210
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/design-docs/cortex-implementation.md` around lines 86 - 89, Two fenced code blocks in the document are unlabeled and trigger MD040; add a language identifier (e.g., "text") to each opening fence for the blocks containing the lines "0 active -> 1 completing -> 3 terminal / 0 active -> 2 killing -> 3 terminal" and the block starting with "Phase 1 (tick loop) — standalone, just plumbing" so the fences become ```text and satisfy the linter.tests/maintenance.rs (1)
8-11:⚠️ Potential issue | 🟠 MajorKeep the temporary directory alive for the full test lifetime.
Line 27 creates
TempDir, but it is dropped when the fixture returns at Line 36.embedding_tablecan then reference deleted storage and become unstable during assertions.🛠️ Suggested fix
async fn make_memory_maintenance_fixture() -> ( std::sync::Arc<MemoryStore>, spacebot::memory::EmbeddingTable, + tempfile::TempDir, ) { @@ - let dir = tempdir().expect("failed to create temp dir"); - let lance_conn = lancedb::connect(dir.path().to_str().expect("temp path")) + let temp_directory = tempdir().expect("failed to create temp dir"); + let lance_conn = lancedb::connect(temp_directory.path().to_str().expect("temp path")) .execute() .await .expect("failed to connect to lancedb"); @@ - (store, embedding_table) + (store, embedding_table, temp_directory) } @@ - let (store, embedding_table) = make_memory_maintenance_fixture().await; + let (store, embedding_table, _temp_directory) = make_memory_maintenance_fixture().await; @@ - let (store, embedding_table) = make_memory_maintenance_fixture().await; + let (store, embedding_table, _temp_directory) = make_memory_maintenance_fixture().await;Also applies to: 27-37, 41-42, 163-164
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/maintenance.rs` around lines 8 - 11, The temporary directory created inside make_memory_maintenance_fixture is dropped when the function returns, leaving embedding_table referencing deleted storage; fix this by extending the TempDir lifetime (e.g., include the TempDir in the fixture's return value or wrap it in an Arc and return it alongside std::sync::Arc<MemoryStore> and spacebot::memory::EmbeddingTable) so the TempDir is kept alive for the entire test; update all call sites and other fixtures mentioned (the other occurrences around where TempDir is created) to accept/propagate the extra TempDir holder to prevent use-after-drop of embedding_table storage.src/agent/process_control.rs (1)
102-112:⚠️ Potential issue | 🟠 MajorAvoid removing a potentially newer channel handle on stale upgrade failure.
Line 110 and Line 129 unconditionally
remove(channel_id)afterupgrade()fails. That can race with concurrent re-registration and delete a fresh live handle.🛠️ Race-safe cleanup
if let Some(handle) = handle.upgrade() { handle.cancel_worker_with_reason(worker_id, reason).await } else { - self.channels.write().await.remove(channel_id); + let mut channels = self.channels.write().await; + if channels + .get(channel_id) + .is_some_and(|current| current.upgrade().is_none()) + { + channels.remove(channel_id); + } ControlActionResult::NotFound } @@ if let Some(handle) = handle.upgrade() { handle.cancel_branch_with_reason(branch_id, reason).await } else { - self.channels.write().await.remove(channel_id); + let mut channels = self.channels.write().await; + if channels + .get(channel_id) + .is_some_and(|current| current.upgrade().is_none()) + { + channels.remove(channel_id); + } ControlActionResult::NotFound }Also applies to: 121-131
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agent/process_control.rs` around lines 102 - 112, The code unconditionally calls self.channels.write().await.remove(channel_id) when Weak::upgrade() fails, which can delete a newly-registered handle; change the cleanup to be race-safe by taking the write lock and removing the entry only if the stored Weak pointer is the same stale Weak you originally looked up (or use the HashMap entry API to compare-and-remove): locate the code around channels.read().await.get(channel_id).cloned() and the subsequent upgrade() check in process_control.rs and replace the unconditional remove(channel_id) with a conditional remove that verifies pointer identity of the Weak handle (or re-check the map entry equals the cloned Weak) before removing, preserving correct behavior for cancel_worker_with_reason and ControlActionResult paths.src/lib.rs (1)
267-277:⚠️ Potential issue | 🟠 MajorGuard zero-capacity bus creation to avoid runtime panic.
On Line 274 and Line 276,
broadcast::channelwill panic if either capacity is0, and this public helper currently accepts unvalidated input.🔧 Suggested hardening (non-breaking)
pub fn create_process_event_buses_with_capacity( control_event_capacity: usize, memory_event_capacity: usize, ) -> ( tokio::sync::broadcast::Sender<ProcessEvent>, tokio::sync::broadcast::Sender<ProcessEvent>, ) { - let (event_tx, _event_rx) = tokio::sync::broadcast::channel(control_event_capacity); - let (memory_event_tx, _memory_event_rx) = - tokio::sync::broadcast::channel(memory_event_capacity); + let control_event_capacity = control_event_capacity.max(1); + let memory_event_capacity = memory_event_capacity.max(1); + let (event_tx, _event_rx) = tokio::sync::broadcast::channel(control_event_capacity); + let (memory_event_tx, _memory_event_rx) = + tokio::sync::broadcast::channel(memory_event_capacity); (event_tx, memory_event_tx) }Does `tokio::sync::broadcast::channel(0)` panic? Please check the Tokio docs “Panics” section for `broadcast::channel`.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib.rs` around lines 267 - 277, The function create_process_event_buses_with_capacity currently calls tokio::sync::broadcast::channel with unvalidated capacities which will panic for 0; before calling channel, clamp control_event_capacity and memory_event_capacity to a minimum of 1 (e.g., let control_capacity = control_event_capacity.max(1); let memory_capacity = memory_event_capacity.max(1);) and use those local values when creating event_tx and memory_event_tx so the public helper never passes 0 into broadcast::channel.
🧹 Nitpick comments (8)
docs/design-docs/cortex-chat.md (1)
119-120: Consider specifying the exact 409 response contract.Please document whether the
409includes a JSON error body and/orRetry-Afterheader so clients can implement deterministic retry behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/design-docs/cortex-chat.md` around lines 119 - 120, Clarify the 409 CONFLICT contract by specifying both the JSON error body and Retry-After header format: state that the endpoint returns HTTP 409 with a JSON payload containing standardized fields (e.g., an error code like "session_busy" and a human-readable message plus an optional numeric retry_after_seconds) and also sets a Retry-After header with the retry interval in seconds; update the sentence about "409 CONFLICT" to include these exact field names and header semantics so clients can implement deterministic retry behavior.prompts/en/fragments/worker_capabilities.md.j2 (1)
47-48: Clarify “default” language to avoid ambiguity.Line 48 is useful, but paired with Line 7 (“Builtin Workers (default)”) it can sound like two competing defaults. Consider making scope explicit (global vs code-task default).
Suggested wording
-Default to OpenCode for code work when it's enabled. Use builtin workers for short command/file tasks. +Default to builtin workers for quick tasks. For code work, default to OpenCode when enabled.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@prompts/en/fragments/worker_capabilities.md.j2` around lines 47 - 48, The current phrasing creates two competing "defaults" between the phrase "Default to OpenCode for code work when it's enabled" and the header "Builtin Workers (default)"; update the copy to make the scope explicit by stating that "OpenCode" is the default for code-related tasks when enabled while "Builtin Workers" remain the default for short command/file tasks (or for global/non-code operations). Locate and edit the strings "Default to OpenCode for code work when it's enabled" and "Builtin Workers (default)" in worker_capabilities.md.j2 so each specifies its scope (e.g., "default for code tasks" vs "global/default for short tasks") to remove ambiguity.prompts/en/worker.md.j2 (1)
39-39: Clarify this capability in both passthrough branches.Line 39 documents unrestricted
fileaccess only when sandbox is fully disabled, but Line 32 already defines another passthrough mode (enabled-in-config, unsupported backend). Iffilebehaves the same there, mirror this sentence in that branch too so worker behavior doesn’t drift by host/backend.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@prompts/en/worker.md.j2` at line 39, Update the documentation in prompts/en/worker.md.j2 to clarify that the `file` tool can access any path readable by the process in both passthrough branches: add the sentence "The `file` tool can access any path readable by the process (no workspace restriction)." to the branch describing the fully disabled sandbox and mirror that exact sentence into the other passthrough branch that covers the enabled-in-config/unsupported-backend case so both branches state identical `file` access behavior.src/agent/status.rs (1)
322-332: Add symmetric test coverage forremove_worker.Nice branch-removal coverage; I’d add the equivalent worker-removal test so both new public removal paths are protected.
Proposed test addition
#[test] fn remove_branch_removes_existing_branch() { let mut status = StatusBlock::new(); let branch_id = Uuid::new_v4(); status.add_branch(branch_id, "work"); let removed = status.remove_branch(branch_id); assert!(removed); assert!(status.active_branches.is_empty()); } + +#[test] +fn remove_worker_removes_existing_worker() { + let mut status = StatusBlock::new(); + let worker_id = Uuid::new_v4(); + status.add_worker(worker_id, "task", false); + + let removed = status.remove_worker(worker_id); + + assert!(removed); + assert!(status.active_workers.is_empty()); +}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agent/status.rs` around lines 322 - 332, Add a symmetric unit test for remove_worker similar to remove_branch_removes_existing_branch: create a StatusBlock via StatusBlock::new(), add a worker id using add_worker (use Uuid::new_v4()), call remove_worker(worker_id), assert the call returns true and that status.active_workers is empty; name it something like remove_worker_removes_existing_worker so both public removal paths (remove_branch and remove_worker) are covered.src/memory/store.rs (1)
255-265: Add focused test coverage for the new association cleanup helper.
delete_associations_for_memoryis a deletion primitive; please add a unit test asserting both incoming and outgoing associations are removed androws_affectedmatches expectation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/memory/store.rs` around lines 255 - 265, Add a focused unit test for delete_associations_for_memory that sets up associations referencing a given memory id both as source_id and as target_id, calls MemoryStore::delete_associations_for_memory(memory_id), asserts the returned rows_affected equals the number of inserted rows, and verifies no rows remain in the associations table for that memory id; use the existing test DB harness or a transaction/fixture to insert the test associations, clean up after the test, and reference the delete_associations_for_memory function, associations table, and rows_affected in your assertions.src/hooks/spacebot.rs (1)
435-438: Remove duplicate comment.Lines 435-438 contain a duplicate comment block that repeats "Send event without blocking. Truncate args to keep broadcast payloads bounded."
Proposed fix
- // Send event without blocking. Truncate args to keep broadcast payloads bounded. - // Send event without blocking. Truncate args to keep broadcast payloads bounded. let capped_args = crate::tools::truncate_output(args, 2_000);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/hooks/spacebot.rs` around lines 435 - 438, Duplicate comment lines were introduced above the creation of capped_args using crate::tools::truncate_output; remove the redundant comment so only a single descriptive line remains. Locate the block where capped_args is assigned (the call to truncate_output) and delete the extra repeated comment text "Send event without blocking. Truncate args to keep broadcast payloads bounded" leaving one occurrence directly above the capped_args assignment.src/config.rs (2)
2669-2672: Avoid silently dropping parse errors in the unknown-key pre-pass.Returning on parse failure here is understandable, but adding a debug log would preserve diagnostics without changing behavior.
♻️ Suggested tweak
fn warn_unknown_config_keys(content: &str) { let table: toml::Table = match content.parse() { - Ok(t) => t, - Err(_) => return, // the typed parse will report the real error + Ok(table) => table, + Err(error) => { + tracing::debug!( + %error, + "skipping unknown-top-level-key warning pass due to TOML parse error" + ); + return; // the typed parse will report the real error + } };As per coding guidelines: "Don't silently discard errors; use
let _ =only on channel sends where the receiver may be dropped; handle, log, or propagate all other errors".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/config.rs` around lines 2669 - 2672, The current pre-pass silently returns on toml parse failure (the match around content.parse() that assigns table: toml::Table), losing diagnostics; modify that match to capture the Err(e) and emit a debug log containing the parsing error and context (e.g., the filename/variable) before returning, e.g., match content.parse() { Ok(t) => t, Err(e) => { debug!("...: {}", e); return } }, so behavior stays the same but diagnostics are preserved; update the match handling around content.parse() and the table: toml::Table binding accordingly.
2650-2663:KNOWN_TOP_LEVEL_KEYSis easy to drift fromTomlConfigover time.Consider adding a small guard test that fails when a new top-level TOML field is added without updating this list, to avoid false warnings during future config evolution.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/config.rs` around lines 2650 - 2663, Add a unit test that serializes the TomlConfig struct to a toml::Value (or serde JSON value) and compares the top-level field names against the KNOWN_TOP_LEVEL_KEYS constant; if the sets differ the test should fail, forcing updates to KNOWN_TOP_LEVEL_KEYS whenever TomlConfig changes. Locate TomlConfig and KNOWN_TOP_LEVEL_KEYS in src/config.rs, derive/ensure serde Serialize on TomlConfig (or construct a representative instance), collect the top-level keys from the serialized value, and assert equality with KNOWN_TOP_LEVEL_KEYS so the CI catches any drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/agent/channel_dispatch.rs`:
- Around line 620-633: Panic payload text is logged and returned without
secret-scrubbing; pass the constructed panic_message through the same secret
scrubber used for non-panic errors before emitting it. Locate the panic branch
that builds panic_message (references: panic_payload, panic_message,
tracing::error, WorkerCompletionError::failed) and replace usages with the
scrubbed version (e.g., let scrubbed = scrubber.scrub(&panic_message)) — use the
scrubbed string for the tracing::error fields and for the
WorkerCompletionError::failed payload so sensitive data is removed consistently.
In `@src/agent/channel.rs`:
- Around line 1752-1762: The early return when
self.state.active_workers.write().await.remove(worker_id) is None skips
necessary completion work; change the logic so removal result is checked but do
not return early — always call run_logger.log_worker_completed(*worker_id,
result, *success) and always attempt to remove entries from
self.state.worker_handles and self.state.worker_inputs (and perform any
retrigger/enqueue logic) even if the worker was not present in active_workers;
keep the drop(workers) before logging to avoid holding the write lock while
calling run_logger, and ensure removals are idempotent (i.e., ignore missing
entries) so repeated/late WorkerComplete events are safe.
In `@src/agent/worker.rs`:
- Around line 387-395: The code directly assigns self.state =
WorkerState::Failed and self.state = WorkerState::WaitingForInput; replace these
direct assignments with the state-machine helpers: call
self.can_transition_to(...) (using matches! on current state as the other
transitions do) and then self.transition_to(...) so transitions are validated;
if can_transition_to returns false, propagate or surface a runtime error (panic
or return Err) consistent with other uses of transition_to()/can_transition_to()
to enforce the state-machine rules for WorkerState transitions instead of
mutating self.state directly.
In `@src/api/settings.rs`:
- Around line 343-353: The current iterator over runtime_configs uses filter_map
and silently drops agents missing in mcp_managers (symbols: runtime_configs,
mcp_managers, reload_targets, agent_id), so change the logic to explicitly
handle the missing case: iterate over runtime_configs with .map or .filter_map
that first checks mcp_managers.get(agent_id) and if missing emit a warning via
the logger (or return an Err/Result) so the missing agent is not discarded
silently; only collect the (agent_id.clone(), runtime_config.clone(),
mcp_manager.clone()) tuples when a manager exists, and ensure you log the
agent_id and reason when skipping so the omission is observable.
In `@src/config.rs`:
- Around line 5763-5764: The hot-swap of the ArcSwap field opencode_server_pool
should explicitly call shutdown_all() on the old OpenCodeServerPool before
replacing it: load the current pool via
self.opencode_server_pool.load().as_ref().clone() to capture the old Arc,
construct the new OpenCodeServerPool, call
self.opencode_server_pool.store(Arc::new(new_pool)), then await
old_pool.shutdown_all() to drain and kill servers instead of relying on Drop;
apply this pattern wherever opencode_server_pool is replaced (e.g., in the
blocks around the existing replacements).
In `@src/conversation/history.rs`:
- Around line 375-397: The DB update in reconcile_running_workers_for_agent and
the awaited worker-log persistence calls inside ProcessRunLogger must be
converted to fire-and-forget tasks: for reconcile_running_workers_for_agent (and
any ProcessRunLogger methods that currently await .execute against self.pool)
move the SQL query + .execute call into a tokio::spawn(async move { ... }) so
the caller is not held by SQLite latency; ensure the closure clones/owns the
pool and the failure_message/agent_id (or other params), await the query inside
the spawned task and handle/map errors there (e.g., processLogger.error or
tracing::error) rather than propagating them, and remove the original .await
from the caller path (or replace the call with spawning the task). Ensure
signatures that must return sync results are adjusted to not await DB writes or
provide a separate non-blocking helper that spawns the task.
In `@src/main.rs`:
- Around line 1662-1680: The spawned task that runs channel.run() cleans the
registry and api_state (process_control_registry.unregister_channel,
api_state_for_cleanup.unregister_channel_status, unregister_channel_state) but
never evicts the corresponding entry from active_channels, so the first inbound
message after exit is dropped; fix by creating a mpsc channel pair during main
loop setup (closed_channel_tx / closed_channel_rx via mpsc::channel::<String>)
and, after channel.run() returns, send the cleanup_channel_id over
closed_channel_tx (from the spawned task) so the main loop's tokio::select!
branch that matches Some(closed_id) = closed_channel_rx.recv() can call
active_channels.remove(&closed_id) to immediately evict the stale entry; ensure
closed_channel_tx is cloned into the spawn and handle send failures gracefully.
In `@src/memory/maintenance.rs`:
- Around line 251-270: The active_survivor snapshot becomes stale after calling
merge_pair which persists merged content; to fix, reload the updated survivor
from the persistent store (or otherwise refresh its snapshot) immediately after
merge_pair completes and before any further merging decisions—i.e., after
merge_pair(...) .await? and after merged_memory_ids/merged_count updates,
replace active_survivor with the freshly fetched memory for winner.id (or
recompute its embeddings/text) so subsequent iterations use the persisted,
up-to-date content; apply the same change in the other loop region (lines
~311-379) that uses active_survivor, and reference choose_merge_pair,
merge_pair, active_survivor, winner, loser, and source_id when locating the code
to modify.
In `@src/opencode/worker.rs`:
- Around line 344-349: The logging currently slices the detected secret bytes
unsafely using &leak[..leak.len().min(8)], which can panic on UTF‑8 multi‑byte
boundaries; replace both occurrences with a char‑safe truncation so the prefix
is taken by characters rather than bytes (e.g., build a prefix with
leak.chars().take(8).collect::<String>() or compute a safe byte index via
leak.char_indices() and use that). Update the calls around scan_for_leaks(), the
local variable leak, and the tracing::warn/tracing::info invocations to use the
new char-safe prefix value.
In `@src/tools/file.rs`:
- Around line 216-232: The protection for identity/memory files should be
unconditional: remove the sandbox-mode conditional around the protected-file
check and always enforce the guard in the file write path. Locate the block that
calls self.sandbox.mode_enabled() and the PROTECTED_FILES constant (and the
file_name extraction via path.file_name().and_then(|n| n.to_str())), and change
the logic so that PROTECTED_FILES.iter().any(...) immediately returns the
FileError directing the caller to use the identity management API regardless of
sandbox state; keep the existing error message and FileError return shape.
In `@src/tools/reply.rs`:
- Around line 375-384: The trace currently logs a substring of the detected
secret from scan_for_leaks (variable leak) via leak_prefix, which exposes
sensitive data; update the logging in the block that calls
crate::secrets::scrub::scan_for_leaks to never include any characters of the
leak string—use a non-sensitive indicator such as leak_len = leak.len() or a
boolean leak_detected = true (and keep conversation_id) in the tracing::error
call, then return the same ReplyError; modify the code around the leak variable
and the tracing::error invocation in src/tools/reply.rs to remove any substring
access (e.g., &leak[..]) and log only length or detection status instead.
---
Duplicate comments:
In `@docs/design-docs/cortex-implementation.md`:
- Around line 19-21: The document currently contradicts itself by saying the
maintenance functions are "Implemented but never called" while also stating they
are "wired into the cortex loop"; update the wording to a single clear state:
either mark memory/maintenance.rs and its functions apply_decay(),
prune_memories(), and merge_similar_memories() as "Implemented and wired into
the cortex loop" or as "Implemented but not yet wired into the cortex loop"
(choose the true runtime state), and make the cortex loop description consistent
with that choice so there is no ambiguity.
- Around line 86-89: Two fenced code blocks in the document are unlabeled and
trigger MD040; add a language identifier (e.g., "text") to each opening fence
for the blocks containing the lines "0 active -> 1 completing -> 3 terminal / 0
active -> 2 killing -> 3 terminal" and the block starting with "Phase 1 (tick
loop) — standalone, just plumbing" so the fences become ```text and satisfy
the linter.
In `@src/agent/process_control.rs`:
- Around line 102-112: The code unconditionally calls
self.channels.write().await.remove(channel_id) when Weak::upgrade() fails, which
can delete a newly-registered handle; change the cleanup to be race-safe by
taking the write lock and removing the entry only if the stored Weak pointer is
the same stale Weak you originally looked up (or use the HashMap entry API to
compare-and-remove): locate the code around
channels.read().await.get(channel_id).cloned() and the subsequent upgrade()
check in process_control.rs and replace the unconditional remove(channel_id)
with a conditional remove that verifies pointer identity of the Weak handle (or
re-check the map entry equals the cloned Weak) before removing, preserving
correct behavior for cancel_worker_with_reason and ControlActionResult paths.
In `@src/api/config.rs`:
- Around line 621-632: The maintenance parameters are written into the TOML
without validation; add range checks before mutating table: for
maintenance_decay_rate, maintenance_prune_threshold, and
maintenance_merge_similarity_threshold (the cortex fields referenced) ensure
their values are within 0.0..=1.0 and for maintenance_min_age_days ensure value
is >= 0; if a value is out of bounds, return an error (or propagate a validation
error) instead of setting table[...] — perform these checks immediately before
the existing assignments that set table["maintenance_decay_rate"],
table["maintenance_prune_threshold"], table["maintenance_min_age_days"], and
table["maintenance_merge_similarity_threshold"] so invalid inputs are rejected
prior to TOML mutation.
- Around line 603-605: The code is performing lossy casts like
`cortex.supervisor_kill_budget_per_tick as i64` (and similarly
`maintenance_interval_secs as i64`) which can wrap negative for oversized
inputs; change these assignments to validate and convert safely using
`i64::try_from(...)` or `usize::try_into::<i64>()` and handle the Err case
(return an error, skip persisting with a logged warning, or clamp to a safe max)
instead of an unchecked `as` cast so only representable values are written into
`table["supervisor_kill_budget_per_tick"]` and
`table["maintenance_interval_secs"]`. Ensure you update the code paths around
the `table[...] = toml_edit::value(...)` assignments (references:
cortex.supervisor_kill_budget_per_tick, maintenance_interval_secs) to perform
the checked conversion and appropriate error handling or fallback.
In `@src/lib.rs`:
- Around line 267-277: The function create_process_event_buses_with_capacity
currently calls tokio::sync::broadcast::channel with unvalidated capacities
which will panic for 0; before calling channel, clamp control_event_capacity and
memory_event_capacity to a minimum of 1 (e.g., let control_capacity =
control_event_capacity.max(1); let memory_capacity =
memory_event_capacity.max(1);) and use those local values when creating event_tx
and memory_event_tx so the public helper never passes 0 into broadcast::channel.
In `@src/memory/maintenance.rs`:
- Around line 324-377: The merge sequence mutates survivor content and rewrites
associations via memory_store.update, get_associations,
delete_associations_for_memory, create_association, creates an Updates
association, calls memory_store.forget and embedding_table.delete with
cancellation checkpoints between them; wrap all of these operations in a single
storage transaction (use whatever transaction primitive exists on MemoryStore —
e.g., begin_transaction/transaction/tx or sqlx::Transaction if present) so the
update + rewire + create Updates + forget + embedding delete are executed
atomically, remove intermediate maintenance_cancelable_op awaits inside the
transactional block (perform a single cancellable check before starting and
ensure cancellation causes a rollback), and on error ensure you rollback the
transaction and only after commit run any non-transactional cleanup (or queue
embedding_table.delete inside the same transaction if supported) so partial
merges cannot persist; reference functions: memory_store.update,
memory_store.get_associations, memory_store.delete_associations_for_memory,
memory_store.create_association, memory_store.forget, embedding_table.delete,
Association::new, maintenance_cancelable_op, check_maintenance_cancellation.
In `@tests/maintenance.rs`:
- Around line 8-11: The temporary directory created inside
make_memory_maintenance_fixture is dropped when the function returns, leaving
embedding_table referencing deleted storage; fix this by extending the TempDir
lifetime (e.g., include the TempDir in the fixture's return value or wrap it in
an Arc and return it alongside std::sync::Arc<MemoryStore> and
spacebot::memory::EmbeddingTable) so the TempDir is kept alive for the entire
test; update all call sites and other fixtures mentioned (the other occurrences
around where TempDir is created) to accept/propagate the extra TempDir holder to
prevent use-after-drop of embedding_table storage.
---
Nitpick comments:
In `@docs/design-docs/cortex-chat.md`:
- Around line 119-120: Clarify the 409 CONFLICT contract by specifying both the
JSON error body and Retry-After header format: state that the endpoint returns
HTTP 409 with a JSON payload containing standardized fields (e.g., an error code
like "session_busy" and a human-readable message plus an optional numeric
retry_after_seconds) and also sets a Retry-After header with the retry interval
in seconds; update the sentence about "409 CONFLICT" to include these exact
field names and header semantics so clients can implement deterministic retry
behavior.
In `@prompts/en/fragments/worker_capabilities.md.j2`:
- Around line 47-48: The current phrasing creates two competing "defaults"
between the phrase "Default to OpenCode for code work when it's enabled" and the
header "Builtin Workers (default)"; update the copy to make the scope explicit
by stating that "OpenCode" is the default for code-related tasks when enabled
while "Builtin Workers" remain the default for short command/file tasks (or for
global/non-code operations). Locate and edit the strings "Default to OpenCode
for code work when it's enabled" and "Builtin Workers (default)" in
worker_capabilities.md.j2 so each specifies its scope (e.g., "default for code
tasks" vs "global/default for short tasks") to remove ambiguity.
In `@prompts/en/worker.md.j2`:
- Line 39: Update the documentation in prompts/en/worker.md.j2 to clarify that
the `file` tool can access any path readable by the process in both passthrough
branches: add the sentence "The `file` tool can access any path readable by the
process (no workspace restriction)." to the branch describing the fully disabled
sandbox and mirror that exact sentence into the other passthrough branch that
covers the enabled-in-config/unsupported-backend case so both branches state
identical `file` access behavior.
In `@src/agent/status.rs`:
- Around line 322-332: Add a symmetric unit test for remove_worker similar to
remove_branch_removes_existing_branch: create a StatusBlock via
StatusBlock::new(), add a worker id using add_worker (use Uuid::new_v4()), call
remove_worker(worker_id), assert the call returns true and that
status.active_workers is empty; name it something like
remove_worker_removes_existing_worker so both public removal paths
(remove_branch and remove_worker) are covered.
In `@src/config.rs`:
- Around line 2669-2672: The current pre-pass silently returns on toml parse
failure (the match around content.parse() that assigns table: toml::Table),
losing diagnostics; modify that match to capture the Err(e) and emit a debug log
containing the parsing error and context (e.g., the filename/variable) before
returning, e.g., match content.parse() { Ok(t) => t, Err(e) => { debug!("...:
{}", e); return } }, so behavior stays the same but diagnostics are preserved;
update the match handling around content.parse() and the table: toml::Table
binding accordingly.
- Around line 2650-2663: Add a unit test that serializes the TomlConfig struct
to a toml::Value (or serde JSON value) and compares the top-level field names
against the KNOWN_TOP_LEVEL_KEYS constant; if the sets differ the test should
fail, forcing updates to KNOWN_TOP_LEVEL_KEYS whenever TomlConfig changes.
Locate TomlConfig and KNOWN_TOP_LEVEL_KEYS in src/config.rs, derive/ensure serde
Serialize on TomlConfig (or construct a representative instance), collect the
top-level keys from the serialized value, and assert equality with
KNOWN_TOP_LEVEL_KEYS so the CI catches any drift.
In `@src/hooks/spacebot.rs`:
- Around line 435-438: Duplicate comment lines were introduced above the
creation of capped_args using crate::tools::truncate_output; remove the
redundant comment so only a single descriptive line remains. Locate the block
where capped_args is assigned (the call to truncate_output) and delete the extra
repeated comment text "Send event without blocking. Truncate args to keep
broadcast payloads bounded" leaving one occurrence directly above the
capped_args assignment.
In `@src/memory/store.rs`:
- Around line 255-265: Add a focused unit test for
delete_associations_for_memory that sets up associations referencing a given
memory id both as source_id and as target_id, calls
MemoryStore::delete_associations_for_memory(memory_id), asserts the returned
rows_affected equals the number of inserted rows, and verifies no rows remain in
the associations table for that memory id; use the existing test DB harness or a
transaction/fixture to insert the test associations, clean up after the test,
and reference the delete_associations_for_memory function, associations table,
and rows_affected in your assertions.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8217132e-dad6-485b-90f6-645b4192be41
⛔ Files ignored due to path filters (15)
Cargo.lockis excluded by!**/*.lock,!**/*.lockCargo.tomlis excluded by!**/*.tomlvendor/imap-proto-0.10.2/Cargo.tomlis excluded by!**/*.toml,!**/vendor/**vendor/imap-proto-0.10.2/src/body.rsis excluded by!**/vendor/**vendor/imap-proto-0.10.2/src/body_structure.rsis excluded by!**/vendor/**vendor/imap-proto-0.10.2/src/builders/command.rsis excluded by!**/vendor/**vendor/imap-proto-0.10.2/src/builders/mod.rsis excluded by!**/vendor/**vendor/imap-proto-0.10.2/src/core.rsis excluded by!**/vendor/**vendor/imap-proto-0.10.2/src/lib.rsis excluded by!**/vendor/**vendor/imap-proto-0.10.2/src/macros.rsis excluded by!**/vendor/**vendor/imap-proto-0.10.2/src/parser/mod.rsis excluded by!**/vendor/**vendor/imap-proto-0.10.2/src/parser/rfc3501.rsis excluded by!**/vendor/**vendor/imap-proto-0.10.2/src/parser/rfc4551.rsis excluded by!**/vendor/**vendor/imap-proto-0.10.2/src/parser/rfc5464.rsis excluded by!**/vendor/**vendor/imap-proto-0.10.2/src/types.rsis excluded by!**/vendor/**
📒 Files selected for processing (64)
README.mddocs/content/docs/(configuration)/config.mdxdocs/content/docs/(configuration)/sandbox.mdxdocs/content/docs/(core)/architecture.mdxdocs/content/docs/(features)/workers.mdxdocs/design-docs/cortex-chat.mddocs/design-docs/cortex-history.mddocs/design-docs/cortex-implementation.mddocs/design-docs/tool-nudging.mdprompts/en/channel.md.j2prompts/en/fragments/worker_capabilities.md.j2prompts/en/tools/spawn_worker_description.md.j2prompts/en/worker.md.j2src/agent.rssrc/agent/branch.rssrc/agent/channel.rssrc/agent/channel_dispatch.rssrc/agent/channel_history.rssrc/agent/compactor.rssrc/agent/cortex.rssrc/agent/cortex_chat.rssrc/agent/ingestion.rssrc/agent/invariant_harness.rssrc/agent/process_control.rssrc/agent/status.rssrc/agent/worker.rssrc/api/agents.rssrc/api/channels.rssrc/api/config.rssrc/api/cortex.rssrc/api/server.rssrc/api/settings.rssrc/api/state.rssrc/api/system.rssrc/config.rssrc/conversation/channels.rssrc/conversation/history.rssrc/hooks.rssrc/hooks/spacebot.rssrc/lib.rssrc/llm/model.rssrc/main.rssrc/memory/maintenance.rssrc/memory/store.rssrc/messaging/discord.rssrc/opencode/types.rssrc/opencode/worker.rssrc/sandbox.rssrc/telemetry/registry.rssrc/tools.rssrc/tools/cancel.rssrc/tools/exec.rssrc/tools/file.rssrc/tools/memory_save.rssrc/tools/reply.rssrc/tools/send_file.rssrc/tools/shell.rssrc/tools/spawn_worker.rssrc/tools/worker_inspect.rstests/bulletin.rstests/context_dump.rstests/detached_worker_bootstrap.rstests/maintenance.rstests/tool_nudge.rs
✅ Files skipped from review due to trivial changes (1)
- docs/design-docs/tool-nudging.md
🚧 Files skipped from review as they are similar to previous changes (3)
- tests/bulletin.rs
- docs/design-docs/cortex-history.md
- docs/content/docs/(configuration)/config.mdx
| Err(panic_payload) => { | ||
| let panic_message = panic_payload | ||
| .downcast_ref::<&str>() | ||
| .map(|message| (*message).to_string()) | ||
| .or_else(|| panic_payload.downcast_ref::<String>().cloned()) | ||
| .unwrap_or_else(|| "unknown panic payload".to_string()); | ||
| tracing::error!( | ||
| worker_id = %worker_id, | ||
| panic_message = %panic_message, | ||
| "worker task panicked" | ||
| ); | ||
| Err(WorkerCompletionError::failed(format!( | ||
| "worker task panicked: {panic_message}" | ||
| ))) |
There was a problem hiding this comment.
Scrub panic payload text before emitting worker completion result.
The panic branch builds panic_message and sends it downstream without passing through the secret scrubber, unlike the non-panic failure path.
🔒 Suggested fix
Err(panic_payload) => {
let panic_message = panic_payload
.downcast_ref::<&str>()
.map(|message| (*message).to_string())
.or_else(|| panic_payload.downcast_ref::<String>().cloned())
.unwrap_or_else(|| "unknown panic payload".to_string());
+ let panic_message = if let Some(store) = &secrets_store {
+ crate::secrets::scrub::scrub_with_store(&panic_message, store)
+ } else {
+ panic_message
+ };
tracing::error!(
worker_id = %worker_id,
panic_message = %panic_message,
"worker task panicked"
);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Err(panic_payload) => { | |
| let panic_message = panic_payload | |
| .downcast_ref::<&str>() | |
| .map(|message| (*message).to_string()) | |
| .or_else(|| panic_payload.downcast_ref::<String>().cloned()) | |
| .unwrap_or_else(|| "unknown panic payload".to_string()); | |
| tracing::error!( | |
| worker_id = %worker_id, | |
| panic_message = %panic_message, | |
| "worker task panicked" | |
| ); | |
| Err(WorkerCompletionError::failed(format!( | |
| "worker task panicked: {panic_message}" | |
| ))) | |
| Err(panic_payload) => { | |
| let panic_message = panic_payload | |
| .downcast_ref::<&str>() | |
| .map(|message| (*message).to_string()) | |
| .or_else(|| panic_payload.downcast_ref::<String>().cloned()) | |
| .unwrap_or_else(|| "unknown panic payload".to_string()); | |
| let panic_message = if let Some(store) = &secrets_store { | |
| crate::secrets::scrub::scrub_with_store(&panic_message, store) | |
| } else { | |
| panic_message | |
| }; | |
| tracing::error!( | |
| worker_id = %worker_id, | |
| panic_message = %panic_message, | |
| "worker task panicked" | |
| ); | |
| Err(WorkerCompletionError::failed(format!( | |
| "worker task panicked: {panic_message}" | |
| ))) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/agent/channel_dispatch.rs` around lines 620 - 633, Panic payload text is
logged and returned without secret-scrubbing; pass the constructed panic_message
through the same secret scrubber used for non-panic errors before emitting it.
Locate the panic branch that builds panic_message (references: panic_payload,
panic_message, tracing::error, WorkerCompletionError::failed) and replace usages
with the scrubbed version (e.g., let scrubbed = scrubber.scrub(&panic_message))
— use the scrubbed string for the tracing::error fields and for the
WorkerCompletionError::failed payload so sensitive data is removed consistently.
| let mut workers = self.state.active_workers.write().await; | ||
| workers.remove(worker_id); | ||
| if workers.remove(worker_id).is_none() { | ||
| return Ok(()); | ||
| } | ||
| drop(workers); | ||
|
|
||
| run_logger.log_worker_completed(*worker_id, result, *success); | ||
|
|
||
| self.state.worker_handles.write().await.remove(worker_id); | ||
| self.state.worker_inputs.write().await.remove(worker_id); | ||
|
|
There was a problem hiding this comment.
WorkerComplete can be dropped before convergence/cleanup.
Line 1753 returns early when active_workers has no entry, which skips the rest of completion handling (run logging, retrigger enqueue, and handle/input cleanup) for that event.
🛠️ Suggested fix
- let mut workers = self.state.active_workers.write().await;
- if workers.remove(worker_id).is_none() {
- return Ok(());
- }
- drop(workers);
+ let removed_worker = self.state.active_workers.write().await.remove(worker_id).is_some();
+ let removed_handle = self.state.worker_handles.write().await.remove(worker_id).is_some();
+ let removed_input = self.state.worker_inputs.write().await.remove(worker_id).is_some();
+ if !(removed_worker || removed_handle || removed_input) {
+ return Ok(());
+ }
run_logger.log_worker_completed(*worker_id, result, *success);
- self.state.worker_handles.write().await.remove(worker_id);
- self.state.worker_inputs.write().await.remove(worker_id);Based on learnings: For changes in async/stateful paths (worker lifecycle, cancellation, retrigger, recall cache behavior), include explicit race/terminal-state reasoning in the PR summary and run targeted tests in addition to just gate-pr.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/agent/channel.rs` around lines 1752 - 1762, The early return when
self.state.active_workers.write().await.remove(worker_id) is None skips
necessary completion work; change the logic so removal result is checked but do
not return early — always call run_logger.log_worker_completed(*worker_id,
result, *success) and always attempt to remove entries from
self.state.worker_handles and self.state.worker_inputs (and perform any
retrigger/enqueue logic) even if the worker was not present in active_workers;
keep the drop(workers) before logging to avoid holding the write lock while
calling run_logger, and ensure removals are idempotent (i.e., ignore missing
entries) so repeated/late WorkerComplete events are safe.
| self.state = WorkerState::Failed; | ||
| self.hook.send_status("failed"); | ||
| follow_up_failure = Some(failure_reason); | ||
| break; | ||
| } | ||
|
|
||
| self.state = WorkerState::WaitingForInput; | ||
| self.hook.send_status("waiting for input"); | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Use validated state transitions in the new follow-up path.
Line 387 and Line 393 assign self.state directly, bypassing transition_to() and can_transition_to() checks.
♻️ Suggested change
- self.state = WorkerState::Failed;
+ self.transition_to(WorkerState::Failed)?;
self.hook.send_status("failed");
follow_up_failure = Some(failure_reason);
break;
}
- self.state = WorkerState::WaitingForInput;
+ self.transition_to(WorkerState::WaitingForInput)?;
self.hook.send_status("waiting for input");As per coding guidelines: Worker state machine must validate transitions with can_transition_to() using matches!; illegal transitions are runtime errors, not silent bugs.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| self.state = WorkerState::Failed; | |
| self.hook.send_status("failed"); | |
| follow_up_failure = Some(failure_reason); | |
| break; | |
| } | |
| self.state = WorkerState::WaitingForInput; | |
| self.hook.send_status("waiting for input"); | |
| } | |
| self.transition_to(WorkerState::Failed)?; | |
| self.hook.send_status("failed"); | |
| follow_up_failure = Some(failure_reason); | |
| break; | |
| } | |
| self.transition_to(WorkerState::WaitingForInput)?; | |
| self.hook.send_status("waiting for input"); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/agent/worker.rs` around lines 387 - 395, The code directly assigns
self.state = WorkerState::Failed and self.state = WorkerState::WaitingForInput;
replace these direct assignments with the state-machine helpers: call
self.can_transition_to(...) (using matches! on current state as the other
transitions do) and then self.transition_to(...) so transitions are validated;
if can_transition_to returns false, propagate or surface a runtime error (panic
or return Err) consistent with other uses of transition_to()/can_transition_to()
to enforce the state-machine rules for WorkerState transitions instead of
mutating self.state directly.
| let reload_targets = runtime_configs | ||
| .iter() | ||
| .filter_map(|(agent_id, runtime_config)| { | ||
| mcp_managers.get(agent_id).map(|mcp_manager| { | ||
| ( | ||
| agent_id.clone(), | ||
| runtime_config.clone(), | ||
| mcp_manager.clone(), | ||
| ) | ||
| }) | ||
| }) |
There was a problem hiding this comment.
Config reload currently skips some agents silently.
The filter_map drops agents that exist in runtime_configs but not in mcp_managers, so those runtimes never reload and there’s no signal this happened.
💡 Proposed fix
- let reload_targets = runtime_configs
- .iter()
- .filter_map(|(agent_id, runtime_config)| {
- mcp_managers.get(agent_id).map(|mcp_manager| {
- (
- agent_id.clone(),
- runtime_config.clone(),
- mcp_manager.clone(),
- )
- })
- })
- .collect::<Vec<_>>();
+ let mut reload_targets = Vec::new();
+ for (agent_id, runtime_config) in runtime_configs.iter() {
+ match mcp_managers.get(agent_id) {
+ Some(mcp_manager) => reload_targets.push((
+ agent_id.clone(),
+ runtime_config.clone(),
+ mcp_manager.clone(),
+ )),
+ None => tracing::warn!(
+ agent_id = %agent_id,
+ "settings written but skipped runtime reload (missing MCP manager)"
+ ),
+ }
+ }As per coding guidelines: "Don't silently discard errors; use let _ = only on channel sends where the receiver may be dropped; handle, log, or propagate all other errors."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let reload_targets = runtime_configs | |
| .iter() | |
| .filter_map(|(agent_id, runtime_config)| { | |
| mcp_managers.get(agent_id).map(|mcp_manager| { | |
| ( | |
| agent_id.clone(), | |
| runtime_config.clone(), | |
| mcp_manager.clone(), | |
| ) | |
| }) | |
| }) | |
| let mut reload_targets = Vec::new(); | |
| for (agent_id, runtime_config) in runtime_configs.iter() { | |
| match mcp_managers.get(agent_id) { | |
| Some(mcp_manager) => reload_targets.push(( | |
| agent_id.clone(), | |
| runtime_config.clone(), | |
| mcp_manager.clone(), | |
| )), | |
| None => tracing::warn!( | |
| agent_id = %agent_id, | |
| "settings written but skipped runtime reload (missing MCP manager)" | |
| ), | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/api/settings.rs` around lines 343 - 353, The current iterator over
runtime_configs uses filter_map and silently drops agents missing in
mcp_managers (symbols: runtime_configs, mcp_managers, reload_targets, agent_id),
so change the logic to explicitly handle the missing case: iterate over
runtime_configs with .map or .filter_map that first checks
mcp_managers.get(agent_id) and if missing emit a warning via the logger (or
return an Err/Result) so the missing agent is not discarded silently; only
collect the (agent_id.clone(), runtime_config.clone(), mcp_manager.clone())
tuples when a manager exists, and ensure you log the agent_id and reason when
skipping so the omission is observable.
src/config.rs
Outdated
| pub opencode_server_pool: ArcSwap<crate::opencode::OpenCodeServerPool>, | ||
| /// Cron store, set after agent initialization. |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Locate OpenCodeServerPool definitions =="
fd 'opencode.*\.rs$' src
echo "== Check for Drop impl on OpenCodeServerPool =="
ast-grep --pattern 'impl Drop for OpenCodeServerPool { $$$ }'
echo "== Inspect lifecycle/cleanup methods and child process ownership =="
rg -n -C3 'OpenCodeServerPool|impl Drop|fn (shutdown|close|stop|terminate)|tokio::process::Command|Child' src
echo "== Inspect all pool swap/load call sites =="
rg -n -C2 'opencode_server_pool\.(store|load)\(' srcRepository: spacedriveapp/spacebot
Length of output: 173
🏁 Script executed:
#!/bin/bash
echo "== Read OpenCodeServerPool definition in src/opencode.rs =="
wc -l src/opencode.rs
echo "== Find OpenCodeServerPool struct definition =="
rg -n 'struct OpenCodeServerPool' src/opencode.rs -A 20
echo "== Search for Drop implementations in opencode.rs =="
rg -n 'impl Drop' src/opencode.rs -B 2 -A 10
echo "== Search for shutdown/cleanup methods on OpenCodeServerPool =="
rg -n 'fn (shutdown|close|stop|terminate|drop)' src/opencode.rs -B 2 -A 5
echo "== Check config.rs lines around 5763-5764 =="
sed -n '5755,5775p' src/config.rs
echo "== Check config.rs lines around 5824 =="
sed -n '5815,5835p' src/config.rs
echo "== Check config.rs lines around 5917-5927 =="
sed -n '5910,5935p' src/config.rsRepository: spacedriveapp/spacebot
Length of output: 3889
🏁 Script executed:
cat -n src/opencode.rsRepository: spacedriveapp/spacebot
Length of output: 755
🏁 Script executed:
echo "== Read src/opencode/server.rs =="
wc -l src/opencode/server.rs
echo "== Show first 100 lines of server.rs =="
head -100 src/opencode/server.rs
echo "== Search for OpenCodeServerPool struct and Drop impl =="
rg -n 'struct OpenCodeServerPool|impl Drop|impl.*OpenCodeServerPool' src/opencode/server.rs -A 15Repository: spacedriveapp/spacebot
Length of output: 5031
🏁 Script executed:
echo "== Search for OpenCodeServerPool Drop impl =="
rg -n 'impl Drop for OpenCodeServerPool' src/opencode/server.rs -A 15
echo "== Show lines around end of file to check for Drop impl =="
tail -100 src/opencode/server.rs | cat -nRepository: spacedriveapp/spacebot
Length of output: 4330
Call shutdown_all() on replaced OpenCodeServerPool before hot-swap, rather than relying on implicit Drop cleanup.
While OpenCodeServer processes are killed via kill_on_drop(true) when the pool is replaced, the code should explicitly call opencode_server_pool.shutdown_all() on the old pool during hot-swap (lines 5917-5927) for graceful shutdown. The pool provides an explicit shutdown_all() method (in src/opencode/server.rs) that drains servers and calls kill() on each, but the hot-swap only calls .store(), leaving cleanup to implicit Drop semantics. For robust resource management, store the old pool before replacement and call shutdown_all() on it:
let old_pool = self.opencode_server_pool.load().as_ref().clone();
let new_pool = crate::opencode::OpenCodeServerPool::new(...);
self.opencode_server_pool.store(Arc::new(new_pool));
// Explicit cleanup of old pool
old_pool.shutdown_all().await;Also applies to: 5824-5824, 5917-5927
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/config.rs` around lines 5763 - 5764, The hot-swap of the ArcSwap field
opencode_server_pool should explicitly call shutdown_all() on the old
OpenCodeServerPool before replacing it: load the current pool via
self.opencode_server_pool.load().as_ref().clone() to capture the old Arc,
construct the new OpenCodeServerPool, call
self.opencode_server_pool.store(Arc::new(new_pool)), then await
old_pool.shutdown_all() to drain and kill servers instead of relying on Drop;
apply this pattern wherever opencode_server_pool is replaced (e.g., in the
blocks around the existing replacements).
| if let Some(leak) = crate::secrets::scrub::scan_for_leaks(&scrubbed) { | ||
| tracing::warn!( | ||
| worker_id = %self.id, | ||
| "LEAK DETECTED in OpenCode worker output — terminating" | ||
| leak_prefix = %&leak[..leak.len().min(8)], | ||
| "potential secret detected in OpenCode worker output" | ||
| ); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's locate and examine the file
find . -name "worker.rs" -path "*/opencode/*" 2>/dev/nullRepository: spacedriveapp/spacebot
Length of output: 90
🏁 Script executed:
# Check the specific lines mentioned in the review
cat -n src/opencode/worker.rs | sed -n '340,395p'Repository: spacedriveapp/spacebot
Length of output: 3586
🏁 Script executed:
# Let's also check the type of the `leak` variable to understand the context better
cat -n src/opencode/worker.rs | sed -n '330,360p'Repository: spacedriveapp/spacebot
Length of output: 1626
🏁 Script executed:
# Find the secrets module
find . -path "*/secrets/*" -name "*.rs" 2>/dev/null | head -20Repository: spacedriveapp/spacebot
Length of output: 137
🏁 Script executed:
# Look for the scan_for_leaks function definition
rg "fn scan_for_leaks" -A 5Repository: spacedriveapp/spacebot
Length of output: 801
🏁 Script executed:
# Search for similar byte-slicing patterns on strings in logging contexts
rg '\[\.\..*\.len\(\)\.min' -A 2 -B 2Repository: spacedriveapp/spacebot
Length of output: 4254
Fix unsafe byte-slicing of potential secret values in logging.
Lines 347 and 387 use &leak[..leak.len().min(8)] which can panic when byte 8 falls in the middle of a UTF-8 multi-byte character (emojis, accented chars, CJK, etc.). Since scan_for_leaks() returns String, use a char-safe method:
- leak_prefix = %&leak[..leak.len().min(8)],
+ leak_prefix = %leak.chars().take(8).collect::<String>(),This applies identically to both occurrences.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if let Some(leak) = crate::secrets::scrub::scan_for_leaks(&scrubbed) { | |
| tracing::warn!( | |
| worker_id = %self.id, | |
| "LEAK DETECTED in OpenCode worker output — terminating" | |
| leak_prefix = %&leak[..leak.len().min(8)], | |
| "potential secret detected in OpenCode worker output" | |
| ); | |
| if let Some(leak) = crate::secrets::scrub::scan_for_leaks(&scrubbed) { | |
| tracing::warn!( | |
| worker_id = %self.id, | |
| leak_prefix = %leak.chars().take(8).collect::<String>(), | |
| "potential secret detected in OpenCode worker output" | |
| ); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/opencode/worker.rs` around lines 344 - 349, The logging currently slices
the detected secret bytes unsafely using &leak[..leak.len().min(8)], which can
panic on UTF‑8 multi‑byte boundaries; replace both occurrences with a char‑safe
truncation so the prefix is taken by characters rather than bytes (e.g., build a
prefix with leak.chars().take(8).collect::<String>() or compute a safe byte
index via leak.char_indices() and use that). Update the calls around
scan_for_leaks(), the local variable leak, and the tracing::warn/tracing::info
invocations to use the new char-safe prefix value.
| // When sandbox is enabled, identity files must go through the | ||
| // dedicated identity API to keep the update flow consistent. | ||
| // When sandbox is disabled, the user has opted into full access. | ||
| if self.sandbox.mode_enabled() { | ||
| let file_name = path.file_name().and_then(|n| n.to_str()).unwrap_or(""); | ||
| const PROTECTED_FILES: &[&str] = &["SOUL.md", "IDENTITY.md", "USER.md"]; | ||
| if PROTECTED_FILES | ||
| .iter() | ||
| .any(|f| file_name.eq_ignore_ascii_case(f)) | ||
| { | ||
| return Err(FileError( | ||
| "ACCESS DENIED: Identity files are protected and cannot be modified \ | ||
| through file operations. Use the identity management API instead." | ||
| .to_string(), | ||
| )); | ||
| } | ||
| } |
There was a problem hiding this comment.
Keep identity/memory write protection unconditional in the file tool.
Line 219 makes protection depend on sandbox mode, which allows direct writes to protected identity paths when sandbox is disabled. That bypasses the required tool-routing guard.
🛠️ Suggested fix
- if self.sandbox.mode_enabled() {
- let file_name = path.file_name().and_then(|n| n.to_str()).unwrap_or("");
- const PROTECTED_FILES: &[&str] = &["SOUL.md", "IDENTITY.md", "USER.md"];
- if PROTECTED_FILES
- .iter()
- .any(|f| file_name.eq_ignore_ascii_case(f))
- {
- return Err(FileError(
- "ACCESS DENIED: Identity files are protected and cannot be modified \
- through file operations. Use the identity management API instead."
- .to_string(),
- ));
- }
- }
+ let file_name = path.file_name().and_then(|n| n.to_str()).unwrap_or("");
+ const PROTECTED_FILES: &[&str] = &["SOUL.md", "IDENTITY.md", "USER.md"];
+ if PROTECTED_FILES
+ .iter()
+ .any(|f| file_name.eq_ignore_ascii_case(f))
+ {
+ return Err(FileError(
+ "ACCESS DENIED: Identity files are protected and cannot be modified \
+ through file operations. Use the identity management API instead."
+ .to_string(),
+ ));
+ }As per coding guidelines: Workspace path guard: File tools must reject writes to identity/memory paths with an error directing the LLM to the correct tool.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // When sandbox is enabled, identity files must go through the | |
| // dedicated identity API to keep the update flow consistent. | |
| // When sandbox is disabled, the user has opted into full access. | |
| if self.sandbox.mode_enabled() { | |
| let file_name = path.file_name().and_then(|n| n.to_str()).unwrap_or(""); | |
| const PROTECTED_FILES: &[&str] = &["SOUL.md", "IDENTITY.md", "USER.md"]; | |
| if PROTECTED_FILES | |
| .iter() | |
| .any(|f| file_name.eq_ignore_ascii_case(f)) | |
| { | |
| return Err(FileError( | |
| "ACCESS DENIED: Identity files are protected and cannot be modified \ | |
| through file operations. Use the identity management API instead." | |
| .to_string(), | |
| )); | |
| } | |
| } | |
| // When sandbox is enabled, identity files must go through the | |
| // dedicated identity API to keep the update flow consistent. | |
| // When sandbox is disabled, the user has opted into full access. | |
| let file_name = path.file_name().and_then(|n| n.to_str()).unwrap_or(""); | |
| const PROTECTED_FILES: &[&str] = &["SOUL.md", "IDENTITY.md", "USER.md"]; | |
| if PROTECTED_FILES | |
| .iter() | |
| .any(|f| file_name.eq_ignore_ascii_case(f)) | |
| { | |
| return Err(FileError( | |
| "ACCESS DENIED: Identity files are protected and cannot be modified \ | |
| through file operations. Use the identity management API instead." | |
| .to_string(), | |
| )); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/tools/file.rs` around lines 216 - 232, The protection for identity/memory
files should be unconditional: remove the sandbox-mode conditional around the
protected-file check and always enforce the guard in the file write path. Locate
the block that calls self.sandbox.mode_enabled() and the PROTECTED_FILES
constant (and the file_name extraction via path.file_name().and_then(|n|
n.to_str())), and change the logic so that PROTECTED_FILES.iter().any(...)
immediately returns the FileError directing the caller to use the identity
management API regardless of sandbox state; keep the existing error message and
FileError return shape.
| if let Some(leak) = crate::secrets::scrub::scan_for_leaks(&converted_content) { | ||
| tracing::error!( | ||
| conversation_id = %self.conversation_id, | ||
| leak_prefix = %&leak[..leak.len().min(8)], | ||
| "reply tool blocked content matching secret pattern" | ||
| ); | ||
| return Err(ReplyError( | ||
| "blocked reply content: potential secret detected".into(), | ||
| )); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Search for scan_for_leaks definition
fd . -type f -name "*.rs" | xargs rg -l "fn scan_for_leaks"Repository: spacedriveapp/spacebot
Length of output: 296
🏁 Script executed:
# Look for the secrets/scrub module structure
fd . -type f -path "*secrets*" -name "*.rs"Repository: spacedriveapp/spacebot
Length of output: 236
🏁 Script executed:
# Search for scan_for_leaks definition and usage
rg "fn scan_for_leaks" --type rust -A 10Repository: spacedriveapp/spacebot
Length of output: 1492
🏁 Script executed:
# Check the scrub module to understand what scan_for_leaks returns
rg "scan_for_leaks" --type rust -B 2 -A 5Repository: spacedriveapp/spacebot
Length of output: 8592
Avoid logging any portion of the detected secret.
The scan_for_leaks() function returns the actual matched secret content as a String, not pattern metadata. Logging the first 8 characters of line 378 (leak_prefix = %&leak[..leak.len().min(8)]) exposes partial secrets in logs, which:
- Can aid attackers in reconstructing full credentials
- Violates compliance policies that restrict secret retention
- Persists the leaked value indefinitely in log storage
Log the length or detection status instead of any portion of the matched content.
Proposed fix
if let Some(leak) = crate::secrets::scrub::scan_for_leaks(&converted_content) {
tracing::error!(
conversation_id = %self.conversation_id,
- leak_prefix = %&leak[..leak.len().min(8)],
+ leak_len = leak.len(),
"reply tool blocked content matching secret pattern"
);
return Err(ReplyError(
"blocked reply content: potential secret detected".into(),
));
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if let Some(leak) = crate::secrets::scrub::scan_for_leaks(&converted_content) { | |
| tracing::error!( | |
| conversation_id = %self.conversation_id, | |
| leak_prefix = %&leak[..leak.len().min(8)], | |
| "reply tool blocked content matching secret pattern" | |
| ); | |
| return Err(ReplyError( | |
| "blocked reply content: potential secret detected".into(), | |
| )); | |
| } | |
| if let Some(leak) = crate::secrets::scrub::scan_for_leaks(&converted_content) { | |
| tracing::error!( | |
| conversation_id = %self.conversation_id, | |
| leak_len = leak.len(), | |
| "reply tool blocked content matching secret pattern" | |
| ); | |
| return Err(ReplyError( | |
| "blocked reply content: potential secret detected".into(), | |
| )); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/tools/reply.rs` around lines 375 - 384, The trace currently logs a
substring of the detected secret from scan_for_leaks (variable leak) via
leak_prefix, which exposes sensitive data; update the logging in the block that
calls crate::secrets::scrub::scan_for_leaks to never include any characters of
the leak string—use a non-sensitive indicator such as leak_len = leak.len() or a
boolean leak_detected = true (and keep conversation_id) in the tracing::error
call, then return the same ReplyError; modify the code around the leak variable
and the tracing::error invocation in src/tools/reply.rs to remove any substring
access (e.g., &leak[..]) and log only length or detection status instead.
87f1038 to
5d6b65c
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/config.rs (1)
4856-4906:⚠️ Potential issue | 🟠 MajorValidate maintenance numeric bounds before accepting config.
New maintenance knobs are accepted without guardrails. Values like negative
maintenance_min_age_daysor out-of-range thresholds can silently produce invalid maintenance behavior.🛠️ Proposed fix
+fn validate_cortex_maintenance(scope: &str, cortex: &CortexConfig) -> Result<()> { + for (field, value) in [ + ("maintenance_decay_rate", cortex.maintenance_decay_rate), + ("maintenance_prune_threshold", cortex.maintenance_prune_threshold), + ( + "maintenance_merge_similarity_threshold", + cortex.maintenance_merge_similarity_threshold, + ), + ] { + if !(0.0..=1.0).contains(&value) { + return Err(ConfigError::Invalid(format!( + "{scope}.cortex.{field} must be within [0.0, 1.0], got {value}" + )) + .into()); + } + } + + if cortex.maintenance_min_age_days < 0 { + return Err(ConfigError::Invalid(format!( + "{scope}.cortex.maintenance_min_age_days must be >= 0, got {}", + cortex.maintenance_min_age_days + )) + .into()); + } + + Ok(()) +}- cortex: toml + cortex: { + let resolved_cortex = toml .defaults .cortex .map(|c| CortexConfig { ... }) - .unwrap_or(base_defaults.cortex), + .unwrap_or(base_defaults.cortex); + validate_cortex_maintenance("defaults", &resolved_cortex)?; + resolved_cortex + },- cortex: a.cortex.map(|c| CortexConfig { - ... - }), + cortex: match a.cortex { + Some(c) => { + let resolved_cortex = CortexConfig { ... }; + validate_cortex_maintenance( + &format!("agents[{}]", a.id), + &resolved_cortex, + )?; + Some(resolved_cortex) + } + None => None, + },Also applies to: 5083-5133
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/config.rs` around lines 4856 - 4906, The new cortex fields are accepted blindly; add validation in the config build (where these fields are set, e.g., the Cortex config constructor or TryFrom implementation) to check numeric bounds before accepting values: ensure integer durations/counters (maintenance_min_age_days, maintenance_interval_secs, worker_timeout_secs, branch_timeout_secs, supervisor_kill_budget_per_tick, bulletin_max_words, bulletin_max_turns, association_max_per_pass, association_updates_threshold, detached_worker_timeout_retry_limit) are non-negative (and >0 where applicable), ensure probability/ratio thresholds (association_similarity_threshold, maintenance_merge_similarity_threshold, maintenance_prune_threshold, maintenance_decay_rate, circuit_breaker_threshold, bulletin_interval_secs if considered ratio) are within sensible ranges (e.g., 0.0..=1.0 for similarity/decay thresholds), and either return a ConfigError/Err on invalid values or clamp to safe defaults using base_defaults.cortex; implement these checks adjacent to the current unwrap_or usage so invalid incoming values are rejected or corrected before assignment.src/agent/compactor.rs (1)
142-157:⚠️ Potential issue | 🟠 MajorAdd the required 3-strike circuit breaker for recurring compaction failures.
The current failure path only logs and clears
is_compacting, so persistent backend failures can retrigger on every check cycle without backoff/disable behavior.As per coding guidelines:
src/{cron,agent/compactor,agent/cortex}.rs: Implement circuit breaker: auto-disable recurring tasks after 3 consecutive failures. Apply to cron jobs, maintenance workers, and cortex routines.
♻️ Duplicate comments (10)
src/api/config.rs (2)
619-621:⚠️ Potential issue | 🟠 MajorUnchecked narrowing on
maintenance_interval_secscan persist invalid values.Line 620 uses
u64 as i64; values abovei64::MAXwill wrap and write invalid negative TOML integers. Use checked conversion and reject with400.Proposed fix
- if let Some(v) = cortex.maintenance_interval_secs { - table["maintenance_interval_secs"] = toml_edit::value(v as i64); + if let Some(value) = cortex.maintenance_interval_secs { + table["maintenance_interval_secs"] = + toml_edit::value(i64::try_from(value).map_err(|_| StatusCode::BAD_REQUEST)?); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/api/config.rs` around lines 619 - 621, The code currently casts cortex.maintenance_interval_secs from u64 to i64 unsafely (in the block that assigns table["maintenance_interval_secs"]), which will wrap values > i64::MAX and produce invalid negative TOML integers; change this to perform a checked conversion (e.g., use i64::try_from or compare against i64::MAX) and only set table["maintenance_interval_secs"] when conversion succeeds, otherwise return a 400 Bad Request error from the surrounding handler/function indicating the value is out of range; update the logic that references maintenance_interval_secs and the error path so callers get a proper rejection instead of writing a wrapped value.
622-633:⚠️ Potential issue | 🟠 MajorMissing bounds validation for maintenance knobs before persistence.
Lines 622-633 write raw values without range checks. Invalid values (e.g., negative
maintenance_min_age_days, or thresholds outside0.0..=1.0) should be rejected at write time.Proposed fix
+fn validate_unit_interval(field: &'static str, value: f32) -> Result<(), StatusCode> { + if (0.0..=1.0).contains(&value) { + Ok(()) + } else { + tracing::warn!(field, value, "out-of-range maintenance config value"); + Err(StatusCode::BAD_REQUEST) + } +} @@ - if let Some(v) = cortex.maintenance_decay_rate { - table["maintenance_decay_rate"] = toml_edit::value(v as f64); + if let Some(value) = cortex.maintenance_decay_rate { + validate_unit_interval("maintenance_decay_rate", value)?; + table["maintenance_decay_rate"] = toml_edit::value(value as f64); } - if let Some(v) = cortex.maintenance_prune_threshold { - table["maintenance_prune_threshold"] = toml_edit::value(v as f64); + if let Some(value) = cortex.maintenance_prune_threshold { + validate_unit_interval("maintenance_prune_threshold", value)?; + table["maintenance_prune_threshold"] = toml_edit::value(value as f64); } - if let Some(v) = cortex.maintenance_min_age_days { - table["maintenance_min_age_days"] = toml_edit::value(v); + if let Some(value) = cortex.maintenance_min_age_days { + if value < 0 { + tracing::warn!(value, "maintenance_min_age_days must be >= 0"); + return Err(StatusCode::BAD_REQUEST); + } + table["maintenance_min_age_days"] = toml_edit::value(value); } - if let Some(v) = cortex.maintenance_merge_similarity_threshold { - table["maintenance_merge_similarity_threshold"] = toml_edit::value(v as f64); + if let Some(value) = cortex.maintenance_merge_similarity_threshold { + validate_unit_interval("maintenance_merge_similarity_threshold", value)?; + table["maintenance_merge_similarity_threshold"] = toml_edit::value(value as f64); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/api/config.rs` around lines 622 - 633, Validate each maintenance knob Option before writing to the TOML table: for cortex.maintenance_decay_rate, cortex.maintenance_prune_threshold, and cortex.maintenance_merge_similarity_threshold ensure the Some(v) value is within 0.0..=1.0; for cortex.maintenance_min_age_days ensure the Some(v) value is >= 0. Do the checks in the same scope that currently writes to table (i.e., before assigning table["maintenance_*"] = toml_edit::value(...)); if a value is out of bounds return or propagate a clear error (with which field failed and why) instead of persisting the invalid value so the invalid config is rejected at write time.src/agent/worker.rs (1)
386-395:⚠️ Potential issue | 🟠 MajorUse validated transitions in the follow-up failure/waiting path.
This segment still mutates
self.statedirectly, which bypasses state-machine guards and can hide illegal transitions.♻️ Suggested fix
- if let Err(failure_reason) = follow_up_result { - self.state = WorkerState::Failed; + if let Err(failure_reason) = follow_up_result { + self.transition_to(WorkerState::Failed)?; self.hook.send_status("failed"); follow_up_failure = Some(failure_reason); break; } - self.state = WorkerState::WaitingForInput; + self.transition_to(WorkerState::WaitingForInput)?; self.hook.send_status("waiting for input");As per coding guidelines:
src/agent/worker.rs: Implement worker state machine usingcan_transition_to()withmatches!. Illegal transitions are runtime errors, not silent bugs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agent/worker.rs` around lines 386 - 395, The code is directly assigning self.state = WorkerState::Failed / WorkerState::WaitingForInput after follow_up_result handling; instead validate transitions via the worker state machine before mutating state: check self.state.can_transition_to(WorkerState::Failed) (and similarly for WorkerState::WaitingForInput) using matches! as per guidelines, and only set self.state when the check passes; if the transition is illegal, convert it into a runtime error/panic or return an Err instead of silently mutating; update the block around follow_up_result, follow_up_failure, and the hook.send_status calls to perform these guarded transitions.src/memory/maintenance.rs (2)
324-377:⚠️ Potential issue | 🔴 Critical
merge_pairneeds atomicity; current cancel/error points can persist partial rewires.This block updates survivor content, deletes/recreates associations, writes updates edge, forgets loser, and deletes embeddings as separate operations. A mid-sequence cancellation or DB failure can leave the memory graph inconsistent.
251-270:⚠️ Potential issue | 🔴 CriticalRefresh
active_survivorafter each successful merge.
active_survivorstays stale aftermerge_pair(...), so the next merge can recompute content from outdated text and overwrite previously merged additions.♻️ Suggested fix
- merge_pair( + merge_pair( memory_store, embedding_table, &winner, &loser, maintenance_cancel_rx, ) .await?; merged_memory_ids.insert(loser.id.clone()); merged_count += 1; @@ - if winner.id != active_survivor.id { - active_survivor = winner; - } + active_survivor = memory_store + .load(&winner.id) + .await? + .ok_or_else(|| anyhow::anyhow!("merged survivor disappeared during maintenance"))?;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/memory/maintenance.rs` around lines 251 - 270, active_survivor can become stale after merge_pair because the stored winner document is updated in-place; after each successful merge (where you currently call merge_pair and then use winner/loser), reload the latest survivor from the store and assign it to active_survivor so subsequent choose_merge_pair calls use up-to-date content — i.e., after merge_pair(...).await? insert merged_memory_ids/merged_count logic, then fetch the current winner by id from memory_store (using the winner.id) and set active_survivor = that fresh record (or change merge_pair to return the updated survivor and assign it) before continuing or breaking on source_id.src/api/channels.rs (1)
293-297:⚠️ Potential issue | 🟠 MajorDon’t collapse cancellation errors into boolean/404 paths.
is_ok()andmap_err(|_| NOT_FOUND)hide real cancellation failures and can misreport server/runtime issues as not found.🛠️ Proposed fix
- if let Some(channel_state) = channel_state - && channel_state - .cancel_worker_with_reason(worker_id, "cancelled via API") - .await - .is_ok() - { - return Ok(Json(CancelProcessResponse { - success: true, - message: format!("Worker {} cancelled", request.process_id), - })); - } + if let Some(channel_state) = channel_state { + match channel_state + .cancel_worker_with_reason(worker_id, "cancelled via API") + .await + { + Ok(()) => { + return Ok(Json(CancelProcessResponse { + success: true, + message: format!("Worker {} cancelled", request.process_id), + })); + } + Err(error) => { + tracing::warn!( + %error, + channel_id = %request.channel_id, + process_id = %request.process_id, + "live worker cancellation failed; trying detached reconciliation" + ); + } + } + } ... - channel_state - .cancel_branch_with_reason(branch_id, "cancelled via API") - .await - .map_err(|_| StatusCode::NOT_FOUND)?; + channel_state + .cancel_branch_with_reason(branch_id, "cancelled via API") + .await + .map_err(|error| { + tracing::warn!( + %error, + channel_id = %request.channel_id, + process_id = %request.process_id, + "branch cancellation failed" + ); + StatusCode::INTERNAL_SERVER_ERROR + })?;As per coding guidelines: “Never silently discard errors. No
let _ =on Results. Handle them, log them, or propagate them. The only exception is.ok()on channel sends where the receiver may be dropped”.Also applies to: 349-352
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/api/channels.rs` around lines 293 - 297, The current use of .is_ok() (and similar mapping to NOT_FOUND) on channel_state.cancel_worker_with_reason(worker_id, "cancelled via API").await silently discards real errors; instead match on the Result from cancel_worker_with_reason and handle errors explicitly (e.g., log the error with process/server logger and return an appropriate server error or propagate the error instead of converting everything to 404). Update the same pattern around the other occurrence (lines referencing cancel_worker_with_reason at 349-352) so neither path swallows failures — use match/if let Err(e) to log e and return a proper error response rather than .is_ok()/map_err(|_| NOT_FOUND).src/agent/channel_dispatch.rs (1)
620-632:⚠️ Potential issue | 🟠 MajorScrub panic payload text before logging/emitting completion error.
The panic branch logs and returns
panic_messagedirectly. That bypasses secret scrubbing used in other failure paths and can leak sensitive text.🔒 Proposed fix
Err(panic_payload) => { let panic_message = panic_payload .downcast_ref::<&str>() .map(|message| (*message).to_string()) .or_else(|| panic_payload.downcast_ref::<String>().cloned()) .unwrap_or_else(|| "unknown panic payload".to_string()); + let panic_message = if let Some(store) = &secrets_store { + crate::secrets::scrub::scrub_with_store(&panic_message, store) + } else { + panic_message + }; tracing::error!( worker_id = %worker_id, panic_message = %panic_message, "worker task panicked" ); Err(WorkerCompletionError::failed(format!( "worker task panicked: {panic_message}" ))) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agent/channel_dispatch.rs` around lines 620 - 632, The panic path builds panic_message from panic_payload and logs it directly (tracing::error) and returns it in WorkerCompletionError::failed, which can leak secrets; fix by passing the derived panic_message through the project's secret-scrubbing helper used elsewhere (the same scrub/redact function applied to other failure paths) before logging and before constructing WorkerCompletionError::failed, i.e., compute panic_message as now, then let scrubbed = scrub_secret_text(panic_message) (or the actual project scrubbing function name), use scrubbed in tracing::error and in WorkerCompletionError::failed.src/agent/channel.rs (1)
1774-1783:⚠️ Potential issue | 🔴 Critical
WorkerCompletestill exits too early and skips convergence work.If the worker is already absent from
active_workers, Line 1776 returns before completion logging, handle/input cleanup, and retrigger queuing. This can drop synthetic terminal events and late completions.🛠️ Suggested fix
- let mut workers = self.state.active_workers.write().await; - if workers.remove(worker_id).is_none() { - return Ok(()); - } - drop(workers); - - run_logger.log_worker_completed(*worker_id, result, *success); - - self.state.worker_handles.write().await.remove(worker_id); - self.state.worker_inputs.write().await.remove(worker_id); + let removed_worker = self.state.active_workers.write().await.remove(worker_id).is_some(); + let removed_handle = self.state.worker_handles.write().await.remove(worker_id).is_some(); + let removed_input = self.state.worker_inputs.write().await.remove(worker_id).is_some(); + if !(removed_worker || removed_handle || removed_input) { + return Ok(()); + } + + run_logger.log_worker_completed(*worker_id, result, *success);Based on learnings "For changes in async/stateful paths (worker lifecycle, cancellation, retrigger, recall cache behavior), include explicit race/terminal-state reasoning in the PR summary and run targeted tests in addition to
just gate-pr".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agent/channel.rs` around lines 1774 - 1783, The early return when active_workers.remove(worker_id).is_none() skips completion logging, cleanup, and retriggering; change the logic in the WorkerComplete handling so you do not return early: record whether the worker was removed but always call run_logger.log_worker_completed(*worker_id, result, *success), always remove entries from self.state.worker_handles and self.state.worker_inputs, and ensure any retrigger/queuing logic still runs for late or synthetic terminal events; reference active_workers, run_logger.log_worker_completed, worker_handles, worker_inputs and the WorkerComplete handling path when making the change.src/main.rs (1)
1662-1680:⚠️ Potential issue | 🟠 MajorChannel termination cleanup still leaves stale
active_channelsentries.After
channel.run()returns, cleanup unregisters registry/API state but never signals the main loop to removeactive_channels[conversation_id]. The first subsequent inbound message for that conversation can still be dropped before retry.💡 Suggested fix
- let cleanup_channel_id = conversation_id.clone(); + let cleanup_channel_id = conversation_id.clone(); + let closed_channel_tx = closed_channel_tx.clone(); let process_control_registry = agent.deps.process_control_registry.clone(); let api_state_for_cleanup = api_state.clone(); tokio::spawn(async move { if let Err(error) = channel.run().await { tracing::error!(%error, "channel event loop failed"); } let scoped_channel_id: spacebot::ChannelId = Arc::from(cleanup_channel_id.as_str()); process_control_registry .unregister_channel(&scoped_channel_id) .await; api_state_for_cleanup .unregister_channel_status(&cleanup_channel_id) .await; api_state_for_cleanup .unregister_channel_state(&cleanup_channel_id) .await; + let _ = closed_channel_tx.send(cleanup_channel_id).await; });+// setup +let (closed_channel_tx, mut closed_channel_rx) = mpsc::channel::<String>(64); + +// in main select! +Some(closed_id) = closed_channel_rx.recv() => { + active_channels.remove(&closed_id); +}Based on learnings "For changes in async/stateful paths (worker lifecycle, cancellation, retrigger, recall cache behavior), include explicit race/terminal-state reasoning in the PR summary and run targeted tests in addition to
just gate-pr".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.rs` around lines 1662 - 1680, The spawned task cleans registry and API state after channel.run() returns but never notifies the main loop to remove the entry from active_channels for conversation_id, leaving a stale entry that can drop the next inbound message; after the existing cleanup steps (after process_control_registry.unregister_channel and api_state_for_cleanup.unregister_channel_*), send a clear/terminate signal to the component that owns active_channels (or call its removal API) to remove active_channels[&conversation_id] (or invoke the main-loop's channel removal method), using the same conversation_id/Arc scoped_channel_id to avoid races and ensure removal happens before returning from the spawned task.docs/design-docs/cortex-implementation.md (1)
19-20:⚠️ Potential issue | 🟡 MinorResolve the contradictory maintenance status text.
Line 19 says “never called,” but Line 20 says these functions are wired into the cortex loop. Please make this section state one clear runtime truth.
📝 Suggested doc fix
-**Implemented but never called:** -- `memory/maintenance.rs` — `apply_decay()`, `prune_memories()`, and `merge_similar_memories()` are implemented and wired into the cortex loop. +**Implemented and wired:** +- `memory/maintenance.rs` — `apply_decay()`, `prune_memories()`, and `merge_similar_memories()` are implemented and wired into the cortex loop.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/design-docs/cortex-implementation.md` around lines 19 - 20, The heading "Implemented but never called" contradicts the following sentence saying apply_decay(), prune_memories(), and merge_similar_memories() are wired into the cortex loop; update the text so it states the single runtime truth: either remove "never called" and say these functions are implemented and wired into the cortex loop, or if they truly aren't invoked, change the sentence to indicate they are implemented but not yet wired; ensure you update the header and the sentence about apply_decay(), prune_memories(), and merge_similar_memories() to match.
🧹 Nitpick comments (3)
src/config.rs (2)
7613-7660: Expand this test to cover all newly added Cortex fields.The test verifies maintenance fields, but it does not assert
detached_worker_timeout_retry_limitandsupervisor_kill_budget_per_tick. Please add assertions so partial wiring regressions get caught.✅ Test extension example
[defaults.cortex] maintenance_interval_secs = 1200 +detached_worker_timeout_retry_limit = 4 +supervisor_kill_budget_per_tick = 16 maintenance_decay_rate = 0.11 maintenance_prune_threshold = 0.21 maintenance_min_age_days = 17 maintenance_merge_similarity_threshold = 0.97 @@ assert_eq!(resolved.cortex.maintenance_interval_secs, 1200); + assert_eq!(resolved.cortex.detached_worker_timeout_retry_limit, 4); + assert_eq!(resolved.cortex.supervisor_kill_budget_per_tick, 16); assert_eq!(resolved.cortex.maintenance_decay_rate, 0.11); @@ -cortex = { maintenance_interval_secs = 300, maintenance_decay_rate = 0.33 } +cortex = { + maintenance_interval_secs = 300, + detached_worker_timeout_retry_limit = 1, + supervisor_kill_budget_per_tick = 3, + maintenance_decay_rate = 0.33 +} @@ assert_eq!(resolved_override.cortex.maintenance_interval_secs, 300); + assert_eq!(resolved_override.cortex.detached_worker_timeout_retry_limit, 1); + assert_eq!(resolved_override.cortex.supervisor_kill_budget_per_tick, 3); assert_eq!(resolved_override.cortex.maintenance_decay_rate, 0.33);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/config.rs` around lines 7613 - 7660, The test test_cortex_maintenance_config_resolution currently asserts several maintenance fields but omits the new cortex settings detached_worker_timeout_retry_limit and supervisor_kill_budget_per_tick; update the assertions to also check resolved.cortex.detached_worker_timeout_retry_limit and resolved.cortex.supervisor_kill_budget_per_tick after parsing the defaults TOML, and likewise assert resolved_override.cortex.detached_worker_timeout_retry_limit and resolved_override.cortex.supervisor_kill_budget_per_tick after the override TOML (using the expected default values for fields not overridden and the explicit override values if you add any to the override case).
4850-4908: Extract Cortex merge into one resolver to avoid drift.The same field-by-field
CortexConfigmerge now exists in both defaults and per-agent paths. Centralizing this into one resolver will prevent divergence as new fields keep being added.As per coding guidelines
src/config.rs: “Use config resolution with precedence:env > DB > default, with per-subsystemresolve()methods”.Also applies to: 5079-5134
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/config.rs` around lines 4850 - 4908, The field-by-field CortexConfig merge is duplicated; extract the merge logic into a single resolver on the CortexConfig type (e.g., implement CortexConfig::resolve(&self, base: &CortexConfig) -> CortexConfig or CortexConfig::merge_with_defaults(&self, base: &CortexConfig) -> CortexConfig) and call that resolver wherever you currently build CortexConfig by manually mapping each field (the current closures that call .map(|c| CortexConfig { ... }).unwrap_or(base_defaults.cortex) and the similar per-agent merge). Replace both inline merges with a call to the new resolve/merge method so future fields are handled in one place and follow the intended env > DB > default precedence.src/hooks/spacebot.rs (1)
477-491: Redundant leak detection — this code is unreachable.
guard_tool_result(line 468) already scansresultfor leaks and returnsTerminateif one is found. The early return at lines 469-472 ensures we never reach this channel/reply-specific check when a leak exists.Consider removing this redundant block or, if distinct termination reasons are intentional, moving this check before
guard_tool_resultfor channel/reply cases.♻️ Proposed fix to remove redundant code
self.record_tool_result_metrics(tool_name, internal_call_id); return guard_action; } - // Only enforce hard-stop leak blocking on channel egress (`reply`). - // Worker and branch tool outputs are internal and should not terminate - // long-running jobs. - if self.process_type == ProcessType::Channel - && tool_name == "reply" - && let Some(leak) = self.scan_for_leaks(result) - { - tracing::error!( - process_id = %self.process_id, - tool_name = %tool_name, - leak_prefix = %&leak[..leak.len().min(8)], - "secret leak detected in reply result, terminating channel turn" - ); - self.record_tool_result_metrics(tool_name, internal_call_id); - return HookAction::Terminate { - reason: "Reply contained a secret. Channel turn terminated.".into(), - }; - } - // Cap the result stored in the broadcast event to avoid blowing up🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/hooks/spacebot.rs` around lines 477 - 491, The channel/reply-specific leak detection block is unreachable because guard_tool_result already checks result for leaks and returns Terminate; remove the redundant if-block that checks self.process_type == ProcessType::Channel && tool_name == "reply" && let Some(leak) = self.scan_for_leaks(result) (and its tracing::error, record_tool_result_metrics call, and HookAction::Terminate return) OR, if you intend a distinct termination reason for channel replies, move this check to run before guard_tool_result so it can short-circuit earlier; update any references to record_tool_result_metrics and the termination reason string accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/agent/channel_dispatch.rs`:
- Around line 19-31: reserve_worker_slot_local currently only performs a
non-atomic check against active_worker_count and returns Ok(()) so concurrent
spawns can both pass and oversubscribe workers; change it to perform an atomic
reservation instead (e.g., acquire a mutex/lock around the channel's active
worker state or use a semaphore/counter with fetch_update) so you
increment/reserve a slot before returning Ok and decrement/release on worker
exit/failure; keep the same error path using AgentError::WorkerLimitReached with
channel_id and max_workers when reservation fails, and apply the same
atomic-reservation fix to the analogous logic referenced around lines 309-313.
In `@src/agent/channel.rs`:
- Around line 1693-1702: The code removes branch_reply_targets too early, so the
later retrieval for retrigger metadata (the later access around where
branch_reply_targets is read) always misses; change the logic in the function
handling branch completion so you do not unconditionally remove from
self.branch_reply_targets before you read it for retrigger metadata—either move
the remove(branch_id) call to after the retrigger metadata is read/used or
replace the unconditional remove with a read-only access (e.g., get/peek) and
only remove when you truly want to drop the target; refer to the symbols
self.branch_reply_targets.remove(branch_id) and the later retrieval code around
the retrigger metadata access and adjust so the reply target is available when
needed.
In `@src/agent/cortex_chat.rs`:
- Around line 93-104: persist_and_emit_cortex_chat_error currently discards the
Result from store.save_message; change it to explicitly handle errors by
awaiting the save and matching its Result (e.g., if let Err(e) =
store.save_message(thread_id, "assistant", &message, channel_ref).await {
tracing::error!("persist_and_emit_cortex_chat_error: failed to save message for
thread_id={} channel_ref={:?}: {:?}", thread_id, channel_ref, e); } ), and then
still send the CortexChatEvent::Error (use .await.ok() on event_tx.send if the
receiver may be dropped). Reference the function
persist_and_emit_cortex_chat_error, the CortexChatStore::save_message call, and
the event_tx.send(CortexChatEvent::Error { message }) send to locate where to
add the match and logging.
In `@src/agent/ingestion.rs`:
- Around line 521-525: The tracing::debug call in ingestion.rs currently logs
the model's response content (response) which may expose sensitive ingested
text; update the log in the chunk processing path (the tracing::debug block that
references filename, chunk_number, total_chunks, and response) to avoid
including response text — log only metadata such as filename, chunk id/total
(chunk_number and total_chunks) and response length (e.g., response.len()) or
other non-textual attributes instead of the response string itself.
---
Outside diff comments:
In `@src/config.rs`:
- Around line 4856-4906: The new cortex fields are accepted blindly; add
validation in the config build (where these fields are set, e.g., the Cortex
config constructor or TryFrom implementation) to check numeric bounds before
accepting values: ensure integer durations/counters (maintenance_min_age_days,
maintenance_interval_secs, worker_timeout_secs, branch_timeout_secs,
supervisor_kill_budget_per_tick, bulletin_max_words, bulletin_max_turns,
association_max_per_pass, association_updates_threshold,
detached_worker_timeout_retry_limit) are non-negative (and >0 where applicable),
ensure probability/ratio thresholds (association_similarity_threshold,
maintenance_merge_similarity_threshold, maintenance_prune_threshold,
maintenance_decay_rate, circuit_breaker_threshold, bulletin_interval_secs if
considered ratio) are within sensible ranges (e.g., 0.0..=1.0 for
similarity/decay thresholds), and either return a ConfigError/Err on invalid
values or clamp to safe defaults using base_defaults.cortex; implement these
checks adjacent to the current unwrap_or usage so invalid incoming values are
rejected or corrected before assignment.
---
Duplicate comments:
In `@docs/design-docs/cortex-implementation.md`:
- Around line 19-20: The heading "Implemented but never called" contradicts the
following sentence saying apply_decay(), prune_memories(), and
merge_similar_memories() are wired into the cortex loop; update the text so it
states the single runtime truth: either remove "never called" and say these
functions are implemented and wired into the cortex loop, or if they truly
aren't invoked, change the sentence to indicate they are implemented but not yet
wired; ensure you update the header and the sentence about apply_decay(),
prune_memories(), and merge_similar_memories() to match.
In `@src/agent/channel_dispatch.rs`:
- Around line 620-632: The panic path builds panic_message from panic_payload
and logs it directly (tracing::error) and returns it in
WorkerCompletionError::failed, which can leak secrets; fix by passing the
derived panic_message through the project's secret-scrubbing helper used
elsewhere (the same scrub/redact function applied to other failure paths) before
logging and before constructing WorkerCompletionError::failed, i.e., compute
panic_message as now, then let scrubbed = scrub_secret_text(panic_message) (or
the actual project scrubbing function name), use scrubbed in tracing::error and
in WorkerCompletionError::failed.
In `@src/agent/channel.rs`:
- Around line 1774-1783: The early return when
active_workers.remove(worker_id).is_none() skips completion logging, cleanup,
and retriggering; change the logic in the WorkerComplete handling so you do not
return early: record whether the worker was removed but always call
run_logger.log_worker_completed(*worker_id, result, *success), always remove
entries from self.state.worker_handles and self.state.worker_inputs, and ensure
any retrigger/queuing logic still runs for late or synthetic terminal events;
reference active_workers, run_logger.log_worker_completed, worker_handles,
worker_inputs and the WorkerComplete handling path when making the change.
In `@src/agent/worker.rs`:
- Around line 386-395: The code is directly assigning self.state =
WorkerState::Failed / WorkerState::WaitingForInput after follow_up_result
handling; instead validate transitions via the worker state machine before
mutating state: check self.state.can_transition_to(WorkerState::Failed) (and
similarly for WorkerState::WaitingForInput) using matches! as per guidelines,
and only set self.state when the check passes; if the transition is illegal,
convert it into a runtime error/panic or return an Err instead of silently
mutating; update the block around follow_up_result, follow_up_failure, and the
hook.send_status calls to perform these guarded transitions.
In `@src/api/channels.rs`:
- Around line 293-297: The current use of .is_ok() (and similar mapping to
NOT_FOUND) on channel_state.cancel_worker_with_reason(worker_id, "cancelled via
API").await silently discards real errors; instead match on the Result from
cancel_worker_with_reason and handle errors explicitly (e.g., log the error with
process/server logger and return an appropriate server error or propagate the
error instead of converting everything to 404). Update the same pattern around
the other occurrence (lines referencing cancel_worker_with_reason at 349-352) so
neither path swallows failures — use match/if let Err(e) to log e and return a
proper error response rather than .is_ok()/map_err(|_| NOT_FOUND).
In `@src/api/config.rs`:
- Around line 619-621: The code currently casts cortex.maintenance_interval_secs
from u64 to i64 unsafely (in the block that assigns
table["maintenance_interval_secs"]), which will wrap values > i64::MAX and
produce invalid negative TOML integers; change this to perform a checked
conversion (e.g., use i64::try_from or compare against i64::MAX) and only set
table["maintenance_interval_secs"] when conversion succeeds, otherwise return a
400 Bad Request error from the surrounding handler/function indicating the value
is out of range; update the logic that references maintenance_interval_secs and
the error path so callers get a proper rejection instead of writing a wrapped
value.
- Around line 622-633: Validate each maintenance knob Option before writing to
the TOML table: for cortex.maintenance_decay_rate,
cortex.maintenance_prune_threshold, and
cortex.maintenance_merge_similarity_threshold ensure the Some(v) value is within
0.0..=1.0; for cortex.maintenance_min_age_days ensure the Some(v) value is >= 0.
Do the checks in the same scope that currently writes to table (i.e., before
assigning table["maintenance_*"] = toml_edit::value(...)); if a value is out of
bounds return or propagate a clear error (with which field failed and why)
instead of persisting the invalid value so the invalid config is rejected at
write time.
In `@src/main.rs`:
- Around line 1662-1680: The spawned task cleans registry and API state after
channel.run() returns but never notifies the main loop to remove the entry from
active_channels for conversation_id, leaving a stale entry that can drop the
next inbound message; after the existing cleanup steps (after
process_control_registry.unregister_channel and
api_state_for_cleanup.unregister_channel_*), send a clear/terminate signal to
the component that owns active_channels (or call its removal API) to remove
active_channels[&conversation_id] (or invoke the main-loop's channel removal
method), using the same conversation_id/Arc scoped_channel_id to avoid races and
ensure removal happens before returning from the spawned task.
In `@src/memory/maintenance.rs`:
- Around line 251-270: active_survivor can become stale after merge_pair because
the stored winner document is updated in-place; after each successful merge
(where you currently call merge_pair and then use winner/loser), reload the
latest survivor from the store and assign it to active_survivor so subsequent
choose_merge_pair calls use up-to-date content — i.e., after
merge_pair(...).await? insert merged_memory_ids/merged_count logic, then fetch
the current winner by id from memory_store (using the winner.id) and set
active_survivor = that fresh record (or change merge_pair to return the updated
survivor and assign it) before continuing or breaking on source_id.
---
Nitpick comments:
In `@src/config.rs`:
- Around line 7613-7660: The test test_cortex_maintenance_config_resolution
currently asserts several maintenance fields but omits the new cortex settings
detached_worker_timeout_retry_limit and supervisor_kill_budget_per_tick; update
the assertions to also check resolved.cortex.detached_worker_timeout_retry_limit
and resolved.cortex.supervisor_kill_budget_per_tick after parsing the defaults
TOML, and likewise assert
resolved_override.cortex.detached_worker_timeout_retry_limit and
resolved_override.cortex.supervisor_kill_budget_per_tick after the override TOML
(using the expected default values for fields not overridden and the explicit
override values if you add any to the override case).
- Around line 4850-4908: The field-by-field CortexConfig merge is duplicated;
extract the merge logic into a single resolver on the CortexConfig type (e.g.,
implement CortexConfig::resolve(&self, base: &CortexConfig) -> CortexConfig or
CortexConfig::merge_with_defaults(&self, base: &CortexConfig) -> CortexConfig)
and call that resolver wherever you currently build CortexConfig by manually
mapping each field (the current closures that call .map(|c| CortexConfig { ...
}).unwrap_or(base_defaults.cortex) and the similar per-agent merge). Replace
both inline merges with a call to the new resolve/merge method so future fields
are handled in one place and follow the intended env > DB > default precedence.
In `@src/hooks/spacebot.rs`:
- Around line 477-491: The channel/reply-specific leak detection block is
unreachable because guard_tool_result already checks result for leaks and
returns Terminate; remove the redundant if-block that checks self.process_type
== ProcessType::Channel && tool_name == "reply" && let Some(leak) =
self.scan_for_leaks(result) (and its tracing::error, record_tool_result_metrics
call, and HookAction::Terminate return) OR, if you intend a distinct termination
reason for channel replies, move this check to run before guard_tool_result so
it can short-circuit earlier; update any references to
record_tool_result_metrics and the termination reason string accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5941001f-4362-4d61-a385-f666bd620bf2
⛔ Files ignored due to path filters (15)
Cargo.lockis excluded by!**/*.lock,!**/*.lockCargo.tomlis excluded by!**/*.tomlvendor/imap-proto-0.10.2/Cargo.tomlis excluded by!**/*.toml,!**/vendor/**vendor/imap-proto-0.10.2/src/body.rsis excluded by!**/vendor/**vendor/imap-proto-0.10.2/src/body_structure.rsis excluded by!**/vendor/**vendor/imap-proto-0.10.2/src/builders/command.rsis excluded by!**/vendor/**vendor/imap-proto-0.10.2/src/builders/mod.rsis excluded by!**/vendor/**vendor/imap-proto-0.10.2/src/core.rsis excluded by!**/vendor/**vendor/imap-proto-0.10.2/src/lib.rsis excluded by!**/vendor/**vendor/imap-proto-0.10.2/src/macros.rsis excluded by!**/vendor/**vendor/imap-proto-0.10.2/src/parser/mod.rsis excluded by!**/vendor/**vendor/imap-proto-0.10.2/src/parser/rfc3501.rsis excluded by!**/vendor/**vendor/imap-proto-0.10.2/src/parser/rfc4551.rsis excluded by!**/vendor/**vendor/imap-proto-0.10.2/src/parser/rfc5464.rsis excluded by!**/vendor/**vendor/imap-proto-0.10.2/src/types.rsis excluded by!**/vendor/**
📒 Files selected for processing (44)
README.mddocs/content/docs/(configuration)/config.mdxdocs/content/docs/(core)/architecture.mdxdocs/design-docs/cortex-chat.mddocs/design-docs/cortex-history.mddocs/design-docs/cortex-implementation.mddocs/design-docs/tool-nudging.mdsrc/agent.rssrc/agent/branch.rssrc/agent/channel.rssrc/agent/channel_dispatch.rssrc/agent/channel_history.rssrc/agent/compactor.rssrc/agent/cortex.rssrc/agent/cortex_chat.rssrc/agent/ingestion.rssrc/agent/invariant_harness.rssrc/agent/process_control.rssrc/agent/status.rssrc/agent/worker.rssrc/api/agents.rssrc/api/channels.rssrc/api/config.rssrc/api/cortex.rssrc/api/state.rssrc/api/system.rssrc/config.rssrc/hooks.rssrc/hooks/spacebot.rssrc/lib.rssrc/llm/model.rssrc/main.rssrc/memory/maintenance.rssrc/memory/store.rssrc/telemetry/registry.rssrc/tools.rssrc/tools/cancel.rssrc/tools/memory_save.rssrc/tools/worker_inspect.rstests/bulletin.rstests/context_dump.rstests/detached_worker_bootstrap.rstests/maintenance.rstests/tool_nudge.rs
🚧 Files skipped from review as they are similar to previous changes (10)
- src/api/system.rs
- README.md
- src/tools/worker_inspect.rs
- src/api/state.rs
- src/agent/status.rs
- tests/maintenance.rs
- docs/design-docs/cortex-chat.md
- docs/content/docs/(configuration)/config.mdx
- docs/content/docs/(core)/architecture.mdx
- src/tools.rs
| pub(crate) fn reserve_worker_slot_local( | ||
| active_worker_count: usize, | ||
| channel_id: &Arc<str>, | ||
| max_workers: usize, | ||
| ) -> std::result::Result<(), AgentError> { | ||
| if active_worker_count >= max_workers { | ||
| return Err(AgentError::WorkerLimitReached { | ||
| channel_id: channel_id.to_string(), | ||
| max: max_workers, | ||
| }); | ||
| } | ||
|
|
||
| Ok(()) |
There was a problem hiding this comment.
reserve_worker_slot_local is check-only; concurrent spawns can exceed max_workers.
This gate reads active_workers.len() and returns Ok(()) without reserving a slot atomically. Two concurrent spawns can both pass and oversubscribe worker capacity.
Based on learnings: “For changes in async/stateful paths (worker lifecycle, cancellation, retrigger, recall cache behavior), include explicit race/terminal-state reasoning…”.
Also applies to: 309-313
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/agent/channel_dispatch.rs` around lines 19 - 31,
reserve_worker_slot_local currently only performs a non-atomic check against
active_worker_count and returns Ok(()) so concurrent spawns can both pass and
oversubscribe workers; change it to perform an atomic reservation instead (e.g.,
acquire a mutex/lock around the channel's active worker state or use a
semaphore/counter with fetch_update) so you increment/reserve a slot before
returning Ok and decrement/release on worker exit/failure; keep the same error
path using AgentError::WorkerLimitReached with channel_id and max_workers when
reservation fails, and apply the same atomic-reservation fix to the analogous
logic referenced around lines 309-313.
| async fn persist_and_emit_cortex_chat_error( | ||
| store: &CortexChatStore, | ||
| event_tx: &mpsc::Sender<CortexChatEvent>, | ||
| thread_id: &str, | ||
| channel_ref: Option<&str>, | ||
| message: String, | ||
| ) { | ||
| let _ = store | ||
| .save_message(thread_id, "assistant", &message, channel_ref) | ||
| .await; | ||
| let _ = event_tx.send(CortexChatEvent::Error { message }).await; | ||
| } |
There was a problem hiding this comment.
Don’t silently drop DB persistence failures in error reporting path.
persist_and_emit_cortex_chat_error ignores save_message errors, which can hide storage failures during the exact path meant to preserve failure context.
🛠️ Suggested fix
async fn persist_and_emit_cortex_chat_error(
@@
) {
- let _ = store
- .save_message(thread_id, "assistant", &message, channel_ref)
- .await;
- let _ = event_tx.send(CortexChatEvent::Error { message }).await;
+ if let Err(error) = store
+ .save_message(thread_id, "assistant", &message, channel_ref)
+ .await
+ {
+ tracing::warn!(%error, thread_id, "failed to persist cortex chat error message");
+ }
+ event_tx.send(CortexChatEvent::Error { message }).await.ok();
}As per coding guidelines "Never silently discard errors. No let _ = on Results. Handle them, log them, or propagate them. The only exception is .ok() on channel sends where the receiver may be dropped".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async fn persist_and_emit_cortex_chat_error( | |
| store: &CortexChatStore, | |
| event_tx: &mpsc::Sender<CortexChatEvent>, | |
| thread_id: &str, | |
| channel_ref: Option<&str>, | |
| message: String, | |
| ) { | |
| let _ = store | |
| .save_message(thread_id, "assistant", &message, channel_ref) | |
| .await; | |
| let _ = event_tx.send(CortexChatEvent::Error { message }).await; | |
| } | |
| async fn persist_and_emit_cortex_chat_error( | |
| store: &CortexChatStore, | |
| event_tx: &mpsc::Sender<CortexChatEvent>, | |
| thread_id: &str, | |
| channel_ref: Option<&str>, | |
| message: String, | |
| ) { | |
| if let Err(error) = store | |
| .save_message(thread_id, "assistant", &message, channel_ref) | |
| .await | |
| { | |
| tracing::warn!(%error, thread_id, "failed to persist cortex chat error message"); | |
| } | |
| event_tx.send(CortexChatEvent::Error { message }).await.ok(); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/agent/cortex_chat.rs` around lines 93 - 104,
persist_and_emit_cortex_chat_error currently discards the Result from
store.save_message; change it to explicitly handle errors by awaiting the save
and matching its Result (e.g., if let Err(e) = store.save_message(thread_id,
"assistant", &message, channel_ref).await {
tracing::error!("persist_and_emit_cortex_chat_error: failed to save message for
thread_id={} channel_ref={:?}: {:?}", thread_id, channel_ref, e); } ), and then
still send the CortexChatEvent::Error (use .await.ok() on event_tx.send if the
receiver may be dropped). Reference the function
persist_and_emit_cortex_chat_error, the CortexChatStore::save_message call, and
the event_tx.send(CortexChatEvent::Error { message }) send to locate where to
add the match and logging.
| tracing::debug!( | ||
| file = %filename, | ||
| chunk = %format!("{chunk_number}/{total_chunks}"), | ||
| response = %response.chars().take(200).collect::<String>(), | ||
| "chunk processed" |
There was a problem hiding this comment.
Avoid logging model response content for ingested chunks.
This logs text derived from ingested files, which can leak sensitive document content into logs. Prefer metadata-only logging (e.g., length/chunk id).
🔧 Suggested fix
Ok(response) => {
tracing::debug!(
file = %filename,
chunk = %format!("{chunk_number}/{total_chunks}"),
- response = %response.chars().take(200).collect::<String>(),
+ response_chars = response.chars().count(),
"chunk processed"
);
Ok(())
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| tracing::debug!( | |
| file = %filename, | |
| chunk = %format!("{chunk_number}/{total_chunks}"), | |
| response = %response.chars().take(200).collect::<String>(), | |
| "chunk processed" | |
| tracing::debug!( | |
| file = %filename, | |
| chunk = %format!("{chunk_number}/{total_chunks}"), | |
| response_chars = response.chars().count(), | |
| "chunk processed" | |
| ); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/agent/ingestion.rs` around lines 521 - 525, The tracing::debug call in
ingestion.rs currently logs the model's response content (response) which may
expose sensitive ingested text; update the log in the chunk processing path (the
tracing::debug block that references filename, chunk_number, total_chunks, and
response) to avoid including response text — log only metadata such as filename,
chunk id/total (chunk_number and total_chunks) and response length (e.g.,
response.len()) or other non-textual attributes instead of the response string
itself.
5d6b65c to
59ffec7
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/agent/process_control.rs (1)
101-103:⚠️ Potential issue | 🟠 MajorStale unregister can remove a fresh channel registration.
Lines 101–103 remove by
channel_idonly without validating the registration. If an older channel instance unregisters after a newer instance re-registers the sameChannelId(e.g., user deletes/recreates a conversation), this silently evicts the live entry and breaks cancellation routing for the new channel.The cleanup task in
src/main.rs:1668–1680callsunregister_channelwithout a registration guard. Pairregister_channelreturn ofregistration_idwithunregister_channelvalidation of it:🛠️ Proposed hardening
- pub async fn register_channel(&self, channel_id: ChannelId, handle: WeakChannelControlHandle) { + pub async fn register_channel( + &self, + channel_id: ChannelId, + handle: WeakChannelControlHandle, + ) -> u64 { let registration_id = self .next_channel_registration .fetch_add(1, Ordering::AcqRel); self.channels.write().await.insert( channel_id, ChannelControlEntry { handle, registration_id, }, ); + registration_id } - pub async fn unregister_channel(&self, channel_id: &ChannelId) -> bool { - self.channels.write().await.remove(channel_id).is_some() + pub async fn unregister_channel( + &self, + channel_id: &ChannelId, + expected_registration_id: u64, + ) -> bool { + let mut channels = self.channels.write().await; + if channels + .get(channel_id) + .is_some_and(|entry| entry.registration_id == expected_registration_id) + { + channels.remove(channel_id); + true + } else { + false + } }Also update
src/main.rs:1613to captureregistration_idfromregister_channeland pass it to the cleanup task at line 1673–1674.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agent/process_control.rs` around lines 101 - 103, The unregister routine currently removes entries by ChannelId only and can evict a newer registration; change register_channel to return a unique registration_id (e.g., UUID or incrementing token) when inserting a channel, then modify unregister_channel to accept both &ChannelId and the registration_id and only remove the map entry if the stored registration_id matches the provided one (preserving the live entry if it differs). Update the caller in the cleanup task to capture the registration_id returned from register_channel and pass that registration_id into unregister_channel when scheduling cleanup; ensure types and tests are updated to reflect the new unregister_channel signature and behavior.
♻️ Duplicate comments (5)
tests/maintenance.rs (1)
8-37:⚠️ Potential issue | 🟠 MajorKeep the TempDir alive for the full fixture lifetime.
Line 27 creates a temp directory, but it is dropped when the fixture returns at Line 36. That can invalidate the LanceDB-backed
EmbeddingTableused later in the tests.🛠️ Proposed fix
async fn make_memory_maintenance_fixture() -> ( std::sync::Arc<MemoryStore>, spacebot::memory::EmbeddingTable, + tempfile::TempDir, ) { @@ - let dir = tempdir().expect("failed to create temp dir"); - let lance_conn = lancedb::connect(dir.path().to_str().expect("temp path")) + let temp_directory = tempdir().expect("failed to create temp dir"); + let lance_conn = lancedb::connect(temp_directory.path().to_str().expect("temp path")) .execute() .await .expect("failed to connect to lancedb"); @@ - (store, embedding_table) + (store, embedding_table, temp_directory) } @@ - let (store, embedding_table) = make_memory_maintenance_fixture().await; + let (store, embedding_table, _temp_directory) = make_memory_maintenance_fixture().await; @@ - let (store, embedding_table) = make_memory_maintenance_fixture().await; + let (store, embedding_table, _temp_directory) = make_memory_maintenance_fixture().await;#!/bin/bash set -euo pipefail echo "== Fixture signature, tempdir allocation, and return tuple ==" rg -n 'async fn make_memory_maintenance_fixture|tempdir\(|\(store,\s*embedding_table\)' tests/maintenance.rs -C2 echo echo "== Fixture call sites that destructure the tuple ==" rg -n 'make_memory_maintenance_fixture\(\)\.await|let \(' tests/maintenance.rs -C2 echo echo "== Lance embedding table storage lifecycle context ==" rg -n 'open_or_create|drop_table|connect' src/memory/lance.rs -C2🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/maintenance.rs` around lines 8 - 37, The tempdir created in make_memory_maintenance_fixture (dir via tempdir()) is dropped when the function returns, invalidating the LanceDB-backed EmbeddingTable; preserve the TempDir for the lifetime of the fixture by returning it (or an Arc-wrapped TempDir) alongside the MemoryStore and EmbeddingTable so the directory stays alive, e.g., change the function signature and return tuple to include the TempDir (or Arc<TempDir>) and ensure lance_conn/embedding_table keep owning references to that dir.docs/design-docs/cortex-implementation.md (1)
19-20:⚠️ Potential issue | 🟡 MinorResolve contradictory maintenance status wording.
Line 19 says “never called” while Line 20 says the same functions are wired into the cortex loop. Please make this status unambiguous.
📝 Suggested wording fix
-**Implemented but never called:** +**Implemented and wired:**🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/design-docs/cortex-implementation.md` around lines 19 - 20, The heading is contradictory: remove "never called" and make the status unambiguous by changing the heading to reflect that the functions in memory/maintenance.rs (apply_decay(), prune_memories(), merge_similar_memories()) are implemented and wired into the cortex loop — e.g., replace "Implemented but never called:" with "Implemented and wired into the cortex loop:" so the description matches the actual wiring.src/api/config.rs (1)
646-660:⚠️ Potential issue | 🟠 MajorHarden maintenance writes with checked conversion + range validation.
Line 647 still uses unchecked
u64 -> i64narrowing, and Line 649-Line 660 persist float/int maintenance knobs without rejecting invalid ranges. This can store invalid values and destabilize runtime behavior.💡 Suggested fix
+fn validate_unit_interval(value: f32) -> Result<(), StatusCode> { + if (0.0..=1.0).contains(&value) { + Ok(()) + } else { + Err(StatusCode::BAD_REQUEST) + } +} + fn update_cortex_table( doc: &mut toml_edit::DocumentMut, agent_idx: usize, cortex: &CortexUpdate, ) -> Result<(), StatusCode> { @@ - if let Some(v) = cortex.maintenance_interval_secs { - table["maintenance_interval_secs"] = toml_edit::value(v as i64); + if let Some(maintenance_interval_secs) = cortex.maintenance_interval_secs { + table["maintenance_interval_secs"] = toml_edit::value( + i64::try_from(maintenance_interval_secs).map_err(|_| StatusCode::BAD_REQUEST)? + ); } - if let Some(v) = cortex.maintenance_decay_rate { - table["maintenance_decay_rate"] = toml_edit::value(v as f64); + if let Some(maintenance_decay_rate) = cortex.maintenance_decay_rate { + validate_unit_interval(maintenance_decay_rate)?; + table["maintenance_decay_rate"] = toml_edit::value(maintenance_decay_rate as f64); } - if let Some(v) = cortex.maintenance_prune_threshold { - table["maintenance_prune_threshold"] = toml_edit::value(v as f64); + if let Some(maintenance_prune_threshold) = cortex.maintenance_prune_threshold { + validate_unit_interval(maintenance_prune_threshold)?; + table["maintenance_prune_threshold"] = toml_edit::value(maintenance_prune_threshold as f64); } - if let Some(v) = cortex.maintenance_min_age_days { - table["maintenance_min_age_days"] = toml_edit::value(v); + if let Some(maintenance_min_age_days) = cortex.maintenance_min_age_days { + if maintenance_min_age_days < 0 { + return Err(StatusCode::BAD_REQUEST); + } + table["maintenance_min_age_days"] = toml_edit::value(maintenance_min_age_days); } - if let Some(v) = cortex.maintenance_merge_similarity_threshold { - table["maintenance_merge_similarity_threshold"] = toml_edit::value(v as f64); + if let Some(maintenance_merge_similarity_threshold) = cortex.maintenance_merge_similarity_threshold { + validate_unit_interval(maintenance_merge_similarity_threshold)?; + table["maintenance_merge_similarity_threshold"] = + toml_edit::value(maintenance_merge_similarity_threshold as f64); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/api/config.rs` around lines 646 - 660, The maintenance fields are written without checked narrowing and range validation; update the cortex.maintenance_... handling to use safe conversions and explicit range checks: for integer fields (maintenance_interval_secs, maintenance_min_age_days) use i64::try_from(...) (or TryInto) and reject values that don't fit i64 (return Err or propagate a config validation error) before assigning to table; for float knobs (maintenance_decay_rate, maintenance_prune_threshold, maintenance_merge_similarity_threshold) ensure the f64 is finite and inside an acceptable range (e.g. 0.0..=1.0 for rates/thresholds) and only then call toml_edit::value(validated_value), otherwise return/propagate an error or skip writing the field; refer to the cortex.maintenance_interval_secs, maintenance_decay_rate, maintenance_prune_threshold, maintenance_min_age_days, and maintenance_merge_similarity_threshold identifiers when making these changes.src/memory/maintenance.rs (2)
251-270:⚠️ Potential issue | 🔴 Critical
active_survivoris stale aftermerge_pair, risking loss of earlier merged content.After Line 252 (
merge_pair(...).await?),active_survivoris not refreshed to the persisted merged survivor state. In subsequent candidate merges (Line 235+), merge content can be recomputed from pre-merge text and overwrite prior additions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/memory/maintenance.rs` around lines 251 - 270, The active_survivor is stale after calling merge_pair which can cause later merges to rebuild from pre-merge data; after merge_pair(...) completes, reload the persisted survivor state and set active_survivor to that refreshed record (e.g., fetch the memory by winner.id from memory_store) before continuing merges and before checking/using winner.id or setting merged_memory_ids/source_merged; keep using choose_merge_pair and merge_pair as-is but ensure active_survivor is updated from the authoritative store rather than relying on the pre-merge in-memory winner.
324-377:⚠️ Potential issue | 🔴 CriticalMake the merge sequence atomic to prevent partial graph rewires.
From Line 324 through Line 377, this is a multi-step mutation across memories, associations, and embeddings with cancellation/error points in between. A mid-sequence failure can leave survivor content, edges, and forgotten/embedding state out of sync.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/memory/maintenance.rs` around lines 324 - 377, The current merge logic (calls to maintenance_cancelable_op with memory_store.update, get_associations, delete_associations_for_memory, create_association, forget and embedding_table.delete, and the created Association::new) must be made atomic: wrap the whole sequence in a single transactional/locked operation or implement an explicit two-phase commit so either all steps succeed or none are applied. Concretely, replace the series of independent maintenance_cancelable_op calls with a transaction API on memory_store/embedding_table (or acquire a per-memory lock for merged.id and updated_survivor.id), perform update(&updated_survivor), fetch and delete associations, re-create remapped associations, create the updates_assoc, and then forget(&merged.id) and embedding_table.delete(&merged.id) only on commit; if the store has no transaction support, implement rollback logic to undo create_association calls and restore deleted associations/embedding on any intermediate error and ensure check_maintenance_cancellation is still honored inside the locked/transactional block.
🧹 Nitpick comments (1)
src/memory/maintenance.rs (1)
483-619: Add a regression test for multi-candidate merge accumulation in one maintenance pass.Current coverage validates a single duplicate merge, but not repeated merges onto the same survivor in one run. A test with one source + two near-duplicates would catch stale-survivor regressions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/memory/maintenance.rs` around lines 483 - 619, Add a new async test (similar to merges_near_duplicate_memories_and_transfers_associations) that creates one survivor memory and two distinct near-duplicate memories via create_memory_with_embedding, links associations from each duplicate to other memories, then runs run_maintenance once with a high merge_similarity_threshold so both duplicates should merge into the same survivor in a single pass; after run_maintenance assert merged == 2, the survivor content includes updates from both duplicates, both duplicates are marked forgotten and have no embeddings in EmbeddingTable::find_similar, and associations from both duplicates have been transferred/rewired to the survivor (use MemoryStore::get_associations, MemoryStore::load, EmbeddingTable::find_similar to verify).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/agent/cortex.rs`:
- Around line 1814-1851: Add a circuit-breaker: introduce a counter field (e.g.,
maintenance_consecutive_failures: usize) and a disabled flag/timestamp (e.g.,
maintenance_disabled_at: Option<Instant>) on the same struct that owns
maintenance_task, then in the maintenance task completion branch (the block
matching task.await in the maintenance_task handling code) increment
maintenance_consecutive_failures on any failure path (Ok(Err(_)) and Err(_)),
reset it to 0 on Ok(Ok(report)), and when the counter reaches the threshold (3)
set maintenance_disabled_at (or a boolean like maintenance_disabled = true) and
stop scheduling new maintenance tasks; also emit a clear log entry when the
circuit trips and when it is later reset. Apply the same pattern to the other
maintenance/cron/compactor routines referenced (around lines ~1932-1972) using
similarly named counters/flags so recurring retries are auto-disabled after 3
consecutive failures and reset on a successful run.
- Around line 1865-1867: Replace the ignored send result pattern "let _ =
cancel_tx.send(true);" with the canonical ".ok()" drop for channel send
failures: find the block that checks maintenance_task_cancel_tx.as_ref() and
where cancel_tx.send(true) is invoked (the variable names
maintenance_task_cancel_tx and cancel_tx), and change the send call to
cancel_tx.send(true).ok() so failures are intentionally ignored per repo policy.
---
Outside diff comments:
In `@src/agent/process_control.rs`:
- Around line 101-103: The unregister routine currently removes entries by
ChannelId only and can evict a newer registration; change register_channel to
return a unique registration_id (e.g., UUID or incrementing token) when
inserting a channel, then modify unregister_channel to accept both &ChannelId
and the registration_id and only remove the map entry if the stored
registration_id matches the provided one (preserving the live entry if it
differs). Update the caller in the cleanup task to capture the registration_id
returned from register_channel and pass that registration_id into
unregister_channel when scheduling cleanup; ensure types and tests are updated
to reflect the new unregister_channel signature and behavior.
---
Duplicate comments:
In `@docs/design-docs/cortex-implementation.md`:
- Around line 19-20: The heading is contradictory: remove "never called" and
make the status unambiguous by changing the heading to reflect that the
functions in memory/maintenance.rs (apply_decay(), prune_memories(),
merge_similar_memories()) are implemented and wired into the cortex loop — e.g.,
replace "Implemented but never called:" with "Implemented and wired into the
cortex loop:" so the description matches the actual wiring.
In `@src/api/config.rs`:
- Around line 646-660: The maintenance fields are written without checked
narrowing and range validation; update the cortex.maintenance_... handling to
use safe conversions and explicit range checks: for integer fields
(maintenance_interval_secs, maintenance_min_age_days) use i64::try_from(...) (or
TryInto) and reject values that don't fit i64 (return Err or propagate a config
validation error) before assigning to table; for float knobs
(maintenance_decay_rate, maintenance_prune_threshold,
maintenance_merge_similarity_threshold) ensure the f64 is finite and inside an
acceptable range (e.g. 0.0..=1.0 for rates/thresholds) and only then call
toml_edit::value(validated_value), otherwise return/propagate an error or skip
writing the field; refer to the cortex.maintenance_interval_secs,
maintenance_decay_rate, maintenance_prune_threshold, maintenance_min_age_days,
and maintenance_merge_similarity_threshold identifiers when making these
changes.
In `@src/memory/maintenance.rs`:
- Around line 251-270: The active_survivor is stale after calling merge_pair
which can cause later merges to rebuild from pre-merge data; after
merge_pair(...) completes, reload the persisted survivor state and set
active_survivor to that refreshed record (e.g., fetch the memory by winner.id
from memory_store) before continuing merges and before checking/using winner.id
or setting merged_memory_ids/source_merged; keep using choose_merge_pair and
merge_pair as-is but ensure active_survivor is updated from the authoritative
store rather than relying on the pre-merge in-memory winner.
- Around line 324-377: The current merge logic (calls to
maintenance_cancelable_op with memory_store.update, get_associations,
delete_associations_for_memory, create_association, forget and
embedding_table.delete, and the created Association::new) must be made atomic:
wrap the whole sequence in a single transactional/locked operation or implement
an explicit two-phase commit so either all steps succeed or none are applied.
Concretely, replace the series of independent maintenance_cancelable_op calls
with a transaction API on memory_store/embedding_table (or acquire a per-memory
lock for merged.id and updated_survivor.id), perform update(&updated_survivor),
fetch and delete associations, re-create remapped associations, create the
updates_assoc, and then forget(&merged.id) and
embedding_table.delete(&merged.id) only on commit; if the store has no
transaction support, implement rollback logic to undo create_association calls
and restore deleted associations/embedding on any intermediate error and ensure
check_maintenance_cancellation is still honored inside the locked/transactional
block.
In `@tests/maintenance.rs`:
- Around line 8-37: The tempdir created in make_memory_maintenance_fixture (dir
via tempdir()) is dropped when the function returns, invalidating the
LanceDB-backed EmbeddingTable; preserve the TempDir for the lifetime of the
fixture by returning it (or an Arc-wrapped TempDir) alongside the MemoryStore
and EmbeddingTable so the directory stays alive, e.g., change the function
signature and return tuple to include the TempDir (or Arc<TempDir>) and ensure
lance_conn/embedding_table keep owning references to that dir.
---
Nitpick comments:
In `@src/memory/maintenance.rs`:
- Around line 483-619: Add a new async test (similar to
merges_near_duplicate_memories_and_transfers_associations) that creates one
survivor memory and two distinct near-duplicate memories via
create_memory_with_embedding, links associations from each duplicate to other
memories, then runs run_maintenance once with a high merge_similarity_threshold
so both duplicates should merge into the same survivor in a single pass; after
run_maintenance assert merged == 2, the survivor content includes updates from
both duplicates, both duplicates are marked forgotten and have no embeddings in
EmbeddingTable::find_similar, and associations from both duplicates have been
transferred/rewired to the survivor (use MemoryStore::get_associations,
MemoryStore::load, EmbeddingTable::find_similar to verify).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 71e55097-3fa0-4b23-b396-dcb19dac8ee0
📒 Files selected for processing (12)
docs/design-docs/cortex-implementation.mdsrc/agent/cortex.rssrc/agent/cortex_chat.rssrc/agent/process_control.rssrc/api/config.rssrc/config.rssrc/config/load.rssrc/config/toml_schema.rssrc/config/types.rssrc/memory/maintenance.rssrc/memory/store.rstests/maintenance.rs
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/config/load.rs (1)
130-188:⚠️ Potential issue | 🟠 MajorMissing lower-bound validation for
maintenance_interval_secs.Line 187 validates maintenance bounds, but this path still needs an explicit
maintenance_interval_secs >= 1guard at load/update resolution time. Without it, a zero interval can pass resolution and lead to pathological maintenance scheduling behavior.🔧 Proposed fix
fn resolve(overrides: TomlCortexConfig, defaults: CortexConfig) -> Result<CortexConfig> { let config = CortexConfig { tick_interval_secs: overrides .tick_interval_secs .unwrap_or(defaults.tick_interval_secs), @@ association_max_per_pass: overrides .association_max_per_pass .unwrap_or(defaults.association_max_per_pass), }; + if config.maintenance_interval_secs < 1 { + return Err( + ConfigError::Invalid("maintenance_interval_secs must be >= 1".into()).into() + ); + } config.validate_maintenance_bounds()?; Ok(config) } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/config/load.rs` around lines 130 - 188, In resolve(overrides: TomlCortexConfig, defaults: CortexConfig) add an explicit guard that ensures the resolved maintenance_interval_secs is at least 1 before constructing/returning the CortexConfig: compute the resolved value for maintenance_interval_secs from overrides.maintenance_interval_secs.unwrap_or(defaults.maintenance_interval_secs) and if it is < 1 return an Err with a clear validation error (consistent with the function's Result error type) instead of allowing zero; keep the subsequent call to config.validate_maintenance_bounds() for the other checks.
🧹 Nitpick comments (1)
src/config.rs (1)
1028-1065: Extend invalid-range test to cover all maintenance validation branches.Nice negative coverage already. To fully lock the validator contract, also assert rejection for zero
maintenance_interval_secsand out-of-rangemaintenance_merge_similarity_threshold.Proposed test extension
let invalid_agent_override = r#" [[agents]] id = "main" [agents.cortex] maintenance_decay_rate = -0.1 "#; let parsed: TomlConfig = toml::from_str(invalid_agent_override).expect("failed to parse invalid agent TOML"); assert!( Config::from_toml(parsed, PathBuf::from(".")).is_err(), "expected invalid agent maintenance_decay_rate to be rejected" ); + + let invalid_interval = r#" +[defaults.cortex] +maintenance_interval_secs = 0 +"#; + let parsed: TomlConfig = + toml::from_str(invalid_interval).expect("failed to parse invalid interval TOML"); + assert!( + Config::from_toml(parsed, PathBuf::from(".")).is_err(), + "expected zero maintenance_interval_secs to be rejected" + ); + + let invalid_merge_similarity = r#" +[defaults.cortex] +maintenance_merge_similarity_threshold = 1.1 +"#; + let parsed: TomlConfig = toml::from_str(invalid_merge_similarity) + .expect("failed to parse invalid merge similarity TOML"); + assert!( + Config::from_toml(parsed, PathBuf::from(".")).is_err(), + "expected invalid maintenance_merge_similarity_threshold to be rejected" + ); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/config.rs` around lines 1028 - 1065, Add additional negative test cases inside test_cortex_maintenance_config_rejects_invalid_ranges to cover the remaining validation branches: create a TOML snippet with maintenance_interval_secs = 0 under [defaults.cortex] and assert Config::from_toml(parsed, PathBuf::from(".")).is_err(); then add snippets for maintenance_merge_similarity_threshold values out of range (e.g., -0.1 and 1.1) under [defaults.cortex] and assert they are rejected as well. Locate the existing test function test_cortex_maintenance_config_rejects_invalid_ranges, reuse TomlConfig and Config::from_toml for parsing/validation, and add assertions mirroring the existing style for the new invalid inputs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/memory/maintenance.rs`:
- Around line 336-346: The survivor's content is updated (updated_survivor,
merged_memory_content) but we never re-compute its embedding; after awaiting
memory_store.merge_memories_atomic(...) you must run the same
re-embedding/upsert code used for normal memory updates to compute and store the
new embedding for updated_survivor (i.e., call the project’s embed-and-upsert
routine for the memory or embedding_table.upsert-equivalent) before calling
embedding_table.delete(&merged.id) and returning; place this re-embedding step
between maintenance_cancelable_op(...merge_memories_atomic...) and the
embedding_table.delete(...) call so the survivor embedding is fresh.
---
Outside diff comments:
In `@src/config/load.rs`:
- Around line 130-188: In resolve(overrides: TomlCortexConfig, defaults:
CortexConfig) add an explicit guard that ensures the resolved
maintenance_interval_secs is at least 1 before constructing/returning the
CortexConfig: compute the resolved value for maintenance_interval_secs from
overrides.maintenance_interval_secs.unwrap_or(defaults.maintenance_interval_secs)
and if it is < 1 return an Err with a clear validation error (consistent with
the function's Result error type) instead of allowing zero; keep the subsequent
call to config.validate_maintenance_bounds() for the other checks.
---
Nitpick comments:
In `@src/config.rs`:
- Around line 1028-1065: Add additional negative test cases inside
test_cortex_maintenance_config_rejects_invalid_ranges to cover the remaining
validation branches: create a TOML snippet with maintenance_interval_secs = 0
under [defaults.cortex] and assert Config::from_toml(parsed,
PathBuf::from(".")).is_err(); then add snippets for
maintenance_merge_similarity_threshold values out of range (e.g., -0.1 and 1.1)
under [defaults.cortex] and assert they are rejected as well. Locate the
existing test function test_cortex_maintenance_config_rejects_invalid_ranges,
reuse TomlConfig and Config::from_toml for parsing/validation, and add
assertions mirroring the existing style for the new invalid inputs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 25de05ab-141b-48f2-99e3-61e87a80653f
📒 Files selected for processing (11)
docs/design-docs/cortex-implementation.mdsrc/agent/cortex.rssrc/agent/process_control.rssrc/api/config.rssrc/config.rssrc/config/load.rssrc/config/types.rssrc/main.rssrc/memory/maintenance.rssrc/memory/store.rstests/maintenance.rs
…anics Fixes spacedriveapp#273 — cortex_chat.rs panics on CJK memory content because `&result[..200]` can land inside a multi-byte character. Same issue existed in discord.rs reply content truncation. Replace bare byte-index slices with `floor_char_boundary(200)` which rounds down to the nearest char boundary.
641765b to
6d649e7
Compare
Overview
This PR lands the Phase 3 execution pass of the Cortex health supervision work (stacked on the Phase 2 baseline in PR-305), with a focus on bounded maintenance execution, deterministic cancellation, and detached-worker correctness.
It carries through required Phase 2 plan items from:
docs/design-docs/cortex-implementation.mddocs/design-docs/cortex-phase2-health-supervision-plan.mdScope
run_maintenancewith bounded execution and explicit two-step cancellation (CancelRequest->Abort), preventing hung maintenance passes.select!biased to completion over cancel receive).src/agent/cortex.rssrc/memory/maintenance.rstests/maintenance.rsPR relationship
cortex: implement phase 2 health supervisionVerification run
cargo test agent::cortex::tests::run_maintenance_with_cancel_stops_when_cancel_requested --lib -- --nocapturecargo test agent::cortex::tests::detached_worker_completion_takes_priority_when_cancel_signal_and_worker_finish_simultaneously --lib -- --nocapturecargo test agent::cortex::tests::maintenance_timeout_action_progresses_from_none_to_cancel_to_abort --lib -- --nocapturecargo test agent::cortex::tests::maintenance_task_timeout_bounds --lib -- --nocapturecargo test --test maintenance -- --nocapturecargo test --libcargo test --tests --no-runjust preflightjust gate-prResidual risk