cortex: phase2/3 health supervision and detached lifecycle hardening by vsumner · Pull Request #307 · spacedriveapp/spacebot

vsumner · 2026-03-04T02:35:33Z

Overview

This PR lands the Phase 3 execution pass of the Cortex health supervision work (stacked on the Phase 2 baseline in PR-305), with a focus on bounded maintenance execution, deterministic cancellation, and detached-worker correctness.

It carries through required Phase 2 plan items from:

docs/design-docs/cortex-implementation.md
docs/design-docs/cortex-phase2-health-supervision-plan.md

Scope

Wires cancel-aware maintenance execution in run_maintenance with bounded execution and explicit two-step cancellation (CancelRequest -> Abort), preventing hung maintenance passes.
Adds lifecycle control for detached task execution so supervisor timeouts and successful completion are handled as mutually consistent terminal outcomes.
Keeps detached control entries from leaking when task bootstrap/update fails.
Adds explicit completion-priority for detached worker races (select! biased to completion over cancel receive).
Adds additional test coverage across:
- src/agent/cortex.rs
- src/memory/maintenance.rs
- tests/maintenance.rs

PR relationship

Depends on PR-305: cortex: implement phase 2 health supervision
Carries forward the detached lifecycle behavior introduced in that phase and closes remaining races/cleanup holes.

Verification run

cargo test agent::cortex::tests::run_maintenance_with_cancel_stops_when_cancel_requested --lib -- --nocapture
cargo test agent::cortex::tests::detached_worker_completion_takes_priority_when_cancel_signal_and_worker_finish_simultaneously --lib -- --nocapture
cargo test agent::cortex::tests::maintenance_timeout_action_progresses_from_none_to_cancel_to_abort --lib -- --nocapture
cargo test agent::cortex::tests::maintenance_task_timeout_bounds --lib -- --nocapture
cargo test --test maintenance -- --nocapture
cargo test --lib
cargo test --tests --no-run
just preflight
just gate-pr

Residual risk

Medium: cancellation state machine has not yet been stress-tested under pathological I/O/DB lock contention and abrupt process shutdown, so edge-case timing under real production load still needs soak validation.
Rollback: drop the three new commits that introduce phase 3 maintenance cancellation and detached lifecycle changes; phase 2 PR-305 remains separable for easier rollback.

coderabbitai · 2026-03-04T02:35:54Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds a cancellation-aware, embedding-backed memory maintenance pipeline (decay, prune, merge) and integrates it into the cortex tick loop with timeout/cancellation controls; extends config and TOML schema with maintenance tuning fields and validation; expands process control registry with registration IDs, stale-entry handling, detached-worker snapshots, and new cancellation APIs; adds store merge helpers and integration tests.

Changes

Cohort / File(s)	Summary
Configuration & Defaults `src/config/types.rs`, `src/config/load.rs`, `src/config/toml_schema.rs`, `src/api/config.rs`, `src/config.rs`	Added five maintenance fields (interval, decay_rate, prune_threshold, min_age_days, merge_similarity_threshold) with defaults, validation (bounds & non-negative checks), TOML deserialization, resolution changes (resolve -> Result), and API persistence/validation paths.
Memory Maintenance Implementation `src/memory/maintenance.rs`	Implemented cancellation-aware maintenance entry points (`run_maintenance`, `run_maintenance_with_cancel`), wired decay/prune/merge phases to a watch cancellation channel, added embedding-aware merge logic, bounded per-pass work, cancellation helpers, and public MaintenanceConfig/MaintenanceReport types.
Memory Store Support `src/memory/store.rs`	Added `merge_memories_atomic` and `delete_associations_for_memory` to perform transactional merge rewiring, updates edge creation, and association cleanup.
Process Control Expansion `src/agent/process_control.rs`	Switched registry to `ChannelControlEntry` with atomic registration IDs, introduced `ChannelLookupResult`, added register/unregister semantics requiring registration_id, stale-entry removal helpers, `detached_worker_snapshots`, and public `ControlActionResult` + lifecycle constants and ProcessControlRegistry API changes.
Cortex Loop Integration `src/agent/cortex.rs`, `src/main.rs`	Wired maintenance task lifecycle into the cortex loop: spawn/cancel/timeout logic, maintenance timeout computation and state transitions (RequestCancel → ForceAbort), pass registration_id from channel registration to unregister flows.
Telemetry & Minor Hooks `src/agent/cortex_chat.rs`, `docs/design-docs/cortex-implementation.md`	Added metric recording on guard-triggered tool results and updated design docs to reflect implemented maintenance functions and per-pass work bounds.
Tests & Integration `tests/maintenance.rs`, tests in `src/*`	Added integration tests for maintenance merges, cancellation behavior, config validation; expanded unit tests covering config write/validation and process_control semantics related to registration/unregistration and stale cleanup.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

cortex: implement phase 2 health supervision #305: touches the same process_control subsystem and registry/channel lifecycle semantics.
refactor: extract config.rs into focused submodules #306: overlaps with config additions and validation for Cortex/TOML config fields.
feat(cortex): Implement Phase 1 ("The Tick Loop") #301: related to the tick-loop refactor and task coordination foundations used by maintenance integration.

Suggested reviewers

jamiepine

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the main changes: Phase 2/3 health supervision work with focus on detached lifecycle hardening and maintenance cancellation. It accurately reflects the primary objectives of the PR.
Description check	✅ Passed	The description provides comprehensive context about Phase 3 implementation, scope (maintenance cancellation, detached worker lifecycle control), dependencies (PR-305), verification steps, and residual risks. It is directly related to the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

src/tools/cancel.rs (1)

86-118: ⚠️ Potential issue | 🟡 Minor

Normalize empty/whitespace cancellation reasons before forwarding.

Some("") (or whitespace) currently bypasses the default and produces an empty reason downstream/message text.

Proposed fix

-        let reason = args.reason.as_deref().unwrap_or("cancelled by tool");
+        let reason = args
+            .reason
+            .as_deref()
+            .map(str::trim)
+            .filter(|reason| !reason.is_empty())
+            .unwrap_or("cancelled by tool");

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/tools/cancel.rs` around lines 86 - 118, Normalize the cancellation reason
by treating empty or all-whitespace strings as absent before using the default
and before forwarding to downstream calls: trim args.reason (or clone/inspect
args.reason.as_deref()) and if the trimmed string is empty treat it like None so
the local variable reason uses "cancelled by tool"; pass that normalized reason
into cancel_branch_with_reason and cancel_worker_with_reason and use it when
building the message text in the final message block (affecting the variables
reason, args.reason check, and the calls to cancel_branch_with_reason /
cancel_worker_with_reason).

src/agent/compactor.rs (1)

226-236: 🛠️ Refactor suggestion | 🟠 Major

Compactor path still runs an LLM agent; move this work out of compactor.

This code keeps the compactor executing LLM summarization directly. Please keep compactor purely programmatic and delegate summarization/extraction to a spawned worker path.

As per coding guidelines src/agent/compactor.rs: "The Compactor must NOT be an LLM process; it is a programmatic monitor per channel that watches context size and triggers compaction using tiered thresholds: >80% background, >85% aggressive, >95% emergency truncation".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agent/compactor.rs` around lines 226 - 236, This compactor currently
constructs an LLM Agent (creates ToolServerHandle via create_cortex_tool_server
and builds an AgentBuilder with compactor_prompt and agent), which violates the
rule that Compactor must be purely programmatic; remove the
AgentBuilder/ToolServer creation from compactor.rs and instead have the
compactor monitor channel context sizes and, when thresholds are crossed (>80%
background, >85% aggressive, >95% emergency truncation), spawn or send a task to
a dedicated summarization worker (e.g., spawn_compaction_worker or
summary_worker) passing the minimal deps (deps.agent_id, deps.memory_event_tx,
deps.memory_search and the compaction request payload) so the worker performs
LLM summarization/extraction; ensure compactor only contains the tiered
threshold logic and a one-way trigger to the worker rather than invoking
AgentBuilder, compactor_prompt, create_cortex_tool_server, or running the LLM
inline.

docs/content/docs/(configuration)/config.mdx (1)

500-508: ⚠️ Potential issue | 🟡 Minor

Add the new maintenance/supervision cortex keys to this config table.

This section still documents only the older cortex fields, but the PR adds new user-facing knobs (maintenance interval, decay/prune/similarity, retry/budget limits). Without listing them here, the config reference is incomplete.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/content/docs/`(configuration)/config.mdx around lines 500 - 508, Update
the `[defaults.cortex]` config table to include the new maintenance/supervision
keys added by the PR: add rows for the exact keys introduced (e.g.,
maintenance_interval_secs, prune_decay_secs, prune_interval_secs,
similarity_threshold, decay_rate, retry_budget, max_retries — use the exact key
names from the PR), specifying each key's Type, Default value and a one-line
Description (e.g., maintenance interval, prune/decay behavior, similarity
threshold, retry/budget limits) so the docs match the code; ensure you reference
the same symbol names used in the codebase (defaults.cortex.<key>) and fill in
the correct defaults and descriptions from the PR.

src/agent/channel.rs (1)

1689-1708: ⚠️ Potential issue | 🟡 Minor

Cancelled branches can leave stale branch-tracking entries.

When a branch is cancelled, active_branches is already removed before the synthetic BranchResult is observed; this early-return path skips cleanup of memory_persistence_branches and branch_reply_targets.

🧹 Suggested fix

             ProcessEvent::BranchResult {
                 branch_id,
                 conclusion,
                 ..
             } => {
                 let mut branches = self.state.active_branches.write().await;
-                if branches.remove(branch_id).is_none() {
+                let was_active = branches.remove(branch_id).is_some();
+                if !was_active {
+                    self.memory_persistence_branches.remove(branch_id);
+                    self.branch_reply_targets.remove(branch_id);
                     return Ok(());
-                };
+                }
                 drop(branches);

                 run_logger.log_branch_completed(*branch_id, conclusion);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/agent/channel.rs` around lines 1689 - 1708, The early return after
branches.remove(branch_id).is_none() causes cancelled branches to skip cleanup
of memory_persistence_branches and branch_reply_targets; modify the logic so
that when branches.remove(branch_id) returns None you still perform the same
cleanup steps (remove from self.memory_persistence_branches and
self.branch_reply_targets, log the memory persistence info via tracing::info! if
applicable, and update metrics/counts) before returning; specifically update the
block around branches.remove(branch_id) in the function containing
run_logger.log_branch_completed and ensure cleanup of
self.memory_persistence_branches and self.branch_reply_targets happens
unconditionally for the branch_id (or at least in the branch where
branches.remove returned None) so no stale tracking entries remain.

🧹 Nitpick comments (2)

src/config.rs (2)

4856-4906: Add bounds validation for new maintenance numeric knobs at config load.

These paths currently accept invalid values (e.g., zero interval, negative age, out-of-range/non-finite thresholds), which can cause unstable runtime maintenance behavior.

Proposed validation guardrails

+fn validate_cortex_maintenance(cortex: &CortexConfig, scope: &str) -> Result<()> {
+    if cortex.maintenance_interval_secs == 0 {
+        return Err(ConfigError::Invalid(format!(
+            "{scope}.maintenance_interval_secs must be > 0"
+        ))
+        .into());
+    }
+    if cortex.maintenance_min_age_days < 0 {
+        return Err(ConfigError::Invalid(format!(
+            "{scope}.maintenance_min_age_days must be >= 0"
+        ))
+        .into());
+    }
+    for (name, value) in [
+        ("maintenance_decay_rate", cortex.maintenance_decay_rate),
+        ("maintenance_prune_threshold", cortex.maintenance_prune_threshold),
+        (
+            "maintenance_merge_similarity_threshold",
+            cortex.maintenance_merge_similarity_threshold,
+        ),
+    ] {
+        if !value.is_finite() || !(0.0..=1.0).contains(&value) {
+            return Err(ConfigError::Invalid(format!(
+                "{scope}.{name} must be a finite value in [0.0, 1.0]"
+            ))
+            .into());
+        }
+    }
+    Ok(())
+}

         let defaults = DefaultsConfig {
             // ...
         };
+        validate_cortex_maintenance(&defaults.cortex, "defaults.cortex")?;

-                    cortex: a.cortex.map(|c| CortexConfig {
+                    cortex: a.cortex.map(|c| CortexConfig {
                         // ...
                     }),

(then validate AgentConfig.cortex after construction, similar to defaults scope)

Also applies to: 5083-5133

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/config.rs` around lines 4856 - 4906, The new numeric maintenance knobs
(e.g., maintenance_interval_secs, worker_timeout_secs, branch_timeout_secs,
detached_worker_timeout_retry_limit, supervisor_kill_budget_per_tick,
circuit_breaker_threshold, bulletin_interval_secs, bulletin_max_words,
bulletin_max_turns, association_interval_secs, association_similarity_threshold,
association_updates_threshold, association_max_per_pass, maintenance_decay_rate,
maintenance_prune_threshold, maintenance_min_age_days,
maintenance_merge_similarity_threshold) currently accept invalid values; add
bounds validation after building AgentConfig.cortex (similar to how defaults are
validated) to reject or clamp zero/negative intervals, non-positive counts,
out-of-range/non-finite thresholds, and unreasonable ages. Implement a
validate_agent_cortex or AgentConfig::validate_cortex function and call it after
construction to return an error (or apply safe clamping) when any field is
invalid, using the same defaults scope/base_defaults.cortex for reference and
include clear error messages mentioning the offending field.

7613-7660: Extend this regression test to assert the two new supervision control fields as well.

The test currently validates maintenance interval/decay-related propagation, but it does not verify detached_worker_timeout_retry_limit and supervisor_kill_budget_per_tick.

Suggested test additions

 [defaults.cortex]
 maintenance_interval_secs = 1200
+detached_worker_timeout_retry_limit = 4
+supervisor_kill_budget_per_tick = 12
 maintenance_decay_rate = 0.11
 maintenance_prune_threshold = 0.21
 maintenance_min_age_days = 17
 maintenance_merge_similarity_threshold = 0.97
@@
         assert_eq!(resolved.cortex.maintenance_interval_secs, 1200);
+        assert_eq!(resolved.cortex.detached_worker_timeout_retry_limit, 4);
+        assert_eq!(resolved.cortex.supervisor_kill_budget_per_tick, 12);
         assert_eq!(resolved.cortex.maintenance_decay_rate, 0.11);
@@
-cortex = { maintenance_interval_secs = 300, maintenance_decay_rate = 0.33 }
+cortex = { maintenance_interval_secs = 300, maintenance_decay_rate = 0.33, detached_worker_timeout_retry_limit = 6, supervisor_kill_budget_per_tick = 16 }
@@
         assert_eq!(resolved_override.cortex.maintenance_interval_secs, 300);
+        assert_eq!(resolved_override.cortex.detached_worker_timeout_retry_limit, 6);
+        assert_eq!(resolved_override.cortex.supervisor_kill_budget_per_tick, 16);
         assert_eq!(resolved_override.cortex.maintenance_decay_rate, 0.33);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/config.rs` around lines 7613 - 7660, The test
test_cortex_maintenance_config_resolution needs to also assert the two new
supervision control fields on the resolved cortex config: add assertions after
the first resolved variable checks to verify
resolved.cortex.detached_worker_timeout_retry_limit and
resolved.cortex.supervisor_kill_budget_per_tick equal the values coming from
config.defaults.cortex (use
config.defaults.cortex.detached_worker_timeout_retry_limit and
config.defaults.cortex.supervisor_kill_budget_per_tick), and likewise add
assertions after resolved_override checks to verify
resolved_override.cortex.detached_worker_timeout_retry_limit and
resolved_override.cortex.supervisor_kill_budget_per_tick equal
config_override.defaults.cortex.* so the test covers both default propagation
and override behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/design-docs/cortex-history.md`:
- Around line 71-74: Update the document to remove the contradictory
rollout-status by making the implementation status consistent: either mark
"Health supervision (Phase 2, implemented)" as implemented throughout or change
it to "planned/partially implemented" and clarify what is actually emitted
initially where the bulletin sentence occurs; specifically edit the "Health
supervision" heading and the earlier sentence that currently says "only bulletin
events are emitted initially" (around the bulletin text) so both state the same
rollout phase and exact emitted events (e.g., "bulletin events only initially;
full health supervision events (worker_killed, branch_killed,
circuit_breaker_tripped, health_check) are available from Phase 2 after rollout"
or mark Phase 2 as not yet rolled out), and ensure you reference the Phase 2
breaker volatility note remains accurate.

In `@docs/design-docs/cortex-implementation.md`:
- Around line 86-89: The fenced lifecycle diagram block containing the lines "0
active -> 1 completing -> 3 terminal" and "0 active -> 2 killing    -> 3
terminal" is missing a language specifier which triggers markdownlint MD040;
update that fenced block by adding a language tag (e.g., "text") immediately
after the opening triple backticks so the block becomes a labeled code fence
(open the file and edit the block that contains those two lifecycle lines).
- Around line 19-21: The heading and bullet contradict each other — either
remove "never called" from the section title or change the bullet to indicate
these functions are not wired; specifically update the text to unambiguously
state the real status of apply_decay, prune_memories, and merge_similar_memories
(e.g., "Implemented and wired into the cortex loop" if they are invoked, or
"Implemented but not yet wired into the cortex loop" if they are not), and
ensure the section references the correct runtime behavior of the cortex loop
rather than both claims.

In `@src/agent/process_control.rs`:
- Around line 102-112: The code removes the map entry unconditionally when the
cloned Weak handle fails to upgrade, which races with a concurrent registration
that may have replaced the entry; change the cleanup to be conditional: keep the
originally read Weak (e.g. the variable bound to handle before upgrade), and
when upgrade() returns None acquire the write lock and only remove(channel_id)
if the current map entry still equals that original Weak
(pointer-equality/Weak::ptr_eq or by comparing the cloned Weak), otherwise leave
it intact; preserve current behavior of calling
handle.cancel_worker_with_reason(worker_id, reason).await when upgrade succeeds
and return ControlActionResult::NotFound only when the entry was indeed the same
stale handle.

In `@src/api/channels.rs`:
- Around line 293-298: The code currently collapses all errors from
channel_state.cancel_worker_with_reason(...) into boolean checks and map_err(|_|
NOT_FOUND), losing actionable failure details; update the handling in the
channel_state branches (the one using cancel_worker_with_reason(...) near the
shown diff and the similar block around lines 349-352) to match on the Result
instead of calling is_ok()/map_err(|_| ...), then log the error and translate
known domain errors to the appropriate API response (e.g., return NOT_FOUND only
when the error truly means "not found") and propagate or convert unexpected
errors to a 5xx response; reference the cancel_worker_with_reason(...) call and
worker_id to locate and change the logic.

In `@src/api/config.rs`:
- Around line 600-605: The code currently casts
cortex.detached_worker_timeout_retry_limit and
cortex.supervisor_kill_budget_per_tick directly to i64 which can overflow;
replace the direct casts with checked conversions (e.g., i64::try_from or
TryInto) and if conversion fails return a 400 BAD_REQUEST (use the same error
path used by the warmup update logic). Update both occurrences (the block
handling detached_worker_timeout_retry_limit and supervisor_kill_budget_per_tick
and the similar block at the later occurrence) to perform the conversion into an
i64 variable, use toml_edit::value on that variable, and map conversion errors
to the existing BAD_REQUEST response type used elsewhere in this module.
- Around line 621-632: The code is writing maintenance knobs directly into the
TOML table; add validation for each optional field before persisting: ensure
cortex.maintenance_decay_rate, cortex.maintenance_prune_threshold, and
cortex.maintenance_merge_similarity_threshold (the three places where you cast
to f64 and assign to table keys "maintenance_decay_rate",
"maintenance_prune_threshold", "maintenance_merge_similarity_threshold") are
within 0.0..=1.0, and ensure cortex.maintenance_min_age_days (the
"maintenance_min_age_days" assignment) is >= 0; if any value is out of range,
return an appropriate error (or propagate a validation error) instead of writing
to the table so the API rejects invalid inputs. Ensure error messages mention
the specific field name for clarity.

In `@src/lib.rs`:
- Around line 267-277: The function create_process_event_buses_with_capacity
must guard against zero capacities to avoid Tokios internal panic: validate
control_event_capacity and memory_event_capacity are > 0 at the start of the
function and return an error instead of calling broadcast::channel when either
is zero (change the signature to return a
Result<(tokio::sync::broadcast::Sender<ProcessEvent>,
tokio::sync::broadcast::Sender<ProcessEvent>), SomeErrorType> or use an existing
error type in the crate); ensure you reference the incoming parameter names
control_event_capacity and memory_event_capacity and return an Err with a clear
message when either is zero, otherwise proceed to create the channels (event_tx,
_event_rx) and (memory_event_tx, _memory_event_rx) and return Ok((event_tx,
memory_event_tx)).

In `@src/memory/maintenance.rs`:
- Around line 324-377: The merge flow (the sequence calling
maintenance_cancelable_op with memory_store.update, get_associations,
delete_associations_for_memory, create_association,
memory_store.create_association, memory_store.forget and embedding_table.delete)
must be executed atomically: open a single DB transaction around that whole
sequence (e.g., in merge_pair), perform one pre-check for cancellation (call
check_maintenance_cancellation once before beginning the transaction), run all
reads/writes inside the transaction, and commit or rollback the transaction on
success/error so partial rewires cannot persist; remove per-association
cancellation checks and avoid issuing intermediate independent
maintenance_cancelable_op calls for those operations so they happen as a single
transactional unit.

In `@tests/maintenance.rs`:
- Around line 8-11: The fixture make_memory_maintenance_fixture drops the
TempDir before embedding_table is used, causing the backing directory to be
deleted; update the fixture signature and return value to include the tempdir
(e.g., return std::sync::Arc<MemoryStore>, spacebot::memory::EmbeddingTable, and
the TempDir) so the TempDir lives as long as embedding_table, update all call
sites that destructure the tuple accordingly, and ensure the TempDir variable is
held for the lifetime of tests that access embedding_table.

---

Outside diff comments:
In `@docs/content/docs/`(configuration)/config.mdx:
- Around line 500-508: Update the `[defaults.cortex]` config table to include
the new maintenance/supervision keys added by the PR: add rows for the exact
keys introduced (e.g., maintenance_interval_secs, prune_decay_secs,
prune_interval_secs, similarity_threshold, decay_rate, retry_budget, max_retries
— use the exact key names from the PR), specifying each key's Type, Default
value and a one-line Description (e.g., maintenance interval, prune/decay
behavior, similarity threshold, retry/budget limits) so the docs match the code;
ensure you reference the same symbol names used in the codebase
(defaults.cortex.<key>) and fill in the correct defaults and descriptions from
the PR.

In `@src/agent/channel.rs`:
- Around line 1689-1708: The early return after
branches.remove(branch_id).is_none() causes cancelled branches to skip cleanup
of memory_persistence_branches and branch_reply_targets; modify the logic so
that when branches.remove(branch_id) returns None you still perform the same
cleanup steps (remove from self.memory_persistence_branches and
self.branch_reply_targets, log the memory persistence info via tracing::info! if
applicable, and update metrics/counts) before returning; specifically update the
block around branches.remove(branch_id) in the function containing
run_logger.log_branch_completed and ensure cleanup of
self.memory_persistence_branches and self.branch_reply_targets happens
unconditionally for the branch_id (or at least in the branch where
branches.remove returned None) so no stale tracking entries remain.

In `@src/agent/compactor.rs`:
- Around line 226-236: This compactor currently constructs an LLM Agent (creates
ToolServerHandle via create_cortex_tool_server and builds an AgentBuilder with
compactor_prompt and agent), which violates the rule that Compactor must be
purely programmatic; remove the AgentBuilder/ToolServer creation from
compactor.rs and instead have the compactor monitor channel context sizes and,
when thresholds are crossed (>80% background, >85% aggressive, >95% emergency
truncation), spawn or send a task to a dedicated summarization worker (e.g.,
spawn_compaction_worker or summary_worker) passing the minimal deps
(deps.agent_id, deps.memory_event_tx, deps.memory_search and the compaction
request payload) so the worker performs LLM summarization/extraction; ensure
compactor only contains the tiered threshold logic and a one-way trigger to the
worker rather than invoking AgentBuilder, compactor_prompt,
create_cortex_tool_server, or running the LLM inline.

In `@src/tools/cancel.rs`:
- Around line 86-118: Normalize the cancellation reason by treating empty or
all-whitespace strings as absent before using the default and before forwarding
to downstream calls: trim args.reason (or clone/inspect args.reason.as_deref())
and if the trimmed string is empty treat it like None so the local variable
reason uses "cancelled by tool"; pass that normalized reason into
cancel_branch_with_reason and cancel_worker_with_reason and use it when building
the message text in the final message block (affecting the variables reason,
args.reason check, and the calls to cancel_branch_with_reason /
cancel_worker_with_reason).

---

Nitpick comments:
In `@src/config.rs`:
- Around line 4856-4906: The new numeric maintenance knobs (e.g.,
maintenance_interval_secs, worker_timeout_secs, branch_timeout_secs,
detached_worker_timeout_retry_limit, supervisor_kill_budget_per_tick,
circuit_breaker_threshold, bulletin_interval_secs, bulletin_max_words,
bulletin_max_turns, association_interval_secs, association_similarity_threshold,
association_updates_threshold, association_max_per_pass, maintenance_decay_rate,
maintenance_prune_threshold, maintenance_min_age_days,
maintenance_merge_similarity_threshold) currently accept invalid values; add
bounds validation after building AgentConfig.cortex (similar to how defaults are
validated) to reject or clamp zero/negative intervals, non-positive counts,
out-of-range/non-finite thresholds, and unreasonable ages. Implement a
validate_agent_cortex or AgentConfig::validate_cortex function and call it after
construction to return an error (or apply safe clamping) when any field is
invalid, using the same defaults scope/base_defaults.cortex for reference and
include clear error messages mentioning the offending field.
- Around line 7613-7660: The test test_cortex_maintenance_config_resolution
needs to also assert the two new supervision control fields on the resolved
cortex config: add assertions after the first resolved variable checks to verify
resolved.cortex.detached_worker_timeout_retry_limit and
resolved.cortex.supervisor_kill_budget_per_tick equal the values coming from
config.defaults.cortex (use
config.defaults.cortex.detached_worker_timeout_retry_limit and
config.defaults.cortex.supervisor_kill_budget_per_tick), and likewise add
assertions after resolved_override checks to verify
resolved_override.cortex.detached_worker_timeout_retry_limit and
resolved_override.cortex.supervisor_kill_budget_per_tick equal
config_override.defaults.cortex.* so the test covers both default propagation
and override behavior.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6443e609-32e5-4011-bbce-78571bb04b30

📥 Commits

Reviewing files that changed from the base of the PR and between 5d82132 and a41e2c7.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock, !**/*.lock

📒 Files selected for processing (29)

docs/content/docs/(configuration)/config.mdx
docs/content/docs/(core)/architecture.mdx
docs/design-docs/cortex-history.md
docs/design-docs/cortex-implementation.md
src/agent.rs
src/agent/channel.rs
src/agent/channel_dispatch.rs
src/agent/channel_history.rs
src/agent/compactor.rs
src/agent/cortex.rs
src/agent/ingestion.rs
src/agent/process_control.rs
src/agent/status.rs
src/api/agents.rs
src/api/channels.rs
src/api/config.rs
src/config.rs
src/lib.rs
src/main.rs
src/memory/maintenance.rs
src/memory/store.rs
src/telemetry/registry.rs
src/tools.rs
src/tools/cancel.rs
src/tools/memory_save.rs
tests/bulletin.rs
tests/context_dump.rs
tests/detached_worker_bootstrap.rs
tests/maintenance.rs

docs/design-docs/cortex-history.md

docs/design-docs/cortex-implementation.md

src/agent/process_control.rs

src/api/channels.rs

src/api/config.rs

src/lib.rs

src/memory/maintenance.rs

tests/maintenance.rs

coderabbitai

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/agent/worker.rs (1)

348-383: ⚠️ Potential issue | 🟠 Major

Handle follow-up cancellation explicitly instead of treating it as failure.

In the follow-up loop, PromptCancelled currently falls into the generic Err(error) arm (Line 377), which reclassifies cancellation as a failure and changes downstream completion semantics (failed vs cancelled).

♻️ Duplicate comments (8)

src/api/config.rs (2)

621-632: ⚠️ Potential issue | 🟠 Major

Validate maintenance parameter bounds before persisting.

These fields are accepted and written without bounds checks. Please reject invalid values (0.0..=1.0 for decay/prune/similarity; >= 0 for min age) before mutating the TOML document.

Suggested fix

+fn validate_unit_interval(value: f32) -> Result<(), StatusCode> {
+    if (0.0..=1.0).contains(&value) {
+        Ok(())
+    } else {
+        Err(StatusCode::BAD_REQUEST)
+    }
+}
@@
-    if let Some(v) = cortex.maintenance_decay_rate {
-        table["maintenance_decay_rate"] = toml_edit::value(v as f64);
+    if let Some(value) = cortex.maintenance_decay_rate {
+        validate_unit_interval(value)?;
+        table["maintenance_decay_rate"] = toml_edit::value(value as f64);
     }
-    if let Some(v) = cortex.maintenance_prune_threshold {
-        table["maintenance_prune_threshold"] = toml_edit::value(v as f64);
+    if let Some(value) = cortex.maintenance_prune_threshold {
+        validate_unit_interval(value)?;
+        table["maintenance_prune_threshold"] = toml_edit::value(value as f64);
     }
-    if let Some(v) = cortex.maintenance_min_age_days {
-        table["maintenance_min_age_days"] = toml_edit::value(v);
+    if let Some(value) = cortex.maintenance_min_age_days {
+        if value < 0 {
+            return Err(StatusCode::BAD_REQUEST);
+        }
+        table["maintenance_min_age_days"] = toml_edit::value(value);
     }
-    if let Some(v) = cortex.maintenance_merge_similarity_threshold {
-        table["maintenance_merge_similarity_threshold"] = toml_edit::value(v as f64);
+    if let Some(value) = cortex.maintenance_merge_similarity_threshold {
+        validate_unit_interval(value)?;
+        table["maintenance_merge_similarity_threshold"] = toml_edit::value(value as f64);
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/api/config.rs` around lines 621 - 632, The maintenance parameters are
written into the TOML without validation; add range checks before mutating
table: for maintenance_decay_rate, maintenance_prune_threshold, and
maintenance_merge_similarity_threshold (the cortex fields referenced) ensure
their values are within 0.0..=1.0 and for maintenance_min_age_days ensure value
is >= 0; if a value is out of bounds, return an error (or propagate a validation
error) instead of setting table[...] — perform these checks immediately before
the existing assignments that set table["maintenance_decay_rate"],
table["maintenance_prune_threshold"], table["maintenance_min_age_days"], and
table["maintenance_merge_similarity_threshold"] so invalid inputs are rejected
prior to TOML mutation.

603-605: ⚠️ Potential issue | 🟠 Major

Guard lossy integer conversions before writing cortex values.

supervisor_kill_budget_per_tick (usize) and maintenance_interval_secs (u64) are cast with as i64; oversized inputs can wrap into negative persisted values.

Suggested fix

-    if let Some(v) = cortex.supervisor_kill_budget_per_tick {
-        table["supervisor_kill_budget_per_tick"] = toml_edit::value(v as i64);
+    if let Some(value) = cortex.supervisor_kill_budget_per_tick {
+        let value = i64::try_from(value).map_err(|_| StatusCode::BAD_REQUEST)?;
+        table["supervisor_kill_budget_per_tick"] = toml_edit::value(value);
     }
@@
-    if let Some(v) = cortex.maintenance_interval_secs {
-        table["maintenance_interval_secs"] = toml_edit::value(v as i64);
+    if let Some(value) = cortex.maintenance_interval_secs {
+        let value = i64::try_from(value).map_err(|_| StatusCode::BAD_REQUEST)?;
+        table["maintenance_interval_secs"] = toml_edit::value(value);
     }

Run this read-only check to confirm remaining risky casts in this path:

#!/bin/bash
rg -nP -C2 'supervisor_kill_budget_per_tick"\]\s*=\s*toml_edit::value\(.+\s+as\s+i64\)|maintenance_interval_secs"\]\s*=\s*toml_edit::value\(.+\s+as\s+i64\)' src/api/config.rs

Also applies to: 618-620

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/api/config.rs` around lines 603 - 605, The code is performing lossy casts
like `cortex.supervisor_kill_budget_per_tick as i64` (and similarly
`maintenance_interval_secs as i64`) which can wrap negative for oversized
inputs; change these assignments to validate and convert safely using
`i64::try_from(...)` or `usize::try_into::<i64>()` and handle the Err case
(return an error, skip persisting with a logged warning, or clamp to a safe max)
instead of an unchecked `as` cast so only representable values are written into
`table["supervisor_kill_budget_per_tick"]` and
`table["maintenance_interval_secs"]`. Ensure you update the code paths around
the `table[...] = toml_edit::value(...)` assignments (references:
cortex.supervisor_kill_budget_per_tick, maintenance_interval_secs) to perform
the checked conversion and appropriate error handling or fallback.

src/memory/maintenance.rs (1)

324-377: ⚠️ Potential issue | 🔴 Critical

Merge rewiring still is not atomic across cancellation/error boundaries.

This sequence mutates survivor content, rewires associations, writes Updates, forgets the loser, and deletes embeddings in separate operations with cancellation checkpoints in the middle. A mid-sequence failure can persist a partially merged graph.

Please verify available transaction primitives in storage code, then wrap this critical section as one atomic unit:

#!/bin/bash
rg -nP --type=rust -C2 'impl\s+MemoryStore|fn\s+.*(transaction|begin|commit|rollback)|sqlx::Transaction' src/memory

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/memory/maintenance.rs` around lines 324 - 377, The merge sequence mutates
survivor content and rewrites associations via memory_store.update,
get_associations, delete_associations_for_memory, create_association, creates an
Updates association, calls memory_store.forget and embedding_table.delete with
cancellation checkpoints between them; wrap all of these operations in a single
storage transaction (use whatever transaction primitive exists on MemoryStore —
e.g., begin_transaction/transaction/tx or sqlx::Transaction if present) so the
update + rewire + create Updates + forget + embedding delete are executed
atomically, remove intermediate maintenance_cancelable_op awaits inside the
transactional block (perform a single cancellable check before starting and
ensure cancellation causes a rollback), and on error ensure you rollback the
transaction and only after commit run any non-transactional cleanup (or queue
embedding_table.delete inside the same transaction if supported) so partial
merges cannot persist; reference functions: memory_store.update,
memory_store.get_associations, memory_store.delete_associations_for_memory,
memory_store.create_association, memory_store.forget, embedding_table.delete,
Association::new, maintenance_cancelable_op, check_maintenance_cancellation.

docs/design-docs/cortex-implementation.md (2)

19-21: ⚠️ Potential issue | 🟡 Minor

Resolve the contradictory maintenance status wording.

Line 19 says “never called” while Line 20 says “wired into the cortex loop.” Keep one unambiguous state.
🛠️ Suggested wording fix
-**Implemented but never called:**
+**Implemented and called:**
 - `memory/maintenance.rs` — `apply_decay()`, `prune_memories()`, and `merge_similar_memories()` are implemented and wired into the cortex loop.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design-docs/cortex-implementation.md` around lines 19 - 21, The document
currently contradicts itself by saying the maintenance functions are
"Implemented but never called" while also stating they are "wired into the
cortex loop"; update the wording to a single clear state: either mark
memory/maintenance.rs and its functions apply_decay(), prune_memories(), and
merge_similar_memories() as "Implemented and wired into the cortex loop" or as
"Implemented but not yet wired into the cortex loop" (choose the true runtime
state), and make the cortex loop description consistent with that choice so
there is no ambiguity.
86-89: ⚠️ Potential issue | 🟡 Minor

Add language identifiers to fenced code blocks.

These fences are unlabeled and will keep triggering MD040.
🛠️ Suggested lint-safe fix
-```
+```text
 0 active -> 1 completing -> 3 terminal
 0 active -> 2 killing    -> 3 terminal
@@
- +text
Phase 1 (tick loop) — standalone, just plumbing
Phase 2 (health) — depends on Phase 1
Phase 3 (maintenance) — depends on Phase 1, independent of Phase 2
Phase 4 (consolidation) — depends on Phase 1, benefits from Phase 3
Phase 5 (hook) — independent, can be done anytime after Phase 1
Also applies to: 204-210
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design-docs/cortex-implementation.md` around lines 86 - 89, Two fenced
code blocks in the document are unlabeled and trigger MD040; add a language
identifier (e.g., "text") to each opening fence for the blocks containing the
lines "0 active -> 1 completing -> 3 terminal / 0 active -> 2 killing    -> 3
terminal" and the block starting with "Phase 1 (tick loop)     — standalone,
just plumbing" so the fences become ```text and satisfy the linter.

tests/maintenance.rs (1)

8-11: ⚠️ Potential issue | 🟠 Major

Keep the temporary directory alive for the full test lifetime.

Line 27 creates TempDir, but it is dropped when the fixture returns at Line 36. embedding_table can then reference deleted storage and become unstable during assertions.

🛠️ Suggested fix

 async fn make_memory_maintenance_fixture() -> (
     std::sync::Arc<MemoryStore>,
     spacebot::memory::EmbeddingTable,
+    tempfile::TempDir,
 ) {
@@
-    let dir = tempdir().expect("failed to create temp dir");
-    let lance_conn = lancedb::connect(dir.path().to_str().expect("temp path"))
+    let temp_directory = tempdir().expect("failed to create temp dir");
+    let lance_conn = lancedb::connect(temp_directory.path().to_str().expect("temp path"))
         .execute()
         .await
         .expect("failed to connect to lancedb");
@@
-    (store, embedding_table)
+    (store, embedding_table, temp_directory)
 }
@@
-    let (store, embedding_table) = make_memory_maintenance_fixture().await;
+    let (store, embedding_table, _temp_directory) = make_memory_maintenance_fixture().await;
@@
-    let (store, embedding_table) = make_memory_maintenance_fixture().await;
+    let (store, embedding_table, _temp_directory) = make_memory_maintenance_fixture().await;

Also applies to: 27-37, 41-42, 163-164

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/maintenance.rs` around lines 8 - 11, The temporary directory created
inside make_memory_maintenance_fixture is dropped when the function returns,
leaving embedding_table referencing deleted storage; fix this by extending the
TempDir lifetime (e.g., include the TempDir in the fixture's return value or
wrap it in an Arc and return it alongside std::sync::Arc<MemoryStore> and
spacebot::memory::EmbeddingTable) so the TempDir is kept alive for the entire
test; update all call sites and other fixtures mentioned (the other occurrences
around where TempDir is created) to accept/propagate the extra TempDir holder to
prevent use-after-drop of embedding_table storage.

src/agent/process_control.rs (1)

102-112: ⚠️ Potential issue | 🟠 Major

Avoid removing a potentially newer channel handle on stale upgrade failure.

Line 110 and Line 129 unconditionally remove(channel_id) after upgrade() fails. That can race with concurrent re-registration and delete a fresh live handle.

🛠️ Race-safe cleanup

         if let Some(handle) = handle.upgrade() {
             handle.cancel_worker_with_reason(worker_id, reason).await
         } else {
-            self.channels.write().await.remove(channel_id);
+            let mut channels = self.channels.write().await;
+            if channels
+                .get(channel_id)
+                .is_some_and(|current| current.upgrade().is_none())
+            {
+                channels.remove(channel_id);
+            }
             ControlActionResult::NotFound
         }
@@
         if let Some(handle) = handle.upgrade() {
             handle.cancel_branch_with_reason(branch_id, reason).await
         } else {
-            self.channels.write().await.remove(channel_id);
+            let mut channels = self.channels.write().await;
+            if channels
+                .get(channel_id)
+                .is_some_and(|current| current.upgrade().is_none())
+            {
+                channels.remove(channel_id);
+            }
             ControlActionResult::NotFound
         }

Also applies to: 121-131

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/agent/process_control.rs` around lines 102 - 112, The code
unconditionally calls self.channels.write().await.remove(channel_id) when
Weak::upgrade() fails, which can delete a newly-registered handle; change the
cleanup to be race-safe by taking the write lock and removing the entry only if
the stored Weak pointer is the same stale Weak you originally looked up (or use
the HashMap entry API to compare-and-remove): locate the code around
channels.read().await.get(channel_id).cloned() and the subsequent upgrade()
check in process_control.rs and replace the unconditional remove(channel_id)
with a conditional remove that verifies pointer identity of the Weak handle (or
re-check the map entry equals the cloned Weak) before removing, preserving
correct behavior for cancel_worker_with_reason and ControlActionResult paths.

src/lib.rs (1)

267-277: ⚠️ Potential issue | 🟠 Major

Guard zero-capacity bus creation to avoid runtime panic.

On Line 274 and Line 276, broadcast::channel will panic if either capacity is 0, and this public helper currently accepts unvalidated input.

🔧 Suggested hardening (non-breaking)

 pub fn create_process_event_buses_with_capacity(
     control_event_capacity: usize,
     memory_event_capacity: usize,
 ) -> (
     tokio::sync::broadcast::Sender<ProcessEvent>,
     tokio::sync::broadcast::Sender<ProcessEvent>,
 ) {
-    let (event_tx, _event_rx) = tokio::sync::broadcast::channel(control_event_capacity);
-    let (memory_event_tx, _memory_event_rx) =
-        tokio::sync::broadcast::channel(memory_event_capacity);
+    let control_event_capacity = control_event_capacity.max(1);
+    let memory_event_capacity = memory_event_capacity.max(1);
+    let (event_tx, _event_rx) = tokio::sync::broadcast::channel(control_event_capacity);
+    let (memory_event_tx, _memory_event_rx) =
+        tokio::sync::broadcast::channel(memory_event_capacity);
     (event_tx, memory_event_tx)
 }

Does `tokio::sync::broadcast::channel(0)` panic? Please check the Tokio docs “Panics” section for `broadcast::channel`.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib.rs` around lines 267 - 277, The function
create_process_event_buses_with_capacity currently calls
tokio::sync::broadcast::channel with unvalidated capacities which will panic for
0; before calling channel, clamp control_event_capacity and
memory_event_capacity to a minimum of 1 (e.g., let control_capacity =
control_event_capacity.max(1); let memory_capacity =
memory_event_capacity.max(1);) and use those local values when creating event_tx
and memory_event_tx so the public helper never passes 0 into broadcast::channel.

🧹 Nitpick comments (8)

docs/design-docs/cortex-chat.md (1)

119-120: Consider specifying the exact 409 response contract.

Please document whether the 409 includes a JSON error body and/or Retry-After header so clients can implement deterministic retry behavior.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/design-docs/cortex-chat.md` around lines 119 - 120, Clarify the 409
CONFLICT contract by specifying both the JSON error body and Retry-After header
format: state that the endpoint returns HTTP 409 with a JSON payload containing
standardized fields (e.g., an error code like "session_busy" and a
human-readable message plus an optional numeric retry_after_seconds) and also
sets a Retry-After header with the retry interval in seconds; update the
sentence about "409 CONFLICT" to include these exact field names and header
semantics so clients can implement deterministic retry behavior.

prompts/en/fragments/worker_capabilities.md.j2 (1)

47-48: Clarify “default” language to avoid ambiguity.

Line 48 is useful, but paired with Line 7 (“Builtin Workers (default)”) it can sound like two competing defaults. Consider making scope explicit (global vs code-task default).

Suggested wording

-Default to OpenCode for code work when it's enabled. Use builtin workers for short command/file tasks.
+Default to builtin workers for quick tasks. For code work, default to OpenCode when enabled.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@prompts/en/fragments/worker_capabilities.md.j2` around lines 47 - 48, The
current phrasing creates two competing "defaults" between the phrase "Default to
OpenCode for code work when it's enabled" and the header "Builtin Workers
(default)"; update the copy to make the scope explicit by stating that
"OpenCode" is the default for code-related tasks when enabled while "Builtin
Workers" remain the default for short command/file tasks (or for global/non-code
operations). Locate and edit the strings "Default to OpenCode for code work when
it's enabled" and "Builtin Workers (default)" in worker_capabilities.md.j2 so
each specifies its scope (e.g., "default for code tasks" vs "global/default for
short tasks") to remove ambiguity.

prompts/en/worker.md.j2 (1)

39-39: Clarify this capability in both passthrough branches.

Line 39 documents unrestricted file access only when sandbox is fully disabled, but Line 32 already defines another passthrough mode (enabled-in-config, unsupported backend). If file behaves the same there, mirror this sentence in that branch too so worker behavior doesn’t drift by host/backend.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@prompts/en/worker.md.j2` at line 39, Update the documentation in
prompts/en/worker.md.j2 to clarify that the `file` tool can access any path
readable by the process in both passthrough branches: add the sentence "The
`file` tool can access any path readable by the process (no workspace
restriction)." to the branch describing the fully disabled sandbox and mirror
that exact sentence into the other passthrough branch that covers the
enabled-in-config/unsupported-backend case so both branches state identical
`file` access behavior.

src/agent/status.rs (1)

322-332: Add symmetric test coverage for remove_worker.

Nice branch-removal coverage; I’d add the equivalent worker-removal test so both new public removal paths are protected.

Proposed test addition

 #[test]
 fn remove_branch_removes_existing_branch() {
     let mut status = StatusBlock::new();
     let branch_id = Uuid::new_v4();
     status.add_branch(branch_id, "work");

     let removed = status.remove_branch(branch_id);

     assert!(removed);
     assert!(status.active_branches.is_empty());
 }
+
+#[test]
+fn remove_worker_removes_existing_worker() {
+    let mut status = StatusBlock::new();
+    let worker_id = Uuid::new_v4();
+    status.add_worker(worker_id, "task", false);
+
+    let removed = status.remove_worker(worker_id);
+
+    assert!(removed);
+    assert!(status.active_workers.is_empty());
+}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/agent/status.rs` around lines 322 - 332, Add a symmetric unit test for
remove_worker similar to remove_branch_removes_existing_branch: create a
StatusBlock via StatusBlock::new(), add a worker id using add_worker (use
Uuid::new_v4()), call remove_worker(worker_id), assert the call returns true and
that status.active_workers is empty; name it something like
remove_worker_removes_existing_worker so both public removal paths
(remove_branch and remove_worker) are covered.

src/memory/store.rs (1)

255-265: Add focused test coverage for the new association cleanup helper.

delete_associations_for_memory is a deletion primitive; please add a unit test asserting both incoming and outgoing associations are removed and rows_affected matches expectation.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/memory/store.rs` around lines 255 - 265, Add a focused unit test for
delete_associations_for_memory that sets up associations referencing a given
memory id both as source_id and as target_id, calls
MemoryStore::delete_associations_for_memory(memory_id), asserts the returned
rows_affected equals the number of inserted rows, and verifies no rows remain in
the associations table for that memory id; use the existing test DB harness or a
transaction/fixture to insert the test associations, clean up after the test,
and reference the delete_associations_for_memory function, associations table,
and rows_affected in your assertions.

src/hooks/spacebot.rs (1)

435-438: Remove duplicate comment.

Lines 435-438 contain a duplicate comment block that repeats "Send event without blocking. Truncate args to keep broadcast payloads bounded."

Proposed fix

-        // Send event without blocking. Truncate args to keep broadcast payloads bounded.
-
         // Send event without blocking. Truncate args to keep broadcast payloads bounded.
         let capped_args = crate::tools::truncate_output(args, 2_000);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/hooks/spacebot.rs` around lines 435 - 438, Duplicate comment lines were
introduced above the creation of capped_args using
crate::tools::truncate_output; remove the redundant comment so only a single
descriptive line remains. Locate the block where capped_args is assigned (the
call to truncate_output) and delete the extra repeated comment text "Send event
without blocking. Truncate args to keep broadcast payloads bounded" leaving one
occurrence directly above the capped_args assignment.

src/config.rs (2)

2669-2672: Avoid silently dropping parse errors in the unknown-key pre-pass.

Returning on parse failure here is understandable, but adding a debug log would preserve diagnostics without changing behavior.

♻️ Suggested tweak

 fn warn_unknown_config_keys(content: &str) {
     let table: toml::Table = match content.parse() {
-        Ok(t) => t,
-        Err(_) => return, // the typed parse will report the real error
+        Ok(table) => table,
+        Err(error) => {
+            tracing::debug!(
+                %error,
+                "skipping unknown-top-level-key warning pass due to TOML parse error"
+            );
+            return; // the typed parse will report the real error
+        }
     };

As per coding guidelines: "Don't silently discard errors; use let _ = only on channel sends where the receiver may be dropped; handle, log, or propagate all other errors".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/config.rs` around lines 2669 - 2672, The current pre-pass silently
returns on toml parse failure (the match around content.parse() that assigns
table: toml::Table), losing diagnostics; modify that match to capture the Err(e)
and emit a debug log containing the parsing error and context (e.g., the
filename/variable) before returning, e.g., match content.parse() { Ok(t) => t,
Err(e) => { debug!("...: {}", e); return } }, so behavior stays the same but
diagnostics are preserved; update the match handling around content.parse() and
the table: toml::Table binding accordingly.

2650-2663: KNOWN_TOP_LEVEL_KEYS is easy to drift from TomlConfig over time.

Consider adding a small guard test that fails when a new top-level TOML field is added without updating this list, to avoid false warnings during future config evolution.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/config.rs` around lines 2650 - 2663, Add a unit test that serializes the
TomlConfig struct to a toml::Value (or serde JSON value) and compares the
top-level field names against the KNOWN_TOP_LEVEL_KEYS constant; if the sets
differ the test should fail, forcing updates to KNOWN_TOP_LEVEL_KEYS whenever
TomlConfig changes. Locate TomlConfig and KNOWN_TOP_LEVEL_KEYS in src/config.rs,
derive/ensure serde Serialize on TomlConfig (or construct a representative
instance), collect the top-level keys from the serialized value, and assert
equality with KNOWN_TOP_LEVEL_KEYS so the CI catches any drift.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/agent/channel_dispatch.rs`:
- Around line 620-633: Panic payload text is logged and returned without
secret-scrubbing; pass the constructed panic_message through the same secret
scrubber used for non-panic errors before emitting it. Locate the panic branch
that builds panic_message (references: panic_payload, panic_message,
tracing::error, WorkerCompletionError::failed) and replace usages with the
scrubbed version (e.g., let scrubbed = scrubber.scrub(&panic_message)) — use the
scrubbed string for the tracing::error fields and for the
WorkerCompletionError::failed payload so sensitive data is removed consistently.

In `@src/agent/channel.rs`:
- Around line 1752-1762: The early return when
self.state.active_workers.write().await.remove(worker_id) is None skips
necessary completion work; change the logic so removal result is checked but do
not return early — always call run_logger.log_worker_completed(*worker_id,
result, *success) and always attempt to remove entries from
self.state.worker_handles and self.state.worker_inputs (and perform any
retrigger/enqueue logic) even if the worker was not present in active_workers;
keep the drop(workers) before logging to avoid holding the write lock while
calling run_logger, and ensure removals are idempotent (i.e., ignore missing
entries) so repeated/late WorkerComplete events are safe.

In `@src/agent/worker.rs`:
- Around line 387-395: The code directly assigns self.state =
WorkerState::Failed and self.state = WorkerState::WaitingForInput; replace these
direct assignments with the state-machine helpers: call
self.can_transition_to(...) (using matches! on current state as the other
transitions do) and then self.transition_to(...) so transitions are validated;
if can_transition_to returns false, propagate or surface a runtime error (panic
or return Err) consistent with other uses of transition_to()/can_transition_to()
to enforce the state-machine rules for WorkerState transitions instead of
mutating self.state directly.

In `@src/api/settings.rs`:
- Around line 343-353: The current iterator over runtime_configs uses filter_map
and silently drops agents missing in mcp_managers (symbols: runtime_configs,
mcp_managers, reload_targets, agent_id), so change the logic to explicitly
handle the missing case: iterate over runtime_configs with .map or .filter_map
that first checks mcp_managers.get(agent_id) and if missing emit a warning via
the logger (or return an Err/Result) so the missing agent is not discarded
silently; only collect the (agent_id.clone(), runtime_config.clone(),
mcp_manager.clone()) tuples when a manager exists, and ensure you log the
agent_id and reason when skipping so the omission is observable.

In `@src/config.rs`:
- Around line 5763-5764: The hot-swap of the ArcSwap field opencode_server_pool
should explicitly call shutdown_all() on the old OpenCodeServerPool before
replacing it: load the current pool via
self.opencode_server_pool.load().as_ref().clone() to capture the old Arc,
construct the new OpenCodeServerPool, call
self.opencode_server_pool.store(Arc::new(new_pool)), then await
old_pool.shutdown_all() to drain and kill servers instead of relying on Drop;
apply this pattern wherever opencode_server_pool is replaced (e.g., in the
blocks around the existing replacements).

In `@src/conversation/history.rs`:
- Around line 375-397: The DB update in reconcile_running_workers_for_agent and
the awaited worker-log persistence calls inside ProcessRunLogger must be
converted to fire-and-forget tasks: for reconcile_running_workers_for_agent (and
any ProcessRunLogger methods that currently await .execute against self.pool)
move the SQL query + .execute call into a tokio::spawn(async move { ... }) so
the caller is not held by SQLite latency; ensure the closure clones/owns the
pool and the failure_message/agent_id (or other params), await the query inside
the spawned task and handle/map errors there (e.g., processLogger.error or
tracing::error) rather than propagating them, and remove the original .await
from the caller path (or replace the call with spawning the task). Ensure
signatures that must return sync results are adjusted to not await DB writes or
provide a separate non-blocking helper that spawns the task.

In `@src/main.rs`:
- Around line 1662-1680: The spawned task that runs channel.run() cleans the
registry and api_state (process_control_registry.unregister_channel,
api_state_for_cleanup.unregister_channel_status, unregister_channel_state) but
never evicts the corresponding entry from active_channels, so the first inbound
message after exit is dropped; fix by creating a mpsc channel pair during main
loop setup (closed_channel_tx / closed_channel_rx via mpsc::channel::<String>)
and, after channel.run() returns, send the cleanup_channel_id over
closed_channel_tx (from the spawned task) so the main loop's tokio::select!
branch that matches Some(closed_id) = closed_channel_rx.recv() can call
active_channels.remove(&closed_id) to immediately evict the stale entry; ensure
closed_channel_tx is cloned into the spawn and handle send failures gracefully.

In `@src/memory/maintenance.rs`:
- Around line 251-270: The active_survivor snapshot becomes stale after calling
merge_pair which persists merged content; to fix, reload the updated survivor
from the persistent store (or otherwise refresh its snapshot) immediately after
merge_pair completes and before any further merging decisions—i.e., after
merge_pair(...) .await? and after merged_memory_ids/merged_count updates,
replace active_survivor with the freshly fetched memory for winner.id (or
recompute its embeddings/text) so subsequent iterations use the persisted,
up-to-date content; apply the same change in the other loop region (lines
~311-379) that uses active_survivor, and reference choose_merge_pair,
merge_pair, active_survivor, winner, loser, and source_id when locating the code
to modify.

In `@src/opencode/worker.rs`:
- Around line 344-349: The logging currently slices the detected secret bytes
unsafely using &leak[..leak.len().min(8)], which can panic on UTF‑8 multi‑byte
boundaries; replace both occurrences with a char‑safe truncation so the prefix
is taken by characters rather than bytes (e.g., build a prefix with
leak.chars().take(8).collect::<String>() or compute a safe byte index via
leak.char_indices() and use that). Update the calls around scan_for_leaks(), the
local variable leak, and the tracing::warn/tracing::info invocations to use the
new char-safe prefix value.

In `@src/tools/file.rs`:
- Around line 216-232: The protection for identity/memory files should be
unconditional: remove the sandbox-mode conditional around the protected-file
check and always enforce the guard in the file write path. Locate the block that
calls self.sandbox.mode_enabled() and the PROTECTED_FILES constant (and the
file_name extraction via path.file_name().and_then(|n| n.to_str())), and change
the logic so that PROTECTED_FILES.iter().any(...) immediately returns the
FileError directing the caller to use the identity management API regardless of
sandbox state; keep the existing error message and FileError return shape.

In `@src/tools/reply.rs`:
- Around line 375-384: The trace currently logs a substring of the detected
secret from scan_for_leaks (variable leak) via leak_prefix, which exposes
sensitive data; update the logging in the block that calls
crate::secrets::scrub::scan_for_leaks to never include any characters of the
leak string—use a non-sensitive indicator such as leak_len = leak.len() or a
boolean leak_detected = true (and keep conversation_id) in the tracing::error
call, then return the same ReplyError; modify the code around the leak variable
and the tracing::error invocation in src/tools/reply.rs to remove any substring
access (e.g., &leak[..]) and log only length or detection status instead.

---

Duplicate comments:
In `@docs/design-docs/cortex-implementation.md`:
- Around line 19-21: The document currently contradicts itself by saying the
maintenance functions are "Implemented but never called" while also stating they
are "wired into the cortex loop"; update the wording to a single clear state:
either mark memory/maintenance.rs and its functions apply_decay(),
prune_memories(), and merge_similar_memories() as "Implemented and wired into
the cortex loop" or as "Implemented but not yet wired into the cortex loop"
(choose the true runtime state), and make the cortex loop description consistent
with that choice so there is no ambiguity.
- Around line 86-89: Two fenced code blocks in the document are unlabeled and
trigger MD040; add a language identifier (e.g., "text") to each opening fence
for the blocks containing the lines "0 active -> 1 completing -> 3 terminal / 0
active -> 2 killing    -> 3 terminal" and the block starting with "Phase 1 (tick
loop)     — standalone, just plumbing" so the fences become ```text and satisfy
the linter.

In `@src/agent/process_control.rs`:
- Around line 102-112: The code unconditionally calls
self.channels.write().await.remove(channel_id) when Weak::upgrade() fails, which
can delete a newly-registered handle; change the cleanup to be race-safe by
taking the write lock and removing the entry only if the stored Weak pointer is
the same stale Weak you originally looked up (or use the HashMap entry API to
compare-and-remove): locate the code around
channels.read().await.get(channel_id).cloned() and the subsequent upgrade()
check in process_control.rs and replace the unconditional remove(channel_id)
with a conditional remove that verifies pointer identity of the Weak handle (or
re-check the map entry equals the cloned Weak) before removing, preserving
correct behavior for cancel_worker_with_reason and ControlActionResult paths.

In `@src/api/config.rs`:
- Around line 621-632: The maintenance parameters are written into the TOML
without validation; add range checks before mutating table: for
maintenance_decay_rate, maintenance_prune_threshold, and
maintenance_merge_similarity_threshold (the cortex fields referenced) ensure
their values are within 0.0..=1.0 and for maintenance_min_age_days ensure value
is >= 0; if a value is out of bounds, return an error (or propagate a validation
error) instead of setting table[...] — perform these checks immediately before
the existing assignments that set table["maintenance_decay_rate"],
table["maintenance_prune_threshold"], table["maintenance_min_age_days"], and
table["maintenance_merge_similarity_threshold"] so invalid inputs are rejected
prior to TOML mutation.
- Around line 603-605: The code is performing lossy casts like
`cortex.supervisor_kill_budget_per_tick as i64` (and similarly
`maintenance_interval_secs as i64`) which can wrap negative for oversized
inputs; change these assignments to validate and convert safely using
`i64::try_from(...)` or `usize::try_into::<i64>()` and handle the Err case
(return an error, skip persisting with a logged warning, or clamp to a safe max)
instead of an unchecked `as` cast so only representable values are written into
`table["supervisor_kill_budget_per_tick"]` and
`table["maintenance_interval_secs"]`. Ensure you update the code paths around
the `table[...] = toml_edit::value(...)` assignments (references:
cortex.supervisor_kill_budget_per_tick, maintenance_interval_secs) to perform
the checked conversion and appropriate error handling or fallback.

In `@src/lib.rs`:
- Around line 267-277: The function create_process_event_buses_with_capacity
currently calls tokio::sync::broadcast::channel with unvalidated capacities
which will panic for 0; before calling channel, clamp control_event_capacity and
memory_event_capacity to a minimum of 1 (e.g., let control_capacity =
control_event_capacity.max(1); let memory_capacity =
memory_event_capacity.max(1);) and use those local values when creating event_tx
and memory_event_tx so the public helper never passes 0 into broadcast::channel.

In `@src/memory/maintenance.rs`:
- Around line 324-377: The merge sequence mutates survivor content and rewrites
associations via memory_store.update, get_associations,
delete_associations_for_memory, create_association, creates an Updates
association, calls memory_store.forget and embedding_table.delete with
cancellation checkpoints between them; wrap all of these operations in a single
storage transaction (use whatever transaction primitive exists on MemoryStore —
e.g., begin_transaction/transaction/tx or sqlx::Transaction if present) so the
update + rewire + create Updates + forget + embedding delete are executed
atomically, remove intermediate maintenance_cancelable_op awaits inside the
transactional block (perform a single cancellable check before starting and
ensure cancellation causes a rollback), and on error ensure you rollback the
transaction and only after commit run any non-transactional cleanup (or queue
embedding_table.delete inside the same transaction if supported) so partial
merges cannot persist; reference functions: memory_store.update,
memory_store.get_associations, memory_store.delete_associations_for_memory,
memory_store.create_association, memory_store.forget, embedding_table.delete,
Association::new, maintenance_cancelable_op, check_maintenance_cancellation.

In `@tests/maintenance.rs`:
- Around line 8-11: The temporary directory created inside
make_memory_maintenance_fixture is dropped when the function returns, leaving
embedding_table referencing deleted storage; fix this by extending the TempDir
lifetime (e.g., include the TempDir in the fixture's return value or wrap it in
an Arc and return it alongside std::sync::Arc<MemoryStore> and
spacebot::memory::EmbeddingTable) so the TempDir is kept alive for the entire
test; update all call sites and other fixtures mentioned (the other occurrences
around where TempDir is created) to accept/propagate the extra TempDir holder to
prevent use-after-drop of embedding_table storage.

---

Nitpick comments:
In `@docs/design-docs/cortex-chat.md`:
- Around line 119-120: Clarify the 409 CONFLICT contract by specifying both the
JSON error body and Retry-After header format: state that the endpoint returns
HTTP 409 with a JSON payload containing standardized fields (e.g., an error code
like "session_busy" and a human-readable message plus an optional numeric
retry_after_seconds) and also sets a Retry-After header with the retry interval
in seconds; update the sentence about "409 CONFLICT" to include these exact
field names and header semantics so clients can implement deterministic retry
behavior.

In `@prompts/en/fragments/worker_capabilities.md.j2`:
- Around line 47-48: The current phrasing creates two competing "defaults"
between the phrase "Default to OpenCode for code work when it's enabled" and the
header "Builtin Workers (default)"; update the copy to make the scope explicit
by stating that "OpenCode" is the default for code-related tasks when enabled
while "Builtin Workers" remain the default for short command/file tasks (or for
global/non-code operations). Locate and edit the strings "Default to OpenCode
for code work when it's enabled" and "Builtin Workers (default)" in
worker_capabilities.md.j2 so each specifies its scope (e.g., "default for code
tasks" vs "global/default for short tasks") to remove ambiguity.

In `@prompts/en/worker.md.j2`:
- Line 39: Update the documentation in prompts/en/worker.md.j2 to clarify that
the `file` tool can access any path readable by the process in both passthrough
branches: add the sentence "The `file` tool can access any path readable by the
process (no workspace restriction)." to the branch describing the fully disabled
sandbox and mirror that exact sentence into the other passthrough branch that
covers the enabled-in-config/unsupported-backend case so both branches state
identical `file` access behavior.

In `@src/agent/status.rs`:
- Around line 322-332: Add a symmetric unit test for remove_worker similar to
remove_branch_removes_existing_branch: create a StatusBlock via
StatusBlock::new(), add a worker id using add_worker (use Uuid::new_v4()), call
remove_worker(worker_id), assert the call returns true and that
status.active_workers is empty; name it something like
remove_worker_removes_existing_worker so both public removal paths
(remove_branch and remove_worker) are covered.

In `@src/config.rs`:
- Around line 2669-2672: The current pre-pass silently returns on toml parse
failure (the match around content.parse() that assigns table: toml::Table),
losing diagnostics; modify that match to capture the Err(e) and emit a debug log
containing the parsing error and context (e.g., the filename/variable) before
returning, e.g., match content.parse() { Ok(t) => t, Err(e) => { debug!("...:
{}", e); return } }, so behavior stays the same but diagnostics are preserved;
update the match handling around content.parse() and the table: toml::Table
binding accordingly.
- Around line 2650-2663: Add a unit test that serializes the TomlConfig struct
to a toml::Value (or serde JSON value) and compares the top-level field names
against the KNOWN_TOP_LEVEL_KEYS constant; if the sets differ the test should
fail, forcing updates to KNOWN_TOP_LEVEL_KEYS whenever TomlConfig changes.
Locate TomlConfig and KNOWN_TOP_LEVEL_KEYS in src/config.rs, derive/ensure serde
Serialize on TomlConfig (or construct a representative instance), collect the
top-level keys from the serialized value, and assert equality with
KNOWN_TOP_LEVEL_KEYS so the CI catches any drift.

In `@src/hooks/spacebot.rs`:
- Around line 435-438: Duplicate comment lines were introduced above the
creation of capped_args using crate::tools::truncate_output; remove the
redundant comment so only a single descriptive line remains. Locate the block
where capped_args is assigned (the call to truncate_output) and delete the extra
repeated comment text "Send event without blocking. Truncate args to keep
broadcast payloads bounded" leaving one occurrence directly above the
capped_args assignment.

In `@src/memory/store.rs`:
- Around line 255-265: Add a focused unit test for
delete_associations_for_memory that sets up associations referencing a given
memory id both as source_id and as target_id, calls
MemoryStore::delete_associations_for_memory(memory_id), asserts the returned
rows_affected equals the number of inserted rows, and verifies no rows remain in
the associations table for that memory id; use the existing test DB harness or a
transaction/fixture to insert the test associations, clean up after the test,
and reference the delete_associations_for_memory function, associations table,
and rows_affected in your assertions.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8217132e-dad6-485b-90f6-645b4192be41

📥 Commits

Reviewing files that changed from the base of the PR and between a41e2c7 and 87f1038.

⛔ Files ignored due to path filters (15)

Cargo.lock is excluded by !**/*.lock, !**/*.lock
Cargo.toml is excluded by !**/*.toml
vendor/imap-proto-0.10.2/Cargo.toml is excluded by !**/*.toml, !**/vendor/**
vendor/imap-proto-0.10.2/src/body.rs is excluded by !**/vendor/**
vendor/imap-proto-0.10.2/src/body_structure.rs is excluded by !**/vendor/**
vendor/imap-proto-0.10.2/src/builders/command.rs is excluded by !**/vendor/**
vendor/imap-proto-0.10.2/src/builders/mod.rs is excluded by !**/vendor/**
vendor/imap-proto-0.10.2/src/core.rs is excluded by !**/vendor/**
vendor/imap-proto-0.10.2/src/lib.rs is excluded by !**/vendor/**
vendor/imap-proto-0.10.2/src/macros.rs is excluded by !**/vendor/**
vendor/imap-proto-0.10.2/src/parser/mod.rs is excluded by !**/vendor/**
vendor/imap-proto-0.10.2/src/parser/rfc3501.rs is excluded by !**/vendor/**
vendor/imap-proto-0.10.2/src/parser/rfc4551.rs is excluded by !**/vendor/**
vendor/imap-proto-0.10.2/src/parser/rfc5464.rs is excluded by !**/vendor/**
vendor/imap-proto-0.10.2/src/types.rs is excluded by !**/vendor/**

📒 Files selected for processing (64)

README.md
docs/content/docs/(configuration)/config.mdx
docs/content/docs/(configuration)/sandbox.mdx
docs/content/docs/(core)/architecture.mdx
docs/content/docs/(features)/workers.mdx
docs/design-docs/cortex-chat.md
docs/design-docs/cortex-history.md
docs/design-docs/cortex-implementation.md
docs/design-docs/tool-nudging.md
prompts/en/channel.md.j2
prompts/en/fragments/worker_capabilities.md.j2
prompts/en/tools/spawn_worker_description.md.j2
prompts/en/worker.md.j2
src/agent.rs
src/agent/branch.rs
src/agent/channel.rs
src/agent/channel_dispatch.rs
src/agent/channel_history.rs
src/agent/compactor.rs
src/agent/cortex.rs
src/agent/cortex_chat.rs
src/agent/ingestion.rs
src/agent/invariant_harness.rs
src/agent/process_control.rs
src/agent/status.rs
src/agent/worker.rs
src/api/agents.rs
src/api/channels.rs
src/api/config.rs
src/api/cortex.rs
src/api/server.rs
src/api/settings.rs
src/api/state.rs
src/api/system.rs
src/config.rs
src/conversation/channels.rs
src/conversation/history.rs
src/hooks.rs
src/hooks/spacebot.rs
src/lib.rs
src/llm/model.rs
src/main.rs
src/memory/maintenance.rs
src/memory/store.rs
src/messaging/discord.rs
src/opencode/types.rs
src/opencode/worker.rs
src/sandbox.rs
src/telemetry/registry.rs
src/tools.rs
src/tools/cancel.rs
src/tools/exec.rs
src/tools/file.rs
src/tools/memory_save.rs
src/tools/reply.rs
src/tools/send_file.rs
src/tools/shell.rs
src/tools/spawn_worker.rs
src/tools/worker_inspect.rs
tests/bulletin.rs
tests/context_dump.rs
tests/detached_worker_bootstrap.rs
tests/maintenance.rs
tests/tool_nudge.rs

✅ Files skipped from review due to trivial changes (1)

docs/design-docs/tool-nudging.md

🚧 Files skipped from review as they are similar to previous changes (3)

tests/bulletin.rs
docs/design-docs/cortex-history.md
docs/content/docs/(configuration)/config.mdx

coderabbitai · 2026-03-04T03:19:30Z

src/agent/channel_dispatch.rs

+            Err(panic_payload) => {
+                let panic_message = panic_payload
+                    .downcast_ref::<&str>()
+                    .map(|message| (*message).to_string())
+                    .or_else(|| panic_payload.downcast_ref::<String>().cloned())
+                    .unwrap_or_else(|| "unknown panic payload".to_string());
+                tracing::error!(
+                    worker_id = %worker_id,
+                    panic_message = %panic_message,
+                    "worker task panicked"
+                );
+                Err(WorkerCompletionError::failed(format!(
+                    "worker task panicked: {panic_message}"
+                )))


⚠️ Potential issue | 🟠 Major

Scrub panic payload text before emitting worker completion result.

The panic branch builds panic_message and sends it downstream without passing through the secret scrubber, unlike the non-panic failure path.

🔒 Suggested fix

Err(panic_payload) => { let panic_message = panic_payload .downcast_ref::<&str>() .map(|message| (*message).to_string()) .or_else(|| panic_payload.downcast_ref::<String>().cloned()) .unwrap_or_else(|| "unknown panic payload".to_string()); + let panic_message = if let Some(store) = &secrets_store { + crate::secrets::scrub::scrub_with_store(&panic_message, store) + } else { + panic_message + }; tracing::error!( worker_id = %worker_id, panic_message = %panic_message, "worker task panicked" );

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Err(panic_payload) => {

let panic_message = panic_payload

.downcast_ref::<&str>()

.map(|message| (*message).to_string())

.or_else(|| panic_payload.downcast_ref::<String>().cloned())

.unwrap_or_else(|| "unknown panic payload".to_string());

tracing::error!(

worker_id = %worker_id,

panic_message = %panic_message,

"worker task panicked"

);

Err(WorkerCompletionError::failed(format!(

"worker task panicked: {panic_message}"

)))

Err(panic_payload) => {

let panic_message = panic_payload

.downcast_ref::<&str>()

.map(|message| (*message).to_string())

.or_else(|| panic_payload.downcast_ref::<String>().cloned())

.unwrap_or_else(|| "unknown panic payload".to_string());

let panic_message = if let Some(store) = &secrets_store {

crate::secrets::scrub::scrub_with_store(&panic_message, store)

} else {

panic_message

};

tracing::error!(

worker_id = %worker_id,

panic_message = %panic_message,

"worker task panicked"

);

Err(WorkerCompletionError::failed(format!(

"worker task panicked: {panic_message}"

)))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/agent/channel_dispatch.rs` around lines 620 - 633, Panic payload text is logged and returned without secret-scrubbing; pass the constructed panic_message through the same secret scrubber used for non-panic errors before emitting it. Locate the panic branch that builds panic_message (references: panic_payload, panic_message, tracing::error, WorkerCompletionError::failed) and replace usages with the scrubbed version (e.g., let scrubbed = scrubber.scrub(&panic_message)) — use the scrubbed string for the tracing::error fields and for the WorkerCompletionError::failed payload so sensitive data is removed consistently.

coderabbitai · 2026-03-04T03:19:30Z

src/agent/channel.rs

                let mut workers = self.state.active_workers.write().await;
-                workers.remove(worker_id);
+                if workers.remove(worker_id).is_none() {
+                    return Ok(());
+                }
                drop(workers);

+                run_logger.log_worker_completed(*worker_id, result, *success);
+
                self.state.worker_handles.write().await.remove(worker_id);
                self.state.worker_inputs.write().await.remove(worker_id);



⚠️ Potential issue | 🔴 Critical

WorkerComplete can be dropped before convergence/cleanup.

Line 1753 returns early when active_workers has no entry, which skips the rest of completion handling (run logging, retrigger enqueue, and handle/input cleanup) for that event.

🛠️ Suggested fix

- let mut workers = self.state.active_workers.write().await; - if workers.remove(worker_id).is_none() { - return Ok(()); - } - drop(workers); + let removed_worker = self.state.active_workers.write().await.remove(worker_id).is_some(); + let removed_handle = self.state.worker_handles.write().await.remove(worker_id).is_some(); + let removed_input = self.state.worker_inputs.write().await.remove(worker_id).is_some(); + if !(removed_worker || removed_handle || removed_input) { + return Ok(()); + } run_logger.log_worker_completed(*worker_id, result, *success); - self.state.worker_handles.write().await.remove(worker_id); - self.state.worker_inputs.write().await.remove(worker_id);

Based on learnings: For changes in async/stateful paths (worker lifecycle, cancellation, retrigger, recall cache behavior), include explicit race/terminal-state reasoning in the PR summary and run targeted tests in addition to just gate-pr.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/agent/channel.rs` around lines 1752 - 1762, The early return when self.state.active_workers.write().await.remove(worker_id) is None skips necessary completion work; change the logic so removal result is checked but do not return early — always call run_logger.log_worker_completed(*worker_id, result, *success) and always attempt to remove entries from self.state.worker_handles and self.state.worker_inputs (and perform any retrigger/enqueue logic) even if the worker was not present in active_workers; keep the drop(workers) before logging to avoid holding the write lock while calling run_logger, and ensure removals are idempotent (i.e., ignore missing entries) so repeated/late WorkerComplete events are safe.

coderabbitai · 2026-03-04T03:19:30Z

src/agent/worker.rs

                    self.state = WorkerState::Failed;
                    self.hook.send_status("failed");
+                    follow_up_failure = Some(failure_reason);
                    break;
                }
+
+                self.state = WorkerState::WaitingForInput;
+                self.hook.send_status("waiting for input");
            }


🛠️ Refactor suggestion | 🟠 Major

Use validated state transitions in the new follow-up path.

Line 387 and Line 393 assign self.state directly, bypassing transition_to() and can_transition_to() checks.

♻️ Suggested change

- self.state = WorkerState::Failed; + self.transition_to(WorkerState::Failed)?; self.hook.send_status("failed"); follow_up_failure = Some(failure_reason); break; } - self.state = WorkerState::WaitingForInput; + self.transition_to(WorkerState::WaitingForInput)?; self.hook.send_status("waiting for input");

As per coding guidelines: Worker state machine must validate transitions with can_transition_to() using matches!; illegal transitions are runtime errors, not silent bugs.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

self.state = WorkerState::Failed;

self.hook.send_status("failed");

follow_up_failure = Some(failure_reason);

break;

}

self.state = WorkerState::WaitingForInput;

self.hook.send_status("waiting for input");

}

self.transition_to(WorkerState::Failed)?;

self.hook.send_status("failed");

follow_up_failure = Some(failure_reason);

break;

}

self.transition_to(WorkerState::WaitingForInput)?;

self.hook.send_status("waiting for input");

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/agent/worker.rs` around lines 387 - 395, The code directly assigns self.state = WorkerState::Failed and self.state = WorkerState::WaitingForInput; replace these direct assignments with the state-machine helpers: call self.can_transition_to(...) (using matches! on current state as the other transitions do) and then self.transition_to(...) so transitions are validated; if can_transition_to returns false, propagate or surface a runtime error (panic or return Err) consistent with other uses of transition_to()/can_transition_to() to enforce the state-machine rules for WorkerState transitions instead of mutating self.state directly.

coderabbitai · 2026-03-04T03:19:30Z

src/api/settings.rs

+            let reload_targets = runtime_configs
+                .iter()
+                .filter_map(|(agent_id, runtime_config)| {
+                    mcp_managers.get(agent_id).map(|mcp_manager| {
+                        (
+                            agent_id.clone(),
+                            runtime_config.clone(),
+                            mcp_manager.clone(),
+                        )
+                    })
+                })


⚠️ Potential issue | 🟠 Major

Config reload currently skips some agents silently.

The filter_map drops agents that exist in runtime_configs but not in mcp_managers, so those runtimes never reload and there’s no signal this happened.

💡 Proposed fix

- let reload_targets = runtime_configs - .iter() - .filter_map(|(agent_id, runtime_config)| { - mcp_managers.get(agent_id).map(|mcp_manager| { - ( - agent_id.clone(), - runtime_config.clone(), - mcp_manager.clone(), - ) - }) - }) - .collect::<Vec<_>>(); + let mut reload_targets = Vec::new(); + for (agent_id, runtime_config) in runtime_configs.iter() { + match mcp_managers.get(agent_id) { + Some(mcp_manager) => reload_targets.push(( + agent_id.clone(), + runtime_config.clone(), + mcp_manager.clone(), + )), + None => tracing::warn!( + agent_id = %agent_id, + "settings written but skipped runtime reload (missing MCP manager)" + ), + } + }

As per coding guidelines: "Don't silently discard errors; use let _ = only on channel sends where the receiver may be dropped; handle, log, or propagate all other errors."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let reload_targets = runtime_configs

.iter()

.filter_map(|(agent_id, runtime_config)| {

mcp_managers.get(agent_id).map(|mcp_manager| {

(

agent_id.clone(),

runtime_config.clone(),

mcp_manager.clone(),

)

})

})

let mut reload_targets = Vec::new();

for (agent_id, runtime_config) in runtime_configs.iter() {

match mcp_managers.get(agent_id) {

Some(mcp_manager) => reload_targets.push((

agent_id.clone(),

runtime_config.clone(),

mcp_manager.clone(),

)),

None => tracing::warn!(

agent_id = %agent_id,

"settings written but skipped runtime reload (missing MCP manager)"

),

}

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/api/settings.rs` around lines 343 - 353, The current iterator over runtime_configs uses filter_map and silently drops agents missing in mcp_managers (symbols: runtime_configs, mcp_managers, reload_targets, agent_id), so change the logic to explicitly handle the missing case: iterate over runtime_configs with .map or .filter_map that first checks mcp_managers.get(agent_id) and if missing emit a warning via the logger (or return an Err/Result) so the missing agent is not discarded silently; only collect the (agent_id.clone(), runtime_config.clone(), mcp_manager.clone()) tuples when a manager exists, and ensure you log the agent_id and reason when skipping so the omission is observable.

coderabbitai · 2026-03-04T03:19:30Z

src/config.rs

+    pub opencode_server_pool: ArcSwap<crate::opencode::OpenCodeServerPool>,
    /// Cron store, set after agent initialization.


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Locate OpenCodeServerPool definitions ==" fd 'opencode.*\.rs$' src echo "== Check for Drop impl on OpenCodeServerPool ==" ast-grep --pattern 'impl Drop for OpenCodeServerPool { $$$ }' echo "== Inspect lifecycle/cleanup methods and child process ownership ==" rg -n -C3 'OpenCodeServerPool|impl Drop|fn (shutdown|close|stop|terminate)|tokio::process::Command|Child' src echo "== Inspect all pool swap/load call sites ==" rg -n -C2 'opencode_server_pool\.(store|load)\(' src

Repository: spacedriveapp/spacebot

Length of output: 173

🏁 Script executed:

#!/bin/bash echo "== Read OpenCodeServerPool definition in src/opencode.rs ==" wc -l src/opencode.rs echo "== Find OpenCodeServerPool struct definition ==" rg -n 'struct OpenCodeServerPool' src/opencode.rs -A 20 echo "== Search for Drop implementations in opencode.rs ==" rg -n 'impl Drop' src/opencode.rs -B 2 -A 10 echo "== Search for shutdown/cleanup methods on OpenCodeServerPool ==" rg -n 'fn (shutdown|close|stop|terminate|drop)' src/opencode.rs -B 2 -A 5 echo "== Check config.rs lines around 5763-5764 ==" sed -n '5755,5775p' src/config.rs echo "== Check config.rs lines around 5824 ==" sed -n '5815,5835p' src/config.rs echo "== Check config.rs lines around 5917-5927 ==" sed -n '5910,5935p' src/config.rs

Repository: spacedriveapp/spacebot

Length of output: 3889

🏁 Script executed:

cat -n src/opencode.rs

Repository: spacedriveapp/spacebot

Length of output: 755

🏁 Script executed:

echo "== Read src/opencode/server.rs ==" wc -l src/opencode/server.rs echo "== Show first 100 lines of server.rs ==" head -100 src/opencode/server.rs echo "== Search for OpenCodeServerPool struct and Drop impl ==" rg -n 'struct OpenCodeServerPool|impl Drop|impl.*OpenCodeServerPool' src/opencode/server.rs -A 15

Repository: spacedriveapp/spacebot

Length of output: 5031

🏁 Script executed:

echo "== Search for OpenCodeServerPool Drop impl ==" rg -n 'impl Drop for OpenCodeServerPool' src/opencode/server.rs -A 15 echo "== Show lines around end of file to check for Drop impl ==" tail -100 src/opencode/server.rs | cat -n

Repository: spacedriveapp/spacebot

Length of output: 4330

Call shutdown_all() on replaced OpenCodeServerPool before hot-swap, rather than relying on implicit Drop cleanup.

While OpenCodeServer processes are killed via kill_on_drop(true) when the pool is replaced, the code should explicitly call opencode_server_pool.shutdown_all() on the old pool during hot-swap (lines 5917-5927) for graceful shutdown. The pool provides an explicit shutdown_all() method (in src/opencode/server.rs) that drains servers and calls kill() on each, but the hot-swap only calls .store(), leaving cleanup to implicit Drop semantics. For robust resource management, store the old pool before replacement and call shutdown_all() on it:

let old_pool = self.opencode_server_pool.load().as_ref().clone(); let new_pool = crate::opencode::OpenCodeServerPool::new(...); self.opencode_server_pool.store(Arc::new(new_pool)); // Explicit cleanup of old pool old_pool.shutdown_all().await;

Also applies to: 5824-5824, 5917-5927

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/config.rs` around lines 5763 - 5764, The hot-swap of the ArcSwap field opencode_server_pool should explicitly call shutdown_all() on the old OpenCodeServerPool before replacing it: load the current pool via self.opencode_server_pool.load().as_ref().clone() to capture the old Arc, construct the new OpenCodeServerPool, call self.opencode_server_pool.store(Arc::new(new_pool)), then await old_pool.shutdown_all() to drain and kill servers instead of relying on Drop; apply this pattern wherever opencode_server_pool is replaced (e.g., in the blocks around the existing replacements).

src/main.rs

src/memory/maintenance.rs

coderabbitai · 2026-03-04T03:19:31Z

src/opencode/worker.rs

+                        if let Some(leak) = crate::secrets::scrub::scan_for_leaks(&scrubbed) {
+                            tracing::warn!(
                                worker_id = %self.id,
-                                "LEAK DETECTED in OpenCode worker output — terminating"
+                                leak_prefix = %&leak[..leak.len().min(8)],
+                                "potential secret detected in OpenCode worker output"
                            );


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and examine the file find . -name "worker.rs" -path "*/opencode/*" 2>/dev/null

Repository: spacedriveapp/spacebot

Length of output: 90

🏁 Script executed:

# Check the specific lines mentioned in the review cat -n src/opencode/worker.rs | sed -n '340,395p'

Repository: spacedriveapp/spacebot

Length of output: 3586

🏁 Script executed:

# Let's also check the type of the `leak` variable to understand the context better cat -n src/opencode/worker.rs | sed -n '330,360p'

Repository: spacedriveapp/spacebot

Length of output: 1626

🏁 Script executed:

# Find the secrets module find . -path "*/secrets/*" -name "*.rs" 2>/dev/null | head -20

Repository: spacedriveapp/spacebot

Length of output: 137

🏁 Script executed:

# Look for the scan_for_leaks function definition rg "fn scan_for_leaks" -A 5

Repository: spacedriveapp/spacebot

Length of output: 801

🏁 Script executed:

# Search for similar byte-slicing patterns on strings in logging contexts rg '\[\.\..*\.len\.min' -A 2 -B 2

Repository: spacedriveapp/spacebot

Length of output: 4254

Fix unsafe byte-slicing of potential secret values in logging.

Lines 347 and 387 use &leak[..leak.len().min(8)] which can panic when byte 8 falls in the middle of a UTF-8 multi-byte character (emojis, accented chars, CJK, etc.). Since scan_for_leaks() returns String, use a char-safe method:

- leak_prefix = %&leak[..leak.len().min(8)], + leak_prefix = %leak.chars().take(8).collect::<String>(),

This applies identically to both occurrences.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if let Some(leak) = crate::secrets::scrub::scan_for_leaks(&scrubbed) {

tracing::warn!(

worker_id = %self.id,

"LEAK DETECTED in OpenCode worker output — terminating"

leak_prefix = %&leak[..leak.len().min(8)],

"potential secret detected in OpenCode worker output"

);

if let Some(leak) = crate::secrets::scrub::scan_for_leaks(&scrubbed) {

tracing::warn!(

worker_id = %self.id,

leak_prefix = %leak.chars().take(8).collect::<String>(),

"potential secret detected in OpenCode worker output"

);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/opencode/worker.rs` around lines 344 - 349, The logging currently slices the detected secret bytes unsafely using &leak[..leak.len().min(8)], which can panic on UTF‑8 multi‑byte boundaries; replace both occurrences with a char‑safe truncation so the prefix is taken by characters rather than bytes (e.g., build a prefix with leak.chars().take(8).collect::<String>() or compute a safe byte index via leak.char_indices() and use that). Update the calls around scan_for_leaks(), the local variable leak, and the tracing::warn/tracing::info invocations to use the new char-safe prefix value.

coderabbitai · 2026-03-04T03:19:31Z

src/tools/file.rs

+                // When sandbox is enabled, identity files must go through the
+                // dedicated identity API to keep the update flow consistent.
+                // When sandbox is disabled, the user has opted into full access.
+                if self.sandbox.mode_enabled() {
+                    let file_name = path.file_name().and_then(|n| n.to_str()).unwrap_or("");
+                    const PROTECTED_FILES: &[&str] = &["SOUL.md", "IDENTITY.md", "USER.md"];
+                    if PROTECTED_FILES
+                        .iter()
+                        .any(|f| file_name.eq_ignore_ascii_case(f))
+                    {
+                        return Err(FileError(
+                            "ACCESS DENIED: Identity files are protected and cannot be modified \
+                             through file operations. Use the identity management API instead."
+                                .to_string(),
+                        ));
+                    }
                }


⚠️ Potential issue | 🟠 Major

Keep identity/memory write protection unconditional in the file tool.

Line 219 makes protection depend on sandbox mode, which allows direct writes to protected identity paths when sandbox is disabled. That bypasses the required tool-routing guard.

🛠️ Suggested fix

- if self.sandbox.mode_enabled() { - let file_name = path.file_name().and_then(|n| n.to_str()).unwrap_or(""); - const PROTECTED_FILES: &[&str] = &["SOUL.md", "IDENTITY.md", "USER.md"]; - if PROTECTED_FILES - .iter() - .any(|f| file_name.eq_ignore_ascii_case(f)) - { - return Err(FileError( - "ACCESS DENIED: Identity files are protected and cannot be modified \ - through file operations. Use the identity management API instead." - .to_string(), - )); - } - } + let file_name = path.file_name().and_then(|n| n.to_str()).unwrap_or(""); + const PROTECTED_FILES: &[&str] = &["SOUL.md", "IDENTITY.md", "USER.md"]; + if PROTECTED_FILES + .iter() + .any(|f| file_name.eq_ignore_ascii_case(f)) + { + return Err(FileError( + "ACCESS DENIED: Identity files are protected and cannot be modified \ + through file operations. Use the identity management API instead." + .to_string(), + )); + }

As per coding guidelines: Workspace path guard: File tools must reject writes to identity/memory paths with an error directing the LLM to the correct tool.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// When sandbox is enabled, identity files must go through the

// dedicated identity API to keep the update flow consistent.

// When sandbox is disabled, the user has opted into full access.

if self.sandbox.mode_enabled() {

let file_name = path.file_name().and_then(|n| n.to_str()).unwrap_or("");

const PROTECTED_FILES: &[&str] = &["SOUL.md", "IDENTITY.md", "USER.md"];

if PROTECTED_FILES

.iter()

.any(|f| file_name.eq_ignore_ascii_case(f))

{

return Err(FileError(

"ACCESS DENIED: Identity files are protected and cannot be modified \

through file operations. Use the identity management API instead."

.to_string(),

));

}

}

// When sandbox is enabled, identity files must go through the

// dedicated identity API to keep the update flow consistent.

// When sandbox is disabled, the user has opted into full access.

let file_name = path.file_name().and_then(|n| n.to_str()).unwrap_or("");

const PROTECTED_FILES: &[&str] = &["SOUL.md", "IDENTITY.md", "USER.md"];

if PROTECTED_FILES

.iter()

.any(|f| file_name.eq_ignore_ascii_case(f))

{

return Err(FileError(

"ACCESS DENIED: Identity files are protected and cannot be modified \

through file operations. Use the identity management API instead."

.to_string(),

));

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/tools/file.rs` around lines 216 - 232, The protection for identity/memory files should be unconditional: remove the sandbox-mode conditional around the protected-file check and always enforce the guard in the file write path. Locate the block that calls self.sandbox.mode_enabled() and the PROTECTED_FILES constant (and the file_name extraction via path.file_name().and_then(|n| n.to_str())), and change the logic so that PROTECTED_FILES.iter().any(...) immediately returns the FileError directing the caller to use the identity management API regardless of sandbox state; keep the existing error message and FileError return shape.

coderabbitai · 2026-03-04T03:19:31Z

src/tools/reply.rs

+        if let Some(leak) = crate::secrets::scrub::scan_for_leaks(&converted_content) {
+            tracing::error!(
+                conversation_id = %self.conversation_id,
+                leak_prefix = %&leak[..leak.len().min(8)],
+                "reply tool blocked content matching secret pattern"
+            );
+            return Err(ReplyError(
+                "blocked reply content: potential secret detected".into(),
+            ));
+        }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Search for scan_for_leaks definition fd . -type f -name "*.rs" | xargs rg -l "fn scan_for_leaks"

Repository: spacedriveapp/spacebot

Length of output: 296

🏁 Script executed:

# Look for the secrets/scrub module structure fd . -type f -path "*secrets*" -name "*.rs"

Repository: spacedriveapp/spacebot

Length of output: 236

🏁 Script executed:

# Search for scan_for_leaks definition and usage rg "fn scan_for_leaks" --type rust -A 10

Repository: spacedriveapp/spacebot

Length of output: 1492

🏁 Script executed:

# Check the scrub module to understand what scan_for_leaks returns rg "scan_for_leaks" --type rust -B 2 -A 5

Repository: spacedriveapp/spacebot

Length of output: 8592

Avoid logging any portion of the detected secret.

The scan_for_leaks() function returns the actual matched secret content as a String, not pattern metadata. Logging the first 8 characters of line 378 (leak_prefix = %&leak[..leak.len().min(8)]) exposes partial secrets in logs, which:

Can aid attackers in reconstructing full credentials

Violates compliance policies that restrict secret retention

Persists the leaked value indefinitely in log storage

Log the length or detection status instead of any portion of the matched content.

Proposed fix

if let Some(leak) = crate::secrets::scrub::scan_for_leaks(&converted_content) { tracing::error!( conversation_id = %self.conversation_id, - leak_prefix = %&leak[..leak.len().min(8)], + leak_len = leak.len(), "reply tool blocked content matching secret pattern" ); return Err(ReplyError( "blocked reply content: potential secret detected".into(), )); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if let Some(leak) = crate::secrets::scrub::scan_for_leaks(&converted_content) {

tracing::error!(

conversation_id = %self.conversation_id,

leak_prefix = %&leak[..leak.len().min(8)],

"reply tool blocked content matching secret pattern"

);

return Err(ReplyError(

"blocked reply content: potential secret detected".into(),

));

}

if let Some(leak) = crate::secrets::scrub::scan_for_leaks(&converted_content) {

tracing::error!(

conversation_id = %self.conversation_id,

leak_len = leak.len(),

"reply tool blocked content matching secret pattern"

);

return Err(ReplyError(

"blocked reply content: potential secret detected".into(),

));

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/tools/reply.rs` around lines 375 - 384, The trace currently logs a substring of the detected secret from scan_for_leaks (variable leak) via leak_prefix, which exposes sensitive data; update the logging in the block that calls crate::secrets::scrub::scan_for_leaks to never include any characters of the leak string—use a non-sensitive indicator such as leak_len = leak.len() or a boolean leak_detected = true (and keep conversation_id) in the tracing::error call, then return the same ReplyError; modify the code around the leak variable and the tracing::error invocation in src/tools/reply.rs to remove any substring access (e.g., &leak[..]) and log only length or detection status instead.

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/config.rs (1)

4856-4906: ⚠️ Potential issue | 🟠 Major

Validate maintenance numeric bounds before accepting config.

New maintenance knobs are accepted without guardrails. Values like negative maintenance_min_age_days or out-of-range thresholds can silently produce invalid maintenance behavior.

🛠️ Proposed fix

+fn validate_cortex_maintenance(scope: &str, cortex: &CortexConfig) -> Result<()> {
+    for (field, value) in [
+        ("maintenance_decay_rate", cortex.maintenance_decay_rate),
+        ("maintenance_prune_threshold", cortex.maintenance_prune_threshold),
+        (
+            "maintenance_merge_similarity_threshold",
+            cortex.maintenance_merge_similarity_threshold,
+        ),
+    ] {
+        if !(0.0..=1.0).contains(&value) {
+            return Err(ConfigError::Invalid(format!(
+                "{scope}.cortex.{field} must be within [0.0, 1.0], got {value}"
+            ))
+            .into());
+        }
+    }
+
+    if cortex.maintenance_min_age_days < 0 {
+        return Err(ConfigError::Invalid(format!(
+            "{scope}.cortex.maintenance_min_age_days must be >= 0, got {}",
+            cortex.maintenance_min_age_days
+        ))
+        .into());
+    }
+
+    Ok(())
+}

-            cortex: toml
+            cortex: {
+                let resolved_cortex = toml
                 .defaults
                 .cortex
                 .map(|c| CortexConfig {
                     ...
                 })
-                .unwrap_or(base_defaults.cortex),
+                .unwrap_or(base_defaults.cortex);
+                validate_cortex_maintenance("defaults", &resolved_cortex)?;
+                resolved_cortex
+            },

-                    cortex: a.cortex.map(|c| CortexConfig {
-                        ...
-                    }),
+                    cortex: match a.cortex {
+                        Some(c) => {
+                            let resolved_cortex = CortexConfig { ... };
+                            validate_cortex_maintenance(
+                                &format!("agents[{}]", a.id),
+                                &resolved_cortex,
+                            )?;
+                            Some(resolved_cortex)
+                        }
+                        None => None,
+                    },

Also applies to: 5083-5133

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/config.rs` around lines 4856 - 4906, The new cortex fields are accepted
blindly; add validation in the config build (where these fields are set, e.g.,
the Cortex config constructor or TryFrom implementation) to check numeric bounds
before accepting values: ensure integer durations/counters
(maintenance_min_age_days, maintenance_interval_secs, worker_timeout_secs,
branch_timeout_secs, supervisor_kill_budget_per_tick, bulletin_max_words,
bulletin_max_turns, association_max_per_pass, association_updates_threshold,
detached_worker_timeout_retry_limit) are non-negative (and >0 where applicable),
ensure probability/ratio thresholds (association_similarity_threshold,
maintenance_merge_similarity_threshold, maintenance_prune_threshold,
maintenance_decay_rate, circuit_breaker_threshold, bulletin_interval_secs if
considered ratio) are within sensible ranges (e.g., 0.0..=1.0 for
similarity/decay thresholds), and either return a ConfigError/Err on invalid
values or clamp to safe defaults using base_defaults.cortex; implement these
checks adjacent to the current unwrap_or usage so invalid incoming values are
rejected or corrected before assignment.

src/agent/compactor.rs (1)

142-157: ⚠️ Potential issue | 🟠 Major

Add the required 3-strike circuit breaker for recurring compaction failures.

The current failure path only logs and clears is_compacting, so persistent backend failures can retrigger on every check cycle without backoff/disable behavior.

As per coding guidelines: src/{cron,agent/compactor,agent/cortex}.rs: Implement circuit breaker: auto-disable recurring tasks after 3 consecutive failures. Apply to cron jobs, maintenance workers, and cortex routines.

♻️ Duplicate comments (10)

src/api/config.rs (2)

619-621: ⚠️ Potential issue | 🟠 Major

Unchecked narrowing on maintenance_interval_secs can persist invalid values.

Line 620 uses u64 as i64; values above i64::MAX will wrap and write invalid negative TOML integers. Use checked conversion and reject with 400.

Proposed fix

-    if let Some(v) = cortex.maintenance_interval_secs {
-        table["maintenance_interval_secs"] = toml_edit::value(v as i64);
+    if let Some(value) = cortex.maintenance_interval_secs {
+        table["maintenance_interval_secs"] =
+            toml_edit::value(i64::try_from(value).map_err(|_| StatusCode::BAD_REQUEST)?);
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/api/config.rs` around lines 619 - 621, The code currently casts
cortex.maintenance_interval_secs from u64 to i64 unsafely (in the block that
assigns table["maintenance_interval_secs"]), which will wrap values > i64::MAX
and produce invalid negative TOML integers; change this to perform a checked
conversion (e.g., use i64::try_from or compare against i64::MAX) and only set
table["maintenance_interval_secs"] when conversion succeeds, otherwise return a
400 Bad Request error from the surrounding handler/function indicating the value
is out of range; update the logic that references maintenance_interval_secs and
the error path so callers get a proper rejection instead of writing a wrapped
value.

622-633: ⚠️ Potential issue | 🟠 Major

Missing bounds validation for maintenance knobs before persistence.

Lines 622-633 write raw values without range checks. Invalid values (e.g., negative maintenance_min_age_days, or thresholds outside 0.0..=1.0) should be rejected at write time.

Proposed fix

+fn validate_unit_interval(field: &'static str, value: f32) -> Result<(), StatusCode> {
+    if (0.0..=1.0).contains(&value) {
+        Ok(())
+    } else {
+        tracing::warn!(field, value, "out-of-range maintenance config value");
+        Err(StatusCode::BAD_REQUEST)
+    }
+}
@@
-    if let Some(v) = cortex.maintenance_decay_rate {
-        table["maintenance_decay_rate"] = toml_edit::value(v as f64);
+    if let Some(value) = cortex.maintenance_decay_rate {
+        validate_unit_interval("maintenance_decay_rate", value)?;
+        table["maintenance_decay_rate"] = toml_edit::value(value as f64);
     }
-    if let Some(v) = cortex.maintenance_prune_threshold {
-        table["maintenance_prune_threshold"] = toml_edit::value(v as f64);
+    if let Some(value) = cortex.maintenance_prune_threshold {
+        validate_unit_interval("maintenance_prune_threshold", value)?;
+        table["maintenance_prune_threshold"] = toml_edit::value(value as f64);
     }
-    if let Some(v) = cortex.maintenance_min_age_days {
-        table["maintenance_min_age_days"] = toml_edit::value(v);
+    if let Some(value) = cortex.maintenance_min_age_days {
+        if value < 0 {
+            tracing::warn!(value, "maintenance_min_age_days must be >= 0");
+            return Err(StatusCode::BAD_REQUEST);
+        }
+        table["maintenance_min_age_days"] = toml_edit::value(value);
     }
-    if let Some(v) = cortex.maintenance_merge_similarity_threshold {
-        table["maintenance_merge_similarity_threshold"] = toml_edit::value(v as f64);
+    if let Some(value) = cortex.maintenance_merge_similarity_threshold {
+        validate_unit_interval("maintenance_merge_similarity_threshold", value)?;
+        table["maintenance_merge_similarity_threshold"] = toml_edit::value(value as f64);
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/api/config.rs` around lines 622 - 633, Validate each maintenance knob
Option before writing to the TOML table: for cortex.maintenance_decay_rate,
cortex.maintenance_prune_threshold, and
cortex.maintenance_merge_similarity_threshold ensure the Some(v) value is within
0.0..=1.0; for cortex.maintenance_min_age_days ensure the Some(v) value is >= 0.
Do the checks in the same scope that currently writes to table (i.e., before
assigning table["maintenance_*"] = toml_edit::value(...)); if a value is out of
bounds return or propagate a clear error (with which field failed and why)
instead of persisting the invalid value so the invalid config is rejected at
write time.

src/agent/worker.rs (1)

386-395: ⚠️ Potential issue | 🟠 Major

Use validated transitions in the follow-up failure/waiting path.

This segment still mutates self.state directly, which bypasses state-machine guards and can hide illegal transitions.

♻️ Suggested fix

-                if let Err(failure_reason) = follow_up_result {
-                    self.state = WorkerState::Failed;
+                if let Err(failure_reason) = follow_up_result {
+                    self.transition_to(WorkerState::Failed)?;
                     self.hook.send_status("failed");
                     follow_up_failure = Some(failure_reason);
                     break;
                 }

-                self.state = WorkerState::WaitingForInput;
+                self.transition_to(WorkerState::WaitingForInput)?;
                 self.hook.send_status("waiting for input");

As per coding guidelines: src/agent/worker.rs: Implement worker state machine using can_transition_to() with matches!. Illegal transitions are runtime errors, not silent bugs.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/agent/worker.rs` around lines 386 - 395, The code is directly assigning
self.state = WorkerState::Failed / WorkerState::WaitingForInput after
follow_up_result handling; instead validate transitions via the worker state
machine before mutating state: check
self.state.can_transition_to(WorkerState::Failed) (and similarly for
WorkerState::WaitingForInput) using matches! as per guidelines, and only set
self.state when the check passes; if the transition is illegal, convert it into
a runtime error/panic or return an Err instead of silently mutating; update the
block around follow_up_result, follow_up_failure, and the hook.send_status calls
to perform these guarded transitions.

src/memory/maintenance.rs (2)

324-377: ⚠️ Potential issue | 🔴 Critical

merge_pair needs atomicity; current cancel/error points can persist partial rewires.

This block updates survivor content, deletes/recreates associations, writes updates edge, forgets loser, and deletes embeddings as separate operations. A mid-sequence cancellation or DB failure can leave the memory graph inconsistent.

251-270: ⚠️ Potential issue | 🔴 Critical

Refresh active_survivor after each successful merge.

active_survivor stays stale after merge_pair(...), so the next merge can recompute content from outdated text and overwrite previously merged additions.

♻️ Suggested fix

-            merge_pair(
+            merge_pair(
                 memory_store,
                 embedding_table,
                 &winner,
                 &loser,
                 maintenance_cancel_rx,
             )
             .await?;
             merged_memory_ids.insert(loser.id.clone());
             merged_count += 1;
@@
-            if winner.id != active_survivor.id {
-                active_survivor = winner;
-            }
+            active_survivor = memory_store
+                .load(&winner.id)
+                .await?
+                .ok_or_else(|| anyhow::anyhow!("merged survivor disappeared during maintenance"))?;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/memory/maintenance.rs` around lines 251 - 270, active_survivor can become
stale after merge_pair because the stored winner document is updated in-place;
after each successful merge (where you currently call merge_pair and then use
winner/loser), reload the latest survivor from the store and assign it to
active_survivor so subsequent choose_merge_pair calls use up-to-date content —
i.e., after merge_pair(...).await? insert merged_memory_ids/merged_count logic,
then fetch the current winner by id from memory_store (using the winner.id) and
set active_survivor = that fresh record (or change merge_pair to return the
updated survivor and assign it) before continuing or breaking on source_id.

src/api/channels.rs (1)

293-297: ⚠️ Potential issue | 🟠 Major

Don’t collapse cancellation errors into boolean/404 paths.

is_ok() and map_err(|_| NOT_FOUND) hide real cancellation failures and can misreport server/runtime issues as not found.

🛠️ Proposed fix

-            if let Some(channel_state) = channel_state
-                && channel_state
-                    .cancel_worker_with_reason(worker_id, "cancelled via API")
-                    .await
-                    .is_ok()
-            {
-                return Ok(Json(CancelProcessResponse {
-                    success: true,
-                    message: format!("Worker {} cancelled", request.process_id),
-                }));
-            }
+            if let Some(channel_state) = channel_state {
+                match channel_state
+                    .cancel_worker_with_reason(worker_id, "cancelled via API")
+                    .await
+                {
+                    Ok(()) => {
+                        return Ok(Json(CancelProcessResponse {
+                            success: true,
+                            message: format!("Worker {} cancelled", request.process_id),
+                        }));
+                    }
+                    Err(error) => {
+                        tracing::warn!(
+                            %error,
+                            channel_id = %request.channel_id,
+                            process_id = %request.process_id,
+                            "live worker cancellation failed; trying detached reconciliation"
+                        );
+                    }
+                }
+            }
...
-            channel_state
-                .cancel_branch_with_reason(branch_id, "cancelled via API")
-                .await
-                .map_err(|_| StatusCode::NOT_FOUND)?;
+            channel_state
+                .cancel_branch_with_reason(branch_id, "cancelled via API")
+                .await
+                .map_err(|error| {
+                    tracing::warn!(
+                        %error,
+                        channel_id = %request.channel_id,
+                        process_id = %request.process_id,
+                        "branch cancellation failed"
+                    );
+                    StatusCode::INTERNAL_SERVER_ERROR
+                })?;

As per coding guidelines: “Never silently discard errors. No let _ = on Results. Handle them, log them, or propagate them. The only exception is .ok() on channel sends where the receiver may be dropped”.

Also applies to: 349-352

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/api/channels.rs` around lines 293 - 297, The current use of .is_ok() (and
similar mapping to NOT_FOUND) on
channel_state.cancel_worker_with_reason(worker_id, "cancelled via API").await
silently discards real errors; instead match on the Result from
cancel_worker_with_reason and handle errors explicitly (e.g., log the error with
process/server logger and return an appropriate server error or propagate the
error instead of converting everything to 404). Update the same pattern around
the other occurrence (lines referencing cancel_worker_with_reason at 349-352) so
neither path swallows failures — use match/if let Err(e) to log e and return a
proper error response rather than .is_ok()/map_err(|_| NOT_FOUND).

src/agent/channel_dispatch.rs (1)

620-632: ⚠️ Potential issue | 🟠 Major

Scrub panic payload text before logging/emitting completion error.

The panic branch logs and returns panic_message directly. That bypasses secret scrubbing used in other failure paths and can leak sensitive text.

🔒 Proposed fix

             Err(panic_payload) => {
                 let panic_message = panic_payload
                     .downcast_ref::<&str>()
                     .map(|message| (*message).to_string())
                     .or_else(|| panic_payload.downcast_ref::<String>().cloned())
                     .unwrap_or_else(|| "unknown panic payload".to_string());
+                let panic_message = if let Some(store) = &secrets_store {
+                    crate::secrets::scrub::scrub_with_store(&panic_message, store)
+                } else {
+                    panic_message
+                };
                 tracing::error!(
                     worker_id = %worker_id,
                     panic_message = %panic_message,
                     "worker task panicked"
                 );
                 Err(WorkerCompletionError::failed(format!(
                     "worker task panicked: {panic_message}"
                 )))
             }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/agent/channel_dispatch.rs` around lines 620 - 632, The panic path builds
panic_message from panic_payload and logs it directly (tracing::error) and
returns it in WorkerCompletionError::failed, which can leak secrets; fix by
passing the derived panic_message through the project's secret-scrubbing helper
used elsewhere (the same scrub/redact function applied to other failure paths)
before logging and before constructing WorkerCompletionError::failed, i.e.,
compute panic_message as now, then let scrubbed =
scrub_secret_text(panic_message) (or the actual project scrubbing function
name), use scrubbed in tracing::error and in WorkerCompletionError::failed.

src/agent/channel.rs (1)

1774-1783: ⚠️ Potential issue | 🔴 Critical

WorkerComplete still exits too early and skips convergence work.

If the worker is already absent from active_workers, Line 1776 returns before completion logging, handle/input cleanup, and retrigger queuing. This can drop synthetic terminal events and late completions.

🛠️ Suggested fix

-                let mut workers = self.state.active_workers.write().await;
-                if workers.remove(worker_id).is_none() {
-                    return Ok(());
-                }
-                drop(workers);
-
-                run_logger.log_worker_completed(*worker_id, result, *success);
-
-                self.state.worker_handles.write().await.remove(worker_id);
-                self.state.worker_inputs.write().await.remove(worker_id);
+                let removed_worker = self.state.active_workers.write().await.remove(worker_id).is_some();
+                let removed_handle = self.state.worker_handles.write().await.remove(worker_id).is_some();
+                let removed_input = self.state.worker_inputs.write().await.remove(worker_id).is_some();
+                if !(removed_worker || removed_handle || removed_input) {
+                    return Ok(());
+                }
+
+                run_logger.log_worker_completed(*worker_id, result, *success);

Based on learnings "For changes in async/stateful paths (worker lifecycle, cancellation, retrigger, recall cache behavior), include explicit race/terminal-state reasoning in the PR summary and run targeted tests in addition to just gate-pr".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/agent/channel.rs` around lines 1774 - 1783, The early return when
active_workers.remove(worker_id).is_none() skips completion logging, cleanup,
and retriggering; change the logic in the WorkerComplete handling so you do not
return early: record whether the worker was removed but always call
run_logger.log_worker_completed(*worker_id, result, *success), always remove
entries from self.state.worker_handles and self.state.worker_inputs, and ensure
any retrigger/queuing logic still runs for late or synthetic terminal events;
reference active_workers, run_logger.log_worker_completed, worker_handles,
worker_inputs and the WorkerComplete handling path when making the change.

src/main.rs (1)

1662-1680: ⚠️ Potential issue | 🟠 Major

Channel termination cleanup still leaves stale active_channels entries.

After channel.run() returns, cleanup unregisters registry/API state but never signals the main loop to remove active_channels[conversation_id]. The first subsequent inbound message for that conversation can still be dropped before retry.

💡 Suggested fix

-                    let cleanup_channel_id = conversation_id.clone();
+                    let cleanup_channel_id = conversation_id.clone();
+                    let closed_channel_tx = closed_channel_tx.clone();
                     let process_control_registry = agent.deps.process_control_registry.clone();
                     let api_state_for_cleanup = api_state.clone();
                     tokio::spawn(async move {
                         if let Err(error) = channel.run().await {
                             tracing::error!(%error, "channel event loop failed");
                         }

                         let scoped_channel_id: spacebot::ChannelId =
                             Arc::from(cleanup_channel_id.as_str());
                         process_control_registry
                             .unregister_channel(&scoped_channel_id)
                             .await;
                         api_state_for_cleanup
                             .unregister_channel_status(&cleanup_channel_id)
                             .await;
                         api_state_for_cleanup
                             .unregister_channel_state(&cleanup_channel_id)
                             .await;
+                        let _ = closed_channel_tx.send(cleanup_channel_id).await;
                     });

+// setup
+let (closed_channel_tx, mut closed_channel_rx) = mpsc::channel::<String>(64);
+
+// in main select!
+Some(closed_id) = closed_channel_rx.recv() => {
+    active_channels.remove(&closed_id);
+}

Based on learnings "For changes in async/stateful paths (worker lifecycle, cancellation, retrigger, recall cache behavior), include explicit race/terminal-state reasoning in the PR summary and run targeted tests in addition to just gate-pr".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/main.rs` around lines 1662 - 1680, The spawned task cleans registry and
API state after channel.run() returns but never notifies the main loop to remove
the entry from active_channels for conversation_id, leaving a stale entry that
can drop the next inbound message; after the existing cleanup steps (after
process_control_registry.unregister_channel and
api_state_for_cleanup.unregister_channel_*), send a clear/terminate signal to
the component that owns active_channels (or call its removal API) to remove
active_channels[&conversation_id] (or invoke the main-loop's channel removal
method), using the same conversation_id/Arc scoped_channel_id to avoid races and
ensure removal happens before returning from the spawned task.

docs/design-docs/cortex-implementation.md (1)

19-20: ⚠️ Potential issue | 🟡 Minor

Resolve the contradictory maintenance status text.

Line 19 says “never called,” but Line 20 says these functions are wired into the cortex loop. Please make this section state one clear runtime truth.

📝 Suggested doc fix

-**Implemented but never called:**
-- `memory/maintenance.rs` — `apply_decay()`, `prune_memories()`, and `merge_similar_memories()` are implemented and wired into the cortex loop.
+**Implemented and wired:**
+- `memory/maintenance.rs` — `apply_decay()`, `prune_memories()`, and `merge_similar_memories()` are implemented and wired into the cortex loop.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/design-docs/cortex-implementation.md` around lines 19 - 20, The heading
"Implemented but never called" contradicts the following sentence saying
apply_decay(), prune_memories(), and merge_similar_memories() are wired into the
cortex loop; update the text so it states the single runtime truth: either
remove "never called" and say these functions are implemented and wired into the
cortex loop, or if they truly aren't invoked, change the sentence to indicate
they are implemented but not yet wired; ensure you update the header and the
sentence about apply_decay(), prune_memories(), and merge_similar_memories() to
match.

🧹 Nitpick comments (3)

src/config.rs (2)

7613-7660: Expand this test to cover all newly added Cortex fields.

The test verifies maintenance fields, but it does not assert detached_worker_timeout_retry_limit and supervisor_kill_budget_per_tick. Please add assertions so partial wiring regressions get caught.

✅ Test extension example

 [defaults.cortex]
 maintenance_interval_secs = 1200
+detached_worker_timeout_retry_limit = 4
+supervisor_kill_budget_per_tick = 16
 maintenance_decay_rate = 0.11
 maintenance_prune_threshold = 0.21
 maintenance_min_age_days = 17
 maintenance_merge_similarity_threshold = 0.97
@@
         assert_eq!(resolved.cortex.maintenance_interval_secs, 1200);
+        assert_eq!(resolved.cortex.detached_worker_timeout_retry_limit, 4);
+        assert_eq!(resolved.cortex.supervisor_kill_budget_per_tick, 16);
         assert_eq!(resolved.cortex.maintenance_decay_rate, 0.11);
@@
-cortex = { maintenance_interval_secs = 300, maintenance_decay_rate = 0.33 }
+cortex = {
+  maintenance_interval_secs = 300,
+  detached_worker_timeout_retry_limit = 1,
+  supervisor_kill_budget_per_tick = 3,
+  maintenance_decay_rate = 0.33
+}
@@
         assert_eq!(resolved_override.cortex.maintenance_interval_secs, 300);
+        assert_eq!(resolved_override.cortex.detached_worker_timeout_retry_limit, 1);
+        assert_eq!(resolved_override.cortex.supervisor_kill_budget_per_tick, 3);
         assert_eq!(resolved_override.cortex.maintenance_decay_rate, 0.33);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/config.rs` around lines 7613 - 7660, The test
test_cortex_maintenance_config_resolution currently asserts several maintenance
fields but omits the new cortex settings detached_worker_timeout_retry_limit and
supervisor_kill_budget_per_tick; update the assertions to also check
resolved.cortex.detached_worker_timeout_retry_limit and
resolved.cortex.supervisor_kill_budget_per_tick after parsing the defaults TOML,
and likewise assert resolved_override.cortex.detached_worker_timeout_retry_limit
and resolved_override.cortex.supervisor_kill_budget_per_tick after the override
TOML (using the expected default values for fields not overridden and the
explicit override values if you add any to the override case).

4850-4908: Extract Cortex merge into one resolver to avoid drift.

The same field-by-field CortexConfig merge now exists in both defaults and per-agent paths. Centralizing this into one resolver will prevent divergence as new fields keep being added.

As per coding guidelines src/config.rs: “Use config resolution with precedence: env > DB > default, with per-subsystem resolve() methods”.

Also applies to: 5079-5134

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/config.rs` around lines 4850 - 4908, The field-by-field CortexConfig
merge is duplicated; extract the merge logic into a single resolver on the
CortexConfig type (e.g., implement CortexConfig::resolve(&self, base:
&CortexConfig) -> CortexConfig or CortexConfig::merge_with_defaults(&self, base:
&CortexConfig) -> CortexConfig) and call that resolver wherever you currently
build CortexConfig by manually mapping each field (the current closures that
call .map(|c| CortexConfig { ... }).unwrap_or(base_defaults.cortex) and the
similar per-agent merge). Replace both inline merges with a call to the new
resolve/merge method so future fields are handled in one place and follow the
intended env > DB > default precedence.

src/hooks/spacebot.rs (1)

477-491: Redundant leak detection — this code is unreachable.

guard_tool_result (line 468) already scans result for leaks and returns Terminate if one is found. The early return at lines 469-472 ensures we never reach this channel/reply-specific check when a leak exists.

Consider removing this redundant block or, if distinct termination reasons are intentional, moving this check before guard_tool_result for channel/reply cases.

♻️ Proposed fix to remove redundant code

         self.record_tool_result_metrics(tool_name, internal_call_id);
             return guard_action;
         }

-        // Only enforce hard-stop leak blocking on channel egress (`reply`).
-        // Worker and branch tool outputs are internal and should not terminate
-        // long-running jobs.
-        if self.process_type == ProcessType::Channel
-            && tool_name == "reply"
-            && let Some(leak) = self.scan_for_leaks(result)
-        {
-            tracing::error!(
-                process_id = %self.process_id,
-                tool_name = %tool_name,
-                leak_prefix = %&leak[..leak.len().min(8)],
-                "secret leak detected in reply result, terminating channel turn"
-            );
-            self.record_tool_result_metrics(tool_name, internal_call_id);
-            return HookAction::Terminate {
-                reason: "Reply contained a secret. Channel turn terminated.".into(),
-            };
-        }
-
         // Cap the result stored in the broadcast event to avoid blowing up

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/hooks/spacebot.rs` around lines 477 - 491, The channel/reply-specific
leak detection block is unreachable because guard_tool_result already checks
result for leaks and returns Terminate; remove the redundant if-block that
checks self.process_type == ProcessType::Channel && tool_name == "reply" && let
Some(leak) = self.scan_for_leaks(result) (and its tracing::error,
record_tool_result_metrics call, and HookAction::Terminate return) OR, if you
intend a distinct termination reason for channel replies, move this check to run
before guard_tool_result so it can short-circuit earlier; update any references
to record_tool_result_metrics and the termination reason string accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/agent/channel_dispatch.rs`:
- Around line 19-31: reserve_worker_slot_local currently only performs a
non-atomic check against active_worker_count and returns Ok(()) so concurrent
spawns can both pass and oversubscribe workers; change it to perform an atomic
reservation instead (e.g., acquire a mutex/lock around the channel's active
worker state or use a semaphore/counter with fetch_update) so you
increment/reserve a slot before returning Ok and decrement/release on worker
exit/failure; keep the same error path using AgentError::WorkerLimitReached with
channel_id and max_workers when reservation fails, and apply the same
atomic-reservation fix to the analogous logic referenced around lines 309-313.

In `@src/agent/channel.rs`:
- Around line 1693-1702: The code removes branch_reply_targets too early, so the
later retrieval for retrigger metadata (the later access around where
branch_reply_targets is read) always misses; change the logic in the function
handling branch completion so you do not unconditionally remove from
self.branch_reply_targets before you read it for retrigger metadata—either move
the remove(branch_id) call to after the retrigger metadata is read/used or
replace the unconditional remove with a read-only access (e.g., get/peek) and
only remove when you truly want to drop the target; refer to the symbols
self.branch_reply_targets.remove(branch_id) and the later retrieval code around
the retrigger metadata access and adjust so the reply target is available when
needed.

In `@src/agent/cortex_chat.rs`:
- Around line 93-104: persist_and_emit_cortex_chat_error currently discards the
Result from store.save_message; change it to explicitly handle errors by
awaiting the save and matching its Result (e.g., if let Err(e) =
store.save_message(thread_id, "assistant", &message, channel_ref).await {
tracing::error!("persist_and_emit_cortex_chat_error: failed to save message for
thread_id={} channel_ref={:?}: {:?}", thread_id, channel_ref, e); } ), and then
still send the CortexChatEvent::Error (use .await.ok() on event_tx.send if the
receiver may be dropped). Reference the function
persist_and_emit_cortex_chat_error, the CortexChatStore::save_message call, and
the event_tx.send(CortexChatEvent::Error { message }) send to locate where to
add the match and logging.

In `@src/agent/ingestion.rs`:
- Around line 521-525: The tracing::debug call in ingestion.rs currently logs
the model's response content (response) which may expose sensitive ingested
text; update the log in the chunk processing path (the tracing::debug block that
references filename, chunk_number, total_chunks, and response) to avoid
including response text — log only metadata such as filename, chunk id/total
(chunk_number and total_chunks) and response length (e.g., response.len()) or
other non-textual attributes instead of the response string itself.

---

Outside diff comments:
In `@src/config.rs`:
- Around line 4856-4906: The new cortex fields are accepted blindly; add
validation in the config build (where these fields are set, e.g., the Cortex
config constructor or TryFrom implementation) to check numeric bounds before
accepting values: ensure integer durations/counters (maintenance_min_age_days,
maintenance_interval_secs, worker_timeout_secs, branch_timeout_secs,
supervisor_kill_budget_per_tick, bulletin_max_words, bulletin_max_turns,
association_max_per_pass, association_updates_threshold,
detached_worker_timeout_retry_limit) are non-negative (and >0 where applicable),
ensure probability/ratio thresholds (association_similarity_threshold,
maintenance_merge_similarity_threshold, maintenance_prune_threshold,
maintenance_decay_rate, circuit_breaker_threshold, bulletin_interval_secs if
considered ratio) are within sensible ranges (e.g., 0.0..=1.0 for
similarity/decay thresholds), and either return a ConfigError/Err on invalid
values or clamp to safe defaults using base_defaults.cortex; implement these
checks adjacent to the current unwrap_or usage so invalid incoming values are
rejected or corrected before assignment.

---

Duplicate comments:
In `@docs/design-docs/cortex-implementation.md`:
- Around line 19-20: The heading "Implemented but never called" contradicts the
following sentence saying apply_decay(), prune_memories(), and
merge_similar_memories() are wired into the cortex loop; update the text so it
states the single runtime truth: either remove "never called" and say these
functions are implemented and wired into the cortex loop, or if they truly
aren't invoked, change the sentence to indicate they are implemented but not yet
wired; ensure you update the header and the sentence about apply_decay(),
prune_memories(), and merge_similar_memories() to match.

In `@src/agent/channel_dispatch.rs`:
- Around line 620-632: The panic path builds panic_message from panic_payload
and logs it directly (tracing::error) and returns it in
WorkerCompletionError::failed, which can leak secrets; fix by passing the
derived panic_message through the project's secret-scrubbing helper used
elsewhere (the same scrub/redact function applied to other failure paths) before
logging and before constructing WorkerCompletionError::failed, i.e., compute
panic_message as now, then let scrubbed = scrub_secret_text(panic_message) (or
the actual project scrubbing function name), use scrubbed in tracing::error and
in WorkerCompletionError::failed.

In `@src/agent/channel.rs`:
- Around line 1774-1783: The early return when
active_workers.remove(worker_id).is_none() skips completion logging, cleanup,
and retriggering; change the logic in the WorkerComplete handling so you do not
return early: record whether the worker was removed but always call
run_logger.log_worker_completed(*worker_id, result, *success), always remove
entries from self.state.worker_handles and self.state.worker_inputs, and ensure
any retrigger/queuing logic still runs for late or synthetic terminal events;
reference active_workers, run_logger.log_worker_completed, worker_handles,
worker_inputs and the WorkerComplete handling path when making the change.

In `@src/agent/worker.rs`:
- Around line 386-395: The code is directly assigning self.state =
WorkerState::Failed / WorkerState::WaitingForInput after follow_up_result
handling; instead validate transitions via the worker state machine before
mutating state: check self.state.can_transition_to(WorkerState::Failed) (and
similarly for WorkerState::WaitingForInput) using matches! as per guidelines,
and only set self.state when the check passes; if the transition is illegal,
convert it into a runtime error/panic or return an Err instead of silently
mutating; update the block around follow_up_result, follow_up_failure, and the
hook.send_status calls to perform these guarded transitions.

In `@src/api/channels.rs`:
- Around line 293-297: The current use of .is_ok() (and similar mapping to
NOT_FOUND) on channel_state.cancel_worker_with_reason(worker_id, "cancelled via
API").await silently discards real errors; instead match on the Result from
cancel_worker_with_reason and handle errors explicitly (e.g., log the error with
process/server logger and return an appropriate server error or propagate the
error instead of converting everything to 404). Update the same pattern around
the other occurrence (lines referencing cancel_worker_with_reason at 349-352) so
neither path swallows failures — use match/if let Err(e) to log e and return a
proper error response rather than .is_ok()/map_err(|_| NOT_FOUND).

In `@src/api/config.rs`:
- Around line 619-621: The code currently casts cortex.maintenance_interval_secs
from u64 to i64 unsafely (in the block that assigns
table["maintenance_interval_secs"]), which will wrap values > i64::MAX and
produce invalid negative TOML integers; change this to perform a checked
conversion (e.g., use i64::try_from or compare against i64::MAX) and only set
table["maintenance_interval_secs"] when conversion succeeds, otherwise return a
400 Bad Request error from the surrounding handler/function indicating the value
is out of range; update the logic that references maintenance_interval_secs and
the error path so callers get a proper rejection instead of writing a wrapped
value.
- Around line 622-633: Validate each maintenance knob Option before writing to
the TOML table: for cortex.maintenance_decay_rate,
cortex.maintenance_prune_threshold, and
cortex.maintenance_merge_similarity_threshold ensure the Some(v) value is within
0.0..=1.0; for cortex.maintenance_min_age_days ensure the Some(v) value is >= 0.
Do the checks in the same scope that currently writes to table (i.e., before
assigning table["maintenance_*"] = toml_edit::value(...)); if a value is out of
bounds return or propagate a clear error (with which field failed and why)
instead of persisting the invalid value so the invalid config is rejected at
write time.

In `@src/main.rs`:
- Around line 1662-1680: The spawned task cleans registry and API state after
channel.run() returns but never notifies the main loop to remove the entry from
active_channels for conversation_id, leaving a stale entry that can drop the
next inbound message; after the existing cleanup steps (after
process_control_registry.unregister_channel and
api_state_for_cleanup.unregister_channel_*), send a clear/terminate signal to
the component that owns active_channels (or call its removal API) to remove
active_channels[&conversation_id] (or invoke the main-loop's channel removal
method), using the same conversation_id/Arc scoped_channel_id to avoid races and
ensure removal happens before returning from the spawned task.

In `@src/memory/maintenance.rs`:
- Around line 251-270: active_survivor can become stale after merge_pair because
the stored winner document is updated in-place; after each successful merge
(where you currently call merge_pair and then use winner/loser), reload the
latest survivor from the store and assign it to active_survivor so subsequent
choose_merge_pair calls use up-to-date content — i.e., after
merge_pair(...).await? insert merged_memory_ids/merged_count logic, then fetch
the current winner by id from memory_store (using the winner.id) and set
active_survivor = that fresh record (or change merge_pair to return the updated
survivor and assign it) before continuing or breaking on source_id.

---

Nitpick comments:
In `@src/config.rs`:
- Around line 7613-7660: The test test_cortex_maintenance_config_resolution
currently asserts several maintenance fields but omits the new cortex settings
detached_worker_timeout_retry_limit and supervisor_kill_budget_per_tick; update
the assertions to also check resolved.cortex.detached_worker_timeout_retry_limit
and resolved.cortex.supervisor_kill_budget_per_tick after parsing the defaults
TOML, and likewise assert
resolved_override.cortex.detached_worker_timeout_retry_limit and
resolved_override.cortex.supervisor_kill_budget_per_tick after the override TOML
(using the expected default values for fields not overridden and the explicit
override values if you add any to the override case).
- Around line 4850-4908: The field-by-field CortexConfig merge is duplicated;
extract the merge logic into a single resolver on the CortexConfig type (e.g.,
implement CortexConfig::resolve(&self, base: &CortexConfig) -> CortexConfig or
CortexConfig::merge_with_defaults(&self, base: &CortexConfig) -> CortexConfig)
and call that resolver wherever you currently build CortexConfig by manually
mapping each field (the current closures that call .map(|c| CortexConfig { ...
}).unwrap_or(base_defaults.cortex) and the similar per-agent merge). Replace
both inline merges with a call to the new resolve/merge method so future fields
are handled in one place and follow the intended env > DB > default precedence.

In `@src/hooks/spacebot.rs`:
- Around line 477-491: The channel/reply-specific leak detection block is
unreachable because guard_tool_result already checks result for leaks and
returns Terminate; remove the redundant if-block that checks self.process_type
== ProcessType::Channel && tool_name == "reply" && let Some(leak) =
self.scan_for_leaks(result) (and its tracing::error, record_tool_result_metrics
call, and HookAction::Terminate return) OR, if you intend a distinct termination
reason for channel replies, move this check to run before guard_tool_result so
it can short-circuit earlier; update any references to
record_tool_result_metrics and the termination reason string accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5941001f-4362-4d61-a385-f666bd620bf2

📥 Commits

Reviewing files that changed from the base of the PR and between 87f1038 and 5d6b65c.

⛔ Files ignored due to path filters (15)

Cargo.lock is excluded by !**/*.lock, !**/*.lock
Cargo.toml is excluded by !**/*.toml
vendor/imap-proto-0.10.2/Cargo.toml is excluded by !**/*.toml, !**/vendor/**
vendor/imap-proto-0.10.2/src/body.rs is excluded by !**/vendor/**
vendor/imap-proto-0.10.2/src/body_structure.rs is excluded by !**/vendor/**
vendor/imap-proto-0.10.2/src/builders/command.rs is excluded by !**/vendor/**
vendor/imap-proto-0.10.2/src/builders/mod.rs is excluded by !**/vendor/**
vendor/imap-proto-0.10.2/src/core.rs is excluded by !**/vendor/**
vendor/imap-proto-0.10.2/src/lib.rs is excluded by !**/vendor/**
vendor/imap-proto-0.10.2/src/macros.rs is excluded by !**/vendor/**
vendor/imap-proto-0.10.2/src/parser/mod.rs is excluded by !**/vendor/**
vendor/imap-proto-0.10.2/src/parser/rfc3501.rs is excluded by !**/vendor/**
vendor/imap-proto-0.10.2/src/parser/rfc4551.rs is excluded by !**/vendor/**
vendor/imap-proto-0.10.2/src/parser/rfc5464.rs is excluded by !**/vendor/**
vendor/imap-proto-0.10.2/src/types.rs is excluded by !**/vendor/**

📒 Files selected for processing (44)

README.md
docs/content/docs/(configuration)/config.mdx
docs/content/docs/(core)/architecture.mdx
docs/design-docs/cortex-chat.md
docs/design-docs/cortex-history.md
docs/design-docs/cortex-implementation.md
docs/design-docs/tool-nudging.md
src/agent.rs
src/agent/branch.rs
src/agent/channel.rs
src/agent/channel_dispatch.rs
src/agent/channel_history.rs
src/agent/compactor.rs
src/agent/cortex.rs
src/agent/cortex_chat.rs
src/agent/ingestion.rs
src/agent/invariant_harness.rs
src/agent/process_control.rs
src/agent/status.rs
src/agent/worker.rs
src/api/agents.rs
src/api/channels.rs
src/api/config.rs
src/api/cortex.rs
src/api/state.rs
src/api/system.rs
src/config.rs
src/hooks.rs
src/hooks/spacebot.rs
src/lib.rs
src/llm/model.rs
src/main.rs
src/memory/maintenance.rs
src/memory/store.rs
src/telemetry/registry.rs
src/tools.rs
src/tools/cancel.rs
src/tools/memory_save.rs
src/tools/worker_inspect.rs
tests/bulletin.rs
tests/context_dump.rs
tests/detached_worker_bootstrap.rs
tests/maintenance.rs
tests/tool_nudge.rs

🚧 Files skipped from review as they are similar to previous changes (10)

src/api/system.rs
README.md
src/tools/worker_inspect.rs
src/api/state.rs
src/agent/status.rs
tests/maintenance.rs
docs/design-docs/cortex-chat.md
docs/content/docs/(configuration)/config.mdx
docs/content/docs/(core)/architecture.mdx
src/tools.rs

coderabbitai · 2026-03-04T22:09:48Z

src/agent/channel_dispatch.rs

+pub(crate) fn reserve_worker_slot_local(
+    active_worker_count: usize,
+    channel_id: &Arc<str>,
+    max_workers: usize,
+) -> std::result::Result<(), AgentError> {
+    if active_worker_count >= max_workers {
+        return Err(AgentError::WorkerLimitReached {
+            channel_id: channel_id.to_string(),
+            max: max_workers,
+        });
+    }
+
+    Ok(())


⚠️ Potential issue | 🟠 Major

reserve_worker_slot_local is check-only; concurrent spawns can exceed max_workers.

This gate reads active_workers.len() and returns Ok(()) without reserving a slot atomically. Two concurrent spawns can both pass and oversubscribe worker capacity.

Based on learnings: “For changes in async/stateful paths (worker lifecycle, cancellation, retrigger, recall cache behavior), include explicit race/terminal-state reasoning…”.

Also applies to: 309-313

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/agent/channel_dispatch.rs` around lines 19 - 31, reserve_worker_slot_local currently only performs a non-atomic check against active_worker_count and returns Ok(()) so concurrent spawns can both pass and oversubscribe workers; change it to perform an atomic reservation instead (e.g., acquire a mutex/lock around the channel's active worker state or use a semaphore/counter with fetch_update) so you increment/reserve a slot before returning Ok and decrement/release on worker exit/failure; keep the same error path using AgentError::WorkerLimitReached with channel_id and max_workers when reservation fails, and apply the same atomic-reservation fix to the analogous logic referenced around lines 309-313.

src/agent/channel.rs

coderabbitai · 2026-03-04T22:09:48Z

src/agent/cortex_chat.rs

+async fn persist_and_emit_cortex_chat_error(
+    store: &CortexChatStore,
+    event_tx: &mpsc::Sender<CortexChatEvent>,
+    thread_id: &str,
+    channel_ref: Option<&str>,
+    message: String,
+) {
+    let _ = store
+        .save_message(thread_id, "assistant", &message, channel_ref)
+        .await;
+    let _ = event_tx.send(CortexChatEvent::Error { message }).await;
+}


⚠️ Potential issue | 🟠 Major

Don’t silently drop DB persistence failures in error reporting path.

persist_and_emit_cortex_chat_error ignores save_message errors, which can hide storage failures during the exact path meant to preserve failure context.

🛠️ Suggested fix

async fn persist_and_emit_cortex_chat_error( @@ ) { - let _ = store - .save_message(thread_id, "assistant", &message, channel_ref) - .await; - let _ = event_tx.send(CortexChatEvent::Error { message }).await; + if let Err(error) = store + .save_message(thread_id, "assistant", &message, channel_ref) + .await + { + tracing::warn!(%error, thread_id, "failed to persist cortex chat error message"); + } + event_tx.send(CortexChatEvent::Error { message }).await.ok(); }

As per coding guidelines "Never silently discard errors. No let _ = on Results. Handle them, log them, or propagate them. The only exception is .ok() on channel sends where the receiver may be dropped".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

async fn persist_and_emit_cortex_chat_error(

store: &CortexChatStore,

event_tx: &mpsc::Sender<CortexChatEvent>,

thread_id: &str,

channel_ref: Option<&str>,

message: String,

) {

let _ = store

.save_message(thread_id, "assistant", &message, channel_ref)

.await;

let _ = event_tx.send(CortexChatEvent::Error { message }).await;

}

async fn persist_and_emit_cortex_chat_error(

store: &CortexChatStore,

event_tx: &mpsc::Sender<CortexChatEvent>,

thread_id: &str,

channel_ref: Option<&str>,

message: String,

) {

if let Err(error) = store

.save_message(thread_id, "assistant", &message, channel_ref)

.await

{

tracing::warn!(%error, thread_id, "failed to persist cortex chat error message");

}

event_tx.send(CortexChatEvent::Error { message }).await.ok();

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/agent/cortex_chat.rs` around lines 93 - 104, persist_and_emit_cortex_chat_error currently discards the Result from store.save_message; change it to explicitly handle errors by awaiting the save and matching its Result (e.g., if let Err(e) = store.save_message(thread_id, "assistant", &message, channel_ref).await { tracing::error!("persist_and_emit_cortex_chat_error: failed to save message for thread_id={} channel_ref={:?}: {:?}", thread_id, channel_ref, e); } ), and then still send the CortexChatEvent::Error (use .await.ok() on event_tx.send if the receiver may be dropped). Reference the function persist_and_emit_cortex_chat_error, the CortexChatStore::save_message call, and the event_tx.send(CortexChatEvent::Error { message }) send to locate where to add the match and logging.

coderabbitai · 2026-03-04T22:09:48Z

src/agent/ingestion.rs

            tracing::debug!(
                file = %filename,
                chunk = %format!("{chunk_number}/{total_chunks}"),
                response = %response.chars().take(200).collect::<String>(),
                "chunk processed"


⚠️ Potential issue | 🟠 Major

Avoid logging model response content for ingested chunks.

This logs text derived from ingested files, which can leak sensitive document content into logs. Prefer metadata-only logging (e.g., length/chunk id).

🔧 Suggested fix

Ok(response) => { tracing::debug!( file = %filename, chunk = %format!("{chunk_number}/{total_chunks}"), - response = %response.chars().take(200).collect::<String>(), + response_chars = response.chars().count(), "chunk processed" ); Ok(()) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

tracing::debug!(

file = %filename,

chunk = %format!("{chunk_number}/{total_chunks}"),

response = %response.chars().take(200).collect::<String>(),

"chunk processed"

tracing::debug!(

file = %filename,

chunk = %format!("{chunk_number}/{total_chunks}"),

response_chars = response.chars().count(),

"chunk processed"

);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/agent/ingestion.rs` around lines 521 - 525, The tracing::debug call in ingestion.rs currently logs the model's response content (response) which may expose sensitive ingested text; update the log in the chunk processing path (the tracing::debug block that references filename, chunk_number, total_chunks, and response) to avoid including response text — log only metadata such as filename, chunk id/total (chunk_number and total_chunks) and response length (e.g., response.len()) or other non-textual attributes instead of the response string itself.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/agent/process_control.rs (1)

101-103: ⚠️ Potential issue | 🟠 Major

Stale unregister can remove a fresh channel registration.

Lines 101–103 remove by channel_id only without validating the registration. If an older channel instance unregisters after a newer instance re-registers the same ChannelId (e.g., user deletes/recreates a conversation), this silently evicts the live entry and breaks cancellation routing for the new channel.

The cleanup task in src/main.rs:1668–1680 calls unregister_channel without a registration guard. Pair register_channel return of registration_id with unregister_channel validation of it:

🛠️ Proposed hardening

-    pub async fn register_channel(&self, channel_id: ChannelId, handle: WeakChannelControlHandle) {
+    pub async fn register_channel(
+        &self,
+        channel_id: ChannelId,
+        handle: WeakChannelControlHandle,
+    ) -> u64 {
         let registration_id = self
             .next_channel_registration
             .fetch_add(1, Ordering::AcqRel);
         self.channels.write().await.insert(
             channel_id,
             ChannelControlEntry {
                 handle,
                 registration_id,
             },
         );
+        registration_id
     }

-    pub async fn unregister_channel(&self, channel_id: &ChannelId) -> bool {
-        self.channels.write().await.remove(channel_id).is_some()
+    pub async fn unregister_channel(
+        &self,
+        channel_id: &ChannelId,
+        expected_registration_id: u64,
+    ) -> bool {
+        let mut channels = self.channels.write().await;
+        if channels
+            .get(channel_id)
+            .is_some_and(|entry| entry.registration_id == expected_registration_id)
+        {
+            channels.remove(channel_id);
+            true
+        } else {
+            false
+        }
     }

Also update src/main.rs:1613 to capture registration_id from register_channel and pass it to the cleanup task at line 1673–1674.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/agent/process_control.rs` around lines 101 - 103, The unregister routine
currently removes entries by ChannelId only and can evict a newer registration;
change register_channel to return a unique registration_id (e.g., UUID or
incrementing token) when inserting a channel, then modify unregister_channel to
accept both &ChannelId and the registration_id and only remove the map entry if
the stored registration_id matches the provided one (preserving the live entry
if it differs). Update the caller in the cleanup task to capture the
registration_id returned from register_channel and pass that registration_id
into unregister_channel when scheduling cleanup; ensure types and tests are
updated to reflect the new unregister_channel signature and behavior.

♻️ Duplicate comments (5)

tests/maintenance.rs (1)

8-37: ⚠️ Potential issue | 🟠 Major

Keep the TempDir alive for the full fixture lifetime.

Line 27 creates a temp directory, but it is dropped when the fixture returns at Line 36. That can invalidate the LanceDB-backed EmbeddingTable used later in the tests.

🛠️ Proposed fix

 async fn make_memory_maintenance_fixture() -> (
     std::sync::Arc<MemoryStore>,
     spacebot::memory::EmbeddingTable,
+    tempfile::TempDir,
 ) {
@@
-    let dir = tempdir().expect("failed to create temp dir");
-    let lance_conn = lancedb::connect(dir.path().to_str().expect("temp path"))
+    let temp_directory = tempdir().expect("failed to create temp dir");
+    let lance_conn = lancedb::connect(temp_directory.path().to_str().expect("temp path"))
         .execute()
         .await
         .expect("failed to connect to lancedb");
@@
-    (store, embedding_table)
+    (store, embedding_table, temp_directory)
 }
@@
-    let (store, embedding_table) = make_memory_maintenance_fixture().await;
+    let (store, embedding_table, _temp_directory) = make_memory_maintenance_fixture().await;
@@
-    let (store, embedding_table) = make_memory_maintenance_fixture().await;
+    let (store, embedding_table, _temp_directory) = make_memory_maintenance_fixture().await;

#!/bin/bash
set -euo pipefail

echo "== Fixture signature, tempdir allocation, and return tuple =="
rg -n 'async fn make_memory_maintenance_fixture|tempdir\(|\(store,\s*embedding_table\)' tests/maintenance.rs -C2

echo
echo "== Fixture call sites that destructure the tuple =="
rg -n 'make_memory_maintenance_fixture\(\)\.await|let \(' tests/maintenance.rs -C2

echo
echo "== Lance embedding table storage lifecycle context =="
rg -n 'open_or_create|drop_table|connect' src/memory/lance.rs -C2

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/maintenance.rs` around lines 8 - 37, The tempdir created in
make_memory_maintenance_fixture (dir via tempdir()) is dropped when the function
returns, invalidating the LanceDB-backed EmbeddingTable; preserve the TempDir
for the lifetime of the fixture by returning it (or an Arc-wrapped TempDir)
alongside the MemoryStore and EmbeddingTable so the directory stays alive, e.g.,
change the function signature and return tuple to include the TempDir (or
Arc<TempDir>) and ensure lance_conn/embedding_table keep owning references to
that dir.

docs/design-docs/cortex-implementation.md (1)

19-20: ⚠️ Potential issue | 🟡 Minor

Resolve contradictory maintenance status wording.

Line 19 says “never called” while Line 20 says the same functions are wired into the cortex loop. Please make this status unambiguous.
📝 Suggested wording fix
-**Implemented but never called:**
+**Implemented and wired:**
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design-docs/cortex-implementation.md` around lines 19 - 20, The heading
is contradictory: remove "never called" and make the status unambiguous by
changing the heading to reflect that the functions in memory/maintenance.rs
(apply_decay(), prune_memories(), merge_similar_memories()) are implemented and
wired into the cortex loop — e.g., replace "Implemented but never called:" with
"Implemented and wired into the cortex loop:" so the description matches the
actual wiring.

src/api/config.rs (1)

646-660: ⚠️ Potential issue | 🟠 Major

Harden maintenance writes with checked conversion + range validation.

Line 647 still uses unchecked u64 -> i64 narrowing, and Line 649-Line 660 persist float/int maintenance knobs without rejecting invalid ranges. This can store invalid values and destabilize runtime behavior.

💡 Suggested fix

+fn validate_unit_interval(value: f32) -> Result<(), StatusCode> {
+    if (0.0..=1.0).contains(&value) {
+        Ok(())
+    } else {
+        Err(StatusCode::BAD_REQUEST)
+    }
+}
+
 fn update_cortex_table(
     doc: &mut toml_edit::DocumentMut,
     agent_idx: usize,
     cortex: &CortexUpdate,
 ) -> Result<(), StatusCode> {
@@
-    if let Some(v) = cortex.maintenance_interval_secs {
-        table["maintenance_interval_secs"] = toml_edit::value(v as i64);
+    if let Some(maintenance_interval_secs) = cortex.maintenance_interval_secs {
+        table["maintenance_interval_secs"] = toml_edit::value(
+            i64::try_from(maintenance_interval_secs).map_err(|_| StatusCode::BAD_REQUEST)?
+        );
     }
-    if let Some(v) = cortex.maintenance_decay_rate {
-        table["maintenance_decay_rate"] = toml_edit::value(v as f64);
+    if let Some(maintenance_decay_rate) = cortex.maintenance_decay_rate {
+        validate_unit_interval(maintenance_decay_rate)?;
+        table["maintenance_decay_rate"] = toml_edit::value(maintenance_decay_rate as f64);
     }
-    if let Some(v) = cortex.maintenance_prune_threshold {
-        table["maintenance_prune_threshold"] = toml_edit::value(v as f64);
+    if let Some(maintenance_prune_threshold) = cortex.maintenance_prune_threshold {
+        validate_unit_interval(maintenance_prune_threshold)?;
+        table["maintenance_prune_threshold"] = toml_edit::value(maintenance_prune_threshold as f64);
     }
-    if let Some(v) = cortex.maintenance_min_age_days {
-        table["maintenance_min_age_days"] = toml_edit::value(v);
+    if let Some(maintenance_min_age_days) = cortex.maintenance_min_age_days {
+        if maintenance_min_age_days < 0 {
+            return Err(StatusCode::BAD_REQUEST);
+        }
+        table["maintenance_min_age_days"] = toml_edit::value(maintenance_min_age_days);
     }
-    if let Some(v) = cortex.maintenance_merge_similarity_threshold {
-        table["maintenance_merge_similarity_threshold"] = toml_edit::value(v as f64);
+    if let Some(maintenance_merge_similarity_threshold) = cortex.maintenance_merge_similarity_threshold {
+        validate_unit_interval(maintenance_merge_similarity_threshold)?;
+        table["maintenance_merge_similarity_threshold"] =
+            toml_edit::value(maintenance_merge_similarity_threshold as f64);
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/api/config.rs` around lines 646 - 660, The maintenance fields are written
without checked narrowing and range validation; update the
cortex.maintenance_... handling to use safe conversions and explicit range
checks: for integer fields (maintenance_interval_secs, maintenance_min_age_days)
use i64::try_from(...) (or TryInto) and reject values that don't fit i64 (return
Err or propagate a config validation error) before assigning to table; for float
knobs (maintenance_decay_rate, maintenance_prune_threshold,
maintenance_merge_similarity_threshold) ensure the f64 is finite and inside an
acceptable range (e.g. 0.0..=1.0 for rates/thresholds) and only then call
toml_edit::value(validated_value), otherwise return/propagate an error or skip
writing the field; refer to the cortex.maintenance_interval_secs,
maintenance_decay_rate, maintenance_prune_threshold, maintenance_min_age_days,
and maintenance_merge_similarity_threshold identifiers when making these
changes.

src/memory/maintenance.rs (2)

251-270: ⚠️ Potential issue | 🔴 Critical

active_survivor is stale after merge_pair, risking loss of earlier merged content.

After Line 252 (merge_pair(...).await?), active_survivor is not refreshed to the persisted merged survivor state. In subsequent candidate merges (Line 235+), merge content can be recomputed from pre-merge text and overwrite prior additions.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/memory/maintenance.rs` around lines 251 - 270, The active_survivor is
stale after calling merge_pair which can cause later merges to rebuild from
pre-merge data; after merge_pair(...) completes, reload the persisted survivor
state and set active_survivor to that refreshed record (e.g., fetch the memory
by winner.id from memory_store) before continuing merges and before
checking/using winner.id or setting merged_memory_ids/source_merged; keep using
choose_merge_pair and merge_pair as-is but ensure active_survivor is updated
from the authoritative store rather than relying on the pre-merge in-memory
winner.

324-377: ⚠️ Potential issue | 🔴 Critical

Make the merge sequence atomic to prevent partial graph rewires.

From Line 324 through Line 377, this is a multi-step mutation across memories, associations, and embeddings with cancellation/error points in between. A mid-sequence failure can leave survivor content, edges, and forgotten/embedding state out of sync.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/memory/maintenance.rs` around lines 324 - 377, The current merge logic
(calls to maintenance_cancelable_op with memory_store.update, get_associations,
delete_associations_for_memory, create_association, forget and
embedding_table.delete, and the created Association::new) must be made atomic:
wrap the whole sequence in a single transactional/locked operation or implement
an explicit two-phase commit so either all steps succeed or none are applied.
Concretely, replace the series of independent maintenance_cancelable_op calls
with a transaction API on memory_store/embedding_table (or acquire a per-memory
lock for merged.id and updated_survivor.id), perform update(&updated_survivor),
fetch and delete associations, re-create remapped associations, create the
updates_assoc, and then forget(&merged.id) and
embedding_table.delete(&merged.id) only on commit; if the store has no
transaction support, implement rollback logic to undo create_association calls
and restore deleted associations/embedding on any intermediate error and ensure
check_maintenance_cancellation is still honored inside the locked/transactional
block.

🧹 Nitpick comments (1)

src/memory/maintenance.rs (1)

483-619: Add a regression test for multi-candidate merge accumulation in one maintenance pass.

Current coverage validates a single duplicate merge, but not repeated merges onto the same survivor in one run. A test with one source + two near-duplicates would catch stale-survivor regressions.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/memory/maintenance.rs` around lines 483 - 619, Add a new async test
(similar to merges_near_duplicate_memories_and_transfers_associations) that
creates one survivor memory and two distinct near-duplicate memories via
create_memory_with_embedding, links associations from each duplicate to other
memories, then runs run_maintenance once with a high merge_similarity_threshold
so both duplicates should merge into the same survivor in a single pass; after
run_maintenance assert merged == 2, the survivor content includes updates from
both duplicates, both duplicates are marked forgotten and have no embeddings in
EmbeddingTable::find_similar, and associations from both duplicates have been
transferred/rewired to the survivor (use MemoryStore::get_associations,
MemoryStore::load, EmbeddingTable::find_similar to verify).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/agent/cortex.rs`:
- Around line 1814-1851: Add a circuit-breaker: introduce a counter field (e.g.,
maintenance_consecutive_failures: usize) and a disabled flag/timestamp (e.g.,
maintenance_disabled_at: Option<Instant>) on the same struct that owns
maintenance_task, then in the maintenance task completion branch (the block
matching task.await in the maintenance_task handling code) increment
maintenance_consecutive_failures on any failure path (Ok(Err(_)) and Err(_)),
reset it to 0 on Ok(Ok(report)), and when the counter reaches the threshold (3)
set maintenance_disabled_at (or a boolean like maintenance_disabled = true) and
stop scheduling new maintenance tasks; also emit a clear log entry when the
circuit trips and when it is later reset. Apply the same pattern to the other
maintenance/cron/compactor routines referenced (around lines ~1932-1972) using
similarly named counters/flags so recurring retries are auto-disabled after 3
consecutive failures and reset on a successful run.
- Around line 1865-1867: Replace the ignored send result pattern "let _ =
cancel_tx.send(true);" with the canonical ".ok()" drop for channel send
failures: find the block that checks maintenance_task_cancel_tx.as_ref() and
where cancel_tx.send(true) is invoked (the variable names
maintenance_task_cancel_tx and cancel_tx), and change the send call to
cancel_tx.send(true).ok() so failures are intentionally ignored per repo policy.

---

Outside diff comments:
In `@src/agent/process_control.rs`:
- Around line 101-103: The unregister routine currently removes entries by
ChannelId only and can evict a newer registration; change register_channel to
return a unique registration_id (e.g., UUID or incrementing token) when
inserting a channel, then modify unregister_channel to accept both &ChannelId
and the registration_id and only remove the map entry if the stored
registration_id matches the provided one (preserving the live entry if it
differs). Update the caller in the cleanup task to capture the registration_id
returned from register_channel and pass that registration_id into
unregister_channel when scheduling cleanup; ensure types and tests are updated
to reflect the new unregister_channel signature and behavior.

---

Duplicate comments:
In `@docs/design-docs/cortex-implementation.md`:
- Around line 19-20: The heading is contradictory: remove "never called" and
make the status unambiguous by changing the heading to reflect that the
functions in memory/maintenance.rs (apply_decay(), prune_memories(),
merge_similar_memories()) are implemented and wired into the cortex loop — e.g.,
replace "Implemented but never called:" with "Implemented and wired into the
cortex loop:" so the description matches the actual wiring.

In `@src/api/config.rs`:
- Around line 646-660: The maintenance fields are written without checked
narrowing and range validation; update the cortex.maintenance_... handling to
use safe conversions and explicit range checks: for integer fields
(maintenance_interval_secs, maintenance_min_age_days) use i64::try_from(...) (or
TryInto) and reject values that don't fit i64 (return Err or propagate a config
validation error) before assigning to table; for float knobs
(maintenance_decay_rate, maintenance_prune_threshold,
maintenance_merge_similarity_threshold) ensure the f64 is finite and inside an
acceptable range (e.g. 0.0..=1.0 for rates/thresholds) and only then call
toml_edit::value(validated_value), otherwise return/propagate an error or skip
writing the field; refer to the cortex.maintenance_interval_secs,
maintenance_decay_rate, maintenance_prune_threshold, maintenance_min_age_days,
and maintenance_merge_similarity_threshold identifiers when making these
changes.

In `@src/memory/maintenance.rs`:
- Around line 251-270: The active_survivor is stale after calling merge_pair
which can cause later merges to rebuild from pre-merge data; after
merge_pair(...) completes, reload the persisted survivor state and set
active_survivor to that refreshed record (e.g., fetch the memory by winner.id
from memory_store) before continuing merges and before checking/using winner.id
or setting merged_memory_ids/source_merged; keep using choose_merge_pair and
merge_pair as-is but ensure active_survivor is updated from the authoritative
store rather than relying on the pre-merge in-memory winner.
- Around line 324-377: The current merge logic (calls to
maintenance_cancelable_op with memory_store.update, get_associations,
delete_associations_for_memory, create_association, forget and
embedding_table.delete, and the created Association::new) must be made atomic:
wrap the whole sequence in a single transactional/locked operation or implement
an explicit two-phase commit so either all steps succeed or none are applied.
Concretely, replace the series of independent maintenance_cancelable_op calls
with a transaction API on memory_store/embedding_table (or acquire a per-memory
lock for merged.id and updated_survivor.id), perform update(&updated_survivor),
fetch and delete associations, re-create remapped associations, create the
updates_assoc, and then forget(&merged.id) and
embedding_table.delete(&merged.id) only on commit; if the store has no
transaction support, implement rollback logic to undo create_association calls
and restore deleted associations/embedding on any intermediate error and ensure
check_maintenance_cancellation is still honored inside the locked/transactional
block.

In `@tests/maintenance.rs`:
- Around line 8-37: The tempdir created in make_memory_maintenance_fixture (dir
via tempdir()) is dropped when the function returns, invalidating the
LanceDB-backed EmbeddingTable; preserve the TempDir for the lifetime of the
fixture by returning it (or an Arc-wrapped TempDir) alongside the MemoryStore
and EmbeddingTable so the directory stays alive, e.g., change the function
signature and return tuple to include the TempDir (or Arc<TempDir>) and ensure
lance_conn/embedding_table keep owning references to that dir.

---

Nitpick comments:
In `@src/memory/maintenance.rs`:
- Around line 483-619: Add a new async test (similar to
merges_near_duplicate_memories_and_transfers_associations) that creates one
survivor memory and two distinct near-duplicate memories via
create_memory_with_embedding, links associations from each duplicate to other
memories, then runs run_maintenance once with a high merge_similarity_threshold
so both duplicates should merge into the same survivor in a single pass; after
run_maintenance assert merged == 2, the survivor content includes updates from
both duplicates, both duplicates are marked forgotten and have no embeddings in
EmbeddingTable::find_similar, and associations from both duplicates have been
transferred/rewired to the survivor (use MemoryStore::get_associations,
MemoryStore::load, EmbeddingTable::find_similar to verify).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 71e55097-3fa0-4b23-b396-dcb19dac8ee0

📥 Commits

Reviewing files that changed from the base of the PR and between 5d6b65c and 59ffec7.

📒 Files selected for processing (12)

docs/design-docs/cortex-implementation.md
src/agent/cortex.rs
src/agent/cortex_chat.rs
src/agent/process_control.rs
src/api/config.rs
src/config.rs
src/config/load.rs
src/config/toml_schema.rs
src/config/types.rs
src/memory/maintenance.rs
src/memory/store.rs
tests/maintenance.rs

src/agent/cortex.rs

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/config/load.rs (1)

130-188: ⚠️ Potential issue | 🟠 Major

Missing lower-bound validation for maintenance_interval_secs.

Line 187 validates maintenance bounds, but this path still needs an explicit maintenance_interval_secs >= 1 guard at load/update resolution time. Without it, a zero interval can pass resolution and lead to pathological maintenance scheduling behavior.

🔧 Proposed fix

     fn resolve(overrides: TomlCortexConfig, defaults: CortexConfig) -> Result<CortexConfig> {
         let config = CortexConfig {
             tick_interval_secs: overrides
                 .tick_interval_secs
                 .unwrap_or(defaults.tick_interval_secs),
@@
             association_max_per_pass: overrides
                 .association_max_per_pass
                 .unwrap_or(defaults.association_max_per_pass),
         };
+        if config.maintenance_interval_secs < 1 {
+            return Err(
+                ConfigError::Invalid("maintenance_interval_secs must be >= 1".into()).into()
+            );
+        }
         config.validate_maintenance_bounds()?;
         Ok(config)
     }
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/config/load.rs` around lines 130 - 188, In resolve(overrides:
TomlCortexConfig, defaults: CortexConfig) add an explicit guard that ensures the
resolved maintenance_interval_secs is at least 1 before constructing/returning
the CortexConfig: compute the resolved value for maintenance_interval_secs from
overrides.maintenance_interval_secs.unwrap_or(defaults.maintenance_interval_secs)
and if it is < 1 return an Err with a clear validation error (consistent with
the function's Result error type) instead of allowing zero; keep the subsequent
call to config.validate_maintenance_bounds() for the other checks.

🧹 Nitpick comments (1)

src/config.rs (1)

1028-1065: Extend invalid-range test to cover all maintenance validation branches.

Nice negative coverage already. To fully lock the validator contract, also assert rejection for zero maintenance_interval_secs and out-of-range maintenance_merge_similarity_threshold.

Proposed test extension

     let invalid_agent_override = r#"
 [[agents]]
 id = "main"

 [agents.cortex]
 maintenance_decay_rate = -0.1
 "#;
     let parsed: TomlConfig =
         toml::from_str(invalid_agent_override).expect("failed to parse invalid agent TOML");
     assert!(
         Config::from_toml(parsed, PathBuf::from(".")).is_err(),
         "expected invalid agent maintenance_decay_rate to be rejected"
     );
+
+    let invalid_interval = r#"
+[defaults.cortex]
+maintenance_interval_secs = 0
+"#;
+    let parsed: TomlConfig =
+        toml::from_str(invalid_interval).expect("failed to parse invalid interval TOML");
+    assert!(
+        Config::from_toml(parsed, PathBuf::from(".")).is_err(),
+        "expected zero maintenance_interval_secs to be rejected"
+    );
+
+    let invalid_merge_similarity = r#"
+[defaults.cortex]
+maintenance_merge_similarity_threshold = 1.1
+"#;
+    let parsed: TomlConfig = toml::from_str(invalid_merge_similarity)
+        .expect("failed to parse invalid merge similarity TOML");
+    assert!(
+        Config::from_toml(parsed, PathBuf::from(".")).is_err(),
+        "expected invalid maintenance_merge_similarity_threshold to be rejected"
+    );
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/config.rs` around lines 1028 - 1065, Add additional negative test cases
inside test_cortex_maintenance_config_rejects_invalid_ranges to cover the
remaining validation branches: create a TOML snippet with
maintenance_interval_secs = 0 under [defaults.cortex] and assert
Config::from_toml(parsed, PathBuf::from(".")).is_err(); then add snippets for
maintenance_merge_similarity_threshold values out of range (e.g., -0.1 and 1.1)
under [defaults.cortex] and assert they are rejected as well. Locate the
existing test function test_cortex_maintenance_config_rejects_invalid_ranges,
reuse TomlConfig and Config::from_toml for parsing/validation, and add
assertions mirroring the existing style for the new invalid inputs.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/memory/maintenance.rs`:
- Around line 336-346: The survivor's content is updated (updated_survivor,
merged_memory_content) but we never re-compute its embedding; after awaiting
memory_store.merge_memories_atomic(...) you must run the same
re-embedding/upsert code used for normal memory updates to compute and store the
new embedding for updated_survivor (i.e., call the project’s embed-and-upsert
routine for the memory or embedding_table.upsert-equivalent) before calling
embedding_table.delete(&merged.id) and returning; place this re-embedding step
between maintenance_cancelable_op(...merge_memories_atomic...) and the
embedding_table.delete(...) call so the survivor embedding is fresh.

---

Outside diff comments:
In `@src/config/load.rs`:
- Around line 130-188: In resolve(overrides: TomlCortexConfig, defaults:
CortexConfig) add an explicit guard that ensures the resolved
maintenance_interval_secs is at least 1 before constructing/returning the
CortexConfig: compute the resolved value for maintenance_interval_secs from
overrides.maintenance_interval_secs.unwrap_or(defaults.maintenance_interval_secs)
and if it is < 1 return an Err with a clear validation error (consistent with
the function's Result error type) instead of allowing zero; keep the subsequent
call to config.validate_maintenance_bounds() for the other checks.

---

Nitpick comments:
In `@src/config.rs`:
- Around line 1028-1065: Add additional negative test cases inside
test_cortex_maintenance_config_rejects_invalid_ranges to cover the remaining
validation branches: create a TOML snippet with maintenance_interval_secs = 0
under [defaults.cortex] and assert Config::from_toml(parsed,
PathBuf::from(".")).is_err(); then add snippets for
maintenance_merge_similarity_threshold values out of range (e.g., -0.1 and 1.1)
under [defaults.cortex] and assert they are rejected as well. Locate the
existing test function test_cortex_maintenance_config_rejects_invalid_ranges,
reuse TomlConfig and Config::from_toml for parsing/validation, and add
assertions mirroring the existing style for the new invalid inputs.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 25de05ab-141b-48f2-99e3-61e87a80653f

📥 Commits

Reviewing files that changed from the base of the PR and between 59ffec7 and 7d97d8b.

📒 Files selected for processing (11)

docs/design-docs/cortex-implementation.md
src/agent/cortex.rs
src/agent/process_control.rs
src/api/config.rs
src/config.rs
src/config/load.rs
src/config/types.rs
src/main.rs
src/memory/maintenance.rs
src/memory/store.rs
tests/maintenance.rs

src/memory/maintenance.rs

…anics Fixes spacedriveapp#273 — cortex_chat.rs panics on CJK memory content because `&result[..200]` can land inside a multi-byte character. Same issue existed in discord.rs reply content truncation. Replace bare byte-index slices with `floor_char_boundary(200)` which rounds down to the nearest char boundary.

… loop

coderabbitai bot reviewed Mar 4, 2026

View reviewed changes

vsumner force-pushed the phase3/cortex-memory-maintenance branch from a41e2c7 to 87f1038 Compare March 4, 2026 03:05

vsumner changed the title ~~fix(cortex): make maintenance cancelable and bounded in cortex loop~~ cortex: phase2/3 health supervision and detached lifecycle hardening Mar 4, 2026

coderabbitai bot reviewed Mar 4, 2026

View reviewed changes

vsumner force-pushed the phase3/cortex-memory-maintenance branch from 87f1038 to 5d6b65c Compare March 4, 2026 21:58

coderabbitai bot reviewed Mar 4, 2026

View reviewed changes

vsumner force-pushed the phase3/cortex-memory-maintenance branch from 5d6b65c to 59ffec7 Compare March 5, 2026 04:11

coderabbitai bot reviewed Mar 5, 2026

View reviewed changes

src/agent/cortex.rs Show resolved Hide resolved

src/agent/cortex.rs Show resolved Hide resolved

coderabbitai bot reviewed Mar 5, 2026

View reviewed changes

src/memory/maintenance.rs Show resolved Hide resolved

l33t0 and others added 8 commits March 5, 2026 08:15

feat(cortex): wire memory maintenance into cortex tick

d572baa

fix(cortex): make memory maintenance cancelable and bounded in cortex…

94d36a2

… loop

fix(hooks): restore tool-result early-return flow

0fae95c

Harden detached channel cleanup with registration tokens

bfd4441

fix(config): restore cortex maintenance config on rebased phase3

730ee89

harden cortex maintenance and config update safety

3193220

fix maintenance survivor embeddings and strengthen config guards

6d649e7

vsumner force-pushed the phase3/cortex-memory-maintenance branch from 641765b to 6d649e7 Compare March 5, 2026 13:15

		pub opencode_server_pool: ArcSwap<crate::opencode::OpenCodeServerPool>,
		/// Cron store, set after agent initialization.

Conversation

vsumner commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Scope

PR relationship

Verification run

Residual risk

Uh oh!

coderabbitai bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

vsumner commented Mar 4, 2026 •

edited

Loading

coderabbitai bot commented Mar 4, 2026 •

edited

Loading