Skip to content

Enable metrics in Docker build, add comprehensive Prometheus instrumentation#274

Open
l33t0 wants to merge 12 commits intospacedriveapp:mainfrom
l33t0:feat/metrics-build-enable
Open

Enable metrics in Docker build, add comprehensive Prometheus instrumentation#274
l33t0 wants to merge 12 commits intospacedriveapp:mainfrom
l33t0:feat/metrics-build-enable

Conversation

@l33t0
Copy link
Contributor

@l33t0 l33t0 commented Mar 1, 2026

Enables the metrics cargo feature in the Docker build and adds Prometheus instrumentation across every major subsystem. All metrics are compile-time gated behind #[cfg(feature = "metrics")] — zero overhead when building without the feature locally.

Issue

#110

What changed

Dockercargo build --release --features metrics is now hardcoded in the Dockerfile. Port 9090 exposed for the metrics endpoint.

Registry (src/telemetry/registry.rs) — 21 new metrics added, 10 existing metrics upgraded with additional labels (process_type, worker_type, agent_id) for per-usecase breakdowns across worker/channel/branch.

New metrics by subsystem:

Subsystem Metrics What they cover
MCP 7 connections, duration, state, tools registered, tool calls, tool call duration, reconnects
Channel 4 messages received/sent, handling duration, errors by type
Memory 3 operation duration (save/update/delete), search result counts, embedding duration
API 2 request count, request duration (with path normalization to prevent cardinality explosion)
Lifecycle 2 branches spawned, context overflows
Cost 1 worker_cost_dollars — cost per worker type for optimization visibility
Cron 1 execution count by job and outcome
Ingestion 1 files processed by outcome

Label upgrades on existing metrics:

  • tool_calls_total / tool_call_duration_seconds — added process_type
  • memory_reads_total / memory_writes_total — converted to IntCounterVec with agent_id
  • All LLM metrics — added worker_type (values: builtin, opencode, ingestion, empty)
  • process_errors_total — added worker_type

DocsMETRICS.md and docs/metrics.md updated with full metric inventory, label descriptions, cardinality estimates (~800–9,200 series), and PromQL examples.

Files touched

  • Dockerfile — enable metrics feature, expose 9090
  • src/telemetry/registry.rs — metric definitions
  • src/llm/model.rs — LLM label upgrades, cost tracking, context overflow
  • src/hooks/spacebot.rs — tool metric label upgrades
  • src/agent/channel.rs — channel message/error metrics
  • src/agent/channel_dispatch.rs — branch spawn tracking, worker_type plumbing
  • src/agent/worker.rs, src/agent/ingestion.rs — worker_type on model creation
  • src/mcp.rs — MCP connection/tool metrics
  • src/memory/store.rs, src/memory/search.rs, src/memory/embedding.rs — memory operation metrics
  • src/api/server.rs — API middleware with path normalization
  • src/cron/scheduler.rs — cron execution metrics
  • src/tools/memory_recall.rs, src/tools/memory_save.rs — agent_id label upgrades
  • METRICS.md, docs/metrics.md — documentation

Testing

  • cargo fmt --all -- --check clean
  • cargo clippy -- -D warnings clean (both with and without --features metrics)
  • cargo test — 300 passed, 0 failed
  • just gate-pr passes all checks

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 1, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds comprehensive Prometheus metrics (behind a metrics feature), threads worker_type/process_type labels through worker/LLM/tool paths, expands the telemetry registry and docs, and updates the Dockerfile to build with metrics and expose port 9090.

Changes

Cohort / File(s) Summary
Build & Docs
Dockerfile, METRICS.md, docs/metrics.md
Dockerfile now hardcodes --features metrics in build stages and exposes port 9090; documentation expanded for new/renamed metrics, labels, buckets, PromQL examples, and Docker guidance.
Telemetry Registry
src/telemetry/registry.rs
Major registry overhaul: many new metrics (MCP, channel/messaging, memory ops, API, cost, cron, ingestion), converted scalars→Vecs, added worker_type/process_type labels, and updated Metrics::new to register all metrics.
LLM Model
src/llm/model.rs
Added worker_type: Option<String> and with_worker_type(...) builder; propagate worker_type into LLM metrics, token/cost metrics, and error/context overflow telemetry.
Agent & Dispatch
src/agent/channel.rs, src/agent/channel_dispatch.rs, src/agent/worker.rs
Instrumented message handling (received/sent/durations/errors), increment branch/worker counters, threaded worker_type into worker spawn calls, and updated spawn_worker_task signature to accept worker_type.
MCP
src/mcp.rs
Added per-server and per-tool MCP metrics and durations, connection attempts/reconnects, and a helper to set per-server state; all instrumentation behind metrics.
API Server
src/api/server.rs
Added conditional metrics middleware recording http_requests_total and http_request_duration_seconds, plus a path-normalization helper to reduce label cardinality.
Memory & Tools
src/memory/store.rs, src/memory/search.rs, src/memory/embedding.rs, src/tools/memory_recall.rs, src/tools/memory_save.rs
Exposed MemoryStore::agent_id(), added per-operation timers/counters, changed memory reads/writes to per-agent labeled metrics, and instrumented embedding/search durations.
Background & Hooks
src/cron/scheduler.rs, src/agent/ingestion.rs, src/hooks/spacebot.rs
Instrumented cron executions and ingestion file processing, set with_worker_type("ingestion") in ingestion flows, and added process_type labels to tool-call metrics.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • jamiepine
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Enable metrics in Docker build, add comprehensive Prometheus instrumentation' accurately summarizes the main change: enabling metrics in Docker and adding comprehensive Prometheus instrumentation across subsystems.
Description check ✅ Passed The description is thorough and directly related to the changeset. It clearly explains what changed (Docker build, registry metrics, new subsystem metrics, label upgrades, files touched, and testing), making it well-aligned with the actual changes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
src/api/server.rs (1)

339-358: Consider adding "links" to the exclusion list.

The route /agents/{id}/links (line 213) may be incorrectly normalized. When processing /api/agents/some-uuid/links, the segment links follows an agent ID, and since "links" isn't in the exclusion list at lines 342-354, it could be normalized to {id} instead of being preserved.

Proposed fix
                     "agents"
                         if !matches!(
                             *part,
                             "mcp"
                                 | "warmup"
                                 | "overview"
                                 | "workers"
                                 | "memories"
                                 | "profile"
                                 | "identity"
                                 | "config"
                                 | "cron"
                                 | "tasks"
                                 | "ingest"
                                 | "skills"
                                 | "tools"
+                                | "links"
                         ) =>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/api/server.rs` around lines 339 - 358, The route segment matcher that
normalizes agent routes is missing "links" in its exclusion list, causing
"/agents/{id}/links" to be mis-normalized; update the match arm that checks
*part (the list with "mcp" | "warmup" ... | "tools") to also include "links" so
that the code path that currently pushes "{id}" (where normalized.push("{id}")
is called) preserves "links" instead of treating it as an ID.
src/mcp.rs (1)

342-382: Tool call metrics only recorded on success.

The metrics are only recorded when result.is_ok(). Failed tool calls won't be reflected in mcp_tool_calls_total or mcp_tool_call_duration_seconds. Consider whether tracking failures with an additional label would be valuable for debugging MCP server issues.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcp.rs` around lines 342 - 382, Metrics are only emitted when
result.is_ok(); change the logic so you always record m.mcp_tool_calls_total and
m.mcp_tool_call_duration_seconds (or a separate failure metric) after the call
completes, using the already-captured call_start and the result variable to
compute elapsed time and add a status label (e.g., "ok"/"err") or a separate
metric for failures. Locate the block around call_start, the call to
client.call_tool(...).await (result), and the metrics usage
(crate::telemetry::Metrics::global(), m.mcp_tool_calls_total,
m.mcp_tool_call_duration_seconds) and move/extend the metrics emission to run
unconditionally after result is ready, using result.is_ok() only to set the
label value.
src/agent/channel.rs (1)

1329-1356: Consider extracting fallback-send + metrics/error accounting into a helper.

The same send/match/increment block is repeated three times, which makes future label changes easy to miss in one branch.

♻️ Suggested refactor sketch
+async fn send_text_with_metrics(
+    &self,
+    final_text: String,
+    metrics_agent_id: &str,
+    metrics_channel_type: &str,
+) {
+    match self.response_tx.send(OutboundResponse::Text(final_text)).await {
+        Ok(()) => {
+            #[cfg(feature = "metrics")]
+            crate::telemetry::Metrics::global()
+                .messages_sent_total
+                .with_label_values(&[metrics_agent_id, metrics_channel_type])
+                .inc();
+        }
+        Err(error) => {
+            #[cfg(feature = "metrics")]
+            crate::telemetry::Metrics::global()
+                .channel_errors_total
+                .with_label_values(&[metrics_agent_id, metrics_channel_type, "send_failed"])
+                .inc();
+            tracing::error!(%error, channel_id = %self.id, "failed to send fallback reply");
+        }
+    }
+}

Also applies to: 1411-1438, 1484-1511

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agent/channel.rs` around lines 1329 - 1356, Extract the repeated
send/match/metrics block into a single async helper (e.g., send_fallback_reply
or send_outbound_with_metrics) that takes the response payload
(OutboundResponse::Text or other variants), references to metrics labels
(metrics_agent_id, metrics_channel_type) and self.id (or &self) so it can
perform the send on self.response_tx, increment messages_sent_total on Ok and
channel_errors_total with "send_failed" on Err, and log the tracing::error with
%error and channel_id = %self.id; then replace the three duplicated match blocks
(the one shown plus the blocks at the other reported locations) with calls to
this helper to ensure label/error accounting is centralized and consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/metrics.md`:
- Around line 169-172: The PromQL example for API latency uses the wrong label
name: replace the `handler` label with `path` in the histogram_quantile query
that references spacebot_http_request_duration_seconds_bucket (i.e., update the
sum by clause from "handler, le" to "path, le" so the query groups by the actual
metric label used in the code).
- Around line 115-118: Update the metrics documentation to match the
implementation: change the label name "handler" to "path" for the HTTP metrics
`spacebot_http_requests_total` and `spacebot_http_request_duration_seconds` so
the Labels column lists "method, path, status" (for the counter) and "method,
path" (for the histogram) to align with the `with_label_values(&[&method, &path,
&status])` calls in src/api/server.rs.

---

Nitpick comments:
In `@src/agent/channel.rs`:
- Around line 1329-1356: Extract the repeated send/match/metrics block into a
single async helper (e.g., send_fallback_reply or send_outbound_with_metrics)
that takes the response payload (OutboundResponse::Text or other variants),
references to metrics labels (metrics_agent_id, metrics_channel_type) and
self.id (or &self) so it can perform the send on self.response_tx, increment
messages_sent_total on Ok and channel_errors_total with "send_failed" on Err,
and log the tracing::error with %error and channel_id = %self.id; then replace
the three duplicated match blocks (the one shown plus the blocks at the other
reported locations) with calls to this helper to ensure label/error accounting
is centralized and consistent.

In `@src/api/server.rs`:
- Around line 339-358: The route segment matcher that normalizes agent routes is
missing "links" in its exclusion list, causing "/agents/{id}/links" to be
mis-normalized; update the match arm that checks *part (the list with "mcp" |
"warmup" ... | "tools") to also include "links" so that the code path that
currently pushes "{id}" (where normalized.push("{id}") is called) preserves
"links" instead of treating it as an ID.

In `@src/mcp.rs`:
- Around line 342-382: Metrics are only emitted when result.is_ok(); change the
logic so you always record m.mcp_tool_calls_total and
m.mcp_tool_call_duration_seconds (or a separate failure metric) after the call
completes, using the already-captured call_start and the result variable to
compute elapsed time and add a status label (e.g., "ok"/"err") or a separate
metric for failures. Locate the block around call_start, the call to
client.call_tool(...).await (result), and the metrics usage
(crate::telemetry::Metrics::global(), m.mcp_tool_calls_total,
m.mcp_tool_call_duration_seconds) and move/extend the metrics emission to run
unconditionally after result is ready, using result.is_ok() only to set the
label value.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 32a35ef and 7e9fffc.

📒 Files selected for processing (18)
  • Dockerfile
  • METRICS.md
  • docs/metrics.md
  • src/agent/channel.rs
  • src/agent/channel_dispatch.rs
  • src/agent/ingestion.rs
  • src/agent/worker.rs
  • src/api/server.rs
  • src/cron/scheduler.rs
  • src/hooks/spacebot.rs
  • src/llm/model.rs
  • src/mcp.rs
  • src/memory/embedding.rs
  • src/memory/search.rs
  • src/memory/store.rs
  • src/telemetry/registry.rs
  • src/tools/memory_recall.rs
  • src/tools/memory_save.rs

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/mcp.rs (2)

148-148: Use a descriptive metrics handle name instead of m.

The repeated let m = ... alias makes these blocks harder to scan and violates the repo naming rule for non-abbreviated variable names.

Proposed rename pattern (apply consistently)
- let m = crate::telemetry::Metrics::global();
- m.mcp_connections
+ let metrics = crate::telemetry::Metrics::global();
+ metrics.mcp_connections
     .with_label_values(&[&self.name, "connected"])
     .set(0);

As per coding guidelines, "Do not abbreviate variable names; use queue not q, message not msg, channel not ch; common abbreviations like config are acceptable."

Also applies to: 183-183, 223-223, 262-262, 317-317, 378-378, 592-592

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcp.rs` at line 148, Multiple places in mcp.rs create an aliased metrics
handle using "let m = crate::telemetry::Metrics::global();" which uses a
non-descriptive, abbreviated name; change these to a clear, non-abbreviated
variable (for example "let metrics = crate::telemetry::Metrics::global();") and
update all subsequent uses in the same scope to use that name (replace
references to "m" with "metrics"); apply the same rename consistently for each
occurrence mentioned so the variable name follows the repository guideline
against abbreviations.

146-164: Consolidate repeated MCP state metric updates into one helper.

The same connected/disconnected/failed/tool-count update logic is duplicated across multiple branches. A small helper will reduce divergence risk and keep lifecycle semantics consistent.

Refactor sketch
+#[cfg(feature = "metrics")]
+fn set_mcp_state_metrics(
+    metrics: &crate::telemetry::Metrics,
+    server_name: &str,
+    connected: i64,
+    disconnected: i64,
+    failed: i64,
+    tools_registered: i64,
+) {
+    metrics.mcp_connections.with_label_values(&[server_name, "connected"]).set(connected);
+    metrics.mcp_connections.with_label_values(&[server_name, "disconnected"]).set(disconnected);
+    metrics.mcp_connections.with_label_values(&[server_name, "failed"]).set(failed);
+    metrics.mcp_tools_registered.with_label_values(&[server_name]).set(tools_registered);
+}

Also applies to: 181-204, 221-239, 315-330

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcp.rs` around lines 146 - 164, Extract the repeated MCP metric updates
into a single helper (e.g., a free fn or impl method like
update_mcp_lifecycle_metrics or set_mcp_metrics) that takes &self.name and the
lifecycle values (connection_attempts status, connected/disconnected/failed
ints, tools_registered count) and calls crate::telemetry::Metrics::global() once
and updates m.mcp_connection_attempts_total, m.mcp_connections (with labels
"connected"/"disconnected"/"failed") and m.mcp_tools_registered accordingly;
then replace each duplicated block (the blocks around the references to
self.name at the shown sites) with calls to that helper to ensure a single
source of truth for MCP state updates.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/metrics.md`:
- Line 54: Update the metrics text to explicitly state that for non-worker tiers
the worker_type label is an empty string; modify the sentence referencing the
`tier` and `worker_type` labels in docs/metrics.md so it reads that `tier` is
one of `channel`, `branch`, `worker`, `compactor`, or `cortex` and that
`worker_type` identifies the worker variant (`builtin`, `opencode`, or
`ingestion`) and is emitted as an empty string for non-worker tiers so
dashboards/filters reflect runtime behavior.

---

Nitpick comments:
In `@src/mcp.rs`:
- Line 148: Multiple places in mcp.rs create an aliased metrics handle using
"let m = crate::telemetry::Metrics::global();" which uses a non-descriptive,
abbreviated name; change these to a clear, non-abbreviated variable (for example
"let metrics = crate::telemetry::Metrics::global();") and update all subsequent
uses in the same scope to use that name (replace references to "m" with
"metrics"); apply the same rename consistently for each occurrence mentioned so
the variable name follows the repository guideline against abbreviations.
- Around line 146-164: Extract the repeated MCP metric updates into a single
helper (e.g., a free fn or impl method like update_mcp_lifecycle_metrics or
set_mcp_metrics) that takes &self.name and the lifecycle values
(connection_attempts status, connected/disconnected/failed ints,
tools_registered count) and calls crate::telemetry::Metrics::global() once and
updates m.mcp_connection_attempts_total, m.mcp_connections (with labels
"connected"/"disconnected"/"failed") and m.mcp_tools_registered accordingly;
then replace each duplicated block (the blocks around the references to
self.name at the shown sites) with calls to that helper to ensure a single
source of truth for MCP state updates.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7e9fffc and 657e549.

📒 Files selected for processing (4)
  • docs/metrics.md
  • src/agent/channel.rs
  • src/llm/model.rs
  • src/mcp.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/llm/model.rs

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/metrics.md`:
- Around line 164-167: The PromQL in the docs uses histogram_quantile(0.99,
rate(spacebot_memory_operation_duration_seconds_bucket[5m])) which is invalid
because histogram_quantile requires the bucket series to be aggregated by the le
label; update the query that references
spacebot_memory_operation_duration_seconds_bucket to wrap the rate(...) with sum
by (le) (rate(...)) so it becomes histogram_quantile(0.99, sum by (le)
(rate(spacebot_memory_operation_duration_seconds_bucket[5m]))); if you need
per-operation or per-agent breakdowns, include those labels in the aggregation
(e.g., sum by (operation, le) or sum by (agent, operation, le)).

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 657e549 and 2cf6410.

📒 Files selected for processing (1)
  • docs/metrics.md

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
src/agent/channel.rs (1)

685-690: Record message_handling_duration_seconds on error paths too.

Both handlers emit duration only at tail, so early ? returns skip histogram updates and bias latency reporting toward successful turns.

Also applies to: 945-950

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agent/channel.rs` around lines 685 - 690, The histogram observation is
only recorded at the end of successful handlers so early `?` returns skip it;
wrap the timing/observation in a scope-safe guard or explicit finally-style
block so message_handling_duration_seconds is observed on all exit paths. Create
a small RAII timer struct (e.g., MetricsTimer) that captures metrics_turn_start,
self.deps.agent_id and metrics_channel_type on creation and calls
.observe(elapsed) in Drop, or alternatively call the observe in a single
`match`/`map_err`/`and_then` chain or a `let res = (|| { ... })();` followed by
a final observe using metrics_turn_start.elapsed() before returning; apply this
change around the two places shown (the block using
Metrics::global().message_handling_duration_seconds at the snippet and the
similar block near lines 945-950) so the histogram is updated regardless of
early returns or errors.
src/llm/model.rs (1)

368-370: Avoid abbreviated binding name wt.

Please use a descriptive binding for readability and consistency.

✏️ Proposed rename
-            let worker_label = match (tier_label, self.worker_type.as_deref()) {
-                ("worker", Some(wt)) => wt,
+            let worker_label = match (tier_label, self.worker_type.as_deref()) {
+                ("worker", Some(worker_type)) => worker_type,
                 ("worker", None) => "unknown",
                 _ => "",
             };
As per coding guidelines, "Do not abbreviate variable names; use `queue` not `q`, `message` not `msg`, `channel` not `ch`; common abbreviations like `config` are acceptable".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llm/model.rs` around lines 368 - 370, Rename the abbreviated match
binding `wt` to a descriptive name (e.g., `worker_type_str` or `worker_type`) in
the match expression that computes `worker_label` (the match on `(tier_label,
self.worker_type.as_deref())`) so the arms read ("worker",
Some(worker_type_str)) => worker_type_str and ("worker", None) => "unknown";
update any references in that scope accordingly to maintain readability and
consistency with naming guidelines.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/metrics.md`:
- Around line 69-72: The metric description for spacebot_mcp_reconnects_total is
misleading because the code increments it only on successful reconnects; update
the table row for spacebot_mcp_reconnects_total in metrics.md to explicitly
state it counts successful MCP reconnections (e.g., "MCP successful
reconnections" or "successful MCP reconnection attempts") so dashboards reflect
the actual behavior of the counter.

In `@src/mcp.rs`:
- Around line 194-211: The write guard from self.state.write() is held across an
await (self.tools.read()) which can deadlock; set *state =
McpConnectionState::Connected then immediately drop the write guard before any
await by scoping the write lock or calling drop(state) so the guard is released,
then obtain self.tools.read().await and call
set_mcp_connection_state(&self.name, 1, 0, 0, tool_count); ensure the order:
acquire write lock only to set state (use a short scope), release it, then await
self.tools.read().

---

Nitpick comments:
In `@src/agent/channel.rs`:
- Around line 685-690: The histogram observation is only recorded at the end of
successful handlers so early `?` returns skip it; wrap the timing/observation in
a scope-safe guard or explicit finally-style block so
message_handling_duration_seconds is observed on all exit paths. Create a small
RAII timer struct (e.g., MetricsTimer) that captures metrics_turn_start,
self.deps.agent_id and metrics_channel_type on creation and calls
.observe(elapsed) in Drop, or alternatively call the observe in a single
`match`/`map_err`/`and_then` chain or a `let res = (|| { ... })();` followed by
a final observe using metrics_turn_start.elapsed() before returning; apply this
change around the two places shown (the block using
Metrics::global().message_handling_duration_seconds at the snippet and the
similar block near lines 945-950) so the histogram is updated regardless of
early returns or errors.

In `@src/llm/model.rs`:
- Around line 368-370: Rename the abbreviated match binding `wt` to a
descriptive name (e.g., `worker_type_str` or `worker_type`) in the match
expression that computes `worker_label` (the match on `(tier_label,
self.worker_type.as_deref())`) so the arms read ("worker",
Some(worker_type_str)) => worker_type_str and ("worker", None) => "unknown";
update any references in that scope accordingly to maintain readability and
consistency with naming guidelines.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2cf6410 and b2e17a6.

📒 Files selected for processing (6)
  • docs/metrics.md
  • src/agent/channel.rs
  • src/agent/channel_dispatch.rs
  • src/api/server.rs
  • src/llm/model.rs
  • src/mcp.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/api/server.rs

l33t0 and others added 7 commits March 7, 2026 08:36
…tems

Enable metrics feature in Docker build and add 21 new metrics with 10
upgraded existing metrics covering LLM, tools, MCP, channels, memory,
API, cron, ingestion, and agent lifecycle. All instrumentation is
compile-time gated behind #[cfg(feature = "metrics")].

New metrics: MCP connections/tools/reconnects (7), channel
messages/errors (4), memory operations/search/embeddings (3), API
requests/duration (2), lifecycle branches/overflow (2), worker cost (1),
cron executions (1), ingestion files (1).

Upgraded metrics add process_type, worker_type, and agent_id labels for
per-usecase visibility across worker/channel/branch splits.
- Fix docs/metrics.md: label `handler` → `path` for API metrics and
  PromQL examples to match actual implementation
- Fix MCP failure paths: reset `disconnected` gauge to 0 alongside
  `connected` to keep state gauges mutually exclusive
- Record MCP tool call count and duration on both success and failure
- Use empty string for worker_type on non-worker tiers instead of
  "unknown" to avoid confusing extra series
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
- Fix PromQL histogram_quantile query missing sum by (le) aggregation
- Add "links" to route normalization exclusion list (prevents
  /agents/{id}/links from being mis-normalized)
- Rename abbreviated `m` variable to `metrics` in mcp.rs for
  consistency with codebase conventions
- Extract set_mcp_connection_state helper to consolidate repeated
  MCP gauge updates across connect/disconnect/failure paths
- Extract send_outbound_text helper in channel.rs to deduplicate
  send/match/metrics blocks
After rebase, two call sites in channel_dispatch.rs were missing the
worker_type parameter required by the metrics-enhanced spawn_worker_task.
@l33t0 l33t0 force-pushed the feat/metrics-build-enable branch from b2e17a6 to 91ae167 Compare March 6, 2026 22:49
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/memory/store.rs (1)

131-179: ⚠️ Potential issue | 🟡 Minor

Only count successful updates in memory_updates_total.

This counter increments even when the UPDATE touches zero rows. delete() already checks rows_affected() > 0, so update() will currently over-report mutations for missing IDs or stale writes.

🩹 Suggested fix
-        sqlx::query(
+        let result = sqlx::query(
             r#"
             UPDATE memories
             SET content = ?, memory_type = ?, importance = ?, updated_at = ?,
                 last_accessed_at = ?, access_count = ?, source = ?, channel_id = ?,
                 forgotten = ?
             WHERE id = ?
             "#,
         )
         .bind(&memory.content)
         .bind(memory.memory_type.to_string())
         .bind(memory.importance)
         .bind(memory.updated_at)
         .bind(memory.last_accessed_at)
         .bind(memory.access_count)
         .bind(&memory.source)
         .bind(memory.channel_id.as_ref().map(|id| id.as_ref()))
         .bind(memory.forgotten)
         .bind(&memory.id)
         .execute(&self.pool)
         .await
         .with_context(|| format!("failed to update memory {}", memory.id))?;
 
         #[cfg(feature = "metrics")]
-        {
+        if result.rows_affected() > 0 {
             let agent_label = if self.agent_id.is_empty() {
                 "unknown"
             } else {
                 &self.agent_id
             };
             crate::telemetry::Metrics::global()
                 .memory_updates_total
                 .with_label_values(&[agent_label, "update"])
                 .inc();
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/memory/store.rs` around lines 131 - 179, The update() function must only
increment memory_updates_total when the UPDATE actually affected rows: capture
the result of the execute call into a variable (e.g. let result =
sqlx::query(...).bind(...)... .execute(&self.pool).await.with_context(||
format!("failed to update memory {}", memory.id))?;), then check
result.rows_affected() > 0 before incrementing the memory_updates_total metric
(mirroring delete()'s behavior); keep the with_context on the execute call so
errors still propagate but only increment the counter on successful row updates.
src/agent/channel_dispatch.rs (1)

644-651: ⚠️ Potential issue | 🔴 Critical

Update the local tests for the new worker_type argument.

This signature now requires a seventh parameter, but the two unit tests below still call the old six-argument form. CI is already failing with E0061 at Line 1021. Pass "builtin" in both test invocations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agent/channel_dispatch.rs` around lines 644 - 651, The unit tests calling
spawn_worker_task use the old six-argument signature and must be updated to pass
the new seventh parameter worker_type; modify both test invocations of
spawn_worker_task to add the string "builtin" as the final argument so the call
matches pub(crate) fn spawn_worker_task(..., worker_type: &'static str, future:
F), which will resolve the E0061 missing argument errors.
♻️ Duplicate comments (2)
docs/metrics.md (1)

65-73: ⚠️ Potential issue | 🟡 Minor

Describe spacebot_mcp_reconnects_total as successful reconnects.

src/mcp.rs only increments this counter after a reconnect succeeds, so calling it “attempts” overstates what the series means in dashboards and alerts.

📝 Doc fix
-| `spacebot_mcp_reconnects_total`                   | Counter   | server_name            | MCP reconnection attempts           |
+| `spacebot_mcp_reconnects_total`                   | Counter   | server_name            | Successful MCP reconnections        |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/metrics.md` around lines 65 - 73, Update the metrics docs to accurately
describe the metric `spacebot_mcp_reconnects_total` as counting successful
reconnects (not attempts); change the Description cell in the `metrics.md` table
row for `spacebot_mcp_reconnects_total` to something like "MCP successful
reconnections" or "Successful MCP reconnects" to match the behavior in
`src/mcp.rs` where the counter is only incremented after a reconnect succeeds.
src/mcp.rs (1)

194-211: ⚠️ Potential issue | 🔴 Critical

Drop the state write guard before the next await.

Line 194 acquires self.state.write(), and Line 209 awaits self.tools.read() while that guard is still alive. That recreates the lock-ordering deadlock that was already called out earlier. Keep the state assignment in a short scope and reuse tools.len() instead of re-reading through self.tools.

🔧 Proposed fix
-                {
-                    let mut cached_tools = self.tools.write().await;
-                    *cached_tools = tools;
-                }
+                let tool_count = tools.len() as i64;
+                {
+                    let mut cached_tools = self.tools.write().await;
+                    *cached_tools = tools;
+                }
                 self.tool_list_changed.store(false, Ordering::SeqCst);

-                let mut state = self.state.write().await;
-                *state = McpConnectionState::Connected;
+                {
+                    let mut state = self.state.write().await;
+                    *state = McpConnectionState::Connected;
+                }

                 #[cfg(feature = "metrics")]
                 {
                     let metrics = crate::telemetry::Metrics::global();
                     let elapsed = connect_start.elapsed().as_secs_f64();
@@
                     metrics
                         .mcp_connection_duration_seconds
                         .with_label_values(&[&self.name])
                         .observe(elapsed);
-                    let tool_count = self.tools.read().await.len() as i64;
                     set_mcp_connection_state(&self.name, 1, 0, 0, tool_count);
                 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcp.rs` around lines 194 - 211, The write guard from self.state.write()
is held across an await which causes a lock-order deadlock; after setting *state
= McpConnectionState::Connected drop the guard (e.g., limit the scope or
explicitly drop(state)) before doing any awaits, and when recording metrics
reuse the already-read tools collection instead of calling
self.tools.read().await again (use the existing tools variable and tools.len())
to avoid re-acquiring locks while the state lock is released.
🧹 Nitpick comments (1)
src/llm/model.rs (1)

376-380: Rename wt to worker_type.

Tiny style nit, but this local abbreviation breaks the repo's Rust naming rule and stands out in a metrics block that otherwise uses fully spelled-out labels.

As per coding guidelines, "Use non-abbreviated variable names in Rust code: queue not q, message not msg, channel not ch; common abbreviations like config are acceptable".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llm/model.rs` around lines 376 - 380, Rename the short binding `wt` to
the full `worker_type` in the match that assigns `worker_label` so the match
reads match (tier_label, self.worker_type.as_deref()) { ("worker",
Some(worker_type)) => worker_type, ("worker", None) => "unknown", _ => "", } —
update the local pattern name and any dependent uses to `worker_type` to comply
with the non-abbreviated naming guideline.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@METRICS.md`:
- Around line 332-349: Update the METRICS.md entries for the API middleware
metrics to use the actual label name `path` instead of `handler`: change the
Labels row for `spacebot_http_requests_total` from `method`, `handler`, `status`
to `method`, `path`, `status` and for `spacebot_http_request_duration_seconds`
from `method`, `handler` to `method`, `path`; mention they are instrumented in
`src/api/server.rs` (middleware layer) so readers can find the normalized path
label there.

In `@src/agent/channel.rs`:
- Around line 1973-1998: Built-in reply sends (send_builtin_text and any direct
writes to response_tx for commands like /status, /quiet, /active, /help,
/agent-id and ping acks) bypass the metrics increments; update send_builtin_text
(and replace direct response_tx.send usages for those built-ins) to call the
existing send_outbound_text(text, error_context) helper so messages_sent_total
and channel_errors_total are incremented consistently, or extract the
metrics/err-handling block from send_outbound_text into a shared helper used by
both send_outbound_text and send_builtin_text (ensure you reference response_tx,
send_outbound_text, send_builtin_text, messages_sent_total and
channel_errors_total when changing call sites).
- Around line 1083-1103: The metrics code only records
message_handling_duration_seconds on the normal end path; create a scope guard
that records the histogram on Drop so duration is observed on every exit
(success, early return, or error). Initialize metrics_turn_start and
metrics_channel_type early in the message-handling functions (e.g., inside
handle_message(), the /agent-id and ping fast-paths, built-in ops handlers) and
instantiate a small RAII guard (e.g., MetricTurnGuard) which on drop reads
Instant::now() - metrics_turn_start and calls
crate::telemetry::Metrics::global().message_handling_duration_seconds.with_label_values(&[&self.deps.agent_id,
&metrics_channel_type]).observe(...). Replace or supplement the existing
bottom-of-handler observation with this guard so every code path records
latency.

In `@src/agent/ingestion.rs`:
- Around line 479-482: SpacebotModel is built with with_worker_type("ingestion")
but SpacebotModel::completion() only emits worker_type when the tier is
"worker", so ingestion requests end up with worker_type="". Update the code so
completion reports worker_type for ingestion: either change the builder call
that creates the model (SpacebotModel::make -> ensure the model's tier is
"worker" if that aligns with metrics rules) or, better, modify
SpacebotModel::completion() to include/emits the with_worker_type value for
non-worker tiers (e.g., treat "ingestion" the same as "worker" for emitting
worker_type) so worker_cost_dollars and metrics pick up the "ingestion"
worker_type. Ensure you adjust SpacebotModel::completion() logic to read and
emit the worker_type set by with_worker_type().

In `@src/api/server.rs`:
- Around line 327-375: normalize_api_path currently leaves user-controlled
segments unnormalized (e.g., /api/mcp/servers/{name}, /api/providers/{provider},
/api/opencode/{port}/{*path}); update normalize_api_path to collapse those into
stable tokens: detect patterns like parent "mcp" with child "servers" (or when
parent == "servers" and grandparent == "mcp") and replace the following segment
with "{name}"; replace any direct child segment under "providers" with
"{provider}"; for "opencode" detect the wildcard tail (when parent == "opencode"
and i >= 2) and replace the rest of the path segments with a single "{tail}"
token; add these checks into the existing parent-match branch in
normalize_api_path so proxy and user-defined names are normalized consistently.

In `@src/mcp.rs`:
- Around line 88-113: The helper set_mcp_connection_state currently only sets
"connected", "disconnected", and "failed" but omits the "connecting" state from
McpConnectionState; update set_mcp_connection_state(server_name: &str,
connected: i64, disconnected: i64, failed: i64, tools_registered: i64) to accept
an additional connecting: i64 parameter and add a
metrics.mcp_connections.with_label_values(&[server_name,
"connecting"]).set(connecting) call, and update any callers of
set_mcp_connection_state to pass the current connecting count so the
"spacebot_mcp_connections" gauge exposes the connecting state.

---

Outside diff comments:
In `@src/agent/channel_dispatch.rs`:
- Around line 644-651: The unit tests calling spawn_worker_task use the old
six-argument signature and must be updated to pass the new seventh parameter
worker_type; modify both test invocations of spawn_worker_task to add the string
"builtin" as the final argument so the call matches pub(crate) fn
spawn_worker_task(..., worker_type: &'static str, future: F), which will resolve
the E0061 missing argument errors.

In `@src/memory/store.rs`:
- Around line 131-179: The update() function must only increment
memory_updates_total when the UPDATE actually affected rows: capture the result
of the execute call into a variable (e.g. let result =
sqlx::query(...).bind(...)... .execute(&self.pool).await.with_context(||
format!("failed to update memory {}", memory.id))?;), then check
result.rows_affected() > 0 before incrementing the memory_updates_total metric
(mirroring delete()'s behavior); keep the with_context on the execute call so
errors still propagate but only increment the counter on successful row updates.

---

Duplicate comments:
In `@docs/metrics.md`:
- Around line 65-73: Update the metrics docs to accurately describe the metric
`spacebot_mcp_reconnects_total` as counting successful reconnects (not
attempts); change the Description cell in the `metrics.md` table row for
`spacebot_mcp_reconnects_total` to something like "MCP successful reconnections"
or "Successful MCP reconnects" to match the behavior in `src/mcp.rs` where the
counter is only incremented after a reconnect succeeds.

In `@src/mcp.rs`:
- Around line 194-211: The write guard from self.state.write() is held across an
await which causes a lock-order deadlock; after setting *state =
McpConnectionState::Connected drop the guard (e.g., limit the scope or
explicitly drop(state)) before doing any awaits, and when recording metrics
reuse the already-read tools collection instead of calling
self.tools.read().await again (use the existing tools variable and tools.len())
to avoid re-acquiring locks while the state lock is released.

---

Nitpick comments:
In `@src/llm/model.rs`:
- Around line 376-380: Rename the short binding `wt` to the full `worker_type`
in the match that assigns `worker_label` so the match reads match (tier_label,
self.worker_type.as_deref()) { ("worker", Some(worker_type)) => worker_type,
("worker", None) => "unknown", _ => "", } — update the local pattern name and
any dependent uses to `worker_type` to comply with the non-abbreviated naming
guideline.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2a07ef49-8f6d-4641-8a5c-663ee09c2a49

📥 Commits

Reviewing files that changed from the base of the PR and between b2e17a6 and 91ae167.

📒 Files selected for processing (18)
  • Dockerfile
  • METRICS.md
  • docs/metrics.md
  • src/agent/channel.rs
  • src/agent/channel_dispatch.rs
  • src/agent/ingestion.rs
  • src/agent/worker.rs
  • src/api/server.rs
  • src/cron/scheduler.rs
  • src/hooks/spacebot.rs
  • src/llm/model.rs
  • src/mcp.rs
  • src/memory/embedding.rs
  • src/memory/search.rs
  • src/memory/store.rs
  • src/telemetry/registry.rs
  • src/tools/memory_recall.rs
  • src/tools/memory_save.rs
🚧 Files skipped from review as they are similar to previous changes (5)
  • src/tools/memory_save.rs
  • Dockerfile
  • src/cron/scheduler.rs
  • src/hooks/spacebot.rs
  • src/agent/worker.rs

Comment on lines 479 to 482
let model = SpacebotModel::make(&deps.llm_manager, &model_name)
.with_context(&*deps.agent_id, "branch")
.with_worker_type("ingestion")
.with_routing((**routing).clone());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

worker_type="ingestion" is currently dropped by the LLM metrics path.

This model is still tagged as process_type = "branch", but SpacebotModel::completion() only emits worker_type for the "worker" tier. As written, ingestion requests will still report worker_type="" and won't contribute to worker_cost_dollars.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agent/ingestion.rs` around lines 479 - 482, SpacebotModel is built with
with_worker_type("ingestion") but SpacebotModel::completion() only emits
worker_type when the tier is "worker", so ingestion requests end up with
worker_type="". Update the code so completion reports worker_type for ingestion:
either change the builder call that creates the model (SpacebotModel::make ->
ensure the model's tier is "worker" if that aligns with metrics rules) or,
better, modify SpacebotModel::completion() to include/emits the with_worker_type
value for non-worker tiers (e.g., treat "ingestion" the same as "worker" for
emitting worker_type) so worker_cost_dollars and metrics pick up the "ingestion"
worker_type. Ensure you adjust SpacebotModel::completion() logic to read and
emit the worker_type set by with_worker_type().

l33t0 and others added 5 commits March 7, 2026 09:03
- Add missing worker_type arg to spawn_worker_task in test call sites
- Rename abbreviated `wt` binding to `worker_type` in model.rs
- Fix mcp_reconnects_total doc description to say "Successful MCP
  reconnections" instead of "reconnection attempts"
- mcp.rs: drop state write lock before tools.read() await (deadlock fix)
- mcp.rs: expose "connecting" state in mcp_connections gauge
- mcp.rs: add connecting parameter to set_mcp_connection_state helper
- channel.rs: use RAII MessageDurationGuard for message_handling_duration
  so early returns and ? errors still record the metric
- channel.rs: add metrics to send_builtin_text (messages_sent_total,
  channel_errors_total) so built-in replies are counted
- server.rs: normalize /mcp/servers/{name}, /providers/{name}, and
  /opencode/{port}/{proxy_path} to prevent high-cardinality metric labels
- model.rs: emit worker_type for any tier when explicitly set, so
  ingestion (process_type=branch) correctly reports worker_type=ingestion
- METRICS.md: fix handler→path label references
…unnecessary lock reacquisition

- memory/store.rs: check rows_affected() > 0 before incrementing
  memory_updates_total, matching delete()'s existing behavior
- mcp.rs: capture tools.len() before moving into cache instead of
  re-acquiring self.tools.read() lock in the metrics block
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant