Enable metrics in Docker build, add comprehensive Prometheus instrumentation#274
Enable metrics in Docker build, add comprehensive Prometheus instrumentation#274l33t0 wants to merge 12 commits intospacedriveapp:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds comprehensive Prometheus metrics (behind a Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
src/api/server.rs (1)
339-358: Consider adding "links" to the exclusion list.The route
/agents/{id}/links(line 213) may be incorrectly normalized. When processing/api/agents/some-uuid/links, the segmentlinksfollows an agent ID, and since "links" isn't in the exclusion list at lines 342-354, it could be normalized to{id}instead of being preserved.Proposed fix
"agents" if !matches!( *part, "mcp" | "warmup" | "overview" | "workers" | "memories" | "profile" | "identity" | "config" | "cron" | "tasks" | "ingest" | "skills" | "tools" + | "links" ) =>🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/api/server.rs` around lines 339 - 358, The route segment matcher that normalizes agent routes is missing "links" in its exclusion list, causing "/agents/{id}/links" to be mis-normalized; update the match arm that checks *part (the list with "mcp" | "warmup" ... | "tools") to also include "links" so that the code path that currently pushes "{id}" (where normalized.push("{id}") is called) preserves "links" instead of treating it as an ID.src/mcp.rs (1)
342-382: Tool call metrics only recorded on success.The metrics are only recorded when
result.is_ok(). Failed tool calls won't be reflected inmcp_tool_calls_totalormcp_tool_call_duration_seconds. Consider whether tracking failures with an additional label would be valuable for debugging MCP server issues.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcp.rs` around lines 342 - 382, Metrics are only emitted when result.is_ok(); change the logic so you always record m.mcp_tool_calls_total and m.mcp_tool_call_duration_seconds (or a separate failure metric) after the call completes, using the already-captured call_start and the result variable to compute elapsed time and add a status label (e.g., "ok"/"err") or a separate metric for failures. Locate the block around call_start, the call to client.call_tool(...).await (result), and the metrics usage (crate::telemetry::Metrics::global(), m.mcp_tool_calls_total, m.mcp_tool_call_duration_seconds) and move/extend the metrics emission to run unconditionally after result is ready, using result.is_ok() only to set the label value.src/agent/channel.rs (1)
1329-1356: Consider extracting fallback-send + metrics/error accounting into a helper.The same send/match/increment block is repeated three times, which makes future label changes easy to miss in one branch.
♻️ Suggested refactor sketch
+async fn send_text_with_metrics( + &self, + final_text: String, + metrics_agent_id: &str, + metrics_channel_type: &str, +) { + match self.response_tx.send(OutboundResponse::Text(final_text)).await { + Ok(()) => { + #[cfg(feature = "metrics")] + crate::telemetry::Metrics::global() + .messages_sent_total + .with_label_values(&[metrics_agent_id, metrics_channel_type]) + .inc(); + } + Err(error) => { + #[cfg(feature = "metrics")] + crate::telemetry::Metrics::global() + .channel_errors_total + .with_label_values(&[metrics_agent_id, metrics_channel_type, "send_failed"]) + .inc(); + tracing::error!(%error, channel_id = %self.id, "failed to send fallback reply"); + } + } +}Also applies to: 1411-1438, 1484-1511
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agent/channel.rs` around lines 1329 - 1356, Extract the repeated send/match/metrics block into a single async helper (e.g., send_fallback_reply or send_outbound_with_metrics) that takes the response payload (OutboundResponse::Text or other variants), references to metrics labels (metrics_agent_id, metrics_channel_type) and self.id (or &self) so it can perform the send on self.response_tx, increment messages_sent_total on Ok and channel_errors_total with "send_failed" on Err, and log the tracing::error with %error and channel_id = %self.id; then replace the three duplicated match blocks (the one shown plus the blocks at the other reported locations) with calls to this helper to ensure label/error accounting is centralized and consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/metrics.md`:
- Around line 169-172: The PromQL example for API latency uses the wrong label
name: replace the `handler` label with `path` in the histogram_quantile query
that references spacebot_http_request_duration_seconds_bucket (i.e., update the
sum by clause from "handler, le" to "path, le" so the query groups by the actual
metric label used in the code).
- Around line 115-118: Update the metrics documentation to match the
implementation: change the label name "handler" to "path" for the HTTP metrics
`spacebot_http_requests_total` and `spacebot_http_request_duration_seconds` so
the Labels column lists "method, path, status" (for the counter) and "method,
path" (for the histogram) to align with the `with_label_values(&[&method, &path,
&status])` calls in src/api/server.rs.
---
Nitpick comments:
In `@src/agent/channel.rs`:
- Around line 1329-1356: Extract the repeated send/match/metrics block into a
single async helper (e.g., send_fallback_reply or send_outbound_with_metrics)
that takes the response payload (OutboundResponse::Text or other variants),
references to metrics labels (metrics_agent_id, metrics_channel_type) and
self.id (or &self) so it can perform the send on self.response_tx, increment
messages_sent_total on Ok and channel_errors_total with "send_failed" on Err,
and log the tracing::error with %error and channel_id = %self.id; then replace
the three duplicated match blocks (the one shown plus the blocks at the other
reported locations) with calls to this helper to ensure label/error accounting
is centralized and consistent.
In `@src/api/server.rs`:
- Around line 339-358: The route segment matcher that normalizes agent routes is
missing "links" in its exclusion list, causing "/agents/{id}/links" to be
mis-normalized; update the match arm that checks *part (the list with "mcp" |
"warmup" ... | "tools") to also include "links" so that the code path that
currently pushes "{id}" (where normalized.push("{id}") is called) preserves
"links" instead of treating it as an ID.
In `@src/mcp.rs`:
- Around line 342-382: Metrics are only emitted when result.is_ok(); change the
logic so you always record m.mcp_tool_calls_total and
m.mcp_tool_call_duration_seconds (or a separate failure metric) after the call
completes, using the already-captured call_start and the result variable to
compute elapsed time and add a status label (e.g., "ok"/"err") or a separate
metric for failures. Locate the block around call_start, the call to
client.call_tool(...).await (result), and the metrics usage
(crate::telemetry::Metrics::global(), m.mcp_tool_calls_total,
m.mcp_tool_call_duration_seconds) and move/extend the metrics emission to run
unconditionally after result is ready, using result.is_ok() only to set the
label value.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (18)
DockerfileMETRICS.mddocs/metrics.mdsrc/agent/channel.rssrc/agent/channel_dispatch.rssrc/agent/ingestion.rssrc/agent/worker.rssrc/api/server.rssrc/cron/scheduler.rssrc/hooks/spacebot.rssrc/llm/model.rssrc/mcp.rssrc/memory/embedding.rssrc/memory/search.rssrc/memory/store.rssrc/telemetry/registry.rssrc/tools/memory_recall.rssrc/tools/memory_save.rs
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
src/mcp.rs (2)
148-148: Use a descriptive metrics handle name instead ofm.The repeated
let m = ...alias makes these blocks harder to scan and violates the repo naming rule for non-abbreviated variable names.Proposed rename pattern (apply consistently)
- let m = crate::telemetry::Metrics::global(); - m.mcp_connections + let metrics = crate::telemetry::Metrics::global(); + metrics.mcp_connections .with_label_values(&[&self.name, "connected"]) .set(0);As per coding guidelines, "Do not abbreviate variable names; use
queuenotq,messagenotmsg,channelnotch; common abbreviations likeconfigare acceptable."Also applies to: 183-183, 223-223, 262-262, 317-317, 378-378, 592-592
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcp.rs` at line 148, Multiple places in mcp.rs create an aliased metrics handle using "let m = crate::telemetry::Metrics::global();" which uses a non-descriptive, abbreviated name; change these to a clear, non-abbreviated variable (for example "let metrics = crate::telemetry::Metrics::global();") and update all subsequent uses in the same scope to use that name (replace references to "m" with "metrics"); apply the same rename consistently for each occurrence mentioned so the variable name follows the repository guideline against abbreviations.
146-164: Consolidate repeated MCP state metric updates into one helper.The same connected/disconnected/failed/tool-count update logic is duplicated across multiple branches. A small helper will reduce divergence risk and keep lifecycle semantics consistent.
Refactor sketch
+#[cfg(feature = "metrics")] +fn set_mcp_state_metrics( + metrics: &crate::telemetry::Metrics, + server_name: &str, + connected: i64, + disconnected: i64, + failed: i64, + tools_registered: i64, +) { + metrics.mcp_connections.with_label_values(&[server_name, "connected"]).set(connected); + metrics.mcp_connections.with_label_values(&[server_name, "disconnected"]).set(disconnected); + metrics.mcp_connections.with_label_values(&[server_name, "failed"]).set(failed); + metrics.mcp_tools_registered.with_label_values(&[server_name]).set(tools_registered); +}Also applies to: 181-204, 221-239, 315-330
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcp.rs` around lines 146 - 164, Extract the repeated MCP metric updates into a single helper (e.g., a free fn or impl method like update_mcp_lifecycle_metrics or set_mcp_metrics) that takes &self.name and the lifecycle values (connection_attempts status, connected/disconnected/failed ints, tools_registered count) and calls crate::telemetry::Metrics::global() once and updates m.mcp_connection_attempts_total, m.mcp_connections (with labels "connected"/"disconnected"/"failed") and m.mcp_tools_registered accordingly; then replace each duplicated block (the blocks around the references to self.name at the shown sites) with calls to that helper to ensure a single source of truth for MCP state updates.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/metrics.md`:
- Line 54: Update the metrics text to explicitly state that for non-worker tiers
the worker_type label is an empty string; modify the sentence referencing the
`tier` and `worker_type` labels in docs/metrics.md so it reads that `tier` is
one of `channel`, `branch`, `worker`, `compactor`, or `cortex` and that
`worker_type` identifies the worker variant (`builtin`, `opencode`, or
`ingestion`) and is emitted as an empty string for non-worker tiers so
dashboards/filters reflect runtime behavior.
---
Nitpick comments:
In `@src/mcp.rs`:
- Line 148: Multiple places in mcp.rs create an aliased metrics handle using
"let m = crate::telemetry::Metrics::global();" which uses a non-descriptive,
abbreviated name; change these to a clear, non-abbreviated variable (for example
"let metrics = crate::telemetry::Metrics::global();") and update all subsequent
uses in the same scope to use that name (replace references to "m" with
"metrics"); apply the same rename consistently for each occurrence mentioned so
the variable name follows the repository guideline against abbreviations.
- Around line 146-164: Extract the repeated MCP metric updates into a single
helper (e.g., a free fn or impl method like update_mcp_lifecycle_metrics or
set_mcp_metrics) that takes &self.name and the lifecycle values
(connection_attempts status, connected/disconnected/failed ints,
tools_registered count) and calls crate::telemetry::Metrics::global() once and
updates m.mcp_connection_attempts_total, m.mcp_connections (with labels
"connected"/"disconnected"/"failed") and m.mcp_tools_registered accordingly;
then replace each duplicated block (the blocks around the references to
self.name at the shown sites) with calls to that helper to ensure a single
source of truth for MCP state updates.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
docs/metrics.mdsrc/agent/channel.rssrc/llm/model.rssrc/mcp.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- src/llm/model.rs
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/metrics.md`:
- Around line 164-167: The PromQL in the docs uses histogram_quantile(0.99,
rate(spacebot_memory_operation_duration_seconds_bucket[5m])) which is invalid
because histogram_quantile requires the bucket series to be aggregated by the le
label; update the query that references
spacebot_memory_operation_duration_seconds_bucket to wrap the rate(...) with sum
by (le) (rate(...)) so it becomes histogram_quantile(0.99, sum by (le)
(rate(spacebot_memory_operation_duration_seconds_bucket[5m]))); if you need
per-operation or per-agent breakdowns, include those labels in the aggregation
(e.g., sum by (operation, le) or sum by (agent, operation, le)).
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
src/agent/channel.rs (1)
685-690: Recordmessage_handling_duration_secondson error paths too.Both handlers emit duration only at tail, so early
?returns skip histogram updates and bias latency reporting toward successful turns.Also applies to: 945-950
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agent/channel.rs` around lines 685 - 690, The histogram observation is only recorded at the end of successful handlers so early `?` returns skip it; wrap the timing/observation in a scope-safe guard or explicit finally-style block so message_handling_duration_seconds is observed on all exit paths. Create a small RAII timer struct (e.g., MetricsTimer) that captures metrics_turn_start, self.deps.agent_id and metrics_channel_type on creation and calls .observe(elapsed) in Drop, or alternatively call the observe in a single `match`/`map_err`/`and_then` chain or a `let res = (|| { ... })();` followed by a final observe using metrics_turn_start.elapsed() before returning; apply this change around the two places shown (the block using Metrics::global().message_handling_duration_seconds at the snippet and the similar block near lines 945-950) so the histogram is updated regardless of early returns or errors.src/llm/model.rs (1)
368-370: Avoid abbreviated binding namewt.Please use a descriptive binding for readability and consistency.
As per coding guidelines, "Do not abbreviate variable names; use `queue` not `q`, `message` not `msg`, `channel` not `ch`; common abbreviations like `config` are acceptable".✏️ Proposed rename
- let worker_label = match (tier_label, self.worker_type.as_deref()) { - ("worker", Some(wt)) => wt, + let worker_label = match (tier_label, self.worker_type.as_deref()) { + ("worker", Some(worker_type)) => worker_type, ("worker", None) => "unknown", _ => "", };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/llm/model.rs` around lines 368 - 370, Rename the abbreviated match binding `wt` to a descriptive name (e.g., `worker_type_str` or `worker_type`) in the match expression that computes `worker_label` (the match on `(tier_label, self.worker_type.as_deref())`) so the arms read ("worker", Some(worker_type_str)) => worker_type_str and ("worker", None) => "unknown"; update any references in that scope accordingly to maintain readability and consistency with naming guidelines.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/metrics.md`:
- Around line 69-72: The metric description for spacebot_mcp_reconnects_total is
misleading because the code increments it only on successful reconnects; update
the table row for spacebot_mcp_reconnects_total in metrics.md to explicitly
state it counts successful MCP reconnections (e.g., "MCP successful
reconnections" or "successful MCP reconnection attempts") so dashboards reflect
the actual behavior of the counter.
In `@src/mcp.rs`:
- Around line 194-211: The write guard from self.state.write() is held across an
await (self.tools.read()) which can deadlock; set *state =
McpConnectionState::Connected then immediately drop the write guard before any
await by scoping the write lock or calling drop(state) so the guard is released,
then obtain self.tools.read().await and call
set_mcp_connection_state(&self.name, 1, 0, 0, tool_count); ensure the order:
acquire write lock only to set state (use a short scope), release it, then await
self.tools.read().
---
Nitpick comments:
In `@src/agent/channel.rs`:
- Around line 685-690: The histogram observation is only recorded at the end of
successful handlers so early `?` returns skip it; wrap the timing/observation in
a scope-safe guard or explicit finally-style block so
message_handling_duration_seconds is observed on all exit paths. Create a small
RAII timer struct (e.g., MetricsTimer) that captures metrics_turn_start,
self.deps.agent_id and metrics_channel_type on creation and calls
.observe(elapsed) in Drop, or alternatively call the observe in a single
`match`/`map_err`/`and_then` chain or a `let res = (|| { ... })();` followed by
a final observe using metrics_turn_start.elapsed() before returning; apply this
change around the two places shown (the block using
Metrics::global().message_handling_duration_seconds at the snippet and the
similar block near lines 945-950) so the histogram is updated regardless of
early returns or errors.
In `@src/llm/model.rs`:
- Around line 368-370: Rename the abbreviated match binding `wt` to a
descriptive name (e.g., `worker_type_str` or `worker_type`) in the match
expression that computes `worker_label` (the match on `(tier_label,
self.worker_type.as_deref())`) so the arms read ("worker",
Some(worker_type_str)) => worker_type_str and ("worker", None) => "unknown";
update any references in that scope accordingly to maintain readability and
consistency with naming guidelines.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
docs/metrics.mdsrc/agent/channel.rssrc/agent/channel_dispatch.rssrc/api/server.rssrc/llm/model.rssrc/mcp.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- src/api/server.rs
…tems Enable metrics feature in Docker build and add 21 new metrics with 10 upgraded existing metrics covering LLM, tools, MCP, channels, memory, API, cron, ingestion, and agent lifecycle. All instrumentation is compile-time gated behind #[cfg(feature = "metrics")]. New metrics: MCP connections/tools/reconnects (7), channel messages/errors (4), memory operations/search/embeddings (3), API requests/duration (2), lifecycle branches/overflow (2), worker cost (1), cron executions (1), ingestion files (1). Upgraded metrics add process_type, worker_type, and agent_id labels for per-usecase visibility across worker/channel/branch splits.
- Fix docs/metrics.md: label `handler` → `path` for API metrics and PromQL examples to match actual implementation - Fix MCP failure paths: reset `disconnected` gauge to 0 alongside `connected` to keep state gauges mutually exclusive - Record MCP tool call count and duration on both success and failure - Use empty string for worker_type on non-worker tiers instead of "unknown" to avoid confusing extra series
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
- Fix PromQL histogram_quantile query missing sum by (le) aggregation
- Add "links" to route normalization exclusion list (prevents
/agents/{id}/links from being mis-normalized)
- Rename abbreviated `m` variable to `metrics` in mcp.rs for
consistency with codebase conventions
- Extract set_mcp_connection_state helper to consolidate repeated
MCP gauge updates across connect/disconnect/failure paths
- Extract send_outbound_text helper in channel.rs to deduplicate
send/match/metrics blocks
After rebase, two call sites in channel_dispatch.rs were missing the worker_type parameter required by the metrics-enhanced spawn_worker_task.
b2e17a6 to
91ae167
Compare
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/memory/store.rs (1)
131-179:⚠️ Potential issue | 🟡 MinorOnly count successful updates in
memory_updates_total.This counter increments even when the
UPDATEtouches zero rows.delete()already checksrows_affected() > 0, soupdate()will currently over-report mutations for missing IDs or stale writes.🩹 Suggested fix
- sqlx::query( + let result = sqlx::query( r#" UPDATE memories SET content = ?, memory_type = ?, importance = ?, updated_at = ?, last_accessed_at = ?, access_count = ?, source = ?, channel_id = ?, forgotten = ? WHERE id = ? "#, ) .bind(&memory.content) .bind(memory.memory_type.to_string()) .bind(memory.importance) .bind(memory.updated_at) .bind(memory.last_accessed_at) .bind(memory.access_count) .bind(&memory.source) .bind(memory.channel_id.as_ref().map(|id| id.as_ref())) .bind(memory.forgotten) .bind(&memory.id) .execute(&self.pool) .await .with_context(|| format!("failed to update memory {}", memory.id))?; #[cfg(feature = "metrics")] - { + if result.rows_affected() > 0 { let agent_label = if self.agent_id.is_empty() { "unknown" } else { &self.agent_id }; crate::telemetry::Metrics::global() .memory_updates_total .with_label_values(&[agent_label, "update"]) .inc(); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/memory/store.rs` around lines 131 - 179, The update() function must only increment memory_updates_total when the UPDATE actually affected rows: capture the result of the execute call into a variable (e.g. let result = sqlx::query(...).bind(...)... .execute(&self.pool).await.with_context(|| format!("failed to update memory {}", memory.id))?;), then check result.rows_affected() > 0 before incrementing the memory_updates_total metric (mirroring delete()'s behavior); keep the with_context on the execute call so errors still propagate but only increment the counter on successful row updates.src/agent/channel_dispatch.rs (1)
644-651:⚠️ Potential issue | 🔴 CriticalUpdate the local tests for the new
worker_typeargument.This signature now requires a seventh parameter, but the two unit tests below still call the old six-argument form. CI is already failing with E0061 at Line 1021. Pass
"builtin"in both test invocations.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agent/channel_dispatch.rs` around lines 644 - 651, The unit tests calling spawn_worker_task use the old six-argument signature and must be updated to pass the new seventh parameter worker_type; modify both test invocations of spawn_worker_task to add the string "builtin" as the final argument so the call matches pub(crate) fn spawn_worker_task(..., worker_type: &'static str, future: F), which will resolve the E0061 missing argument errors.
♻️ Duplicate comments (2)
docs/metrics.md (1)
65-73:⚠️ Potential issue | 🟡 MinorDescribe
spacebot_mcp_reconnects_totalas successful reconnects.
src/mcp.rsonly increments this counter after a reconnect succeeds, so calling it “attempts” overstates what the series means in dashboards and alerts.📝 Doc fix
-| `spacebot_mcp_reconnects_total` | Counter | server_name | MCP reconnection attempts | +| `spacebot_mcp_reconnects_total` | Counter | server_name | Successful MCP reconnections |🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/metrics.md` around lines 65 - 73, Update the metrics docs to accurately describe the metric `spacebot_mcp_reconnects_total` as counting successful reconnects (not attempts); change the Description cell in the `metrics.md` table row for `spacebot_mcp_reconnects_total` to something like "MCP successful reconnections" or "Successful MCP reconnects" to match the behavior in `src/mcp.rs` where the counter is only incremented after a reconnect succeeds.src/mcp.rs (1)
194-211:⚠️ Potential issue | 🔴 CriticalDrop the state write guard before the next await.
Line 194 acquires
self.state.write(), and Line 209 awaitsself.tools.read()while that guard is still alive. That recreates the lock-ordering deadlock that was already called out earlier. Keep the state assignment in a short scope and reusetools.len()instead of re-reading throughself.tools.🔧 Proposed fix
- { - let mut cached_tools = self.tools.write().await; - *cached_tools = tools; - } + let tool_count = tools.len() as i64; + { + let mut cached_tools = self.tools.write().await; + *cached_tools = tools; + } self.tool_list_changed.store(false, Ordering::SeqCst); - let mut state = self.state.write().await; - *state = McpConnectionState::Connected; + { + let mut state = self.state.write().await; + *state = McpConnectionState::Connected; + } #[cfg(feature = "metrics")] { let metrics = crate::telemetry::Metrics::global(); let elapsed = connect_start.elapsed().as_secs_f64(); @@ metrics .mcp_connection_duration_seconds .with_label_values(&[&self.name]) .observe(elapsed); - let tool_count = self.tools.read().await.len() as i64; set_mcp_connection_state(&self.name, 1, 0, 0, tool_count); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mcp.rs` around lines 194 - 211, The write guard from self.state.write() is held across an await which causes a lock-order deadlock; after setting *state = McpConnectionState::Connected drop the guard (e.g., limit the scope or explicitly drop(state)) before doing any awaits, and when recording metrics reuse the already-read tools collection instead of calling self.tools.read().await again (use the existing tools variable and tools.len()) to avoid re-acquiring locks while the state lock is released.
🧹 Nitpick comments (1)
src/llm/model.rs (1)
376-380: Renamewttoworker_type.Tiny style nit, but this local abbreviation breaks the repo's Rust naming rule and stands out in a metrics block that otherwise uses fully spelled-out labels.
As per coding guidelines, "Use non-abbreviated variable names in Rust code:
queuenotq,messagenotmsg,channelnotch; common abbreviations likeconfigare acceptable".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/llm/model.rs` around lines 376 - 380, Rename the short binding `wt` to the full `worker_type` in the match that assigns `worker_label` so the match reads match (tier_label, self.worker_type.as_deref()) { ("worker", Some(worker_type)) => worker_type, ("worker", None) => "unknown", _ => "", } — update the local pattern name and any dependent uses to `worker_type` to comply with the non-abbreviated naming guideline.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@METRICS.md`:
- Around line 332-349: Update the METRICS.md entries for the API middleware
metrics to use the actual label name `path` instead of `handler`: change the
Labels row for `spacebot_http_requests_total` from `method`, `handler`, `status`
to `method`, `path`, `status` and for `spacebot_http_request_duration_seconds`
from `method`, `handler` to `method`, `path`; mention they are instrumented in
`src/api/server.rs` (middleware layer) so readers can find the normalized path
label there.
In `@src/agent/channel.rs`:
- Around line 1973-1998: Built-in reply sends (send_builtin_text and any direct
writes to response_tx for commands like /status, /quiet, /active, /help,
/agent-id and ping acks) bypass the metrics increments; update send_builtin_text
(and replace direct response_tx.send usages for those built-ins) to call the
existing send_outbound_text(text, error_context) helper so messages_sent_total
and channel_errors_total are incremented consistently, or extract the
metrics/err-handling block from send_outbound_text into a shared helper used by
both send_outbound_text and send_builtin_text (ensure you reference response_tx,
send_outbound_text, send_builtin_text, messages_sent_total and
channel_errors_total when changing call sites).
- Around line 1083-1103: The metrics code only records
message_handling_duration_seconds on the normal end path; create a scope guard
that records the histogram on Drop so duration is observed on every exit
(success, early return, or error). Initialize metrics_turn_start and
metrics_channel_type early in the message-handling functions (e.g., inside
handle_message(), the /agent-id and ping fast-paths, built-in ops handlers) and
instantiate a small RAII guard (e.g., MetricTurnGuard) which on drop reads
Instant::now() - metrics_turn_start and calls
crate::telemetry::Metrics::global().message_handling_duration_seconds.with_label_values(&[&self.deps.agent_id,
&metrics_channel_type]).observe(...). Replace or supplement the existing
bottom-of-handler observation with this guard so every code path records
latency.
In `@src/agent/ingestion.rs`:
- Around line 479-482: SpacebotModel is built with with_worker_type("ingestion")
but SpacebotModel::completion() only emits worker_type when the tier is
"worker", so ingestion requests end up with worker_type="". Update the code so
completion reports worker_type for ingestion: either change the builder call
that creates the model (SpacebotModel::make -> ensure the model's tier is
"worker" if that aligns with metrics rules) or, better, modify
SpacebotModel::completion() to include/emits the with_worker_type value for
non-worker tiers (e.g., treat "ingestion" the same as "worker" for emitting
worker_type) so worker_cost_dollars and metrics pick up the "ingestion"
worker_type. Ensure you adjust SpacebotModel::completion() logic to read and
emit the worker_type set by with_worker_type().
In `@src/api/server.rs`:
- Around line 327-375: normalize_api_path currently leaves user-controlled
segments unnormalized (e.g., /api/mcp/servers/{name}, /api/providers/{provider},
/api/opencode/{port}/{*path}); update normalize_api_path to collapse those into
stable tokens: detect patterns like parent "mcp" with child "servers" (or when
parent == "servers" and grandparent == "mcp") and replace the following segment
with "{name}"; replace any direct child segment under "providers" with
"{provider}"; for "opencode" detect the wildcard tail (when parent == "opencode"
and i >= 2) and replace the rest of the path segments with a single "{tail}"
token; add these checks into the existing parent-match branch in
normalize_api_path so proxy and user-defined names are normalized consistently.
In `@src/mcp.rs`:
- Around line 88-113: The helper set_mcp_connection_state currently only sets
"connected", "disconnected", and "failed" but omits the "connecting" state from
McpConnectionState; update set_mcp_connection_state(server_name: &str,
connected: i64, disconnected: i64, failed: i64, tools_registered: i64) to accept
an additional connecting: i64 parameter and add a
metrics.mcp_connections.with_label_values(&[server_name,
"connecting"]).set(connecting) call, and update any callers of
set_mcp_connection_state to pass the current connecting count so the
"spacebot_mcp_connections" gauge exposes the connecting state.
---
Outside diff comments:
In `@src/agent/channel_dispatch.rs`:
- Around line 644-651: The unit tests calling spawn_worker_task use the old
six-argument signature and must be updated to pass the new seventh parameter
worker_type; modify both test invocations of spawn_worker_task to add the string
"builtin" as the final argument so the call matches pub(crate) fn
spawn_worker_task(..., worker_type: &'static str, future: F), which will resolve
the E0061 missing argument errors.
In `@src/memory/store.rs`:
- Around line 131-179: The update() function must only increment
memory_updates_total when the UPDATE actually affected rows: capture the result
of the execute call into a variable (e.g. let result =
sqlx::query(...).bind(...)... .execute(&self.pool).await.with_context(||
format!("failed to update memory {}", memory.id))?;), then check
result.rows_affected() > 0 before incrementing the memory_updates_total metric
(mirroring delete()'s behavior); keep the with_context on the execute call so
errors still propagate but only increment the counter on successful row updates.
---
Duplicate comments:
In `@docs/metrics.md`:
- Around line 65-73: Update the metrics docs to accurately describe the metric
`spacebot_mcp_reconnects_total` as counting successful reconnects (not
attempts); change the Description cell in the `metrics.md` table row for
`spacebot_mcp_reconnects_total` to something like "MCP successful reconnections"
or "Successful MCP reconnects" to match the behavior in `src/mcp.rs` where the
counter is only incremented after a reconnect succeeds.
In `@src/mcp.rs`:
- Around line 194-211: The write guard from self.state.write() is held across an
await which causes a lock-order deadlock; after setting *state =
McpConnectionState::Connected drop the guard (e.g., limit the scope or
explicitly drop(state)) before doing any awaits, and when recording metrics
reuse the already-read tools collection instead of calling
self.tools.read().await again (use the existing tools variable and tools.len())
to avoid re-acquiring locks while the state lock is released.
---
Nitpick comments:
In `@src/llm/model.rs`:
- Around line 376-380: Rename the short binding `wt` to the full `worker_type`
in the match that assigns `worker_label` so the match reads match (tier_label,
self.worker_type.as_deref()) { ("worker", Some(worker_type)) => worker_type,
("worker", None) => "unknown", _ => "", } — update the local pattern name and
any dependent uses to `worker_type` to comply with the non-abbreviated naming
guideline.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2a07ef49-8f6d-4641-8a5c-663ee09c2a49
📒 Files selected for processing (18)
DockerfileMETRICS.mddocs/metrics.mdsrc/agent/channel.rssrc/agent/channel_dispatch.rssrc/agent/ingestion.rssrc/agent/worker.rssrc/api/server.rssrc/cron/scheduler.rssrc/hooks/spacebot.rssrc/llm/model.rssrc/mcp.rssrc/memory/embedding.rssrc/memory/search.rssrc/memory/store.rssrc/telemetry/registry.rssrc/tools/memory_recall.rssrc/tools/memory_save.rs
🚧 Files skipped from review as they are similar to previous changes (5)
- src/tools/memory_save.rs
- Dockerfile
- src/cron/scheduler.rs
- src/hooks/spacebot.rs
- src/agent/worker.rs
| let model = SpacebotModel::make(&deps.llm_manager, &model_name) | ||
| .with_context(&*deps.agent_id, "branch") | ||
| .with_worker_type("ingestion") | ||
| .with_routing((**routing).clone()); |
There was a problem hiding this comment.
worker_type="ingestion" is currently dropped by the LLM metrics path.
This model is still tagged as process_type = "branch", but SpacebotModel::completion() only emits worker_type for the "worker" tier. As written, ingestion requests will still report worker_type="" and won't contribute to worker_cost_dollars.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/agent/ingestion.rs` around lines 479 - 482, SpacebotModel is built with
with_worker_type("ingestion") but SpacebotModel::completion() only emits
worker_type when the tier is "worker", so ingestion requests end up with
worker_type="". Update the code so completion reports worker_type for ingestion:
either change the builder call that creates the model (SpacebotModel::make ->
ensure the model's tier is "worker" if that aligns with metrics rules) or,
better, modify SpacebotModel::completion() to include/emits the with_worker_type
value for non-worker tiers (e.g., treat "ingestion" the same as "worker" for
emitting worker_type) so worker_cost_dollars and metrics pick up the "ingestion"
worker_type. Ensure you adjust SpacebotModel::completion() logic to read and
emit the worker_type set by with_worker_type().
- Add missing worker_type arg to spawn_worker_task in test call sites - Rename abbreviated `wt` binding to `worker_type` in model.rs - Fix mcp_reconnects_total doc description to say "Successful MCP reconnections" instead of "reconnection attempts"
- mcp.rs: drop state write lock before tools.read() await (deadlock fix)
- mcp.rs: expose "connecting" state in mcp_connections gauge
- mcp.rs: add connecting parameter to set_mcp_connection_state helper
- channel.rs: use RAII MessageDurationGuard for message_handling_duration
so early returns and ? errors still record the metric
- channel.rs: add metrics to send_builtin_text (messages_sent_total,
channel_errors_total) so built-in replies are counted
- server.rs: normalize /mcp/servers/{name}, /providers/{name}, and
/opencode/{port}/{proxy_path} to prevent high-cardinality metric labels
- model.rs: emit worker_type for any tier when explicitly set, so
ingestion (process_type=branch) correctly reports worker_type=ingestion
- METRICS.md: fix handler→path label references
…unnecessary lock reacquisition - memory/store.rs: check rows_affected() > 0 before incrementing memory_updates_total, matching delete()'s existing behavior - mcp.rs: capture tools.len() before moving into cache instead of re-acquiring self.tools.read() lock in the metrics block
Enables the
metricscargo feature in the Docker build and adds Prometheus instrumentation across every major subsystem. All metrics are compile-time gated behind#[cfg(feature = "metrics")]— zero overhead when building without the feature locally.Issue
#110
What changed
Docker —
cargo build --release --features metricsis now hardcoded in the Dockerfile. Port 9090 exposed for the metrics endpoint.Registry (
src/telemetry/registry.rs) — 21 new metrics added, 10 existing metrics upgraded with additional labels (process_type,worker_type,agent_id) for per-usecase breakdowns across worker/channel/branch.New metrics by subsystem:
worker_cost_dollars— cost per worker type for optimization visibilityLabel upgrades on existing metrics:
tool_calls_total/tool_call_duration_seconds— addedprocess_typememory_reads_total/memory_writes_total— converted toIntCounterVecwithagent_idworker_type(values:builtin,opencode,ingestion, empty)process_errors_total— addedworker_typeDocs —
METRICS.mdanddocs/metrics.mdupdated with full metric inventory, label descriptions, cardinality estimates (~800–9,200 series), and PromQL examples.Files touched
Dockerfile— enable metrics feature, expose 9090src/telemetry/registry.rs— metric definitionssrc/llm/model.rs— LLM label upgrades, cost tracking, context overflowsrc/hooks/spacebot.rs— tool metric label upgradessrc/agent/channel.rs— channel message/error metricssrc/agent/channel_dispatch.rs— branch spawn tracking, worker_type plumbingsrc/agent/worker.rs,src/agent/ingestion.rs— worker_type on model creationsrc/mcp.rs— MCP connection/tool metricssrc/memory/store.rs,src/memory/search.rs,src/memory/embedding.rs— memory operation metricssrc/api/server.rs— API middleware with path normalizationsrc/cron/scheduler.rs— cron execution metricssrc/tools/memory_recall.rs,src/tools/memory_save.rs— agent_id label upgradesMETRICS.md,docs/metrics.md— documentationTesting
cargo fmt --all -- --checkcleancargo clippy -- -D warningsclean (both with and without--features metrics)cargo test— 300 passed, 0 failedjust gate-prpasses all checks