feat: add lightweight observability and metrics service#230
feat: add lightweight observability and metrics service#230ksapru wants to merge 10 commits intoNVIDIA:mainfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughThis PR adds an optional observability layer: an in-process metrics registry (counters, histograms, latency helper), an optional background HTTP metrics endpoint at /metrics (controlled by env vars, default port 9090, disabled by default), README docs, and tests. Changes
Sequence Diagram(s)sequenceDiagram
participant App as Application
participant Registry as Metrics Registry
participant HttpServer as Metrics HTTP Server
participant Client as External Client
Note over App,Registry: Application records metrics during operation
App->>Registry: incrementCounter(name, labels)
Registry-->>Registry: update counter state
App->>Registry: observeHistogram(name, labels, value)
Registry-->>Registry: update buckets, sum, count
Client->>HttpServer: GET /metrics
HttpServer->>Registry: getPrometheusMetrics()
Registry-->>HttpServer: formatted Prometheus text
HttpServer-->>Client: 200 OK (text/plain) + metrics
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (3)
test/metrics.test.js (2)
15-57: Tests share a singleton—accumulated state may cause flaky results.All tests share the same
metricssingleton without reset between tests. Counters and histograms accumulate across the test run:
- Line 20 asserts
test_counter{foo="bar"} 1, but if this test runs twice or another test uses the same metric name, it will fail.- Line 41 asserts
test_op_total{...} 1—same concern.This could cause flaky tests in CI if test order or parallelism changes.
💡 Suggested improvement: Add a reset method or use unique metric names
Option 1: Add a
reset()method toMetricsRegistryfor test use:// In metrics.ts public reset(): void { this.counters.clear(); this.histograms.clear(); }Option 2: Use unique metric names per test (e.g., include a timestamp or UUID).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/metrics.test.js` around lines 15 - 57, Tests share the same metrics singleton (metrics) causing accumulated counters/histograms and flaky assertions; add a reset mechanism to MetricsRegistry (e.g., MetricsRegistry.reset() that clears counters and histograms) and call it at the start of each test (or beforeEach) in the test file, or alternatively ensure tests use unique metric names (e.g., include a UUID/timestamp) when invoking metrics.incrementCounter, metrics.observeHistogram, and observeLatency to avoid cross-test contamination; update tests to call metrics.reset() (or use unique names) so assertions like test_counter and test_op_total always start from zero.
45-52: Error test doesn't fail if no error is thrown.If
observeLatencyfails to rethrow the error (a regression), this test would silently pass because there's no assertion that the catch block was actually entered.💡 More robust error assertion
-test("observeLatency tracks error metrics", async () => { - try { - await observeLatency("test_op_err", { op: "fail" }, async () => { - throw new Error("oops"); - }); - } catch (err) { - assert.strictEqual(err.message, "oops"); - } +test("observeLatency tracks error metrics", async () => { + await assert.rejects( + () => observeLatency("test_op_err", { op: "fail" }, async () => { + throw new Error("oops"); + }), + { message: "oops" } + );🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/metrics.test.js` around lines 45 - 52, The test "observeLatency tracks error metrics" doesn't assert that the error path was executed, so add a guard after calling observeLatency to fail the test if no error is thrown (e.g., call assert.fail or equivalent) and keep the existing catch verifying err.message === "oops"; locate the test block for observeLatency in test/metrics.test.js to update the test case named "observeLatency tracks error metrics" so it explicitly fails when the awaited call does not throw.nemoclaw/src/observability/metrics.ts (1)
69-100: Prometheus format emits duplicate HELP/TYPE per label combination.When multiple label combinations exist for the same metric name (e.g.,
foo{a="1"}andfoo{a="2"}),getPrometheusMetrics()will emit separate# HELPand# TYPElines for each. Prometheus exposition format expects these metadata lines only once per metric name, with all samples following.Most parsers tolerate this, but for strict compliance:
💡 Group metrics by name before emitting HELP/TYPE
public getPrometheusMetrics(): string { let output = ""; // Group counters by metric name const countersByName = new Map<string, string[]>(); for (const [key, value] of this.counters.entries()) { const [name, labelStr] = this.parseKey(key); if (!countersByName.has(name)) countersByName.set(name, []); countersByName.get(name)!.push(`${name}${labelStr} ${value}`); } for (const [name, samples] of countersByName.entries()) { output += `# HELP ${name} Total count of ${name}\n`; output += `# TYPE ${name} counter\n`; output += samples.join("\n") + "\n\n"; } // Similar grouping for histograms... return output; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemoclaw/src/observability/metrics.ts` around lines 69 - 100, getPrometheusMetrics currently emits # HELP and # TYPE for every metric sample causing duplicates when the same metric name has multiple label sets; change it to group samples by metric name before emitting metadata. For counters: iterate this.counters, use parseKey to get name and labelStr, collect samples per name into a Map<string, string[]> and after collection emit one `# HELP` and `# TYPE` per name followed by all samples. For histograms: similarly group entries from this.histograms by parsed name, then for each name emit a single HELP/TYPE and then output all bucket/_sum/_count lines for each label combination (ensure bucket ordering and labelsBase logic from the current histogram loop is preserved when writing each sample group). Keep function name getPrometheusMetrics, and data sources this.counters, this.histograms and parseKey to locate and implement the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemoclaw/src/blueprint/exec.ts`:
- Around line 35-44: The metrics wrapper observeLatency around execBlueprint is
marking failures as successes because execBlueprintInternal (like
validateApiKey) never throws—fix execBlueprintInternal to reject the Promise on
real failures (non-zero exit codes, missing Python, runner errors) instead of
always resolving, then keep execBlueprint calling observeLatency as-is so
observeLatency can record status="error" when the wrapped call throws; locate
and update error paths inside execBlueprintInternal (and any helper functions it
uses that swallow errors) to throw or return a rejected Promise with an Error
describing the failure so metrics reflect actual outcomes.
In `@nemoclaw/src/index.ts`:
- Around line 259-287: The metrics service block calls createServer (used in the
type annotation ReturnType<typeof createServer> and in server =
createServer(...)) but createServer is not imported; add an import for
createServer from "node:http" at the top of the file (alongside the other
imports) so the symbol is defined before api.registerService runs when
metrics.isEnabled() is true.
- Around line 260-270: The file references the metrics singleton
(metrics.isEnabled() and metrics.getPrometheusMetrics()) but never imports it;
add an import for the metrics export used by those calls (import the metrics
symbol from the observability metrics module) alongside the other top-level
imports so metrics is defined before being used in the api.registerService/start
block (where createServer and metrics.getPrometheusMetrics() are called).
In `@nemoclaw/src/onboard/validate.ts`:
- Around line 12-21: The observeLatency wrapper currently only marks errors when
the wrapped function throws, but validateApiKeyInternal returns { valid:
boolean, error?: string } on failures; update validateApiKey to call
validateApiKeyInternal inside observeLatency as you do, then inspect the
returned ValidationResult and emit a separate metric or set status based on
result.valid (or rethrow when valid is false) so failures
(validateApiKeyInternal returning valid: false) are recorded as errors;
reference the validateApiKey and validateApiKeyInternal symbols and ensure you
add the counter/instrumentation update immediately after the observeLatency call
to reflect validation success/failure.
---
Nitpick comments:
In `@nemoclaw/src/observability/metrics.ts`:
- Around line 69-100: getPrometheusMetrics currently emits # HELP and # TYPE for
every metric sample causing duplicates when the same metric name has multiple
label sets; change it to group samples by metric name before emitting metadata.
For counters: iterate this.counters, use parseKey to get name and labelStr,
collect samples per name into a Map<string, string[]> and after collection emit
one `# HELP` and `# TYPE` per name followed by all samples. For histograms:
similarly group entries from this.histograms by parsed name, then for each name
emit a single HELP/TYPE and then output all bucket/_sum/_count lines for each
label combination (ensure bucket ordering and labelsBase logic from the current
histogram loop is preserved when writing each sample group). Keep function name
getPrometheusMetrics, and data sources this.counters, this.histograms and
parseKey to locate and implement the change.
In `@test/metrics.test.js`:
- Around line 15-57: Tests share the same metrics singleton (metrics) causing
accumulated counters/histograms and flaky assertions; add a reset mechanism to
MetricsRegistry (e.g., MetricsRegistry.reset() that clears counters and
histograms) and call it at the start of each test (or beforeEach) in the test
file, or alternatively ensure tests use unique metric names (e.g., include a
UUID/timestamp) when invoking metrics.incrementCounter,
metrics.observeHistogram, and observeLatency to avoid cross-test contamination;
update tests to call metrics.reset() (or use unique names) so assertions like
test_counter and test_op_total always start from zero.
- Around line 45-52: The test "observeLatency tracks error metrics" doesn't
assert that the error path was executed, so add a guard after calling
observeLatency to fail the test if no error is thrown (e.g., call assert.fail or
equivalent) and keep the existing catch verifying err.message === "oops"; locate
the test block for observeLatency in test/metrics.test.js to update the test
case named "observeLatency tracks error metrics" so it explicitly fails when the
awaited call does not throw.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3b594b22-cd47-46df-9cc3-68d18744a6cb
📒 Files selected for processing (7)
README.mdnemoclaw/src/blueprint/exec.tsnemoclaw/src/commands/launch.tsnemoclaw/src/index.tsnemoclaw/src/observability/metrics.tsnemoclaw/src/onboard/validate.tstest/metrics.test.js
| export async function validateApiKey( | ||
| apiKey: string, | ||
| endpointUrl: string, | ||
| ): Promise<ValidationResult> { | ||
| return observeLatency( | ||
| "tool_api_validate", | ||
| { endpoint: endpointUrl }, | ||
| () => validateApiKeyInternal(apiKey, endpointUrl), | ||
| ); | ||
| } |
There was a problem hiding this comment.
Metric status may not reflect validation outcome.
The observeLatency wrapper records status="error" only when the wrapped function throws. However, validateApiKeyInternal catches all errors internally and returns { valid: false, error: message } instead of throwing. This means:
- HTTP 401/403 responses → recorded as
status="success" - Request timeouts → recorded as
status="success" - Network errors → recorded as
status="success"
If the intent is to track whether the API call completed (regardless of validity), this is fine. If you want metrics to reflect validation success/failure, consider either:
- Having
validateApiKeyInternalthrow on invalid results, or - Using a separate counter increment based on
result.validafter the call.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemoclaw/src/onboard/validate.ts` around lines 12 - 21, The observeLatency
wrapper currently only marks errors when the wrapped function throws, but
validateApiKeyInternal returns { valid: boolean, error?: string } on failures;
update validateApiKey to call validateApiKeyInternal inside observeLatency as
you do, then inspect the returned ValidationResult and emit a separate metric or
set status based on result.valid (or rethrow when valid is false) so failures
(validateApiKeyInternal returning valid: false) are recorded as errors;
reference the validateApiKey and validateApiKeyInternal symbols and ensure you
add the counter/instrumentation update immediately after the observeLatency call
to reflect validation success/failure.
…trics # Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.
… feat/observability-metrics
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
nemoclaw/src/index.ts (1)
255-255: Consider validating the metrics port.If
NEMOCLAW_METRICS_PORTis set to a non-numeric value,Number()returnsNaN, which will causeserver.listen()to throw. Adding validation provides a clearer error message.💡 Proposed fix to validate the port
- const port = Number(process.env.NEMOCLAW_METRICS_PORT || 9090); + const port = Number(process.env.NEMOCLAW_METRICS_PORT) || 9090; + if (!Number.isInteger(port) || port < 1 || port > 65535) { + logger.error(`Invalid NEMOCLAW_METRICS_PORT: ${process.env.NEMOCLAW_METRICS_PORT}`); + return; + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemoclaw/src/index.ts` at line 255, The current code converts process.env.NEMOCLAW_METRICS_PORT to a Number for const port but doesn't validate it, which can produce NaN and cause server.listen() to throw; update the logic that sets port (the const port declaration using process.env.NEMOCLAW_METRICS_PORT) to explicitly parse and validate the value (e.g., parseInt/Number and check isFinite/Number.isInteger and range 1–65535), and if invalid either fall back to the default 9090 or throw a clear error that mentions NEMOCLAW_METRICS_PORT and the invalid value before calling server.listen().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemoclaw/src/index.ts`:
- Around line 256-267: Add an 'error' event handler on the created HTTP server
(the server returned by createServer and used with server.listen) to catch
binding/startup errors (e.g., EADDRINUSE); in the handler call logger.error with
the error and context (including the port) and exit or cleanly fail the process
as appropriate. Ensure the handler is attached before or immediately after
server.listen so createServer/server and server.listen locations are updated
(references: createServer, server, server.listen, logger.info).
---
Nitpick comments:
In `@nemoclaw/src/index.ts`:
- Line 255: The current code converts process.env.NEMOCLAW_METRICS_PORT to a
Number for const port but doesn't validate it, which can produce NaN and cause
server.listen() to throw; update the logic that sets port (the const port
declaration using process.env.NEMOCLAW_METRICS_PORT) to explicitly parse and
validate the value (e.g., parseInt/Number and check isFinite/Number.isInteger
and range 1–65535), and if invalid either fall back to the default 9090 or throw
a clear error that mentions NEMOCLAW_METRICS_PORT and the invalid value before
calling server.listen().
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 81396cb1-2af9-400e-ae75-5d3123e24521
📒 Files selected for processing (2)
README.mdnemoclaw/src/index.ts
✅ Files skipped from review due to trivial changes (1)
- README.md
…IA#230) Replace the clear_process_identity + maybe_infer_sandbox_user workaround with strict enforcement: run_as_user and run_as_group must always be 'sandbox'. If omitted from YAML, they are defaulted server-side via ensure_sandbox_process_identity(). Any other value is rejected. This fixes 'policy set' failing with 'process policy cannot be changed on a live sandbox' for custom-image sandboxes, where the CLI was clearing identity fields but the YAML had 'sandbox' explicitly set. - Remove clear_process_identity() from CLI and TUI - Add ensure_sandbox_process_identity() to default missing fields - Rename RootProcessIdentity -> InvalidProcessIdentity - Replace maybe_infer_sandbox_user with validate_sandbox_user (fail-fast) - Update OPA test policies to use run_as_user: sandbox
Summary
This PR proposes a lightweight, optional observability layer for NemoClaw to provide visibility into execution latency and success/failure rates across key runtime paths.
Metrics are exposed in a Prometheus-compatible format and are fully gated behind
NEMOCLAW_METRICS_ENABLED=true, ensuring zero impact to existing behavior by default.Motivation
Recent issues around sandbox creation and readiness (e.g., #140, #152, #229) highlight the difficulty of debugging failures and latency in the current system.
Currently, NemoClaw does not expose:
This PR adds minimal instrumentation to improve observability and help diagnose these behaviors without modifying core execution logic. This is fully optional and does not change default behavior.
Changes
Lightweight metrics registry (no external dependencies)
Optional instrumentation (disabled by default)
NEMOCLAW_METRICS_ENABLED=trueInstrumented execution paths
execBlueprint)Optional metrics endpoint
/metricsover HTTP (default port 9090)Example Output
blueprint_exec_total{status="success"} 1blueprint_exec_latency_seconds_count{status="success"} 1Why this approach
Future Extensions
Testing
/metricsendpointSummary by CodeRabbit
New Features
Documentation
Tests