feat: add lightweight observability and metrics service by ksapru · Pull Request #230 · NVIDIA/NemoClaw

ksapru · 2026-03-17T18:34:17Z

Summary

This PR proposes a lightweight, optional observability layer for NemoClaw to provide visibility into execution latency and success/failure rates across key runtime paths.

Metrics are exposed in a Prometheus-compatible format and are fully gated behind NEMOCLAW_METRICS_ENABLED=true, ensuring zero impact to existing behavior by default.

Motivation

Recent issues around sandbox creation and readiness (e.g., #140, #152, #229) highlight the difficulty of debugging failures and latency in the current system.

Currently, NemoClaw does not expose:

execution latency for agent workflows
success/failure counters for key operations
visibility into slow or failing onboarding and execution paths

This PR adds minimal instrumentation to improve observability and help diagnose these behaviors without modifying core execution logic. This is fully optional and does not change default behavior.

Changes

Lightweight metrics registry (no external dependencies)
- Counters and histograms
- Prometheus-compatible output
Optional instrumentation (disabled by default)
- Enabled via NEMOCLAW_METRICS_ENABLED=true
- No runtime overhead when disabled
Instrumented execution paths
- Blueprint execution (execBlueprint)
- API validation during onboarding
Optional metrics endpoint
- Exposes /metrics over HTTP (default port 9090)
- Intended for local debugging and integration with Prometheus/Grafana

Example Output

blueprint_exec_total{status="success"} 1
blueprint_exec_latency_seconds_count{status="success"} 1

Why this approach

No additional dependencies
Minimal code footprint
Fully opt-in and non-breaking
Low operational and review risk

Future Extensions

Sandbox lifecycle observability (creation attempts, failures, readiness latency)
Improved debugging for onboarding and sandbox initialization issues
Integration with production monitoring systems

Testing

Added unit tests for metrics registry and latency observer
Manually validated via /metrics endpoint

Summary by CodeRabbit

New Features
- Optional observability metrics layer exposing Prometheus-format metrics, including counters, latency histograms, and success/error status labels
- Prometheus-compatible /metrics endpoint (disabled by default); port configurable (default 9090)
Documentation
- README updated with metrics configuration, usage examples and sample output
Tests
- Added tests validating metrics emission, histograms, and latency observation for success and error cases

coderabbitai · 2026-03-20T16:29:23Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e4b0e876-1297-4401-899c-37615a4c476e

📥 Commits

Reviewing files that changed from the base of the PR and between e53dcdb and a9a5887.

📒 Files selected for processing (1)

README.md

✅ Files skipped from review due to trivial changes (1)

README.md

📝 Walkthrough

Walkthrough

This PR adds an optional observability layer: an in-process metrics registry (counters, histograms, latency helper), an optional background HTTP metrics endpoint at /metrics (controlled by env vars, default port 9090, disabled by default), README docs, and tests.

Changes

Cohort / File(s)	Summary
Documentation `README.md`	Added "Observability" section describing optional metrics, `NEMOCLAW_METRICS_ENABLED` and `NEMOCLAW_METRICS_PORT`, `/metrics` endpoint usage, and sample Prometheus output.
Core Metrics Implementation `nemoclaw/src/observability/metrics.ts`	New MetricsRegistry with counters and histograms, Prometheus text exposition rendering, label handling, exported `metrics` singleton, and `observeLatency()` helper that records latency and success/error counts.
Service Integration `nemoclaw/src/index.ts`	When metrics are enabled, registers a background HTTP service bound to `0.0.0.0:NEMOCLAW_METRICS_PORT` that serves `GET /metrics` with Prometheus output and returns 404 for other paths; logs startup/shutdown and closes server on stop.
Test Coverage `test/metrics.test.js`	New tests enabling metrics via env var; validate counter increments, histogram observations, Prometheus output formatting, and `observeLatency()` behavior for success and error cases.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant Registry as Metrics Registry
    participant HttpServer as Metrics HTTP Server
    participant Client as External Client

    Note over App,Registry: Application records metrics during operation
    App->>Registry: incrementCounter(name, labels)
    Registry-->>Registry: update counter state

    App->>Registry: observeHistogram(name, labels, value)
    Registry-->>Registry: update buckets, sum, count

    Client->>HttpServer: GET /metrics
    HttpServer->>Registry: getPrometheusMetrics()
    Registry-->>HttpServer: formatted Prometheus text
    HttpServer-->>Client: 200 OK (text/plain) + metrics

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped and tallied every trace,
Counters, buckets — snug in place.
/metrics hums at nine-oh-nine-oh,
I watch latencies ebb and flow.
Tiny rabbit, big observability glow.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'feat: add lightweight observability and metrics service' accurately summarizes the main objective of introducing an optional metrics layer, which aligns with the primary changes across README, index.ts, metrics.ts, and test files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (3)

test/metrics.test.js (2)

15-57: Tests share a singleton—accumulated state may cause flaky results.

All tests share the same metrics singleton without reset between tests. Counters and histograms accumulate across the test run:

Line 20 asserts test_counter{foo="bar"} 1, but if this test runs twice or another test uses the same metric name, it will fail.

Line 41 asserts test_op_total{...} 1—same concern.

This could cause flaky tests in CI if test order or parallelism changes.
💡 Suggested improvement: Add a reset method or use unique metric names

Option 1: Add a reset() method to MetricsRegistry for test use:
// In metrics.ts
public reset(): void {
  this.counters.clear();
  this.histograms.clear();
}
Option 2: Use unique metric names per test (e.g., include a timestamp or UUID).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/metrics.test.js` around lines 15 - 57, Tests share the same metrics
singleton (metrics) causing accumulated counters/histograms and flaky
assertions; add a reset mechanism to MetricsRegistry (e.g.,
MetricsRegistry.reset() that clears counters and histograms) and call it at the
start of each test (or beforeEach) in the test file, or alternatively ensure
tests use unique metric names (e.g., include a UUID/timestamp) when invoking
metrics.incrementCounter, metrics.observeHistogram, and observeLatency to avoid
cross-test contamination; update tests to call metrics.reset() (or use unique
names) so assertions like test_counter and test_op_total always start from zero.
45-52: Error test doesn't fail if no error is thrown.

If observeLatency fails to rethrow the error (a regression), this test would silently pass because there's no assertion that the catch block was actually entered.
💡 More robust error assertion
-test("observeLatency tracks error metrics", async () => {
-  try {
-    await observeLatency("test_op_err", { op: "fail" }, async () => {
-      throw new Error("oops");
-    });
-  } catch (err) {
-    assert.strictEqual(err.message, "oops");
-  }
+test("observeLatency tracks error metrics", async () => {
+  await assert.rejects(
+    () => observeLatency("test_op_err", { op: "fail" }, async () => {
+      throw new Error("oops");
+    }),
+    { message: "oops" }
+  );
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/metrics.test.js` around lines 45 - 52, The test "observeLatency tracks
error metrics" doesn't assert that the error path was executed, so add a guard
after calling observeLatency to fail the test if no error is thrown (e.g., call
assert.fail or equivalent) and keep the existing catch verifying err.message ===
"oops"; locate the test block for observeLatency in test/metrics.test.js to
update the test case named "observeLatency tracks error metrics" so it
explicitly fails when the awaited call does not throw.

nemoclaw/src/observability/metrics.ts (1)

69-100: Prometheus format emits duplicate HELP/TYPE per label combination.

When multiple label combinations exist for the same metric name (e.g., foo{a="1"} and foo{a="2"}), getPrometheusMetrics() will emit separate # HELP and # TYPE lines for each. Prometheus exposition format expects these metadata lines only once per metric name, with all samples following.

Most parsers tolerate this, but for strict compliance:

💡 Group metrics by name before emitting HELP/TYPE

public getPrometheusMetrics(): string {
  let output = "";
  
  // Group counters by metric name
  const countersByName = new Map<string, string[]>();
  for (const [key, value] of this.counters.entries()) {
    const [name, labelStr] = this.parseKey(key);
    if (!countersByName.has(name)) countersByName.set(name, []);
    countersByName.get(name)!.push(`${name}${labelStr} ${value}`);
  }
  
  for (const [name, samples] of countersByName.entries()) {
    output += `# HELP ${name} Total count of ${name}\n`;
    output += `# TYPE ${name} counter\n`;
    output += samples.join("\n") + "\n\n";
  }
  
  // Similar grouping for histograms...
  return output;
}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemoclaw/src/observability/metrics.ts` around lines 69 - 100,
getPrometheusMetrics currently emits # HELP and # TYPE for every metric sample
causing duplicates when the same metric name has multiple label sets; change it
to group samples by metric name before emitting metadata. For counters: iterate
this.counters, use parseKey to get name and labelStr, collect samples per name
into a Map<string, string[]> and after collection emit one `# HELP` and `# TYPE`
per name followed by all samples. For histograms: similarly group entries from
this.histograms by parsed name, then for each name emit a single HELP/TYPE and
then output all bucket/_sum/_count lines for each label combination (ensure
bucket ordering and labelsBase logic from the current histogram loop is
preserved when writing each sample group). Keep function name
getPrometheusMetrics, and data sources this.counters, this.histograms and
parseKey to locate and implement the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemoclaw/src/blueprint/exec.ts`:
- Around line 35-44: The metrics wrapper observeLatency around execBlueprint is
marking failures as successes because execBlueprintInternal (like
validateApiKey) never throws—fix execBlueprintInternal to reject the Promise on
real failures (non-zero exit codes, missing Python, runner errors) instead of
always resolving, then keep execBlueprint calling observeLatency as-is so
observeLatency can record status="error" when the wrapped call throws; locate
and update error paths inside execBlueprintInternal (and any helper functions it
uses that swallow errors) to throw or return a rejected Promise with an Error
describing the failure so metrics reflect actual outcomes.

In `@nemoclaw/src/index.ts`:
- Around line 259-287: The metrics service block calls createServer (used in the
type annotation ReturnType<typeof createServer> and in server =
createServer(...)) but createServer is not imported; add an import for
createServer from "node:http" at the top of the file (alongside the other
imports) so the symbol is defined before api.registerService runs when
metrics.isEnabled() is true.
- Around line 260-270: The file references the metrics singleton
(metrics.isEnabled() and metrics.getPrometheusMetrics()) but never imports it;
add an import for the metrics export used by those calls (import the metrics
symbol from the observability metrics module) alongside the other top-level
imports so metrics is defined before being used in the api.registerService/start
block (where createServer and metrics.getPrometheusMetrics() are called).

In `@nemoclaw/src/onboard/validate.ts`:
- Around line 12-21: The observeLatency wrapper currently only marks errors when
the wrapped function throws, but validateApiKeyInternal returns { valid:
boolean, error?: string } on failures; update validateApiKey to call
validateApiKeyInternal inside observeLatency as you do, then inspect the
returned ValidationResult and emit a separate metric or set status based on
result.valid (or rethrow when valid is false) so failures
(validateApiKeyInternal returning valid: false) are recorded as errors;
reference the validateApiKey and validateApiKeyInternal symbols and ensure you
add the counter/instrumentation update immediately after the observeLatency call
to reflect validation success/failure.

---

Nitpick comments:
In `@nemoclaw/src/observability/metrics.ts`:
- Around line 69-100: getPrometheusMetrics currently emits # HELP and # TYPE for
every metric sample causing duplicates when the same metric name has multiple
label sets; change it to group samples by metric name before emitting metadata.
For counters: iterate this.counters, use parseKey to get name and labelStr,
collect samples per name into a Map<string, string[]> and after collection emit
one `# HELP` and `# TYPE` per name followed by all samples. For histograms:
similarly group entries from this.histograms by parsed name, then for each name
emit a single HELP/TYPE and then output all bucket/_sum/_count lines for each
label combination (ensure bucket ordering and labelsBase logic from the current
histogram loop is preserved when writing each sample group). Keep function name
getPrometheusMetrics, and data sources this.counters, this.histograms and
parseKey to locate and implement the change.

In `@test/metrics.test.js`:
- Around line 15-57: Tests share the same metrics singleton (metrics) causing
accumulated counters/histograms and flaky assertions; add a reset mechanism to
MetricsRegistry (e.g., MetricsRegistry.reset() that clears counters and
histograms) and call it at the start of each test (or beforeEach) in the test
file, or alternatively ensure tests use unique metric names (e.g., include a
UUID/timestamp) when invoking metrics.incrementCounter,
metrics.observeHistogram, and observeLatency to avoid cross-test contamination;
update tests to call metrics.reset() (or use unique names) so assertions like
test_counter and test_op_total always start from zero.
- Around line 45-52: The test "observeLatency tracks error metrics" doesn't
assert that the error path was executed, so add a guard after calling
observeLatency to fail the test if no error is thrown (e.g., call assert.fail or
equivalent) and keep the existing catch verifying err.message === "oops"; locate
the test block for observeLatency in test/metrics.test.js to update the test
case named "observeLatency tracks error metrics" so it explicitly fails when the
awaited call does not throw.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3b594b22-cd47-46df-9cc3-68d18744a6cb

📥 Commits

Reviewing files that changed from the base of the PR and between 17a4067 and ceb38d1.

📒 Files selected for processing (7)

README.md
nemoclaw/src/blueprint/exec.ts
nemoclaw/src/commands/launch.ts
nemoclaw/src/index.ts
nemoclaw/src/observability/metrics.ts
nemoclaw/src/onboard/validate.ts
test/metrics.test.js

coderabbitai · 2026-03-20T16:32:12Z

 export async function validateApiKey(
  apiKey: string,
  endpointUrl: string,
+): Promise<ValidationResult> {
+  return observeLatency(
+    "tool_api_validate",
+    { endpoint: endpointUrl },
+    () => validateApiKeyInternal(apiKey, endpointUrl),
+  );
+}


⚠️ Potential issue | 🟡 Minor

Metric status may not reflect validation outcome.

The observeLatency wrapper records status="error" only when the wrapped function throws. However, validateApiKeyInternal catches all errors internally and returns { valid: false, error: message } instead of throwing. This means:

HTTP 401/403 responses → recorded as status="success"

Request timeouts → recorded as status="success"

Network errors → recorded as status="success"

If the intent is to track whether the API call completed (regardless of validity), this is fine. If you want metrics to reflect validation success/failure, consider either:

Having validateApiKeyInternal throw on invalid results, or

Using a separate counter increment based on result.valid after the call.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemoclaw/src/onboard/validate.ts` around lines 12 - 21, The observeLatency wrapper currently only marks errors when the wrapped function throws, but validateApiKeyInternal returns { valid: boolean, error?: string } on failures; update validateApiKey to call validateApiKeyInternal inside observeLatency as you do, then inspect the returned ValidationResult and emit a separate metric or set status based on result.valid (or rethrow when valid is false) so failures (validateApiKeyInternal returning valid: false) are recorded as errors; reference the validateApiKey and validateApiKeyInternal symbols and ensure you add the counter/instrumentation update immediately after the observeLatency call to reflect validation success/failure.

…trics # Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

…d deleted files

… feat/observability-metrics

…ndex.ts

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

nemoclaw/src/index.ts (1)

255-255: Consider validating the metrics port.

If NEMOCLAW_METRICS_PORT is set to a non-numeric value, Number() returns NaN, which will cause server.listen() to throw. Adding validation provides a clearer error message.

💡 Proposed fix to validate the port

-        const port = Number(process.env.NEMOCLAW_METRICS_PORT || 9090);
+        const port = Number(process.env.NEMOCLAW_METRICS_PORT) || 9090;
+        if (!Number.isInteger(port) || port < 1 || port > 65535) {
+          logger.error(`Invalid NEMOCLAW_METRICS_PORT: ${process.env.NEMOCLAW_METRICS_PORT}`);
+          return;
+        }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemoclaw/src/index.ts` at line 255, The current code converts
process.env.NEMOCLAW_METRICS_PORT to a Number for const port but doesn't
validate it, which can produce NaN and cause server.listen() to throw; update
the logic that sets port (the const port declaration using
process.env.NEMOCLAW_METRICS_PORT) to explicitly parse and validate the value
(e.g., parseInt/Number and check isFinite/Number.isInteger and range 1–65535),
and if invalid either fall back to the default 9090 or throw a clear error that
mentions NEMOCLAW_METRICS_PORT and the invalid value before calling
server.listen().

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemoclaw/src/index.ts`:
- Around line 256-267: Add an 'error' event handler on the created HTTP server
(the server returned by createServer and used with server.listen) to catch
binding/startup errors (e.g., EADDRINUSE); in the handler call logger.error with
the error and context (including the port) and exit or cleanly fail the process
as appropriate. Ensure the handler is attached before or immediately after
server.listen so createServer/server and server.listen locations are updated
(references: createServer, server, server.listen, logger.info).

---

Nitpick comments:
In `@nemoclaw/src/index.ts`:
- Line 255: The current code converts process.env.NEMOCLAW_METRICS_PORT to a
Number for const port but doesn't validate it, which can produce NaN and cause
server.listen() to throw; update the logic that sets port (the const port
declaration using process.env.NEMOCLAW_METRICS_PORT) to explicitly parse and
validate the value (e.g., parseInt/Number and check isFinite/Number.isInteger
and range 1–65535), and if invalid either fall back to the default 9090 or throw
a clear error that mentions NEMOCLAW_METRICS_PORT and the invalid value before
calling server.listen().

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 81396cb1-2af9-400e-ae75-5d3123e24521

📥 Commits

Reviewing files that changed from the base of the PR and between ceb38d1 and e53dcdb.

📒 Files selected for processing (2)

README.md
nemoclaw/src/index.ts

✅ Files skipped from review due to trivial changes (1)

README.md

…IA#230) Replace the clear_process_identity + maybe_infer_sandbox_user workaround with strict enforcement: run_as_user and run_as_group must always be 'sandbox'. If omitted from YAML, they are defaulted server-side via ensure_sandbox_process_identity(). Any other value is rejected. This fixes 'policy set' failing with 'process policy cannot be changed on a live sandbox' for custom-image sandboxes, where the CLI was clearing identity fields but the YAML had 'sandbox' explicitly set. - Remove clear_process_identity() from CLI and TUI - Add ensure_sandbox_process_identity() to default missing fields - Rename RootProcessIdentity -> InvalidProcessIdentity - Replace maybe_infer_sandbox_user with validate_sandbox_user (fail-fast) - Update OPA test policies to use run_as_user: sandbox

ksapru added 3 commits March 17, 2026 14:33

feat: add lightweight observability and metrics service

1c74bc0

fix: improve metrics server lifecycle and add sandbox operation metric

ad1f2e7

chore: cleanup pr description and rename metric to blueprint_execution

68bb16c

ksapru mentioned this pull request Mar 17, 2026

Proposal: Lightweight observability and Prometheus-compatible metrics service #233

Open

wscurran added observability Use this label to improve NemoClaw logging, metrics, and tracing. enhancement New feature or request labels Mar 18, 2026

Merge branch 'main' into feat/observability-metrics

ceb38d1

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

ksapru added 4 commits March 21, 2026 12:30

Merge upstream/main and resolve conflicts in nemoclaw/src/index.ts an…

c490415

…d deleted files

Merge remote-tracking branch 'origin/feat/observability-metrics' into…

d19fdab

… feat/observability-metrics

Merge main into feat/observability-metrics and resolve conflicts in i…

e53dcdb

…ndex.ts

coderabbitai bot reviewed Mar 24, 2026

View reviewed changes

Comment thread nemoclaw/src/index.ts

Merge branch 'main' into feat/observability-metrics

b0b11e8

Merge branch 'main' into feat/observability-metrics

a9a5887

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add lightweight observability and metrics service#230

feat: add lightweight observability and metrics service#230
ksapru wants to merge 10 commits intoNVIDIA:mainfrom
ksapru:feat/observability-metrics

ksapru commented Mar 17, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 20, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Mar 20, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ksapru commented Mar 17, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Example Output

Why this approach

Future Extensions

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ksapru commented Mar 17, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 20, 2026 •

edited

Loading