Skip to content

feat: add lightweight observability and metrics service#230

Open
ksapru wants to merge 10 commits intoNVIDIA:mainfrom
ksapru:feat/observability-metrics
Open

feat: add lightweight observability and metrics service#230
ksapru wants to merge 10 commits intoNVIDIA:mainfrom
ksapru:feat/observability-metrics

Conversation

@ksapru
Copy link
Copy Markdown
Contributor

@ksapru ksapru commented Mar 17, 2026

Summary

This PR proposes a lightweight, optional observability layer for NemoClaw to provide visibility into execution latency and success/failure rates across key runtime paths.

Metrics are exposed in a Prometheus-compatible format and are fully gated behind NEMOCLAW_METRICS_ENABLED=true, ensuring zero impact to existing behavior by default.

Motivation

Recent issues around sandbox creation and readiness (e.g., #140, #152, #229) highlight the difficulty of debugging failures and latency in the current system.

Currently, NemoClaw does not expose:

  • execution latency for agent workflows
  • success/failure counters for key operations
  • visibility into slow or failing onboarding and execution paths

This PR adds minimal instrumentation to improve observability and help diagnose these behaviors without modifying core execution logic. This is fully optional and does not change default behavior.

Changes

  • Lightweight metrics registry (no external dependencies)

    • Counters and histograms
    • Prometheus-compatible output
  • Optional instrumentation (disabled by default)

    • Enabled via NEMOCLAW_METRICS_ENABLED=true
    • No runtime overhead when disabled
  • Instrumented execution paths

    • Blueprint execution (execBlueprint)
    • API validation during onboarding
  • Optional metrics endpoint

    • Exposes /metrics over HTTP (default port 9090)
    • Intended for local debugging and integration with Prometheus/Grafana

Example Output

blueprint_exec_total{status="success"} 1
blueprint_exec_latency_seconds_count{status="success"} 1

Why this approach

  • No additional dependencies
  • Minimal code footprint
  • Fully opt-in and non-breaking
  • Low operational and review risk

Future Extensions

  • Sandbox lifecycle observability (creation attempts, failures, readiness latency)
  • Improved debugging for onboarding and sandbox initialization issues
  • Integration with production monitoring systems

Testing

  • Added unit tests for metrics registry and latency observer
  • Manually validated via /metrics endpoint

Summary by CodeRabbit

  • New Features

    • Optional observability metrics layer exposing Prometheus-format metrics, including counters, latency histograms, and success/error status labels
    • Prometheus-compatible /metrics endpoint (disabled by default); port configurable (default 9090)
  • Documentation

    • README updated with metrics configuration, usage examples and sample output
  • Tests

    • Added tests validating metrics emission, histograms, and latency observation for success and error cases

@wscurran wscurran added observability Use this label to improve NemoClaw logging, metrics, and tracing. enhancement New feature or request labels Mar 18, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 20, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e4b0e876-1297-4401-899c-37615a4c476e

📥 Commits

Reviewing files that changed from the base of the PR and between e53dcdb and a9a5887.

📒 Files selected for processing (1)
  • README.md
✅ Files skipped from review due to trivial changes (1)
  • README.md

📝 Walkthrough

Walkthrough

This PR adds an optional observability layer: an in-process metrics registry (counters, histograms, latency helper), an optional background HTTP metrics endpoint at /metrics (controlled by env vars, default port 9090, disabled by default), README docs, and tests.

Changes

Cohort / File(s) Summary
Documentation
README.md
Added "Observability" section describing optional metrics, NEMOCLAW_METRICS_ENABLED and NEMOCLAW_METRICS_PORT, /metrics endpoint usage, and sample Prometheus output.
Core Metrics Implementation
nemoclaw/src/observability/metrics.ts
New MetricsRegistry with counters and histograms, Prometheus text exposition rendering, label handling, exported metrics singleton, and observeLatency() helper that records latency and success/error counts.
Service Integration
nemoclaw/src/index.ts
When metrics are enabled, registers a background HTTP service bound to 0.0.0.0:NEMOCLAW_METRICS_PORT that serves GET /metrics with Prometheus output and returns 404 for other paths; logs startup/shutdown and closes server on stop.
Test Coverage
test/metrics.test.js
New tests enabling metrics via env var; validate counter increments, histogram observations, Prometheus output formatting, and observeLatency() behavior for success and error cases.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant Registry as Metrics Registry
    participant HttpServer as Metrics HTTP Server
    participant Client as External Client

    Note over App,Registry: Application records metrics during operation
    App->>Registry: incrementCounter(name, labels)
    Registry-->>Registry: update counter state

    App->>Registry: observeHistogram(name, labels, value)
    Registry-->>Registry: update buckets, sum, count

    Client->>HttpServer: GET /metrics
    HttpServer->>Registry: getPrometheusMetrics()
    Registry-->>HttpServer: formatted Prometheus text
    HttpServer-->>Client: 200 OK (text/plain) + metrics
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped and tallied every trace,
Counters, buckets — snug in place.
/metrics hums at nine-oh-nine-oh,
I watch latencies ebb and flow.
Tiny rabbit, big observability glow.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'feat: add lightweight observability and metrics service' accurately summarizes the main objective of introducing an optional metrics layer, which aligns with the primary changes across README, index.ts, metrics.ts, and test files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (3)
test/metrics.test.js (2)

15-57: Tests share a singleton—accumulated state may cause flaky results.

All tests share the same metrics singleton without reset between tests. Counters and histograms accumulate across the test run:

  • Line 20 asserts test_counter{foo="bar"} 1, but if this test runs twice or another test uses the same metric name, it will fail.
  • Line 41 asserts test_op_total{...} 1—same concern.

This could cause flaky tests in CI if test order or parallelism changes.

💡 Suggested improvement: Add a reset method or use unique metric names

Option 1: Add a reset() method to MetricsRegistry for test use:

// In metrics.ts
public reset(): void {
  this.counters.clear();
  this.histograms.clear();
}

Option 2: Use unique metric names per test (e.g., include a timestamp or UUID).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/metrics.test.js` around lines 15 - 57, Tests share the same metrics
singleton (metrics) causing accumulated counters/histograms and flaky
assertions; add a reset mechanism to MetricsRegistry (e.g.,
MetricsRegistry.reset() that clears counters and histograms) and call it at the
start of each test (or beforeEach) in the test file, or alternatively ensure
tests use unique metric names (e.g., include a UUID/timestamp) when invoking
metrics.incrementCounter, metrics.observeHistogram, and observeLatency to avoid
cross-test contamination; update tests to call metrics.reset() (or use unique
names) so assertions like test_counter and test_op_total always start from zero.

45-52: Error test doesn't fail if no error is thrown.

If observeLatency fails to rethrow the error (a regression), this test would silently pass because there's no assertion that the catch block was actually entered.

💡 More robust error assertion
-test("observeLatency tracks error metrics", async () => {
-  try {
-    await observeLatency("test_op_err", { op: "fail" }, async () => {
-      throw new Error("oops");
-    });
-  } catch (err) {
-    assert.strictEqual(err.message, "oops");
-  }
+test("observeLatency tracks error metrics", async () => {
+  await assert.rejects(
+    () => observeLatency("test_op_err", { op: "fail" }, async () => {
+      throw new Error("oops");
+    }),
+    { message: "oops" }
+  );
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/metrics.test.js` around lines 45 - 52, The test "observeLatency tracks
error metrics" doesn't assert that the error path was executed, so add a guard
after calling observeLatency to fail the test if no error is thrown (e.g., call
assert.fail or equivalent) and keep the existing catch verifying err.message ===
"oops"; locate the test block for observeLatency in test/metrics.test.js to
update the test case named "observeLatency tracks error metrics" so it
explicitly fails when the awaited call does not throw.
nemoclaw/src/observability/metrics.ts (1)

69-100: Prometheus format emits duplicate HELP/TYPE per label combination.

When multiple label combinations exist for the same metric name (e.g., foo{a="1"} and foo{a="2"}), getPrometheusMetrics() will emit separate # HELP and # TYPE lines for each. Prometheus exposition format expects these metadata lines only once per metric name, with all samples following.

Most parsers tolerate this, but for strict compliance:

💡 Group metrics by name before emitting HELP/TYPE
public getPrometheusMetrics(): string {
  let output = "";
  
  // Group counters by metric name
  const countersByName = new Map<string, string[]>();
  for (const [key, value] of this.counters.entries()) {
    const [name, labelStr] = this.parseKey(key);
    if (!countersByName.has(name)) countersByName.set(name, []);
    countersByName.get(name)!.push(`${name}${labelStr} ${value}`);
  }
  
  for (const [name, samples] of countersByName.entries()) {
    output += `# HELP ${name} Total count of ${name}\n`;
    output += `# TYPE ${name} counter\n`;
    output += samples.join("\n") + "\n\n";
  }
  
  // Similar grouping for histograms...
  return output;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemoclaw/src/observability/metrics.ts` around lines 69 - 100,
getPrometheusMetrics currently emits # HELP and # TYPE for every metric sample
causing duplicates when the same metric name has multiple label sets; change it
to group samples by metric name before emitting metadata. For counters: iterate
this.counters, use parseKey to get name and labelStr, collect samples per name
into a Map<string, string[]> and after collection emit one `# HELP` and `# TYPE`
per name followed by all samples. For histograms: similarly group entries from
this.histograms by parsed name, then for each name emit a single HELP/TYPE and
then output all bucket/_sum/_count lines for each label combination (ensure
bucket ordering and labelsBase logic from the current histogram loop is
preserved when writing each sample group). Keep function name
getPrometheusMetrics, and data sources this.counters, this.histograms and
parseKey to locate and implement the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemoclaw/src/blueprint/exec.ts`:
- Around line 35-44: The metrics wrapper observeLatency around execBlueprint is
marking failures as successes because execBlueprintInternal (like
validateApiKey) never throws—fix execBlueprintInternal to reject the Promise on
real failures (non-zero exit codes, missing Python, runner errors) instead of
always resolving, then keep execBlueprint calling observeLatency as-is so
observeLatency can record status="error" when the wrapped call throws; locate
and update error paths inside execBlueprintInternal (and any helper functions it
uses that swallow errors) to throw or return a rejected Promise with an Error
describing the failure so metrics reflect actual outcomes.

In `@nemoclaw/src/index.ts`:
- Around line 259-287: The metrics service block calls createServer (used in the
type annotation ReturnType<typeof createServer> and in server =
createServer(...)) but createServer is not imported; add an import for
createServer from "node:http" at the top of the file (alongside the other
imports) so the symbol is defined before api.registerService runs when
metrics.isEnabled() is true.
- Around line 260-270: The file references the metrics singleton
(metrics.isEnabled() and metrics.getPrometheusMetrics()) but never imports it;
add an import for the metrics export used by those calls (import the metrics
symbol from the observability metrics module) alongside the other top-level
imports so metrics is defined before being used in the api.registerService/start
block (where createServer and metrics.getPrometheusMetrics() are called).

In `@nemoclaw/src/onboard/validate.ts`:
- Around line 12-21: The observeLatency wrapper currently only marks errors when
the wrapped function throws, but validateApiKeyInternal returns { valid:
boolean, error?: string } on failures; update validateApiKey to call
validateApiKeyInternal inside observeLatency as you do, then inspect the
returned ValidationResult and emit a separate metric or set status based on
result.valid (or rethrow when valid is false) so failures
(validateApiKeyInternal returning valid: false) are recorded as errors;
reference the validateApiKey and validateApiKeyInternal symbols and ensure you
add the counter/instrumentation update immediately after the observeLatency call
to reflect validation success/failure.

---

Nitpick comments:
In `@nemoclaw/src/observability/metrics.ts`:
- Around line 69-100: getPrometheusMetrics currently emits # HELP and # TYPE for
every metric sample causing duplicates when the same metric name has multiple
label sets; change it to group samples by metric name before emitting metadata.
For counters: iterate this.counters, use parseKey to get name and labelStr,
collect samples per name into a Map<string, string[]> and after collection emit
one `# HELP` and `# TYPE` per name followed by all samples. For histograms:
similarly group entries from this.histograms by parsed name, then for each name
emit a single HELP/TYPE and then output all bucket/_sum/_count lines for each
label combination (ensure bucket ordering and labelsBase logic from the current
histogram loop is preserved when writing each sample group). Keep function name
getPrometheusMetrics, and data sources this.counters, this.histograms and
parseKey to locate and implement the change.

In `@test/metrics.test.js`:
- Around line 15-57: Tests share the same metrics singleton (metrics) causing
accumulated counters/histograms and flaky assertions; add a reset mechanism to
MetricsRegistry (e.g., MetricsRegistry.reset() that clears counters and
histograms) and call it at the start of each test (or beforeEach) in the test
file, or alternatively ensure tests use unique metric names (e.g., include a
UUID/timestamp) when invoking metrics.incrementCounter,
metrics.observeHistogram, and observeLatency to avoid cross-test contamination;
update tests to call metrics.reset() (or use unique names) so assertions like
test_counter and test_op_total always start from zero.
- Around line 45-52: The test "observeLatency tracks error metrics" doesn't
assert that the error path was executed, so add a guard after calling
observeLatency to fail the test if no error is thrown (e.g., call assert.fail or
equivalent) and keep the existing catch verifying err.message === "oops"; locate
the test block for observeLatency in test/metrics.test.js to update the test
case named "observeLatency tracks error metrics" so it explicitly fails when the
awaited call does not throw.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3b594b22-cd47-46df-9cc3-68d18744a6cb

📥 Commits

Reviewing files that changed from the base of the PR and between 17a4067 and ceb38d1.

📒 Files selected for processing (7)
  • README.md
  • nemoclaw/src/blueprint/exec.ts
  • nemoclaw/src/commands/launch.ts
  • nemoclaw/src/index.ts
  • nemoclaw/src/observability/metrics.ts
  • nemoclaw/src/onboard/validate.ts
  • test/metrics.test.js

Comment thread nemoclaw/src/blueprint/exec.ts Outdated
Comment thread nemoclaw/src/index.ts Outdated
Comment thread nemoclaw/src/index.ts
Comment thread nemoclaw/src/onboard/validate.ts Outdated
Comment on lines +12 to +21
export async function validateApiKey(
apiKey: string,
endpointUrl: string,
): Promise<ValidationResult> {
return observeLatency(
"tool_api_validate",
{ endpoint: endpointUrl },
() => validateApiKeyInternal(apiKey, endpointUrl),
);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Metric status may not reflect validation outcome.

The observeLatency wrapper records status="error" only when the wrapped function throws. However, validateApiKeyInternal catches all errors internally and returns { valid: false, error: message } instead of throwing. This means:

  • HTTP 401/403 responses → recorded as status="success"
  • Request timeouts → recorded as status="success"
  • Network errors → recorded as status="success"

If the intent is to track whether the API call completed (regardless of validity), this is fine. If you want metrics to reflect validation success/failure, consider either:

  1. Having validateApiKeyInternal throw on invalid results, or
  2. Using a separate counter increment based on result.valid after the call.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemoclaw/src/onboard/validate.ts` around lines 12 - 21, The observeLatency
wrapper currently only marks errors when the wrapped function throws, but
validateApiKeyInternal returns { valid: boolean, error?: string } on failures;
update validateApiKey to call validateApiKeyInternal inside observeLatency as
you do, then inspect the returned ValidationResult and emit a separate metric or
set status based on result.valid (or rethrow when valid is false) so failures
(validateApiKeyInternal returning valid: false) are recorded as errors;
reference the validateApiKey and validateApiKeyInternal symbols and ensure you
add the counter/instrumentation update immediately after the observeLatency call
to reflect validation success/failure.

ksapru added 4 commits March 21, 2026 12:30
…trics

# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
nemoclaw/src/index.ts (1)

255-255: Consider validating the metrics port.

If NEMOCLAW_METRICS_PORT is set to a non-numeric value, Number() returns NaN, which will cause server.listen() to throw. Adding validation provides a clearer error message.

💡 Proposed fix to validate the port
-        const port = Number(process.env.NEMOCLAW_METRICS_PORT || 9090);
+        const port = Number(process.env.NEMOCLAW_METRICS_PORT) || 9090;
+        if (!Number.isInteger(port) || port < 1 || port > 65535) {
+          logger.error(`Invalid NEMOCLAW_METRICS_PORT: ${process.env.NEMOCLAW_METRICS_PORT}`);
+          return;
+        }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemoclaw/src/index.ts` at line 255, The current code converts
process.env.NEMOCLAW_METRICS_PORT to a Number for const port but doesn't
validate it, which can produce NaN and cause server.listen() to throw; update
the logic that sets port (the const port declaration using
process.env.NEMOCLAW_METRICS_PORT) to explicitly parse and validate the value
(e.g., parseInt/Number and check isFinite/Number.isInteger and range 1–65535),
and if invalid either fall back to the default 9090 or throw a clear error that
mentions NEMOCLAW_METRICS_PORT and the invalid value before calling
server.listen().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemoclaw/src/index.ts`:
- Around line 256-267: Add an 'error' event handler on the created HTTP server
(the server returned by createServer and used with server.listen) to catch
binding/startup errors (e.g., EADDRINUSE); in the handler call logger.error with
the error and context (including the port) and exit or cleanly fail the process
as appropriate. Ensure the handler is attached before or immediately after
server.listen so createServer/server and server.listen locations are updated
(references: createServer, server, server.listen, logger.info).

---

Nitpick comments:
In `@nemoclaw/src/index.ts`:
- Line 255: The current code converts process.env.NEMOCLAW_METRICS_PORT to a
Number for const port but doesn't validate it, which can produce NaN and cause
server.listen() to throw; update the logic that sets port (the const port
declaration using process.env.NEMOCLAW_METRICS_PORT) to explicitly parse and
validate the value (e.g., parseInt/Number and check isFinite/Number.isInteger
and range 1–65535), and if invalid either fall back to the default 9090 or throw
a clear error that mentions NEMOCLAW_METRICS_PORT and the invalid value before
calling server.listen().

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 81396cb1-2af9-400e-ae75-5d3123e24521

📥 Commits

Reviewing files that changed from the base of the PR and between ceb38d1 and e53dcdb.

📒 Files selected for processing (2)
  • README.md
  • nemoclaw/src/index.ts
✅ Files skipped from review due to trivial changes (1)
  • README.md

Comment thread nemoclaw/src/index.ts
mafueee pushed a commit to mafueee/NemoClaw that referenced this pull request Mar 28, 2026
…IA#230)

Replace the clear_process_identity + maybe_infer_sandbox_user workaround
with strict enforcement: run_as_user and run_as_group must always be
'sandbox'. If omitted from YAML, they are defaulted server-side via
ensure_sandbox_process_identity(). Any other value is rejected.

This fixes 'policy set' failing with 'process policy cannot be changed
on a live sandbox' for custom-image sandboxes, where the CLI was
clearing identity fields but the YAML had 'sandbox' explicitly set.

- Remove clear_process_identity() from CLI and TUI
- Add ensure_sandbox_process_identity() to default missing fields
- Rename RootProcessIdentity -> InvalidProcessIdentity
- Replace maybe_infer_sandbox_user with validate_sandbox_user (fail-fast)
- Update OPA test policies to use run_as_user: sandbox
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request observability Use this label to improve NemoClaw logging, metrics, and tracing.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants