Problem
Currently, there is limited visibility into the performance and health of NemoClaw operations. Specifically, tracking blueprint execution latency, API validation success rates, and sandbox lifecycle operations (such as launch) requires manual log parsing or external wrappers. This lack of telemetry makes it difficult to monitor NemoClaw in automated CI/CD pipelines or production-like environments where performance regressions or intermittent API failures need to be surfaced programmatically.
Proposed Solution
Introduce an optional, lightweight metrics service built directly into the plugin. When enabled via the NEMOCLAW_METRICS_ENABLED environment variable, NemoClaw will:
- Maintain an internal registry of counters and histograms for key operations.
- Start a minimal HTTP server (defaulting to port 9090) to export these metrics in Prometheus text format at a
/metrics endpoint.
- Instrument critical paths, including
execBlueprint (renamed to blueprint_execution for clarity) and API key validation.
Design Goals
- Zero Overhead: When disabled (default), the metrics logic is bypassed to ensure no performance impact for standard CLI users.
- Zero Dependencies: The implementation uses native
node:http and process.hrtime to maintain a minimal footprint without adding to the dependency tree.
- Prometheus Compatibility: Adheres to standard exposition formats for immediate integration with existing monitoring stacks.
Open Questions
- Does this built-in approach align with the project's long-term roadmap, or is there a preference for moving toward OpenTelemetry despite the additional dependency weight?
- Are there specific sandbox lifecycle events (e.g.,
eject, migrate) that should be prioritized for initial instrumentation?
I have opened PR #230 with a working implementation of this proposal. I look forward to your feedback on the architectural direction.
Problem
Currently, there is limited visibility into the performance and health of NemoClaw operations. Specifically, tracking blueprint execution latency, API validation success rates, and sandbox lifecycle operations (such as
launch) requires manual log parsing or external wrappers. This lack of telemetry makes it difficult to monitor NemoClaw in automated CI/CD pipelines or production-like environments where performance regressions or intermittent API failures need to be surfaced programmatically.Proposed Solution
Introduce an optional, lightweight metrics service built directly into the plugin. When enabled via the
NEMOCLAW_METRICS_ENABLEDenvironment variable, NemoClaw will:/metricsendpoint.execBlueprint(renamed toblueprint_executionfor clarity) and API key validation.Design Goals
node:httpandprocess.hrtimeto maintain a minimal footprint without adding to the dependency tree.Open Questions
eject,migrate) that should be prioritized for initial instrumentation?I have opened PR #230 with a working implementation of this proposal. I look forward to your feedback on the architectural direction.