fix(cli): show inference health in sandbox status output by ericksoa · Pull Request #2002 · NVIDIA/NemoClaw

ericksoa · 2026-04-17T04:10:49Z

Summary

Adds remote provider health probing to nemoclaw <name> status so all providers (not just local) show an Inference line
Local probing (vllm-local, ollama-local) already worked — this fills the gap for remote providers (nvidia-prod, openai-api, anthropic-prod, gemini-api)
Creates a unified probeProviderHealth() dispatcher in new inference-health.ts module that handles both local and remote providers
Remote probes use lightweight reachability checks (any HTTP response including 401/403 = reachable, no API keys sent)
compatible-* providers show "not probed" since their endpoint URLs aren't known

Fixes #995

Test plan

23 new unit tests in inference-health.test.ts covering endpoint mapping, reachability semantics, timeouts, and unified dispatch
All 1832 existing tests continue to pass
Manual: nemoclaw <sandbox> status with a remote provider shows new Inference line
Manual: nemoclaw <sandbox> status with a local provider output is unchanged

Summary by CodeRabbit

New Features
- Unified health checks for inference providers (local and remote) with three-state reporting: not probed, healthy, and unreachable. Status now shows endpoint and diagnostic details when available.
- Remote probe delegation uses short reachability checks; certain compatible endpoints are reported as "not probed".
Tests
- Added comprehensive tests covering provider-to-endpoint mapping, reachable/unreachable cases, timeouts/connection failures, auth responses, and delegation behavior.

sandboxStatus() already probed local providers (vllm-local, ollama-local) but showed no Inference line for remote providers. Add a unified probeProviderHealth() dispatcher that performs lightweight reachability checks for remote cloud endpoints (nvidia-prod, openai-api, anthropic-prod, gemini-api) and a "not probed" fallback for compatible-* providers whose URLs are unknown. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-17T04:11:02Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: f470cd35-e709-4269-9197-8b4a27e18c88

📥 Commits

Reviewing files that changed from the base of the PR and between c657560 and 225ca04.

📒 Files selected for processing (1)

src/nemoclaw.ts

📝 Walkthrough

Walkthrough

Adds a unified inference-provider health probe (local + remote), provider→endpoint mapping, curl-based reachability checks with short timeouts, comprehensive Vitest tests, and integrates probe results into nemoclaw's sandbox status output.

Changes

Cohort / File(s)	Summary
Health probing module & tests `src/lib/inference-health.ts`, `src/lib/inference-health.test.ts`	New unified probe API: `probeProviderHealth`, `probeRemoteProviderHealth`, `getRemoteProviderHealthEndpoint`. Implements provider→endpoint mapping, curl-based probes (`--connect-timeout 3`, `--max-time 5`), treats HTTP 401/403 as reachable, special-cases compatible endpoints as `probed: false`, and adds extensive Vitest coverage for mappings, reachable/unreachable cases, timeouts, and curl argument passing.
Integration (status rendering) `src/nemoclaw.ts`	Replaced local-only probe call with `probeProviderHealth`. `sandboxStatus` now renders three inference states (`not probed`, `healthy`, `unreachable`) and surfaces `endpoint` and `detail` from probe results.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant ProbeLayer as probeProviderHealth
    participant LocalProbe as probeLocalProviderHealth
    participant RemoteProbe as probeRemoteProviderHealth
    participant Mapper as getRemoteProviderHealthEndpoint
    participant Curl as runCurlProbeImpl

    Caller->>ProbeLayer: probeProviderHealth(provider, options)
    alt local provider recognized
        ProbeLayer->>LocalProbe: probeLocalProviderHealth(...)
        LocalProbe-->>ProbeLayer: ProviderHealthStatus | null
    else remote provider
        ProbeLayer->>RemoteProbe: probeRemoteProviderHealth(...)
        RemoteProbe->>Mapper: getRemoteProviderHealthEndpoint(provider)
        Mapper-->>RemoteProbe: endpoint | null
        alt compatible endpoint
            RemoteProbe-->>ProbeLayer: {probed:false, ok:true, endpoint, detail}
        else endpoint found
            RemoteProbe->>Curl: runCurlProbeImpl(["curl", "--connect-timeout", "3", "--max-time", "5", endpoint, ...])
            Curl-->>RemoteProbe: CurlProbeResult (status, stdout, stderr)
            RemoteProbe-->>ProbeLayer: {probed:true, ok:boolean, endpoint, detail}
        end
    else unknown provider
        ProbeLayer-->>Caller: null
    end
    ProbeLayer-->>Caller: ProviderHealthStatus | null

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐇 I sniffed each endpoint with a curious hop,
Curl-timed my whiskers at every stop,
Found local burrows and cloud doors ajar,
Noted which replied and which stayed far,
Now the sandbox hums — I made it clear, hop!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'fix(cli): show inference health in sandbox status output' directly reflects the main change: adding inference health visibility to the status command for all providers.
Linked Issues check	✅ Passed	The PR introduces unified provider health probing covering both local and remote providers with reachability checks, addressing issue `#995`'s core objective of improving visibility and error messaging for inference backend health.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to implementing provider health probing: new inference-health module, comprehensive tests, and integration into nemoclaw status—no extraneous modifications detected.
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/995-status-inference-health

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

src/nemoclaw.ts (1)

1214-1226: Extract inference rendering to keep sandboxStatus complexity in check.

Line 1200’s function is already complexity-suppressed, and this new branch block adds more decision paths. Consider moving this rendering logic to a small helper.

♻️ Proposed refactor

+function printInferenceHealthStatus(inferenceHealth) {
+  if (!inferenceHealth) return;
+  if (!inferenceHealth.probed) {
+    console.log(`    Inference: ${D}not probed${R} (${inferenceHealth.detail})`);
+    return;
+  }
+  if (inferenceHealth.ok) {
+    console.log(`    Inference: ${G}healthy${R} (${inferenceHealth.endpoint})`);
+    return;
+  }
+  console.log(`    Inference: ${_RD}unreachable${R} (${inferenceHealth.endpoint})`);
+  console.log(`      ${inferenceHealth.detail}`);
+}
...
-    if (inferenceHealth) {
-      if (!inferenceHealth.probed) {
-        console.log(`    Inference: ${D}not probed${R} (${inferenceHealth.detail})`);
-      } else if (inferenceHealth.ok) {
-        console.log(
-          `    Inference: ${G}healthy${R} (${inferenceHealth.endpoint})`,
-        );
-      } else {
-        console.log(
-          `    Inference: ${_RD}unreachable${R} (${inferenceHealth.endpoint})`,
-        );
-        console.log(`      ${inferenceHealth.detail}`);
-      }
-    }
+    printInferenceHealthStatus(inferenceHealth);

As per coding guidelines, **/*.{js,ts,tsx,jsx}: Limit cyclomatic complexity to 20 in JavaScript/TypeScript files, with target of 15.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/nemoclaw.ts` around lines 1214 - 1226, The inference rendering block
inside the sandboxStatus function is increasing cyclomatic complexity; extract
it into a small helper named something like
renderInferenceHealth(inferenceHealth) that takes the existing inferenceHealth
object and the color constants (D, R, G, _RD) and returns or prints the exact
same lines (handle !probed, ok, and unreachable cases including detail and
endpoint) and replace the inline branch in sandboxStatus with a single call to
that helper to preserve behavior and reduce complexity.

src/lib/inference-health.ts (1)

92-95: Prefer not probed over null for recognized-but-unmapped providers.

If a provider is recognized by config but missing endpoint mapping, returning null drops the Inference line entirely. Returning a probed: false status is safer and keeps output stable as providers evolve.

♻️ Proposed refactor

   const endpoint = getRemoteProviderHealthEndpoint(provider);
   if (!endpoint) {
-    return null;
+    if (config) {
+      return {
+        ok: true,
+        probed: false,
+        providerLabel,
+        endpoint: "",
+        detail: "Health probe endpoint is not defined for this provider.",
+      };
+    }
+    return null;
   }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/inference-health.ts` around lines 92 - 95, The current code in
inference-health.ts calls getRemoteProviderHealthEndpoint(provider) and returns
null when endpoint is missing, which removes the provider from output; change
the behavior so that when endpoint is falsy you return an object indicating the
provider is recognized but not probed (e.g., { provider, probed: false, status:
'not probed' } or matching the existing Inference/Health shape) instead of null.
Update the branch that checks `if (!endpoint)` (the code referencing endpoint
from getRemoteProviderHealthEndpoint) to construct and return the non-probed
status object so downstream consumers still see the provider entry.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/lib/inference-health.ts`:
- Around line 92-95: The current code in inference-health.ts calls
getRemoteProviderHealthEndpoint(provider) and returns null when endpoint is
missing, which removes the provider from output; change the behavior so that
when endpoint is falsy you return an object indicating the provider is
recognized but not probed (e.g., { provider, probed: false, status: 'not probed'
} or matching the existing Inference/Health shape) instead of null. Update the
branch that checks `if (!endpoint)` (the code referencing endpoint from
getRemoteProviderHealthEndpoint) to construct and return the non-probed status
object so downstream consumers still see the provider entry.

In `@src/nemoclaw.ts`:
- Around line 1214-1226: The inference rendering block inside the sandboxStatus
function is increasing cyclomatic complexity; extract it into a small helper
named something like renderInferenceHealth(inferenceHealth) that takes the
existing inferenceHealth object and the color constants (D, R, G, _RD) and
returns or prints the exact same lines (handle !probed, ok, and unreachable
cases including detail and endpoint) and replace the inline branch in
sandboxStatus with a single call to that helper to preserve behavior and reduce
complexity.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: b11baa0d-7e08-4b7b-a922-78b0e2db8c65

📥 Commits

Reviewing files that changed from the base of the PR and between 56ee83f and efd5a8f.

📒 Files selected for processing (3)

src/lib/inference-health.test.ts
src/lib/inference-health.ts
src/nemoclaw.ts

## Summary Catches up the user-facing reference and troubleshooting docs with the CLI and policy behavior changes that landed in v0.0.21. Drafted via the `nemoclaw-contributor-update-docs` skill against commits in `v0.0.20..v0.0.21`, filtered through `docs/.docs-skip`. ## Changes - **`docs/reference/commands.md`** - `nemoclaw list`: session indicator (●) for connected sandboxes (#2117). - `nemoclaw <name> connect`: active-session note; auto-recovery from SSH identity drift after a host reboot (#2117, #2064). - `nemoclaw <name> status`: three-state Inference line (`healthy` / `unreachable` / `not probed`) covering both local and remote providers; new `Connected` line (#2002, #2117). - `nemoclaw <name> destroy` and `rebuild`: active-session warning with second confirm; rebuild reapplies policy presets to the recreated sandbox (#2117, #2026). - `nemoclaw <name> policy-add` and `policy-remove`: positional preset argument and non-interactive flow via `--yes`/`--force`/`NEMOCLAW_NON_INTERACTIVE=1` (#2070). - `nemoclaw <name> policy-list`: registry-vs-gateway desync detection (#2089). - **`docs/reference/troubleshooting.md`** - `Reconnect after a host reboot`: now reflects automatic stale `known_hosts` pruning on `connect` (#2064). - `Running multiple sandboxes simultaneously`: onboard's forward-port collision guard (#2086). - New section: `openclaw config set` or `unset` is blocked inside the sandbox (#2081). - **`docs/network-policy/customize-network-policy.md`**: non-interactive `policy-add`/`policy-remove` form; preset preservation across rebuild (#2070, #2026). - **`docs/inference/use-local-inference.md`**: NIM section now covers the NGC API key prompt with masked input and `docker login nvcr.io --password-stdin` behavior (#2043). - **Generated skills regenerated** to pick up the source changes (`.agents/skills/nemoclaw-user-reference/references/{commands,troubleshooting}.md`, plus minor heading-flow deltas elsewhere). The pre-commit `Regenerate agent skills from docs` hook ran and confirmed source ↔ generated parity. Commits skipped per `docs/.docs-skip` or no doc impact: `bbbaa0fb` (skip-features), `7cb482cb` (skip-features), `8dee23fd` (skip-terms), plus the usual CI / test / refactor / install-plumbing churn. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes for the modified files (the one failing test, `test/cli.test.ts > unknown command exits 1`, also fails on `origin/main` and is unrelated to these markdown-only changes) - [ ] `npm test` passes — skipped; same pre-existing CLI-dispatch test failure unrelated to docs - [ ] Tests added or updated for new or changed behavior — n/a, doc-only - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) — not run locally - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) — n/a, no new pages ## AI Disclosure - [x] AI-assisted — tool: Claude Code --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>  ## Summary by CodeRabbit * **New Features** * Multi-session SSH connections with concurrent session support. * Three-state inference health reporting (healthy/unreachable/not probed) across all providers. * Automatic SSH host key rotation detection and recovery. * Non-interactive policy preset management via positional arguments. * Session indicators in sandbox list view. * **Bug Fixes** * Protected destructive operations with active-session warnings. * Policy presets now preserved during sandbox rebuilds. * **Documentation** * NGC registry authentication requirements for container images. * Multi-sandbox onboarding and reconnection guidance. * Troubleshooting updates for port conflicts and SSH issues.  Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai Bot reviewed Apr 17, 2026

View reviewed changes

ericksoa added 4 commits April 16, 2026 22:45

Merge branch 'main' into fix/995-status-inference-health

e723a1a

Merge branch 'main' into fix/995-status-inference-health

1237f69

Merge branch 'main' into fix/995-status-inference-health

604d167

Merge branch 'main' into fix/995-status-inference-health

e85e2cd

ericksoa self-assigned this Apr 20, 2026

wscurran added NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). fix labels Apr 20, 2026

ericksoa added 4 commits April 20, 2026 09:32

Merge branch 'main' into fix/995-status-inference-health

c657560

Merge branch 'main' into fix/995-status-inference-health

1f5c1af

Merge branch 'main' into fix/995-status-inference-health

52be198

Merge branch 'main' into fix/995-status-inference-health

225ca04

cv approved these changes Apr 21, 2026

View reviewed changes

ericksoa merged commit 470d0b4 into main Apr 21, 2026
12 checks passed

miyoungc mentioned this pull request Apr 21, 2026

docs: catch up reference and troubleshooting docs for v0.0.21 #2193

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cli): show inference health in sandbox status output#2002

fix(cli): show inference health in sandbox status output#2002
ericksoa merged 9 commits intomainfrom
fix/995-status-inference-health

ericksoa commented Apr 17, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 17, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ericksoa commented Apr 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ericksoa commented Apr 17, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 17, 2026 •

edited

Loading