Skip to content

fix(cli): show inference health in sandbox status output#2002

Merged
ericksoa merged 9 commits intomainfrom
fix/995-status-inference-health
Apr 21, 2026
Merged

fix(cli): show inference health in sandbox status output#2002
ericksoa merged 9 commits intomainfrom
fix/995-status-inference-health

Conversation

@ericksoa
Copy link
Copy Markdown
Contributor

@ericksoa ericksoa commented Apr 17, 2026

Summary

  • Adds remote provider health probing to nemoclaw <name> status so all providers (not just local) show an Inference line
  • Local probing (vllm-local, ollama-local) already worked — this fills the gap for remote providers (nvidia-prod, openai-api, anthropic-prod, gemini-api)
  • Creates a unified probeProviderHealth() dispatcher in new inference-health.ts module that handles both local and remote providers
  • Remote probes use lightweight reachability checks (any HTTP response including 401/403 = reachable, no API keys sent)
  • compatible-* providers show "not probed" since their endpoint URLs aren't known

Fixes #995

Test plan

  • 23 new unit tests in inference-health.test.ts covering endpoint mapping, reachability semantics, timeouts, and unified dispatch
  • All 1832 existing tests continue to pass
  • Manual: nemoclaw <sandbox> status with a remote provider shows new Inference line
  • Manual: nemoclaw <sandbox> status with a local provider output is unchanged

Summary by CodeRabbit

  • New Features

    • Unified health checks for inference providers (local and remote) with three-state reporting: not probed, healthy, and unreachable. Status now shows endpoint and diagnostic details when available.
    • Remote probe delegation uses short reachability checks; certain compatible endpoints are reported as "not probed".
  • Tests

    • Added comprehensive tests covering provider-to-endpoint mapping, reachable/unreachable cases, timeouts/connection failures, auth responses, and delegation behavior.

sandboxStatus() already probed local providers (vllm-local, ollama-local)
but showed no Inference line for remote providers. Add a unified
probeProviderHealth() dispatcher that performs lightweight reachability
checks for remote cloud endpoints (nvidia-prod, openai-api, anthropic-prod,
gemini-api) and a "not probed" fallback for compatible-* providers whose
URLs are unknown.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 17, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: f470cd35-e709-4269-9197-8b4a27e18c88

📥 Commits

Reviewing files that changed from the base of the PR and between c657560 and 225ca04.

📒 Files selected for processing (1)
  • src/nemoclaw.ts

📝 Walkthrough

Walkthrough

Adds a unified inference-provider health probe (local + remote), provider→endpoint mapping, curl-based reachability checks with short timeouts, comprehensive Vitest tests, and integrates probe results into nemoclaw's sandbox status output.

Changes

Cohort / File(s) Summary
Health probing module & tests
src/lib/inference-health.ts, src/lib/inference-health.test.ts
New unified probe API: probeProviderHealth, probeRemoteProviderHealth, getRemoteProviderHealthEndpoint. Implements provider→endpoint mapping, curl-based probes (--connect-timeout 3, --max-time 5), treats HTTP 401/403 as reachable, special-cases compatible endpoints as probed: false, and adds extensive Vitest coverage for mappings, reachable/unreachable cases, timeouts, and curl argument passing.
Integration (status rendering)
src/nemoclaw.ts
Replaced local-only probe call with probeProviderHealth. sandboxStatus now renders three inference states (not probed, healthy, unreachable) and surfaces endpoint and detail from probe results.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant ProbeLayer as probeProviderHealth
    participant LocalProbe as probeLocalProviderHealth
    participant RemoteProbe as probeRemoteProviderHealth
    participant Mapper as getRemoteProviderHealthEndpoint
    participant Curl as runCurlProbeImpl

    Caller->>ProbeLayer: probeProviderHealth(provider, options)
    alt local provider recognized
        ProbeLayer->>LocalProbe: probeLocalProviderHealth(...)
        LocalProbe-->>ProbeLayer: ProviderHealthStatus | null
    else remote provider
        ProbeLayer->>RemoteProbe: probeRemoteProviderHealth(...)
        RemoteProbe->>Mapper: getRemoteProviderHealthEndpoint(provider)
        Mapper-->>RemoteProbe: endpoint | null
        alt compatible endpoint
            RemoteProbe-->>ProbeLayer: {probed:false, ok:true, endpoint, detail}
        else endpoint found
            RemoteProbe->>Curl: runCurlProbeImpl(["curl", "--connect-timeout", "3", "--max-time", "5", endpoint, ...])
            Curl-->>RemoteProbe: CurlProbeResult (status, stdout, stderr)
            RemoteProbe-->>ProbeLayer: {probed:true, ok:boolean, endpoint, detail}
        end
    else unknown provider
        ProbeLayer-->>Caller: null
    end
    ProbeLayer-->>Caller: ProviderHealthStatus | null
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐇 I sniffed each endpoint with a curious hop,
Curl-timed my whiskers at every stop,
Found local burrows and cloud doors ajar,
Noted which replied and which stayed far,
Now the sandbox hums — I made it clear, hop!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'fix(cli): show inference health in sandbox status output' directly reflects the main change: adding inference health visibility to the status command for all providers.
Linked Issues check ✅ Passed The PR introduces unified provider health probing covering both local and remote providers with reachability checks, addressing issue #995's core objective of improving visibility and error messaging for inference backend health.
Out of Scope Changes check ✅ Passed All changes are directly scoped to implementing provider health probing: new inference-health module, comprehensive tests, and integration into nemoclaw status—no extraneous modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/995-status-inference-health

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
src/nemoclaw.ts (1)

1214-1226: Extract inference rendering to keep sandboxStatus complexity in check.

Line 1200’s function is already complexity-suppressed, and this new branch block adds more decision paths. Consider moving this rendering logic to a small helper.

♻️ Proposed refactor
+function printInferenceHealthStatus(inferenceHealth) {
+  if (!inferenceHealth) return;
+  if (!inferenceHealth.probed) {
+    console.log(`    Inference: ${D}not probed${R} (${inferenceHealth.detail})`);
+    return;
+  }
+  if (inferenceHealth.ok) {
+    console.log(`    Inference: ${G}healthy${R} (${inferenceHealth.endpoint})`);
+    return;
+  }
+  console.log(`    Inference: ${_RD}unreachable${R} (${inferenceHealth.endpoint})`);
+  console.log(`      ${inferenceHealth.detail}`);
+}
...
-    if (inferenceHealth) {
-      if (!inferenceHealth.probed) {
-        console.log(`    Inference: ${D}not probed${R} (${inferenceHealth.detail})`);
-      } else if (inferenceHealth.ok) {
-        console.log(
-          `    Inference: ${G}healthy${R} (${inferenceHealth.endpoint})`,
-        );
-      } else {
-        console.log(
-          `    Inference: ${_RD}unreachable${R} (${inferenceHealth.endpoint})`,
-        );
-        console.log(`      ${inferenceHealth.detail}`);
-      }
-    }
+    printInferenceHealthStatus(inferenceHealth);

As per coding guidelines, **/*.{js,ts,tsx,jsx}: Limit cyclomatic complexity to 20 in JavaScript/TypeScript files, with target of 15.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/nemoclaw.ts` around lines 1214 - 1226, The inference rendering block
inside the sandboxStatus function is increasing cyclomatic complexity; extract
it into a small helper named something like
renderInferenceHealth(inferenceHealth) that takes the existing inferenceHealth
object and the color constants (D, R, G, _RD) and returns or prints the exact
same lines (handle !probed, ok, and unreachable cases including detail and
endpoint) and replace the inline branch in sandboxStatus with a single call to
that helper to preserve behavior and reduce complexity.
src/lib/inference-health.ts (1)

92-95: Prefer not probed over null for recognized-but-unmapped providers.

If a provider is recognized by config but missing endpoint mapping, returning null drops the Inference line entirely. Returning a probed: false status is safer and keeps output stable as providers evolve.

♻️ Proposed refactor
   const endpoint = getRemoteProviderHealthEndpoint(provider);
   if (!endpoint) {
-    return null;
+    if (config) {
+      return {
+        ok: true,
+        probed: false,
+        providerLabel,
+        endpoint: "",
+        detail: "Health probe endpoint is not defined for this provider.",
+      };
+    }
+    return null;
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/inference-health.ts` around lines 92 - 95, The current code in
inference-health.ts calls getRemoteProviderHealthEndpoint(provider) and returns
null when endpoint is missing, which removes the provider from output; change
the behavior so that when endpoint is falsy you return an object indicating the
provider is recognized but not probed (e.g., { provider, probed: false, status:
'not probed' } or matching the existing Inference/Health shape) instead of null.
Update the branch that checks `if (!endpoint)` (the code referencing endpoint
from getRemoteProviderHealthEndpoint) to construct and return the non-probed
status object so downstream consumers still see the provider entry.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/lib/inference-health.ts`:
- Around line 92-95: The current code in inference-health.ts calls
getRemoteProviderHealthEndpoint(provider) and returns null when endpoint is
missing, which removes the provider from output; change the behavior so that
when endpoint is falsy you return an object indicating the provider is
recognized but not probed (e.g., { provider, probed: false, status: 'not probed'
} or matching the existing Inference/Health shape) instead of null. Update the
branch that checks `if (!endpoint)` (the code referencing endpoint from
getRemoteProviderHealthEndpoint) to construct and return the non-probed status
object so downstream consumers still see the provider entry.

In `@src/nemoclaw.ts`:
- Around line 1214-1226: The inference rendering block inside the sandboxStatus
function is increasing cyclomatic complexity; extract it into a small helper
named something like renderInferenceHealth(inferenceHealth) that takes the
existing inferenceHealth object and the color constants (D, R, G, _RD) and
returns or prints the exact same lines (handle !probed, ok, and unreachable
cases including detail and endpoint) and replace the inline branch in
sandboxStatus with a single call to that helper to preserve behavior and reduce
complexity.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: b11baa0d-7e08-4b7b-a922-78b0e2db8c65

📥 Commits

Reviewing files that changed from the base of the PR and between 56ee83f and efd5a8f.

📒 Files selected for processing (3)
  • src/lib/inference-health.test.ts
  • src/lib/inference-health.ts
  • src/nemoclaw.ts

@ericksoa ericksoa self-assigned this Apr 20, 2026
@wscurran wscurran added NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). fix labels Apr 20, 2026
@ericksoa ericksoa merged commit 470d0b4 into main Apr 21, 2026
12 checks passed
ericksoa pushed a commit that referenced this pull request Apr 21, 2026
## Summary

Catches up the user-facing reference and troubleshooting docs with the
CLI and policy behavior changes that landed in v0.0.21. Drafted via the
`nemoclaw-contributor-update-docs` skill against commits in
`v0.0.20..v0.0.21`, filtered through `docs/.docs-skip`.

## Changes

- **`docs/reference/commands.md`**
- `nemoclaw list`: session indicator (●) for connected sandboxes
(#2117).
- `nemoclaw <name> connect`: active-session note; auto-recovery from SSH
identity drift after a host reboot (#2117, #2064).
- `nemoclaw <name> status`: three-state Inference line (`healthy` /
`unreachable` / `not probed`) covering both local and remote providers;
new `Connected` line (#2002, #2117).
- `nemoclaw <name> destroy` and `rebuild`: active-session warning with
second confirm; rebuild reapplies policy presets to the recreated
sandbox (#2117, #2026).
- `nemoclaw <name> policy-add` and `policy-remove`: positional preset
argument and non-interactive flow via
`--yes`/`--force`/`NEMOCLAW_NON_INTERACTIVE=1` (#2070).
- `nemoclaw <name> policy-list`: registry-vs-gateway desync detection
(#2089).
- **`docs/reference/troubleshooting.md`**
- `Reconnect after a host reboot`: now reflects automatic stale
`known_hosts` pruning on `connect` (#2064).
- `Running multiple sandboxes simultaneously`: onboard's forward-port
collision guard (#2086).
- New section: `openclaw config set` or `unset` is blocked inside the
sandbox (#2081).
- **`docs/network-policy/customize-network-policy.md`**: non-interactive
`policy-add`/`policy-remove` form; preset preservation across rebuild
(#2070, #2026).
- **`docs/inference/use-local-inference.md`**: NIM section now covers
the NGC API key prompt with masked input and `docker login nvcr.io
--password-stdin` behavior (#2043).
- **Generated skills regenerated** to pick up the source changes
(`.agents/skills/nemoclaw-user-reference/references/{commands,troubleshooting}.md`,
plus minor heading-flow deltas elsewhere). The pre-commit `Regenerate
agent skills from docs` hook ran and confirmed source ↔ generated
parity.

Commits skipped per `docs/.docs-skip` or no doc impact: `bbbaa0fb`
(skip-features), `7cb482cb` (skip-features), `8dee23fd` (skip-terms),
plus the usual CI / test / refactor / install-plumbing churn.

## Type of Change

- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [x] Doc only (includes code sample changes)

## Verification

- [x] `npx prek run --all-files` passes for the modified files (the one
failing test, `test/cli.test.ts > unknown command exits 1`, also fails
on `origin/main` and is unrelated to these markdown-only changes)
- [ ] `npm test` passes — skipped; same pre-existing CLI-dispatch test
failure unrelated to docs
- [ ] Tests added or updated for new or changed behavior — n/a, doc-only
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [ ] `make docs` builds without warnings (doc changes only) — not run
locally
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)
— n/a, no new pages

## AI Disclosure

- [x] AI-assisted — tool: Claude Code

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
  * Multi-session SSH connections with concurrent session support.
* Three-state inference health reporting (healthy/unreachable/not
probed) across all providers.
  * Automatic SSH host key rotation detection and recovery.
  * Non-interactive policy preset management via positional arguments.
  * Session indicators in sandbox list view.

* **Bug Fixes**
  * Protected destructive operations with active-session warnings.
  * Policy presets now preserved during sandbox rebuilds.

* **Documentation**
  * NGC registry authentication requirements for container images.
  * Multi-sandbox onboarding and reconnection guidance.
  * Troubleshooting updates for port conflicts and SSH issues.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI).

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[MacOS] No clear error message when Ollama backend is stopped

3 participants