fix(cli): show inference health in sandbox status output#2002
Conversation
sandboxStatus() already probed local providers (vllm-local, ollama-local) but showed no Inference line for remote providers. Add a unified probeProviderHealth() dispatcher that performs lightweight reachability checks for remote cloud endpoints (nvidia-prod, openai-api, anthropic-prod, gemini-api) and a "not probed" fallback for compatible-* providers whose URLs are unknown. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds a unified inference-provider health probe (local + remote), provider→endpoint mapping, curl-based reachability checks with short timeouts, comprehensive Vitest tests, and integrates probe results into nemoclaw's sandbox status output. Changes
Sequence DiagramsequenceDiagram
participant Caller
participant ProbeLayer as probeProviderHealth
participant LocalProbe as probeLocalProviderHealth
participant RemoteProbe as probeRemoteProviderHealth
participant Mapper as getRemoteProviderHealthEndpoint
participant Curl as runCurlProbeImpl
Caller->>ProbeLayer: probeProviderHealth(provider, options)
alt local provider recognized
ProbeLayer->>LocalProbe: probeLocalProviderHealth(...)
LocalProbe-->>ProbeLayer: ProviderHealthStatus | null
else remote provider
ProbeLayer->>RemoteProbe: probeRemoteProviderHealth(...)
RemoteProbe->>Mapper: getRemoteProviderHealthEndpoint(provider)
Mapper-->>RemoteProbe: endpoint | null
alt compatible endpoint
RemoteProbe-->>ProbeLayer: {probed:false, ok:true, endpoint, detail}
else endpoint found
RemoteProbe->>Curl: runCurlProbeImpl(["curl", "--connect-timeout", "3", "--max-time", "5", endpoint, ...])
Curl-->>RemoteProbe: CurlProbeResult (status, stdout, stderr)
RemoteProbe-->>ProbeLayer: {probed:true, ok:boolean, endpoint, detail}
end
else unknown provider
ProbeLayer-->>Caller: null
end
ProbeLayer-->>Caller: ProviderHealthStatus | null
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/nemoclaw.ts (1)
1214-1226: Extract inference rendering to keepsandboxStatuscomplexity in check.Line 1200’s function is already complexity-suppressed, and this new branch block adds more decision paths. Consider moving this rendering logic to a small helper.
♻️ Proposed refactor
+function printInferenceHealthStatus(inferenceHealth) { + if (!inferenceHealth) return; + if (!inferenceHealth.probed) { + console.log(` Inference: ${D}not probed${R} (${inferenceHealth.detail})`); + return; + } + if (inferenceHealth.ok) { + console.log(` Inference: ${G}healthy${R} (${inferenceHealth.endpoint})`); + return; + } + console.log(` Inference: ${_RD}unreachable${R} (${inferenceHealth.endpoint})`); + console.log(` ${inferenceHealth.detail}`); +} ... - if (inferenceHealth) { - if (!inferenceHealth.probed) { - console.log(` Inference: ${D}not probed${R} (${inferenceHealth.detail})`); - } else if (inferenceHealth.ok) { - console.log( - ` Inference: ${G}healthy${R} (${inferenceHealth.endpoint})`, - ); - } else { - console.log( - ` Inference: ${_RD}unreachable${R} (${inferenceHealth.endpoint})`, - ); - console.log(` ${inferenceHealth.detail}`); - } - } + printInferenceHealthStatus(inferenceHealth);As per coding guidelines,
**/*.{js,ts,tsx,jsx}: Limit cyclomatic complexity to 20 in JavaScript/TypeScript files, with target of 15.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/nemoclaw.ts` around lines 1214 - 1226, The inference rendering block inside the sandboxStatus function is increasing cyclomatic complexity; extract it into a small helper named something like renderInferenceHealth(inferenceHealth) that takes the existing inferenceHealth object and the color constants (D, R, G, _RD) and returns or prints the exact same lines (handle !probed, ok, and unreachable cases including detail and endpoint) and replace the inline branch in sandboxStatus with a single call to that helper to preserve behavior and reduce complexity.src/lib/inference-health.ts (1)
92-95: Prefernot probedovernullfor recognized-but-unmapped providers.If a provider is recognized by config but missing endpoint mapping, returning
nulldrops the Inference line entirely. Returning aprobed: falsestatus is safer and keeps output stable as providers evolve.♻️ Proposed refactor
const endpoint = getRemoteProviderHealthEndpoint(provider); if (!endpoint) { - return null; + if (config) { + return { + ok: true, + probed: false, + providerLabel, + endpoint: "", + detail: "Health probe endpoint is not defined for this provider.", + }; + } + return null; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/inference-health.ts` around lines 92 - 95, The current code in inference-health.ts calls getRemoteProviderHealthEndpoint(provider) and returns null when endpoint is missing, which removes the provider from output; change the behavior so that when endpoint is falsy you return an object indicating the provider is recognized but not probed (e.g., { provider, probed: false, status: 'not probed' } or matching the existing Inference/Health shape) instead of null. Update the branch that checks `if (!endpoint)` (the code referencing endpoint from getRemoteProviderHealthEndpoint) to construct and return the non-probed status object so downstream consumers still see the provider entry.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/lib/inference-health.ts`:
- Around line 92-95: The current code in inference-health.ts calls
getRemoteProviderHealthEndpoint(provider) and returns null when endpoint is
missing, which removes the provider from output; change the behavior so that
when endpoint is falsy you return an object indicating the provider is
recognized but not probed (e.g., { provider, probed: false, status: 'not probed'
} or matching the existing Inference/Health shape) instead of null. Update the
branch that checks `if (!endpoint)` (the code referencing endpoint from
getRemoteProviderHealthEndpoint) to construct and return the non-probed status
object so downstream consumers still see the provider entry.
In `@src/nemoclaw.ts`:
- Around line 1214-1226: The inference rendering block inside the sandboxStatus
function is increasing cyclomatic complexity; extract it into a small helper
named something like renderInferenceHealth(inferenceHealth) that takes the
existing inferenceHealth object and the color constants (D, R, G, _RD) and
returns or prints the exact same lines (handle !probed, ok, and unreachable
cases including detail and endpoint) and replace the inline branch in
sandboxStatus with a single call to that helper to preserve behavior and reduce
complexity.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: b11baa0d-7e08-4b7b-a922-78b0e2db8c65
📒 Files selected for processing (3)
src/lib/inference-health.test.tssrc/lib/inference-health.tssrc/nemoclaw.ts
## Summary Catches up the user-facing reference and troubleshooting docs with the CLI and policy behavior changes that landed in v0.0.21. Drafted via the `nemoclaw-contributor-update-docs` skill against commits in `v0.0.20..v0.0.21`, filtered through `docs/.docs-skip`. ## Changes - **`docs/reference/commands.md`** - `nemoclaw list`: session indicator (●) for connected sandboxes (#2117). - `nemoclaw <name> connect`: active-session note; auto-recovery from SSH identity drift after a host reboot (#2117, #2064). - `nemoclaw <name> status`: three-state Inference line (`healthy` / `unreachable` / `not probed`) covering both local and remote providers; new `Connected` line (#2002, #2117). - `nemoclaw <name> destroy` and `rebuild`: active-session warning with second confirm; rebuild reapplies policy presets to the recreated sandbox (#2117, #2026). - `nemoclaw <name> policy-add` and `policy-remove`: positional preset argument and non-interactive flow via `--yes`/`--force`/`NEMOCLAW_NON_INTERACTIVE=1` (#2070). - `nemoclaw <name> policy-list`: registry-vs-gateway desync detection (#2089). - **`docs/reference/troubleshooting.md`** - `Reconnect after a host reboot`: now reflects automatic stale `known_hosts` pruning on `connect` (#2064). - `Running multiple sandboxes simultaneously`: onboard's forward-port collision guard (#2086). - New section: `openclaw config set` or `unset` is blocked inside the sandbox (#2081). - **`docs/network-policy/customize-network-policy.md`**: non-interactive `policy-add`/`policy-remove` form; preset preservation across rebuild (#2070, #2026). - **`docs/inference/use-local-inference.md`**: NIM section now covers the NGC API key prompt with masked input and `docker login nvcr.io --password-stdin` behavior (#2043). - **Generated skills regenerated** to pick up the source changes (`.agents/skills/nemoclaw-user-reference/references/{commands,troubleshooting}.md`, plus minor heading-flow deltas elsewhere). The pre-commit `Regenerate agent skills from docs` hook ran and confirmed source ↔ generated parity. Commits skipped per `docs/.docs-skip` or no doc impact: `bbbaa0fb` (skip-features), `7cb482cb` (skip-features), `8dee23fd` (skip-terms), plus the usual CI / test / refactor / install-plumbing churn. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes for the modified files (the one failing test, `test/cli.test.ts > unknown command exits 1`, also fails on `origin/main` and is unrelated to these markdown-only changes) - [ ] `npm test` passes — skipped; same pre-existing CLI-dispatch test failure unrelated to docs - [ ] Tests added or updated for new or changed behavior — n/a, doc-only - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) — not run locally - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) — n/a, no new pages ## AI Disclosure - [x] AI-assisted — tool: Claude Code --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Multi-session SSH connections with concurrent session support. * Three-state inference health reporting (healthy/unreachable/not probed) across all providers. * Automatic SSH host key rotation detection and recovery. * Non-interactive policy preset management via positional arguments. * Session indicators in sandbox list view. * **Bug Fixes** * Protected destructive operations with active-session warnings. * Policy presets now preserved during sandbox rebuilds. * **Documentation** * NGC registry authentication requirements for container images. * Multi-sandbox onboarding and reconnection guidance. * Troubleshooting updates for port conflicts and SSH issues. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
nemoclaw <name> statusso all providers (not just local) show an Inference linevllm-local,ollama-local) already worked — this fills the gap for remote providers (nvidia-prod,openai-api,anthropic-prod,gemini-api)probeProviderHealth()dispatcher in newinference-health.tsmodule that handles both local and remote providerscompatible-*providers show "not probed" since their endpoint URLs aren't knownFixes #995
Test plan
inference-health.test.tscovering endpoint mapping, reachability semantics, timeouts, and unified dispatchnemoclaw <sandbox> statuswith a remote provider shows new Inference linenemoclaw <sandbox> statuswith a local provider output is unchangedSummary by CodeRabbit
New Features
Tests