Conversation
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (2)
📒 Files selected for processing (155)
📝 WalkthroughWalkthroughLarge, multi-area update adding new network policy preset for Homebrew, extensive CI/workflow and action additions, many docs and scripts changes, numerous new/rewritten TypeScript modules and tests (including preflight, inference, nim, onboarding/session, runtime-recovery, validation), multiple bin shims to dist, registry and credentials hardening, k8s manifest, and many e2e/test script additions and revisions. Changes
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemoclaw-blueprint/policies/presets/brew.yaml`:
- Around line 12-54: The brew preset currently configures
registry/package-manager hosts (hosts: formulae.brew.sh, github.com, ghcr.io,
pkg-containers.githubusercontent.com, objects.githubusercontent.com,
raw.githubusercontent.com) with protocol: rest, tls: terminate and REST
method/path rules; change these entries to use access: full (remove protocol:
rest, tls: terminate and the rules block) so CONNECT/TLS tunneling is allowed
for package manager and registry traffic, ensuring auth/downloads work
correctly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 801a2016-7512-4d4d-bc95-132c616fd565
📒 Files selected for processing (1)
nemoclaw-blueprint/policies/presets/brew.yaml
| - host: formulae.brew.sh | ||
| port: 443 | ||
| protocol: rest | ||
| enforcement: enforce | ||
| tls: terminate | ||
| rules: | ||
| - allow: { method: GET, path: "/**" } | ||
| - host: github.com | ||
| port: 443 | ||
| protocol: rest | ||
| enforcement: enforce | ||
| tls: terminate | ||
| rules: | ||
| - allow: { method: GET, path: "/Homebrew/**" } | ||
| - allow: { method: POST, path: "/Homebrew/**" } | ||
| - host: ghcr.io | ||
| port: 443 | ||
| protocol: rest | ||
| enforcement: enforce | ||
| tls: terminate | ||
| rules: | ||
| - allow: { method: GET, path: "/**" } | ||
| - host: pkg-containers.githubusercontent.com | ||
| port: 443 | ||
| protocol: rest | ||
| enforcement: enforce | ||
| tls: terminate | ||
| rules: | ||
| - allow: { method: GET, path: "/**" } | ||
| - host: objects.githubusercontent.com | ||
| port: 443 | ||
| protocol: rest | ||
| enforcement: enforce | ||
| tls: terminate | ||
| rules: | ||
| - allow: { method: GET, path: "/**" } | ||
| - host: raw.githubusercontent.com | ||
| port: 443 | ||
| protocol: rest | ||
| enforcement: enforce | ||
| tls: terminate | ||
| rules: | ||
| - allow: { method: GET, path: "/**" } |
There was a problem hiding this comment.
Use access: full for brew endpoints instead of REST + TLS termination.
For package-manager/registry traffic, tls: terminate with REST path rules will break CONNECT tunneling and can cause auth/download failures. Switch these endpoints to access: full and remove REST method/path filtering.
Suggested fix
endpoints:
- host: formulae.brew.sh
port: 443
- protocol: rest
enforcement: enforce
- tls: terminate
- rules:
- - allow: { method: GET, path: "/**" }
+ access: full
- host: github.com
port: 443
- protocol: rest
enforcement: enforce
- tls: terminate
- rules:
- - allow: { method: GET, path: "/Homebrew/**" }
- - allow: { method: POST, path: "/Homebrew/**" }
+ access: full
- host: ghcr.io
port: 443
- protocol: rest
enforcement: enforce
- tls: terminate
- rules:
- - allow: { method: GET, path: "/**" }
+ access: full
- host: pkg-containers.githubusercontent.com
port: 443
- protocol: rest
enforcement: enforce
- tls: terminate
- rules:
- - allow: { method: GET, path: "/**" }
+ access: full
- host: objects.githubusercontent.com
port: 443
- protocol: rest
enforcement: enforce
- tls: terminate
- rules:
- - allow: { method: GET, path: "/**" }
+ access: full
- host: raw.githubusercontent.com
port: 443
- protocol: rest
enforcement: enforce
- tls: terminate
- rules:
- - allow: { method: GET, path: "/**" }
+ access: full📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - host: formulae.brew.sh | |
| port: 443 | |
| protocol: rest | |
| enforcement: enforce | |
| tls: terminate | |
| rules: | |
| - allow: { method: GET, path: "/**" } | |
| - host: github.com | |
| port: 443 | |
| protocol: rest | |
| enforcement: enforce | |
| tls: terminate | |
| rules: | |
| - allow: { method: GET, path: "/Homebrew/**" } | |
| - allow: { method: POST, path: "/Homebrew/**" } | |
| - host: ghcr.io | |
| port: 443 | |
| protocol: rest | |
| enforcement: enforce | |
| tls: terminate | |
| rules: | |
| - allow: { method: GET, path: "/**" } | |
| - host: pkg-containers.githubusercontent.com | |
| port: 443 | |
| protocol: rest | |
| enforcement: enforce | |
| tls: terminate | |
| rules: | |
| - allow: { method: GET, path: "/**" } | |
| - host: objects.githubusercontent.com | |
| port: 443 | |
| protocol: rest | |
| enforcement: enforce | |
| tls: terminate | |
| rules: | |
| - allow: { method: GET, path: "/**" } | |
| - host: raw.githubusercontent.com | |
| port: 443 | |
| protocol: rest | |
| enforcement: enforce | |
| tls: terminate | |
| rules: | |
| - allow: { method: GET, path: "/**" } | |
| - host: formulae.brew.sh | |
| port: 443 | |
| enforcement: enforce | |
| access: full | |
| - host: github.com | |
| port: 443 | |
| enforcement: enforce | |
| access: full | |
| - host: ghcr.io | |
| port: 443 | |
| enforcement: enforce | |
| access: full | |
| - host: pkg-containers.githubusercontent.com | |
| port: 443 | |
| enforcement: enforce | |
| access: full | |
| - host: objects.githubusercontent.com | |
| port: 443 | |
| enforcement: enforce | |
| access: full | |
| - host: raw.githubusercontent.com | |
| port: 443 | |
| enforcement: enforce | |
| access: full |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemoclaw-blueprint/policies/presets/brew.yaml` around lines 12 - 54, The brew
preset currently configures registry/package-manager hosts (hosts:
formulae.brew.sh, github.com, ghcr.io, pkg-containers.githubusercontent.com,
objects.githubusercontent.com, raw.githubusercontent.com) with protocol: rest,
tls: terminate and REST method/path rules; change these entries to use access:
full (remove protocol: rest, tls: terminate and the rules block) so CONNECT/TLS
tunneling is allowed for package manager and registry traffic, ensuring
auth/downloads work correctly.
* chore: add cyclomatic complexity rule (ratchet from 95) Add ESLint complexity rule to bin/ and scripts/ to prevent new functions from accumulating excessive branching. Starting threshold is 95 (current worst offender: setupNim in onboard.js). Ratchet plan: 95 → 40 → 25 → 15. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: ratchet complexity to 20, suppress existing violations Suppress 6 functions that exceed the threshold with eslint-disable comments so we can start enforcing at 20 instead of 95: - setupNim (95), setupPolicies (41), setupInference (22) in onboard.js - deploy (22), main IIFE (27) in nemoclaw.js - applyPreset (24) in policies.js Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: suppress complexity for 3 missed functions preflight (23), getReconciledSandboxGatewayState (25), sandboxStatus (27) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: peteryuqin <peter.yuqin@gmail.com>
* chore: add tsconfig.cli.json, root execa, TS coverage ratchet Foundation for the CLI TypeScript migration (PR 0 of the shell consolidation plan). No runtime changes — config, tooling, and dependency only. - tsconfig.cli.json: strict TS type-checking for bin/ and scripts/ (noEmit, module: preserve — tsx handles the runtime) - scripts/check-coverage-ratchet.ts: pure TS replacement for the bash+python coverage ratchet script (same logic, same tolerance) - execa ^9.6.1 added to root devDependencies (used by PR 1+) - pr.yaml: coverage ratchet step now runs the TS version via tsx - .pre-commit-config.yaml: SPDX headers cover scripts/*.ts, new tsc-check-cli pre-push hook - CONTRIBUTING.md: document typecheck:cli task and CLI pre-push hook - Delete scripts/check-coverage-ratchet.sh Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Apply suggestion from @brandonpelfrey * chore: address PR feedback — use types_or, add tsx devDep - Use `types_or: [ts, tsx]` instead of file glob for tsc-check-cli hook per @brandonpelfrey's suggestion. - Add `tsx` to devDependencies so CI doesn't re-fetch it on every run per CodeRabbit's suggestion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(ci): ignore GitHub "Apply suggestion" commits in commitlint * fix(ci): lint only PR title since repo is squash-merge only Reverts the commitlint ignores rule from the previous commit and instead removes the per-commit lint step entirely. Individual commit messages are discarded at merge time — only the squash-merged PR title lands in main and drives changelog generation. Drop the per-commit lint, keep the PR title check, and remove the now-unnecessary fetch-depth: 0. * Revert "fix(ci): lint only PR title since repo is squash-merge only" This reverts commit 1257a47. * Revert "fix(ci): ignore GitHub "Apply suggestion" commits in commitlint" This reverts commit c395657. * docs: fix markdownlint MD032 in README (blank line before list) * refactor: make coverage ratchet script idiomatic TypeScript - Wrap in main() with process.exitCode instead of scattered process.exit() - Replace mutable flags with .map()/.some() over typed MetricResult[] - Separate pure logic (checkMetrics) from formatting (formatReport) - Throw with { cause } chaining instead of exit-in-helpers - Derive CoverageThresholds from METRICS tuple (single source of truth) - Exhaustive switch on CheckStatus discriminated union * refactor: remove duplication in coverage ratchet script - Drop STATUS_LABELS map; inline labels in exhaustive switch - Extract common 'metric coverage is N%' preamble in formatResult - Simplify ratchetedThresholds: use results directly (already in METRICS order) instead of re-scanning with .find() per metric - Compute 'failed' once in main, pass into formatReport to avoid duplicate .some() scan * refactor: simplify coverage ratchet with FP patterns - Extract classify() as a named pure function (replaces nested ternary) - loadJSON takes repo-relative paths, eliminating THRESHOLD_PATH and SUMMARY_PATH constants (DRY the join-with-REPO_ROOT pattern) - Drop CoverageMetric/CoverageSummary interfaces (only pct is read); use structural type at the call site instead - Inline ratchetedThresholds (one-liner, used once) - formatReport derives fail/improved from results instead of taking a pre-computed boolean (let functions derive from data, don't thread derived state) - sections.join("\n\n") replaces manual empty-string pushing - Shorter type names (Thresholds, Status, Result) — no ambiguity in a single-purpose script * refactor: strip coverage ratchet to failure-only output prek hides output from commands that exit 0, so ok/improved reporting was dead code. Remove Status, Result, classify, formatResult, formatReport, and the ratcheted-thresholds suggestion block. The script now just filters for regressions and prints actionable errors on failure. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Brandon Pelfrey <bpelfrey@nvidia.com>
…ets (#438) * fix: use CONNECT tunnel for WebSocket endpoints in Discord/Slack presets The egress proxy's HTTP idle timeout (~2 min) kills long-lived WebSocket connections when endpoints are configured with protocol:rest + tls:terminate. Switch WebSocket endpoints to access:full (CONNECT tunnel) which bypasses HTTP-level timeouts entirely. Discord: - gateway.discord.gg → access:full (WebSocket gateway) - Add PUT/PATCH/DELETE methods for discord.com (message editing, reactions) - Add media.discordapp.net for attachment access Slack: - Add wss-primary.slack.com and wss-backup.slack.com → access:full (Socket Mode WebSocket endpoints) Partially addresses #409 — the policy-level fix enables WebSocket connections to survive. The hardcoded 2-min timeout in openshell-sandbox still affects any protocol:rest endpoints with long-lived connections. Related: #361 (WhatsApp Web, same root cause) * fix: correct comment wording for media endpoint and YAML formatting
* fix: remove unused RECOMMENDED_NODE_MAJOR from scripts/install.sh Shellcheck flagged it as unused after the min/recommended merge. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: enforce full semver >=22.16.0 in installer scripts The runtime checks only compared the major Node.js version, allowing 22.0–22.15 to pass despite package.json requiring >=22.16.0. Use the version_gte() helper for full semver comparison in both installers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: harden version_gte and align fallback message Guard version_gte() against prerelease suffixes (e.g. "22.16.0-rc.1") that would crash bash arithmetic. Also update the manual-install fallback message to reference MIN_NODE_VERSION instead of hardcoded "22". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: update test stubs for Node.js 22.16 minimum and add Node 20 rejection test - Bump node stub in 'succeeds with acceptable Node.js' from v20.0.0 to v22.16.0 - Bump node stub in buildCurlPipeEnv from v22.14.0 to v22.16.0 - Add new test asserting Node.js 20 is rejected by ensure_supported_runtime --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: harden installer and onboard resiliency * fix: address installer and debug review follow-ups * fix: harden onboard resume across later setup steps * test: simplify payload extraction in onboard tests * test: keep onboard payload extraction target-compatible * chore: align onboard session lint with complexity rule * fix: harden onboard session safety and lock handling * fix: tighten onboard session redaction and metadata handling
… blueprint digest (#769) Reconciles #156 and #743 into a single comprehensive solution: - Filter auth-profiles.json at copy time via cpSync filter (from #743) - Recursive stripCredentials() with pattern-based field detection for deep config sanitization (from #156: CREDENTIAL_FIELDS set + CREDENTIAL_FIELD_PATTERN regex) - Remove gateway config section (contains auth tokens) from sandbox openclaw.json - Blueprint digest verification (SHA-256): recorded at snapshot time, validated on restore, empty/missing digest is a hard failure - computeFileDigest() throws when blueprint file is missing instead of silently returning null - Sanitize both snapshot-level and sandbox-bundle openclaw.json copies - Backward compatible: old snapshots without blueprintDigest skip validation - Bump SNAPSHOT_VERSION 2 → 3 Supersedes #156 and #743.
…ross reconnects (#1025) * fix(sandbox): export proxy env vars with full NO_PROXY and persist across reconnects OpenShell injects NO_PROXY=127.0.0.1,localhost,::1 into the sandbox, missing inference.local and the gateway IP (10.200.0.1). This causes LLM inference requests to route through the egress proxy instead of going direct, and the proxy gateway IP itself gets proxied. Add proxy configuration block to nemoclaw-start.sh that: - Exports HTTP_PROXY, HTTPS_PROXY, and NO_PROXY with inference.local and the gateway IP included - Persists via /etc/profile.d/nemoclaw-proxy.sh (root) or ~/.profile (non-root fallback) so values survive OpenShell reconnect injection - Supports NEMOCLAW_PROXY_HOST / NEMOCLAW_PROXY_PORT overrides The non-root fallback ensures the fix works in environments like Brev where containers run without root privileges. Tested on DGX Spark (ARM64) and Brev VM (x86_64). Verified NO_PROXY contains inference.local and 10.200.0.1 inside the live sandbox after connect. Ref: #626, #704 Ref: #704 (comment) * fix(sandbox): write proxy config to ~/.bashrc for interactive reconnect sessions OpenShell's `sandbox connect` spawns `/bin/bash -i` (interactive, non-login), which sources ~/.bashrc — not ~/.profile or /etc/profile.d/*. The previous approach wrote to ~/.profile and /etc/profile.d/, neither of which is sourced by `bash -i`, so the narrow OpenShell-injected NO_PROXY persisted in live interactive sessions. Changes: - Write proxy snippet to ~/.bashrc (primary) and ~/.profile (login fallback) - Export both uppercase and lowercase proxy variants (NO_PROXY + no_proxy, HTTP_PROXY + http_proxy, etc.) — Node.js undici prefers lowercase no_proxy over uppercase NO_PROXY when both are set - Add idempotency guard to prevent duplicate blocks on container restart - Update tests: verify .bashrc writing, idempotency, bash -i override behavior, and lowercase variant correctness Tested on DGX Spark (ARM64) and Brev VM (x86_64) with full destroy + re-onboard + live `env | grep proxy` verification inside the sandbox shell via `openshell sandbox connect`. Ref: #626 * fix(sandbox): replace stale proxy values on restart with begin/end markers Use begin/end markers in .bashrc/.profile proxy snippet so _write_proxy_snippet replaces the block when PROXY_HOST/PORT change instead of silently keeping stale values. Adds test coverage for the replacement path. Addresses CodeRabbit review feedback on idempotency gap. * fix(sandbox): resolve sandbox user home dynamically when running as root When the entrypoint runs as root, $HOME is /root — the proxy snippet was written to /root/.bashrc instead of the sandbox user's home. Use getent passwd to look up the sandbox user's home when running as UID 0; fall back to /sandbox if the user entry is missing. Addresses CodeRabbit review feedback on _SANDBOX_HOME resolution. --------- Co-authored-by: Carlos Villela <cvillela@nvidia.com>
#101) * fix(policies): allow preset application for versionless policies (Fixes #35) Fixes #35 Signed-off-by: Deepak Jain <deepujain@gmail.com> * fix: remove stale complexity suppression in policies --------- Signed-off-by: Deepak Jain <deepujain@gmail.com> Co-authored-by: Kevin Jones <kejones@nvidia.com>
* fix: restore routed inference and connect UX * fix: simplify detected local inference hint * fix: remove stale local inference hint * test: relax connect forward assertion
… (#1062) * fix(sandbox): add DNS forwarder so web_fetch resolves hostnames (fixes #626) The sandbox runs in an isolated network namespace (10.200.0.0/24) where OpenShell's iptables rules reject all non-proxy traffic including UDP. This causes getaddrinfo EAI_AGAIN for every outbound DNS lookup, breaking web_fetch, web_search, and all Node.js HTTP tools. Fix (three steps in new scripts/setup-dns-proxy.sh): 1. Run a Python UDP DNS forwarder on the pod-side veth gateway (10.200.0.1:53), forwarding to the real CoreDNS pod IP 2. Add an iptables rule in the sandbox namespace allowing UDP to the gateway on port 53 (the only non-proxy firewall exception) 3. Update the sandbox's /etc/resolv.conf to point to 10.200.0.1 Also broadens the CoreDNS patch (fix-coredns.sh / shouldPatchCoredns) to run on all Docker-based runtimes, not just Colima — k3s-inside-Docker has broken DNS forwarding on Linux hosts with systemd-resolved too. Inspired by PR #732's DNS forwarder approach. Key differences: - Uses kubectl exec instead of nsenter (fixes PR #732's launch bug) - Handles the sandbox iptables constraint (all UDP blocked) - Resolves systemd-resolved upstreams via resolvectl before falling back to 8.8.8.8 - Uses grep -F for fixed-string sandbox name matching Tested on DGX Spark (ARM64): fresh destroy + onboard, getent hosts and node dns.lookup both resolve from inside the sandbox. * fix(sandbox): add runtime DNS verification after setup setup-dns-proxy.sh now verifies all three DNS bridge layers from inside the sandbox namespace post-deployment: resolv.conf target, iptables UDP rule, and actual getent hosts resolution. Reports [PASS]/[FAIL] per check so failures are immediately visible in onboard logs.
…1070) * fix(install): add trap handler to spin() for clean Ctrl+C teardown spin() runs commands in the background with a spinner animation but had no signal handler. Pressing Ctrl+C during installation would kill the parent shell while leaving the background process running and the temp log file on disk. Add INT/TERM trap inside spin() that kills the background process and removes the temp file, then restore default signal handling after the background process exits normally. Closes #1020 * fix(install): add global EXIT trap for robust cleanup on all exit paths The function-scoped INT/TERM trap in spin() handles Ctrl+C well, but set -e triggering between trap clearance and explicit rm -f could leak temp files or orphan background processes. Add global _cleanup_pids and _cleanup_files arrays with an EXIT trap that catches all exit paths. spin() registers its PID and temp file on entry and deregisters on normal completion. This defense-in-depth ensures cleanup even on unexpected exits. Addresses CodeRabbit review suggestion. * fix(install): defer cleanup array deregistration until after rm/kill Move _cleanup_pids and _cleanup_files deregistration to after the actual cleanup actions (rm -f, pid termination) so the global EXIT trap still covers these resources if a signal arrives mid-cleanup. Closes CodeRabbit review comment on #1070. Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com> --------- Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com> Co-authored-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
…nimum (#1079) * fix(install): upgrade Node.js via nvm when system version is below minimum install_nodejs() returned early when any Node.js was found on PATH, regardless of version. On systems with Node 20 pre-installed (e.g. GitHub Actions ubuntu-24.04 runners), the installer skipped the nvm upgrade path and then ensure_supported_runtime() rejected the old version. Additionally, ensure_nvm_loaded() skipped sourcing nvm.sh when any node was on PATH, preventing the newly-installed Node 22 from being activated in the parent shell. Fixes #1078 * fix(test): update preflight test for nvm upgrade behavior The installer now attempts to upgrade Node.js via nvm when the system version is below minimum, instead of failing immediately. Update the test to verify the new warn-and-upgrade path, with a fake curl stub to keep the test fast. * fix(install): also check npm version before skipping nvm upgrade The early return in install_nodejs() only checked the Node.js version, but ensure_supported_runtime() also requires npm >= 10. Gate the early return on both Node.js and npm meeting the minimum requirements. Addresses CodeRabbit review feedback on #1079. * fix(onboard): increase endpoint probe timeout for large model inference The onboard endpoint validation sends a full inference request to verify the provider is reachable. The 20s max-time was too tight for large models like nemotron-3-super-120b-a12b on NVIDIA Endpoints, causing the probe to time out and onboard to fail in non-interactive mode. Increase connect-timeout from 5s to 10s and max-time from 20s to 60s. This only runs once during onboard, so the longer timeout is acceptable. * fix(install): set nvm default after installing Node.js 22 When upgrading from an existing nvm with a different default version, nvm use 22 only affects the current shell. Set the nvm default alias so that subsequent shells (e.g. the e2e test harness) that source nvm.sh also get Node 22. * fix(install): always create nemoclaw shim and fix nvm alias default Two issues preventing nemoclaw from being found after install.sh exits: 1. verify_nemoclaw() skipped shim creation when nemoclaw was found on nvm's PATH, but that PATH only exists inside install.sh's subshell. Always call ensure_nemoclaw_shim() to create ~/.local/bin/nemoclaw. 2. nvm alias default had an invalid --silent flag that silently failed, so the nvm default wasn't set for subsequent shells.
* chore(docs): add debug and uninstall to commands reference check-docs.sh found 15 commands in --help but only 13 headings in docs/reference/commands.md. Add the missing nemoclaw debug and nemoclaw uninstall entries. * chore(docs): add missing --sandbox flag to debug entry
* fix(onboard): retry gateway start with exponential backoff On some hosts (Horde VMs, first-run environments), the embedded k3s inside the OpenShell gateway needs more time to initialize than the gateway's internal health-check window allows. The first attempt fails with misleading orphaned-cgroup cleanup messages, but a second attempt typically succeeds because container images are cached and cgroup state is cleaner. Replace the single-shot gateway start + separate health check with a retry loop (up to 3 attempts, 10s/30s backoff). Each failed attempt gets a clean destroyGateway() before retry. On final failure, the error message now includes openshell doctor troubleshooting commands. Upstream tracking: NVIDIA/OpenShell#433 * fix(onboard): skip retry for recovery path, preserve health-check timing The recovery path (nemoclaw status) should not retry — only the onboard wizard benefits from retries. Also restore the original behavior of not sleeping after the final health-check iteration, matching the timing the cli.test.js tests depend on. * refactor(onboard): use p-retry for gateway start retry logic Replace hand-rolled retry loop with p-retry library for gateway start backoff. Cleaner, more conventional, and delegates retry/backoff concerns to a well-tested library. * fix(onboard): correct openshell doctor flags in troubleshooting output * fix(onboard): use --name flag for doctor logs (targets container directly) * fix(onboard): skip gateway destroy on recovery path failure The recovery path (exitOnFailure=false) should preserve gateway state for diagnostics. Gate destroyGateway() behind exitOnFailure so only the onboard path (which will retry) tears down on failure. * fix(onboard): run health checks regardless of gateway start exit code The slow-start case (where gateway start exits non-zero because OpenShell's internal health check timed out) is exactly the scenario this retry logic targets. Always run the grace-period health-check loop so a gateway that's still coming up gets sampled before teardown. * fix(onboard): preserve gateway state on final failure for doctor diagnostics Move destroyGateway() from the pRetry callback into onFailedAttempt so it only runs between retries, not on the terminal failure. The final failed gateway state is now preserved so openshell doctor logs can inspect container logs for troubleshooting.
The pre-push hooks only ran vitest for the nemoclaw/ plugin project. Root-level unit tests (test/*.test.js) were only caught by CI's test-unit job, not locally. Add a vitest-root hook that runs --project cli on pre-push so test failures are caught before push.
* feat: add Kubernetes testing infrastructure Add k8s-testing/ directory with scripts and manifests for testing NemoClaw on Kubernetes with Dynamo vLLM inference. Includes: - test-installer.sh: Public installer test (requires unattended install support) - setup.sh: Manual setup from source for development - Pod manifests for Docker-in-Docker execution Architecture: OpenShell runs k3s inside Docker, so we use DinD pods to provide the Docker daemon on Kubernetes. Signed-off-by: rwipfelnv * fix: add socat proxy for K8s DNS isolation workaround OpenShell's nested k3s cluster cannot resolve Kubernetes DNS names, so inference requests fail with 502 Bad Gateway. This adds: - socat TCP proxy setup in setup.sh to forward localhost:8000 to the K8s vLLM service endpoint - Provider configuration using host.openshell.internal:8000 which resolves to the workspace container from inside k3s - Documentation explaining the network architecture and workaround - Updated env var names to match PR #318 (NEMOCLAW_NON_INTERACTIVE) - cgroup v2 compatibility fix for Docker daemon - Removed memory limits that caused OOM Tested: Inference requests from sandboxes now route correctly through the socat proxy to the Dynamo vLLM endpoint. Depends on: #318 (non-interactive mode), #365 (Dynamo provider) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat: NemoKlaw - NemoClaw on Kubernetes with Dynamo support Complete K8s deployment solution for NemoClaw: - nemoklaw.yaml: Pod manifest with DinD, init containers, hostPath storage - install.sh: Interactive installer with preflight checks - Rename k8s-testing -> k8s, move old files to dev/ Key learnings: - hostPath storage (/mnt/k8s-disks) avoids ephemeral storage eviction - Init containers for docker config, openshell CLI, NemoClaw build - Workspace container installs apt packages at runtime (can't share via volumes) - socat proxy bridges K8s DNS to nested k3s (host.openshell.internal) Tested successfully with Dynamo vLLM backend on EKS. Signed-off-by: Robert Wipfel <rwipfel@nvidia.com> * fix: rename NemoKlaw to NemoClaw and document known limitations Address PR feedback: - Rename NemoKlaw -> NemoClaw (avoid confusing naming) - Rename nemoklaw.yaml -> nemoclaw-k8s.yaml - Fix hardcoded endpoint to use generic example - Remove log file from repo - Document known limitations (HTTPS proxy issue) - Update README with accurate status of what works/doesn't work Signed-off-by: rwipfelnv Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: update DYNAMO_HOST to vllm-agg-frontend The aggregated frontend service is the correct endpoint for Dynamo vLLM inference. Signed-off-by: Robert Wipfel <rwipfel@nvidia.com> * docs: add Using NemoClaw section with CLI commands - Add workspace shell access command - Add sandbox status/logs/list commands - Add chat completion test example - Rename section from "What Can You Do?" to "Using NemoClaw" Signed-off-by: Robert Wipfel <rwipfel@nvidia.com> * refactor(k8s): simplify deployment to use official installer - Use official NemoClaw installer (`curl | bash`) instead of git clone/build - Switch to `custom` provider from PR #648 (supersedes dynamo-specific provider) - Remove k8s/dev/ directory (no longer needed for testing) - Use emptyDir volumes for portability across clusters - Add /etc/hosts workaround for endpoint validation during onboarding - Update README with verification steps for local inference Tested end-to-end with Dynamo vLLM backend. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(k8s): resolve lint errors in yaml and markdown - Remove multi-document YAML (move namespace creation to README) - Add language specifier to fenced code block (```text) - Add blank lines before lists per markdownlint rules Signed-off-by: Robert Wipfel <rwipfel@nvidia.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> * docs(k8s): add experimental warning and clarify requirements - Add explicit experimental warning at top of README - Clarify this is for trying NemoClaw on k8s, not production - Document privileged pod and DinD requirements upfront - Add resource requirements to prerequisites Signed-off-by: Robert Wipfel <rwipfel@nvidia.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> --------- Signed-off-by: rwipfelnv Signed-off-by: Robert Wipfel <rwipfel@nvidia.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com> Co-authored-by: KJ <kejones@nvidia.com>
…json (#1033) The save() function now writes to a temp file and renames it into place, preventing half-written JSON from corrupting the registry. All mutating operations (registerSandbox, updateSandbox, removeSandbox, setDefault) are wrapped in a mkdir-based advisory lock so concurrent nemoclaw processes cannot produce lost updates. Fixes #1011 Co-authored-by: KJ <kejones@nvidia.com>
…ders (#980) * fix(onboard): force chat completions API for vLLM and NIM-local providers vLLM's /v1/responses endpoint does not run the --tool-call-parser, so tool calls arrive as raw text in the response content instead of the structured tool_calls array. The probe picks openai-responses because vLLM accepts the request, but parsing only works on /v1/chat/completions. Override preferredInferenceApi to openai-completions after validation for both vllm and nim-local provider paths. Fixes #976 * fix: log when overriding probe result to chat completions Surface the API override during onboard so users see why responses API was not selected. * test: add regression test for vLLM chat completions override Verifies that setupNim() forces preferredInferenceApi to openai-completions for the vLLM provider path even when the probe detects openai-responses first. This locks down the fix for #976 so the tool-call-parser override cannot silently regress. * test: add NIM-local chat completions override regression test Companion test for the vLLM case: verifies that setupNim() forces openai-completions for the NIM-local path too, since NIM uses vLLM internally and has the same tool-call-parser limitation. --------- Co-authored-by: Benedikt Schackenberg <benedikt@users.noreply.github.com> Co-authored-by: Benedikt Schackenberg <69834303+BenediktSchackenberg@users.noreply.github.com>
#1041) nemoclaw list and the global status display only read model/provider from the local registry, which stores null when inference is configured after onboarding (e.g. via openshell inference set). This makes both commands show 'unknown' even when the gateway has a configured inference route. Query openshell inference get once per invocation and prefer the live values over stale registry entries, matching the existing behavior of the per-sandbox status command. Fixes #986 Co-authored-by: KJ <kejones@nvidia.com>
#1090) * fix: improve setup.sh and brev-setup.sh for CI and non-interactive use setup.sh: - Add timestamped log output ([HH:MM:SS] prefix) for CI visibility - Support NEMOCLAW_SANDBOX_NAME env var for custom sandbox names - Add --no-tty flag to openshell sandbox create for non-interactive mode - Pre-build base image locally when GHCR image is unavailable (forks) - Add background progress reporter during sandbox build (heartbeat every 30s with Docker step info, filters secrets from output) - Show build elapsed time on completion brev-setup.sh: - Add timestamped log output matching setup.sh format - Support SKIP_VLLM=1 to skip vLLM install on CPU-only instances * fix(test): update sandbox name test for NEMOCLAW_SANDBOX_NAME env var fallback
- Add mermaid diagram showing NemoClaw/OpenShell relationship - Explain what each layer provides (onboarding, blueprint, state, bridges) - Clarify that OpenShell is the runtime, NemoClaw is the opinionated stack
…1103) Add project-level configuration files for AI coding agents: - CLAUDE.md: Project overview, architecture map, development commands, code style conventions, test structure, and PR requirements. - AGENTS.md: Agent-specific guidance including quick reference commands, key architecture decisions, common patterns for making changes, and gotchas for working with the dual-language stack. These files help AI agents (Claude, Cursor, Codex, etc.) understand the project structure, conventions, and development workflow without needing to discover them from scratch each session.
…1104) Ollama's Go runtime creates a dual-stack (AF_INET6) socket when given OLLAMA_HOST=0.0.0.0, but WSL2's port relay only forwards IPv4 sockets to the Windows host. Docker's host-gateway cannot reach the dual-stack socket, causing inference requests to fail with connection refused. On WSL2, let Ollama bind to the default 127.0.0.1 instead, which creates an IPv4-only socket that the WSL2 relay forwards correctly.
…al sanitization (#1092) * test(security): add E2E tests for command injection and credential sanitization Adds two new Brev E2E test suites targeting the vulnerabilities fixed by PR #119 (Telegram bridge command injection) and PR #156 (credential exposure in migration snapshots + blueprint digest bypass). Test suites: - test-telegram-injection.sh: 8 tests covering command substitution, backtick injection, quote-breakout, parameter expansion, process table leaks, and SANDBOX_NAME validation - test-credential-sanitization.sh: 13 tests covering auth-profiles.json deletion, credential field stripping, non-credential preservation, symlink safety, blueprint digest verification, and pattern-based field detection These tests are expected to FAIL on main (unfixed code) and PASS once PR #119 and #156 are merged. Refs: #118, #119, #156, #813 * ci: temporarily disable repo guard for fork testing * ci: bump bootstrap timeout, skip vLLM on CPU E2E runs - Add SKIP_VLLM=1 support to brev-setup.sh - Use SKIP_VLLM=1 in brev-e2e.test.js bootstrap - Bump beforeAll timeout to 30 min for CPU instances - Bump workflow timeout to 60 min for 3 test suites * ci: bump bootstrap timeout to 40 min for sandbox image build * ci: bump Brev instance to 8x32 for faster Docker builds * ci: add real-time progress streaming for E2E bootstrap and tests - Stream SSH output to CI log during bootstrap (no more silence) - Add timestamps to brev-setup.sh and setup.sh info/warn/fail messages - Add background progress reporter during sandbox Docker build (heartbeat every 30s showing elapsed time, current Docker step, and last log line) - Stream test script output to CI log via tee + capture for assertions - Filter potential secrets from progress heartbeat output * ci: use NemoClaw launchable for E2E bootstrap Replace bare 'brev create' + brev-setup.sh with 'brev start' using the OpenShell-Community launch-nemoclaw.sh setup script. This installs Docker, OpenShell CLI, and Node.js via the launchable's proven path, then runs 'nemoclaw onboard --non-interactive' to build the sandbox (testing whether this path is faster than our manual setup.sh). Changes: - Default CPU back to 4x16 (8x32 didn't help — bottleneck was I/O) - Launchable path: brev start + setup-script URL, poll for completion, rsync PR branch, npm ci, nemoclaw onboard - Legacy path preserved (USE_LAUNCHABLE=0) - Timestamped logging throughout for timing comparison - New use_launchable workflow input (default: true) * fix: prevent openshell sandbox create from hanging in non-interactive mode openshell sandbox create without a command defaults to opening an interactive shell inside the sandbox. In CI (non-interactive SSH), this hangs forever — the sandbox goes Ready but the command never returns. The [?2004h] terminal escape codes in CI logs were bash enabling bracketed paste mode, waiting for input. Add --no-tty -- true so the command exits immediately after the sandbox is created and Ready. * fix: source nvm in non-interactive SSH for launchable path The launchable setup script installs Node.js via nvm, which sets up PATH in ~/.nvm/nvm.sh. Non-interactive SSH doesn't source .bashrc, so npm/node commands fail with 'command not found'. Source nvm.sh before running npm in the launchable path and runRemoteTest. * fix: setup.sh respects NEMOCLAW_SANDBOX_NAME env var setup.sh defaulted to 'nemoclaw' ignoring the NEMOCLAW_SANDBOX_NAME env var set by the CI test harness (e2e-test). Now uses $1 > $NEMOCLAW_SANDBOX_NAME > nemoclaw. * ci: bump full E2E test timeout to 15 min for install + sandbox build * ci: don't run full E2E alongside security tests (it destroys the sandbox) The full E2E test runs install.sh --non-interactive which destroys and rebuilds the sandbox. When TEST_SUITE=all, this kills the sandbox that beforeAll created, causing credential-sanitization and telegram-injection to fail with 'sandbox not running'. Only run full E2E when TEST_SUITE=full. * ci: pre-build base image locally when GHCR image unavailable On forks or before the first base-image workflow run, the GHCR base image (ghcr.io/nvidia/nemoclaw/sandbox-base:latest) doesn't exist. This causes the Dockerfile's FROM to fail. Now setup.sh checks for the base image and builds Dockerfile.base locally if needed. On subsequent builds, Docker layer cache makes this near-instant. Once the GHCR base image is available, this becomes a no-op (docker pull succeeds and the local build is skipped). * ci: install nemoclaw CLI after bootstrap in non-launchable path brev-setup.sh creates the sandbox but doesn't install the host-side nemoclaw CLI that test scripts need for 'nemoclaw <name> status'. Add npm install + build + link step after bootstrap. * fix: use npm_config_prefix for nemoclaw CLI install so it lands on PATH * fix: npm link from repo root where bin.nemoclaw is defined * fix(ci): register sandbox in nemoclaw registry after setup.sh bootstrap setup.sh creates the sandbox via openshell directly but never writes ~/.nemoclaw/sandboxes.json. The security test scripts check `nemoclaw <name> status` which reads the registry, causing all E2E runs to fail with 'Sandbox e2e-test not running'. Write the registry entry after nemoclaw CLI install so the test scripts can find the sandbox. * style: shfmt formatting fix in setup.sh * fix(test): exclude policy presets from C7 secret pattern scan C7 greps for 'npm_' inside the sandbox and false-positives on nemoclaw-blueprint/policies/presets/npm.yaml which contains rule names like 'npm_yarn', not actual credentials. Filter out /policies/ paths from all three pattern checks. * docs(ci): add test suite descriptions to e2e-brev workflow header Document what each test_suite option runs so maintainers can make an informed choice from the Actions UI without reading the test scripts. * ci: re-enable repo guard for e2e-brev workflow Re-enable the github.repository check so the workflow only runs on NVIDIA/NemoClaw, not on forks. * fix(test): update setup-sandbox-name test for NEMOCLAW_SANDBOX_NAME env var setup.sh now uses ${1:-${NEMOCLAW_SANDBOX_NAME:-nemoclaw}} instead of ${1:-nemoclaw}. Update the test to match and add coverage for the env var fallback path. * fix(lint): add shellcheck directives for injection test payloads and fix stdio type * fix(lint): suppress SC2034 for status_output in credential sanitization test * fix: address CodeRabbit review — timeout, pipefail, fail-closed probes, shell injection in test - Bump e2e-brev workflow timeout-minutes from 60 to 90 - Add fail-fast when launchable setup exceeds 40-min wait - Add pipefail to remote pipeline commands in runRemoteTest and npm ci - Fix backtick shell injection in validateName test loop (use process.argv) - Make sandbox_exec fail closed with __PROBE_FAILED__ sentinel - Add probe failure checks in C6/C7 sandbox assertions --------- Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Replace manual fd lifecycle (`openSync`/`writeFileSync`/`closeSync`) in `acquireOnboardLock` with a single `writeFileSync` call. If the write threw (e.g. `ENOSPC`), the catch block re-threw because the error code is not `EEXIST`, bypassing `closeSync` and leaking the descriptor. Repeated failures accumulate leaked fds until the process crashes with `EMFILE`.
…l job (#1085) * ci(nightly-e2e): remove environment NVIDIA_API_KEY from cloud-experimental job Avoid spurious deployment records; secrets still come from secrets.NVIDIA_API_KEY. Made-with: Cursor * fix(ci): match cloud-experimental nightly artifact to default install log path Upload /tmp/nemoclaw-e2e-cloud-experimental-install.log on failure; script uses the same fixed default without UID. * ci(test): split nightly cloud-experimental, shared cleanup, integration + Spark E2E * test(cli): extend timeout for install fallback and uninstall --yes * ci(test): skip huggingface policy add and check --------- Co-authored-by: Carlos Villela <cvillela@nvidia.com>
## Summary - stop requiring `NVIDIA_API_KEY` for local-only `nemoclaw start` and only gate the Telegram bridge when that bridge actually needs the key - clean up the dashboard forward, `nemoclaw` gateway, and `openshell-cluster-nemoclaw` Docker volumes when the last sandbox is destroyed - broaden unified-memory NVIDIA GPU detection beyond `GB10` while keeping `spark: true` specific to GB10 - harden policy merge/retry behavior so truncated or error-like current-policy reads rebuild from a clean `version: 1` scaffold instead of producing malformed YAML ## Issue Mapping Fixes #1191 Fixes #1160 Fixes #1182 Fixes #1162 Related #991 ## Notes - `#1188` was investigated but is not included in this PR. - The current evidence still points to a deeper runtime / proxy reachability problem on macOS + Colima rather than a bounded NemoClaw-only fix. - Keeping it out of this branch avoids speculative networking changes without strong reproduction and cross-platform coverage. ## Validation ```bash npx vitest run npx eslint bin/nemoclaw.js bin/lib/nim.js bin/lib/policies.js test/cli.test.js test/nim.test.js test/policies.test.js test/service-env.test.js npx tsc -p jsconfig.json --noEmit ``` ## References Reviewed - #1106 - #308 - #95 - #770 Signed-off-by: Kevin Jones <kejones@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** - Core services can start without an NVIDIA API key. - Enhanced unified‑memory GPU detection with more accurate capability reporting. * **Bug Fixes** - Gateway and forwarded‑port cleanup only runs when the last sandbox is removed and no live sandboxes remain. - Telegram bridge now starts only when both required tokens are present; clearer startup warnings. - Policy parsing/merge more robust for metadata‑only or malformed inputs; consistent version header formatting. * **Tests** - Added tests covering GPU detection, policy parsing/merge, CLI sandbox/gateway flows, and service startup. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary When a user enters an invalid sandbox name (e.g. starting with a digit like `7racii`), `promptValidatedSandboxName()` now shows the error and **re-prompts** instead of calling `process.exit(1)`. This prevents users from getting stuck in a loop where rerunning the onboard script reuses the invalid name from the session state, requiring manual deletion of `~/.nemoclaw` to recover. ## Changes - **`bin/lib/onboard.js`**: Replace exit-on-invalid with a `while(true)` retry loop in interactive mode. Non-interactive mode (CI/CD) still exits with status 1. - **`test/onboard.test.js`**: Source-level test verifying the retry loop exists and non-interactive fallback is preserved. ## Before ``` Sandbox name: 7racii Invalid sandbox name: '7racii' Names must be lowercase... # process.exit(1) — rerun uses same name from session, stuck ``` ## After ``` Sandbox name: 7racii Invalid sandbox name: '7racii' Names must be lowercase... Please try again. Sandbox name: my-racii ✓ Valid name accepted ``` Closes #1120 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Sandbox name validation now allows interactive users to retry on invalid input with a clear "Please try again" prompt; non-interactive runs still terminate on invalid names. * **Tests** * Added and updated tests to verify interactive retry vs non-interactive exit and to strengthen temporary-file handling assertions (more flexible script naming and safer cleanup). <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Benedikt Schackenberg <benedikt@schackenberg.com> --------- Signed-off-by: Benedikt Schackenberg <benedikt@schackenberg.com> Co-authored-by: Carlos Villela <cv@lixo.org> Co-authored-by: Carlos Villela <cvillela@nvidia.com>
<!-- markdownlint-disable MD041 --> ## Summary Preview: https://nvidia.github.io/NemoClaw/pr-preview/pr-1209/ ## Related Issue <!-- Link to the issue: Fixes #NNN or Closes #NNN. Remove this section if none. --> ## Changes <!-- Bullet list of key changes. --> ## Type of Change <!-- Check the one that applies. --> - [ ] Code change for a new feature, bug fix, or refactor. - [ ] Code change with doc updates. - [x] Doc only. Prose changes without code sample modifications. - [ ] Doc only. Includes code sample changes. ## Testing <!-- What testing was done? --> - [ ] `npx prek run --all-files` passes (or equivalently `make check`). - [ ] `npm test` passes. - [x] `make docs` builds without warnings. (for doc-only changes) ## Checklist ### General - [ ] I have read and followed the [contributing guide](https://github.com/NVIDIA/NemoClaw/blob/main/CONTRIBUTING.md). - [ ] I have read and followed the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md). (for doc-only changes) ### Code Changes <!-- Skip if this is a doc-only PR. --> - [ ] Formatters applied — `npx prek run --all-files` auto-fixes formatting (or `make format` for targeted runs). - [ ] Tests added or updated for new or changed behavior. - [ ] No secrets, API keys, or credentials committed. - [ ] Doc pages updated for any user-facing behavior changes (new commands, changed defaults, new features, bug fixes that contradict existing docs). ### Doc Changes <!-- Skip if this PR has no doc changes. --> - [ ] Follows the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md). Try running the `update-docs` agent skill to draft changes while complying with the style guide. For example, prompt your agent with "`/update-docs` catch up the docs for the new changes I made in this PR." - [ ] New pages include SPDX license header and frontmatter, if creating a new page. - [ ] Cross-references and links verified. --- <!-- DCO sign-off (required by CI). Replace with your real name and email. --> Signed-off-by: Your Name <your-email@example.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Updated README and site docs with a new Notice & Disclaimer clarifying the software’s alpha status, advising against production use, and adding warranty and liability disclaimers. * Added a prominent announcement/warning on the docs homepage and release notes, removed the prior embedded alpha include, and noted that the software may access external materials subject to separate terms—users must verify suitability and compliance. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary The NVIDIA provider (`build`) in `REMOTE_PROVIDER_CONFIG` is the only remote provider missing the `skipVerify` flag. This causes `openshell inference set` to attempt endpoint verification during onboarding, which fails even when the API is reachable from the host — blocking onboarding completely at step 4/7. All other providers (OpenAI, Anthropic, Gemini, custom) already set `skipVerify: true` so that `--no-verify` is passed to the inference setup command. This adds the same flag to the NVIDIA provider for consistent behavior. ## Changes - Added `skipVerify: true` to the `build` entry in `REMOTE_PROVIDER_CONFIG` (`bin/lib/onboard.js`) ## Test Plan - [x] Verified all other providers already include `skipVerify: true` - [x] Confirmed `skipVerify` controls the `--no-verify` flag at line 2420 - [x] Ran full test suite: 650/653 passed (3 pre-existing failures in `service-env.test.js` unrelated to this change) Closes #1130 Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Enabled skip verification support for the NVIDIA Endpoints provider during onboarding, aligning its behavior with other remote providers for a more consistent setup experience. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com> Co-authored-by: latenighthackathon <latenighthackathon@users.noreply.github.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>
…1123) ## Summary Improves test determinism, consistency, and reliability across CLI, uninstall, and blueprint test suites by standardizing shell invocation, tightening execution patterns, and removing redundant or outdated test code. --- ## Related Issue Fixes #977 (part 1) --- ## Changes - Normalize shell invocation: - Replace `bash -lc` with `bash -c` in uninstall tests to avoid shell initialization side effects - Improve CLI test stability: - Increase timeouts for long-running commands - Standardize usage of `runWithEnv(..., timeout)` - Remove redundant / outdated test code: - Clean up unused or deprecated test logic in `runner.test.ts` - Improve test consistency: - Align execution patterns across CLI and uninstall tests - Preserve security coverage: - Maintain regression protections (e.g., path validation and credential handling) --- ## Verification - `npm test` passes locally - `npx prek run --all-files` passes in CI - No changes to CLI behavior or runtime logic - Existing security and regression tests continue to pass --- ## Rationale Some tests relied on shell initialization behavior (`bash -lc`) and inconsistent execution patterns, leading to flakiness and non-deterministic outcomes. These updates: - eliminate shell-dependent variability - standardize execution across test suites - improve reliability without impacting functionality Additionally, minor cleanup removes redundant or outdated test code to improve maintainability. --- ## Risk Assessment **Low risk** - Changes are limited to test code and execution behavior - No production code paths modified - Security and regression coverage preserved **Rollback** - Fully reversible by reverting test changes --- ## Type of Change - [x] Test / infrastructure improvement (no behavioral change) - [x] Code cleanup / maintenance --- ## Testing - [x] `npm test` passes - [x] `npx prek run --all-files` passes (CI) --- ## Checklist ### General - [x] Contributing guide followed ### Code Changes - [x] Formatters applied - [x] No user-facing behavior changes - [x] No secrets committed --- ## Summary by CodeRabbit (updated) * **Tests** * Improved CLI and uninstall test determinism by standardizing shell invocation * Increased timeouts to reduce flakiness in long-running test cases * Removed redundant or outdated test logic for improved maintainability <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Improved TypeScript typing inside test mocks for safer compilation. * Added a shared snapshot constant and strengthened a null-check assertion. * Simplified test environment handling by explicitly setting HOME and adjusting shell invocation semantics. * Removed redundant inline comment assertions and tightened output checks. * Minor formatting and clarity tweaks across test suites for easier maintenance. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Krish Sapru <ksapru@bu.edu> --------- Co-authored-by: Carlos Villela <cvillela@nvidia.com>
## Problem
After pulling latest main, `prek run -a` fails with:
1. **ESLint error** in `snapshot.test.ts` —
`@typescript-eslint/consistent-type-imports` forbids `import()` type
annotations in generic parameters
2. **Slow tests** — gateway recovery tests sleep 8–19s each waiting for
health polls that will never succeed against stub `openshell` scripts,
adding ~40s to the suite
## Changes
### ESLint fix
- `snapshot.test.ts`: replace `importOriginal<typeof
import("node:fs")>()` with a top-level `import type fs from "node:fs"`
and use `importOriginal<typeof fs>()`
### Test performance (70s → 28s)
- `onboard.js`: make health-poll count and interval configurable via
`NEMOCLAW_HEALTH_POLL_COUNT` and `NEMOCLAW_HEALTH_POLL_INTERVAL` env
vars (production defaults unchanged)
- `cli.test.js`: set `POLL_COUNT=1`, `POLL_INTERVAL=0` in the
`runWithEnv` test helper so gateway recovery tests skip unnecessary
sleep loops
- `cli.test.js`: reduce test timeouts from 25s → 10s accordingly
### Formatting (pre-commit hooks)
- Prettier reformatting in `runner.test.ts`, `state.test.ts`,
`uninstall.test.js`
## Verification
- All 740 tests pass
- All pre-commit hooks pass (ESLint, Prettier, TypeScript, tests)
- No production behavior change (env vars default to existing values)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Chores**
* Made gateway health polling configurable via environment variables
(poll count and interval).
* **Tests**
* Reduced CLI test timeouts and injected health-check env vars to speed
CI runs.
* Improved test typing/mocking and cleaned up test formatting for
stability and readability.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
…son (#1221) ## Problem `nemoclaw -v` reported `v0.1.0` — a hard-coded value in `package.json` that was never updated when new `v*` git tags were created. The `install.sh` banner had the same issue. ## Solution Derive the version from git tags at runtime, with a layered fallback chain: | Install path | Primary source | Fallback | |---|---|---| | Dev clone (full repo) | `git describe --tags --match "v*"` | `package.json` | | `install.sh` remote (shallow clone) | `git describe` after explicit `v*` tag fetch | `.version` file stamped during install | | `npm install -g` (published tarball) | `.version` stamped by `prepublishOnly` | `package.json` | ### Pre-push hook Added `scripts/check-version-tag-sync.sh` as a prek pre-push hook. When pushing a `v*` tag, it verifies `package.json` at the tagged commit has a matching version. Regular branch pushes are unaffected. ## Files changed - **`bin/lib/version.js`** (new) — version resolver: git describe → `.version` → `package.json` - **`bin/nemoclaw.js`** — uses `getVersion()` instead of `pkg.version` - **`install.sh`** — `resolve_installer_version()` uses same fallback chain; fetches `v*` tags into shallow clones - **`package.json`** — `prepublishOnly` stamps `.version`; `files` includes it in tarball - **`.gitignore`** — ignores generated `.version` - **`scripts/check-version-tag-sync.sh`** (new) — pre-push guard - **`.pre-commit-config.yaml`** — registers the pre-push hook - **`test/install-preflight.test.js`** — updated for new version resolution ## Testing - All 740 tests pass - Verified all install paths (dev clone, shallow clone, tag divergence) - Verified pre-push hook blocks/passes correctly <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * CLI and installer now determine and display releases using git tags, stamped version files, or package metadata for more reliable version reporting. * Health poll retry counts and intervals for gateway operations are configurable via environment variables. * **Chores** * Added a pre-push check to ensure git tag and package.json version stay in sync. * `.version` files are now ignored. * **Tests** * Tests updated to accept git-describe style versions, use shorter CLI timeouts, and include minor typing/formatting refactors. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Six functions in `bin/lib/onboard.js` create temporary files using `Date.now()` and `Math.random().toString(36)`, both of which are predictable. A local attacker can win a TOCTOU race to pre-create a symlink at the predicted path in `/tmp`, enabling: 1. **Data exfiltration** — redirect curl output (API responses with model data) to an attacker-controlled location 2. **Script injection** — via `writeSandboxConfigSyncFile`, inject a malicious script that gets piped into `openshell sandbox connect` ## Changes **`bin/lib/onboard.js`:** - Added `secureTempFile(prefix, ext)` helper that uses `fs.mkdtempSync()` (OS-level `mkdtemp` syscall with cryptographically random suffix) - Replaced all 6 predictable temp file constructions: - `probeOpenAiLikeEndpoint` (line ~667) - `probeAnthropicEndpoint` (line ~711) - `fetchNvidiaEndpointModels` (line ~857) - `fetchOpenAiLikeModels` (line ~911) - `fetchAnthropicModels` (line ~947) - `writeSandboxConfigSyncFile` (line ~527) **`test/onboard.test.js`:** - Updated the sync file test to validate the new mkdtemp-based path structure instead of the old predictable pattern ## Why this approach The file already uses `fs.mkdtempSync()` securely in two other places (lines 1781 and 2710), making this fix consistent with existing code patterns rather than introducing new dependencies. The remaining `Date.now()` usage in `patchStagedDockerfile` is a build identifier written into the Dockerfile, not a temp filename — left unchanged. Fixes #1093 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Refactor** * Improved temporary-file handling: sync scripts and downloaded responses are now written to securely named temp subdirectories and cleaned up at the directory level to reduce leftover artifacts. * **Tests** * Updated tests to validate new temp-file placement and naming pattern, exact script contents, and tightened directory-level cleanup checks. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com>
…#1227) ## Problem The port-availability preflight check in `checkPortAvailable` retries with `sudo lsof` to detect root-owned listeners when unprivileged `lsof` returns empty output. Without `-n` (non-interactive mode), `sudo` blocks on a `Password:` prompt, stalling non-interactive installs: - `curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash` - `nemoclaw onboard --non-interactive` - `NEMOCLAW_NON_INTERACTIVE=1` - CI pipelines ## Fix Use `sudo -n` so it fails immediately when passwordless sudo is unavailable, falling through to the TCP bind probe instead of hanging. This is the only `sudo` call in the codebase that should degrade gracefully — the other `sudo` calls (swap creation) intentionally require elevated privileges to succeed.
…1055) ## Summary Fixes #1010 - Replace text-based string manipulation in `mergePresetIntoPolicy()` with structured YAML parsing via the `yaml` package - Merge `network_policies` by name: preset entries override existing on name collision (prevents duplicates on re-apply) - Preserve all non-network sections (`filesystem_policy`, `process`, `landlock`, etc.) - Falls back to text-based approach for non-standard entry formats (backward compatibility) ## Problem The previous implementation used regex and line splitting to inject preset entries into existing policy YAML. This produced invalid YAML when: - The same preset was applied twice (duplicate entries) - Indentation varied between the current policy and preset content - `network_policies:` appeared at unexpected positions in the document ## Changes - `bin/lib/policies.js` — Rewrite `mergePresetIntoPolicy()` to parse YAML, merge objects, serialize back. Add `yaml` as dependency. - `test/policies.test.js` — Add 3 tests with realistic preset data: structured merge, name collision dedup, non-network section preservation. Relax existing string-format assertions to check correctness instead of exact formatting. ## Test plan - [x] All 25 policy tests pass - [x] Full suite: 603/605 pass (2 pre-existing failures in `install-preflight.test.js`) - [x] Tested with real presets (npm, pypi, slack) on a live sandbox <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved policy merging to prefer structured YAML parsing, preserve other top-level sections, set a version field, and fall back to a tolerant text-based merge when structured parsing or merging isn’t possible. * **Tests** * Expanded tests for structured merges, preset overrides/deduplication, preservation of non-policy sections, and relaxed header-placement checks. * **Chores** * Added a YAML parsing library as a runtime dependency. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: KJ <kejones@nvidia.com>
…ints (#1220) ## Summary - Add missing `protocol: rest`, `enforcement: enforce`, and `tls: terminate` to `statsig.anthropic.com` and `sentry.io` endpoints in `openclaw-sandbox.yaml` ## Related Issue Closes #1214 ## Changes Both endpoints define GET/POST method rules, but without `protocol: rest` the proxy treats them as L4-only connections — the rules are never evaluated and any HTTP method is allowed through. The fix adds the same three fields that `api.anthropic.com` (the adjacent endpoint in the same policy group) already has: ```yaml protocol: rest enforcement: enforce tls: terminate ``` No new endpoints, no rule changes — just enabling L7 inspection on two endpoints that already have rules written for it. ## Testing - Verified the YAML structure matches the working `api.anthropic.com` endpoint pattern - No schema changes — `protocol`, `enforcement`, and `tls` are existing fields used across the policy file ## Checklist - [x] Conventional commit format - [x] Scoped to issue, no unrelated changes - [x] No secrets or credentials <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Updated network policy configurations for external service endpoints, including enhanced security enforcement and TLS termination settings. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: latenighthackathon <latenighthackathon@users.noreply.github.com> Co-authored-by: KJ <kejones@nvidia.com>
…ript modules (#1240) ## Summary - Extract ~210 lines of pure, side-effect-free functions from the 3,800-line `onboard.js` into **5 typed TypeScript modules** under `src/lib/`: - `gateway-state.ts` — gateway/sandbox state classification from openshell output - `validation.ts` — failure classification, API key validation, model ID checks - `url-utils.ts` — URL normalization, text compaction, env formatting - `build-context.ts` — Docker build context filtering, recovery hints - `dashboard.ts` — dashboard URL resolution and construction - Add **56 co-located unit tests** (`src/lib/*.test.ts`) for the extracted modules - Set up CLI TypeScript compilation: `tsconfig.src.json` compiles `src/` → `dist/` as CJS - `onboard.js` imports from compiled `dist/lib/` output — transparent to callers - Pre-commit hook updated to build TS and include `dist/lib/` in coverage These functions are **not touched by any #924 blocker PR** (#781, #782, #819, #672, #634, #890), so this extraction is safe to land immediately. ## Test plan - [x] 598 CLI tests pass (542 existing + 56 new) - [x] `tsc -p tsconfig.src.json` compiles cleanly - [x] `tsc -p tsconfig.cli.json` type-checks cleanly - [x] `tsc -p jsconfig.json` type-checks cleanly - [x] Coverage ratchet passes with `dist/lib/` included Closes #1237. Relates to #924 (shell consolidation). 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Improved sandbox-creation recovery hints and targeted remediation commands. * Smarter dashboard URL resolution and control-UI URL construction. * **Bug Fixes** * More accurate gateway and sandbox state detection. * Enhanced classification of validation/apply failures and safer model/key validation. * Better provider URL normalization and loopback handling. * **Tests** * Added comprehensive tests covering new utilities. * **Chores** * CLI now builds before CLI tests; CI/commit hooks updated to run the CLI build. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ard (#1235) ## Summary - Delete `scripts/setup.sh` (324 lines) — every step it performed is already handled by `nemoclaw onboard` with better error recovery (exponential backoff via pRetry), interactive prompts, registry management, and WSL2-aware Ollama binding - `nemoclaw setup` now delegates to `onboard` instead of shelling out to the deleted script - `brev-setup.sh` installs the nemoclaw CLI and runs `nemoclaw onboard --non-interactive` instead of calling setup.sh - Remove setup.sh-specific tests and assertions (3 test removals across 3 files); all equivalent assertions against `onboard.js` already exist ## Test plan - [x] 541 CLI tests pass (`vitest run --project cli`) - [ ] Verify `nemoclaw setup` prints deprecation notice and runs onboard - [ ] Verify `brev-setup.sh` bootstraps correctly on a fresh Brev VM (E2E) Relates to #924 (shell consolidation). Supersedes #1230. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Refactor** * Migrated legacy setup to a unified onboarding flow and removed the old top-level setup orchestration. * **Chores** * Updated bootstrap to build/install the CLI locally and run onboarding non-interactively. * Adjusted setup walkthrough instructions to reference onboarding. * **Tests** * Removed tests tied to the removed setup script and sandbox-name behavior. * Added CLI dispatch tests verifying setup argument forwarding into onboarding. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…1217) ## Summary Fixes #117 — Insecure authentication configuration in gateway setup. > **Supersedes PR #123**, which targeted the correct security issues but patched code paths that have since moved. Recommend closing #123 in favor of this PR. --- ## Background NemoClaw runs an AI assistant inside a locked-down sandbox. The **gateway** is the front door to that sandbox, and the **Control UI** is a web dashboard users open in their browser to interact with the assistant. There were three security problems with that front door: ### Problem 1: Device auth was permanently disabled 🔓 The config had `dangerouslyDisableDeviceAuth: True` hardcoded. This meant any device could connect to the gateway without proving its identity. It was added as a temporary workaround (commit `5fb629f`) and never reverted. While the sandbox network isolation mitigates this today, it becomes dangerous if combined with LAN-bind changes or cloudflared tunnels in remote deployments — resulting in an unauthenticated, publicly reachable dashboard. **Fix:** Default to `false` (device auth enabled). Developers who genuinely need it disabled for headless/dev environments can set `NEMOCLAW_DISABLE_DEVICE_AUTH=1` as a build arg. ### Problem 2: Insecure auth was always allowed, even over HTTPS 🌐 `allowInsecureAuth: True` was hardcoded, meaning insecure (non-HTTPS) authentication was permitted regardless of deployment context. For local development on `http://127.0.0.1` this is fine, but for production/remote deployments over `https://` it defeats the purpose of TLS. **Fix:** Derive it from the `CHAT_UI_URL` scheme. If the URL is `http://` → allow insecure auth (local dev). If `https://` → block it (production). ### Problem 3: The auto-pair bouncer let everyone in 🚪 The `start_auto_pair` watcher automatically approved *every* pending device pairing request with zero validation. A rogue or unexpected client type could pair with the gateway unchallenged. **Fix:** Only approve devices with recognized `clientId` (`openclaw-control-ui`) or `clientMode` (`webchat`). Unknown clients are rejected and logged. --- ## Why this PR exists (instead of merging #123) PR #123 (opened Mar 17) correctly identified all three issues and proposed the right fixes. However, since then, commit `0f8eedd` ("make openclaw.json immutable at runtime, move config to build time") moved the gateway `controlUi` configuration from `scripts/nemoclaw-start.sh` into the **Dockerfile**. The config is now baked at image build time and verified via SHA-256 hash at startup. PR #123's changes to the start script's Python config block would patch dead code — that block no longer exists. This PR applies the same security logic where the code actually lives now: | Issue | PR #123 patched | This PR patches | |-------|----------------|----------------| | `dangerouslyDisableDeviceAuth` | `scripts/nemoclaw-start.sh` (removed) | `Dockerfile` (build-time config) | | `allowInsecureAuth` | `scripts/nemoclaw-start.sh` (removed) | `Dockerfile` (build-time config) | | Auto-pair whitelisting | `scripts/nemoclaw-start.sh` ✅ | `scripts/nemoclaw-start.sh` ✅ | --- ## Changes ### `Dockerfile` (+8/-2) - `dangerouslyDisableDeviceAuth` → derived from `NEMOCLAW_DISABLE_DEVICE_AUTH` env var (default `0` = secure) - `allowInsecureAuth` → derived from `CHAT_UI_URL` scheme (`false` when `https://`) - New `ARG NEMOCLAW_DISABLE_DEVICE_AUTH=0` with `ENV` promotion (follows existing C-2 safe pattern — no ARG interpolation into Python source) ### `scripts/nemoclaw-start.sh` (+12/-4) - Auto-pair only approves `ALLOWED_CLIENTS = {'openclaw-control-ui'}` and `ALLOWED_MODES = {'webchat'}` - Validates `isinstance(device, dict)` before field access - Rejects unknown clients with `[auto-pair] rejected unknown client=... mode=...` log - Logs `client=` on approval for audit trail - Documents `NEMOCLAW_DISABLE_DEVICE_AUTH` in script header ### `test/security-c2-dockerfile-injection.test.js` (+6 tests) - No hardcoded `True` for `dangerouslyDisableDeviceAuth` - No hardcoded `True` for `allowInsecureAuth` - Env var derivation for both settings - `ARG` defaults to `0` - `ENV` promotion before `RUN` layer ### `test/nemoclaw-start.test.js` (+7 tests) - Whitelist definitions (`ALLOWED_CLIENTS`, `ALLOWED_MODES`) - Rejection logic for unknown clients - Dict type validation - Client identity in approval logs - No unconditional approval pattern - `NEMOCLAW_DISABLE_DEVICE_AUTH` documented --- ## Why no Brev E2E test These changes are **deterministic config values and simple conditional logic**, not runtime behavior with environmental ambiguity: 1. **Dockerfile config** — Two Python booleans evaluated at build time: `os.environ.get(...) == '1'` and `parsed.scheme != 'https'`. If the source is correct, the output is correct. 2. **Auto-pair whitelisting** — Python set membership check. No shell interpretation, no SSH pipeline, no quoting edge cases. This is fundamentally different from something like the telegram bridge injection fix, where you *must* send real payloads through a real shell on a real sandbox to prove metacharacters don't get interpreted. Our changes have no gap between "looks right in source" and "works right at runtime." The 13 source-pattern tests provide sufficient confidence. If deeper verification is desired in the future, a **build-time unit test** that runs the Dockerfile's Python snippet and asserts the JSON output would be the appropriate next step — a regular CI test, not a Brev instance. --- ## Test plan - [x] All 752 tests pass (739 existing + 13 new) - [x] `shellcheck` ✅ - [x] `hadolint` ✅ - [x] `shfmt` / `prettier` ✅ - [x] `gitleaks` ✅ - [ ] Default onboarding: gateway starts with device auth enabled, Control UI pairing works via auto-pair - [ ] `NEMOCLAW_DISABLE_DEVICE_AUTH=1` build: device auth is disabled (backward-compatible escape hatch) - [ ] `CHAT_UI_URL=https://...` build: `allowInsecureAuth` is `false` - [ ] `CHAT_UI_URL=http://127.0.0.1:18789` (default): `allowInsecureAuth` is `true` - [ ] Unknown device type connecting to gateway: auto-pair rejects and logs - [ ] `openclaw-control-ui` / `webchat` devices: auto-pair approves as before <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Security** * Device authentication can be toggled at build time; UI insecure-auth is allowed only when the control UI URL uses http. * **Bug Fixes** * Auto-pair processing now validates inputs, rejects unknown client/mode, logs identifiers, and deduplicates requests to avoid repeated approvals/rejections. * **Tests** * Added tests for auto-pair validation, logging, de-duplication, and configurable auth behavior. * **Chores** * Sandbox/setup flow now enforces the build-time device-auth setting for images. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- markdownlint-disable MD041 --> ## Summary Follow up of #1207 to add unit tests ## Changes Add tests verifying CoreDNS patching is skipped on WSL2 and still applied on non-WSL runtimes. Existing tests updated to pass explicit platform opts so they work correctly when run from a WSL2 host. ## Type of Change <!-- Check the one that applies. --> - [ ] Code change for a new feature, bug fix, or refactor. - [ ] Code change with doc updates. - [ ] Doc only. Prose changes without code sample modifications. - [ ] Doc only. Includes code sample changes. ## Testing <!-- What testing was done? --> - [X] `npx prek run --all-files` passes (or equivalently `make check`). - [X] `npm test` passes. - [ ] `make docs` builds without warnings. (for doc-only changes) ## Checklist ### General - [X] I have read and followed the [contributing guide](https://github.com/NVIDIA/NemoClaw/blob/main/CONTRIBUTING.md). - [ ] I have read and followed the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md). (for doc-only changes) ### Code Changes <!-- Skip if this is a doc-only PR. --> - [X] Formatters applied — `npx prek run --all-files` auto-fixes formatting (or `make format` for targeted runs). - [X] Tests added or updated for new or changed behavior. - [X] No secrets, API keys, or credentials committed. - [ ] Doc pages updated for any user-facing behavior changes (new commands, changed defaults, new features, bug fixes that contradict existing docs). ### Doc Changes <!-- Skip if this PR has no doc changes. --> - [ ] Follows the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md). Try running the `update-docs` agent skill to draft changes while complying with the style guide. For example, prompt your agent with "`/update-docs` catch up the docs for the new changes I made in this PR." - [ ] New pages include SPDX license header and frontmatter, if creating a new page. - [ ] Cross-references and links verified. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Improved platform runtime tests to pass full context options and validate behavior for specific known runtimes. * Added a coverage case ensuring WSL detection prevents the patch in the appropriate scenario. * Retained the existing test for skipping unknown runtimes. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Carlos Villela <cvillela@nvidia.com>
## Summary - **install.sh missing build:cli**: PR #1240 replaced `bin/lib/` CJS modules with shims that re-export from `dist/lib/`, but `install.sh` never ran `npm run build:cli`. Every fresh install crashed on `nemoclaw onboard` with `Cannot find module '../../dist/lib/preflight'` - **E2E destroy verification**: The cleanup phase used `nemoclaw list` to verify a sandbox was gone, but `list` triggers gateway recovery which can restart a destroyed gateway and re-import the sandbox. Now checks the registry file directly instead - Uses `--if-present` so installs targeting older tags without `build:cli` don't break ## Test plan - [ ] Run `install.sh` on a clean VM and verify `nemoclaw onboard` no longer crashes - [ ] Verify `dist/lib/preflight.js` exists after install completes - [ ] Nightly E2E cloud-e2e passes the destroy verification phase --------- Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
## Summary Fix incorrect `-f` shorthand in `docs/monitoring/monitor-sandbox-activity.md`. The CLI only accepts `--follow` — the `-f` flag is silently ignored, causing users to miss real-time log output. ## Related Issue Fixes #1021 ## Changes - Line 57: `nemoclaw <name> logs -f` → `nemoclaw <name> logs --follow` - Line 90: `nemoclaw <name> logs -f` → `nemoclaw <name> logs --follow` ## Type of Change - [ ] Code change for a new feature, bug fix, or refactor. - [ ] Code change with doc updates. - [x] Doc only. Prose changes without code sample modifications. - [ ] Doc only. Includes code sample changes. ## Testing - [ ] `npx prek run --all-files` passes (or equivalently `make check`). - [ ] `npm test` passes. - [x] `make docs` builds without warnings. (for doc-only changes) ## Checklist ### General - [x] I have read and followed the [contributing guide](https://github.com/NVIDIA/NemoClaw/blob/main/CONTRIBUTING.md). - [x] I have read and followed the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md). (for doc-only changes) ### Doc Changes - [x] Follows the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md). - [ ] New pages include SPDX license header and frontmatter, if creating a new page. - [x] Cross-references and links verified. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Standardized log-following examples to use the long-form --follow flag in monitoring and troubleshooting instructions for consistent real-time log streaming. * Improved formatting by splitting a status/list command across lines to enhance readability without changing command semantics. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com> --------- Co-authored-by: latenighthackathon <latenighthackathon@users.noreply.github.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>
## Summary - Convert `bin/lib/runtime-recovery.js` (84 lines) to `src/lib/runtime-recovery.ts` - Typed `StateClassification` interface for sandbox/gateway state - 5 pure classifiers/parsers + 1 I/O function (`getRecoveryCommand`) - Co-locate tests with additional edge cases Stacked on #1240. 617 CLI tests pass. Coverage ratchet passes. Relates to #924 (shell consolidation). 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Improved runtime recovery logic to better parse CLI output and determine recovery actions, including more reliable sandbox/gateway state classification and recovery command selection. * **Bug Fixes** * Correctly treats "Disconnected" gateway status as inactive and handles error/empty lines when listing sandboxes. * **Tests** * Expanded test coverage for edge cases, malformed input, and empty states. * **Refactor** * Runtime recovery helpers consolidated to a shared compiled module. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary - Convert `bin/lib/local-inference.js` (228 lines) to `src/lib/local-inference.ts` - Typed interfaces: `ValidationResult`, `GpuInfo`, `RunCaptureFn` - 9 pure functions + 4 with DI-injected `runCapture` - Co-locate tests (30 tests) Stacked on #1240. 616 CLI tests pass. Coverage ratchet passes. Relates to #924 (shell consolidation). 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added local inference provider validation and connectivity diagnostics for vllm-local and ollama-local. * Implemented Ollama model discovery, warmup/probe commands, and intelligent model selection based on GPU memory. * Added health-check and model readiness checks to surface configuration or networking issues earlier. * **Tests** * Enhanced test coverage for provider validation, model detection, and unknown-provider handling; updated test targets accordingly. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary - Convert `bin/lib/nim.js` (259 lines) to `src/lib/nim.ts` - Typed interfaces: `NimModel`, `GpuDetection`, `NimStatus` - 4 pure functions + subprocess-heavy I/O functions (docker, nvidia-smi) - Co-locate tests (16 tests) with updated mock pattern for dist/ paths Stacked on #1240. 615 CLI tests pass. Coverage ratchet passes. Relates to #924 (shell consolidation). 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * GPU detection with capability assessment for NIM containers * Container lifecycle management (start, stop, status monitoring) with automatic health checks * Model enumeration and image selection for launching containers * **Refactor** * Simplified module surface and improved internal architecture for maintainability * **Tests** * Updated test suite to align with the refactored module behavior and stronger typings <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixes: #1163 Now, install.sh does onboarding step. Hence, there is no need to "run nemoclaw onboard" in the setup-spark.sh This also aligns with the spark setup instructions in the spark-install.md <!-- markdownlint-disable MD041 --> ## Summary <!-- 1-3 sentences: what this PR does and why. --> ## Related Issue <!-- Link to the issue: Fixes #NNN or Closes #NNN. Remove this section if none. --> ## Changes <!-- Bullet list of key changes. --> ## Type of Change <!-- Check the one that applies. --> - [ ] Code change for a new feature, bug fix, or refactor. - [ ] Code change with doc updates. - [ ] Doc only. Prose changes without code sample modifications. - [ ] Doc only. Includes code sample changes. ## Testing <!-- What testing was done? --> - [ ] `npx prek run --all-files` passes (or equivalently `make check`). - [ ] `npm test` passes. - [ ] `make docs` builds without warnings. (for doc-only changes) ## Checklist ### General - [ ] I have read and followed the [contributing guide](https://github.com/NVIDIA/NemoClaw/blob/main/CONTRIBUTING.md). - [ ] I have read and followed the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md). (for doc-only changes) ### Code Changes <!-- Skip if this is a doc-only PR. --> - [ ] Formatters applied — `npx prek run --all-files` auto-fixes formatting (or `make format` for targeted runs). - [ ] Tests added or updated for new or changed behavior. - [ ] No secrets, API keys, or credentials committed. - [ ] Doc pages updated for any user-facing behavior changes (new commands, changed defaults, new features, bug fixes that contradict existing docs). ### Doc Changes <!-- Skip if this PR has no doc changes. --> - [ ] Follows the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md). Try running the `update-docs` agent skill to draft changes while complying with the style guide. For example, prompt your agent with "`/update-docs` catch up the docs for the new changes I made in this PR." - [ ] New pages include SPDX license header and frontmatter, if creating a new page. - [ ] Cross-references and links verified. --- <!-- DCO sign-off (required by CI). Replace with your real name and email. --> Signed-off-by: Your Name <your-email@example.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Streamlined DGX Spark Docker setup instructions by removing a post-configuration guidance prompt. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Paritosh Dixit <paritoshd@nvidia.com> Co-authored-by: Aaron Erickson 🦞 <aerickson@nvidia.com>
## Summary - Convert `bin/lib/onboard-session.js` (440 lines) to `src/lib/onboard-session.ts` - Typed interfaces: `Session`, `StepState`, `SessionFailure`, `LockResult`, `SessionUpdates`, `LockInfo`, `SessionMetadata` - Full migration including fs I/O (session persistence, file locking) and pure helpers (redaction, validation, normalization) - Co-locate tests (14 tests) with updated cache-clearing for dist/ path Stacked on #1240. 615 CLI tests pass. Coverage ratchet passes. Relates to #924 (shell consolidation). 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Refactor** * Reorganized onboarding session management module structure and improved distribution of the codebase. Session tracking, step-level state management, lock handling, and sensitive data redaction functionality have been consolidated and optimized for better maintainability. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary - Convert `bin/lib/inference-config.js` (143 lines) to `src/lib/inference-config.ts` - Typed interfaces: `ProviderSelectionConfig`, `GatewayInference` - All 3 exports are pure — straightforward full conversion - Co-locate tests: `test/inference-config.test.js` → `src/lib/inference-config.test.ts` Stacked on #1240. 614 CLI tests pass. Coverage ratchet passes. Relates to #924 (shell consolidation). 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Refactor** * Centralized inference configuration, provider selection, and gateway output parsing into a single consolidated module for more consistent behavior. * **Tests** * Updated tests to match the consolidated module and improved assertion patterns. * **Chores** * Install/prep process now includes a CLI build step during project setup (silently handled). <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary - remove `Qwen3.5 397B A17B` from the curated NVIDIA Endpoints cloud model picker - keep the rest of the routed inference model handling unchanged ## Why QA has been using `Qwen3.5 397B A17B` as the default validation target for NVIDIA Endpoints onboarding, but the endpoint team has confirmed that this model is currently not working. This PR does **not** claim that the rest of the NVIDIA Endpoints catalog is broken. It only removes the specific known-bad curated picker entry so onboarding defaults and manual QA runs stop selecting it from the built-in list. Related #1161 ## Validation ```bash npx vitest run test/inference-config.test.js npx eslint bin/lib/inference-config.js test/inference-config.test.js npx tsc -p jsconfig.json --noEmit ``` Signed-off-by: Kevin Jones <kejones@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Removed Qwen3.5 397B A17B model from available cloud model options. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Aaron Erickson <aerickson@nvidia.com> Co-authored-by: Aaron Erickson <aerickson@nvidia.com>
## Summary - harden sandbox startup on Spark/arm64 by unwrapping the `env ... nemoclaw-start` bootstrap self-invocation inside the entrypoint instead of trying to `gosu` into it - make `nemoclaw destroy` truthful: fail on real OpenShell delete errors, but treat already-missing live sandboxes as safe stale-registry cleanup - add regression coverage for both the Spark bootstrap wrapper path and the destroy error-handling paths ## Issues - Fixes #1054 - Fixes #1028 - Fixes #1243 ## Investigated But Not Included - #1063 - investigated during this branch - still appears to be the same upstream trusted-proxy/local-DNS behavior tracked by #396 - direct `curl` inside the sandbox works, which points away from a bounded NemoClaw-only fix in this PR ## Validation Local: ```bash npm run build:cli npx vitest run test/nemoclaw-start.test.js test/cli.test.js test/onboard.test.js npx eslint bin/nemoclaw.js test/nemoclaw-start.test.js test/cli.test.js npx tsc -p jsconfig.json --noEmit ``` Spark (`spark-d8c8`), commit `0efbd66` in a disposable worktree: ```bash npm install --include=dev cd nemoclaw && npm install --include=dev cd .. npm run build:cli npx vitest run test/nemoclaw-start.test.js test/cli.test.js test/onboard.test.js npx eslint bin/nemoclaw.js test/nemoclaw-start.test.js test/cli.test.js npx tsc -p jsconfig.json --noEmit ``` Result: - passed on Spark - `3` files, `92` tests passed - eslint passed - typecheck passed Spark caveat: - the existing Spark checkout had incomplete root JS dev dependencies (`vitest/config` missing), so the disposable worktree required a local `npm install --include=dev` repair before validation ## Notes Plan note: - `/Users/kejones/nemoclaw-notes/spark-good-blockers-plan-2026-04-01.md` Signed-off-by: Kevin Jones <kejones@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Destroy now treats an already-absent sandbox as successful, prints a clear "already absent" message, proceeds with cleanup, and only fails for genuine delete errors (printing delete output). * Start-wrapper handling improved so env ... nemoclaw-start invocations correctly export and pass through environment assignments. * **Tests** * Added tests for destroy behavior (failure vs already-missing) and for the start-script argument-unwrapping. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Aaron Erickson <aerickson@nvidia.com>
…d" (#1288) ## Summary The `classifyGatewayStatus` regex matched "Not connected" as connected because `\bConnected\b` matches the substring. This was partially fixed in #1268 (added Disconnected guard) but "Not connected" still slipped through. Fix: - Move unavailable-pattern check before connected check - Use strict line-anchored regex: `/^\s*(?:Status:\s*)?Connected\s*$/im` - Add regression tests for "Status: Not connected" → inactive This fix was intended for #1268 but the auto-merge squashed before the commit landed. Fixes #1279. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved gateway connection status detection to handle edge cases and properly classify various status formats, ensuring more accurate connection state reporting. * **Tests** * Added test coverage for additional gateway status variations. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary Adds a `/cut-release-tag` agent skill that walks through creating annotated semver tags on main. The skill: - Fetches the latest semver tag and computes patch/minor/major bump options - Prompts for which bump (patch default) - Shows the changelog since the last tag - Creates an annotated tag + moves the `latest` floating tag - Pushes both Tested live — used it to cut `v0.0.3` just now. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Added release tagging workflow documentation with step-by-step instructions for semantic version management, including tag creation, changelog reference points, and verification checks. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Summary
brewnetwork policy preset that allows scoped HTTPS access to theHomebrew/Linuxbrew package registry domains (formulae.brew.sh, github.com/Homebrew,
ghcr.io, and associated GitHub CDN hosts).
Homebrew into the sandbox image. The sandbox image is immutable by design;
embedding a full Linuxbrew installation would add significant bloat for a
tool that not every sandbox needs.
approach is to customise the OpenShell sandbox policy to allow the
necessary network access, rather than baking package managers into the
base image.
will refuse to run. There is an undocumented workaround for non-root
installations discussed in Homebrew on linux without root access Homebrew/discussions#3386.
Related Issue
Fixes #491
Changes
Type of Change
Testing
npx prek run --all-filespasses (or equivalentlymake check).npm testpasses.make docsbuilds without warnings. (for doc-only changes)Checklist
General
Code Changes
npx prek run --all-filesauto-fixes formatting (ormake formatfor targeted runs).Doc Changes
update-docsagent skill to draft changes while complying with the style guide. For example, prompt your agent with "/update-docscatch up the docs for the new changes I made in this PR."Signed-off-by: Tinson Lai tinsonl@nvidia.com
Summary by CodeRabbit
New Features
Documentation
CI / Workflow