Skip to content

fix: add policy preset for brew#1292

Closed
laitingsheng wants to merge 115 commits intomainfrom
fix/brew-policy
Closed

fix: add policy preset for brew#1292
laitingsheng wants to merge 115 commits intomainfrom
fix/brew-policy

Conversation

@laitingsheng
Copy link
Copy Markdown
Contributor

@laitingsheng laitingsheng commented Apr 2, 2026

Summary

  • Add a brew network policy preset that allows scoped HTTPS access to the
    Homebrew/Linuxbrew package registry domains (formulae.brew.sh, github.com/Homebrew,
    ghcr.io, and associated GitHub CDN hosts).
  • This preset only provides default network endpoints — it does not install
    Homebrew into the sandbox image. The sandbox image is immutable by design;
    embedding a full Linuxbrew installation would add significant bloat for a
    tool that not every sandbox needs.
  • For workflows that require installing system packages, the recommended
    approach is to customise the OpenShell sandbox policy to allow the
    necessary network access, rather than baking package managers into the
    base image.
  • Since the sandbox runs as a non-root user, the standard Homebrew installer
    will refuse to run. There is an undocumented workaround for non-root
    installations discussed in Homebrew on linux without root access Homebrew/discussions#3386.

Related Issue

Fixes #491

Changes

Type of Change

  • Code change for a new feature, bug fix, or refactor.
  • Code change with doc updates.
  • Doc only. Prose changes without code sample modifications.
  • Doc only. Includes code sample changes.

Testing

  • npx prek run --all-files passes (or equivalently make check).
  • npm test passes.
  • make docs builds without warnings. (for doc-only changes)

Checklist

General

Code Changes

  • Formatters applied — npx prek run --all-files auto-fixes formatting (or make format for targeted runs).
  • Tests added or updated for new or changed behavior.
  • No secrets, API keys, or credentials committed.
  • Doc pages updated for any user-facing behavior changes (new commands, changed defaults, new features, bug fixes that contradict existing docs).

Doc Changes

  • Follows the style guide. Try running the update-docs agent skill to draft changes while complying with the style guide. For example, prompt your agent with "/update-docs catch up the docs for the new changes I made in this PR."
  • New pages include SPDX license header and frontmatter, if creating a new page.
  • Cross-references and links verified.

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

  • New Features

    • Homebrew/Linuxbrew network preset added; Homebrew installs allowed and Linuxbrew binaries recognized.
    • Onboarding resume/recovery: interrupted onboard sessions can be resumed and missing sandboxes auto-reconciled.
    • Installer improvements: robust Node/version checks, PATH/profile fixes, and safer install verification.
  • Documentation

    • Updated getting-started, troubleshooting, security best-practices, and reference docs; Node minimum now 22.16.0.
  • CI / Workflow

    • Added DCO check and consolidated CI checks/workflows.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 2, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 792c0fc3-5e87-46d2-9793-f1fefb0e0012

📥 Commits

Reviewing files that changed from the base of the PR and between 1646090 and 4591280.

⛔ Files ignored due to path filters (2)
  • nemoclaw/package-lock.json is excluded by !**/package-lock.json
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (155)
  • .agents/skills/cut-release-tag/SKILL.md
  • .agents/skills/nemoclaw-get-started/SKILL.md
  • .agents/skills/nemoclaw-reference/references/troubleshooting.md
  • .dockerignore
  • .github/PULL_REQUEST_TEMPLATE.md
  • .github/actions/basic-checks/action.yaml
  • .github/actions/resolve-sandbox-base-image/action.yaml
  • .github/dco-bypass.txt
  • .github/workflows/base-image.yaml
  • .github/workflows/dco-check.yaml
  • .github/workflows/docs-preview-deploy.yaml
  • .github/workflows/e2e-brev.yaml
  • .github/workflows/main.yaml
  • .github/workflows/nightly-e2e.yaml
  • .github/workflows/pr.yaml
  • .github/workflows/sandbox-images-and-e2e.yaml
  • .gitignore
  • .pre-commit-config.yaml
  • .prettierignore
  • .prettierrc
  • AGENTS.md
  • CLAUDE.md
  • CONTRIBUTING.md
  • Dockerfile
  • Dockerfile.base
  • Makefile
  • README.md
  • SECURITY.md
  • bin/lib/credentials.js
  • bin/lib/inference-config.js
  • bin/lib/local-inference.js
  • bin/lib/nim.js
  • bin/lib/onboard-session.js
  • bin/lib/onboard.js
  • bin/lib/platform.js
  • bin/lib/policies.js
  • bin/lib/preflight.js
  • bin/lib/registry.js
  • bin/lib/resolve-openshell.js
  • bin/lib/runner.js
  • bin/lib/runtime-recovery.js
  • bin/lib/version.js
  • bin/nemoclaw.js
  • ci/coverage-threshold-cli.json
  • ci/coverage-threshold-plugin.json
  • docs/_includes/alpha-statement.md
  • docs/about/how-it-works.md
  • docs/about/overview.md
  • docs/about/release-notes.md
  • docs/conf.py
  • docs/deployment/deploy-to-remote-gpu.md
  • docs/get-started/quickstart.md
  • docs/index.md
  • docs/monitoring/monitor-sandbox-activity.md
  • docs/reference/architecture.md
  • docs/reference/commands.md
  • docs/reference/network-policies.md
  • docs/reference/troubleshooting.md
  • docs/security/best-practices.md
  • eslint.config.mjs
  • install.sh
  • k8s/README.md
  • k8s/nemoclaw-k8s.yaml
  • nemoclaw-blueprint/policies/openclaw-sandbox.yaml
  • nemoclaw-blueprint/policies/presets/discord.yaml
  • nemoclaw-blueprint/policies/presets/slack.yaml
  • nemoclaw/package.json
  • nemoclaw/src/blueprint/runner.test.ts
  • nemoclaw/src/blueprint/snapshot.test.ts
  • nemoclaw/src/blueprint/ssrf.test.ts
  • nemoclaw/src/blueprint/ssrf.ts
  • nemoclaw/src/blueprint/state.test.ts
  • nemoclaw/src/commands/migration-state.test.ts
  • nemoclaw/src/commands/migration-state.ts
  • package.json
  • scripts/brev-setup.sh
  • scripts/check-coverage-ratchet.sh
  • scripts/check-coverage-ratchet.ts
  • scripts/check-version-tag-sync.sh
  • scripts/debug.sh
  • scripts/fix-coredns.sh
  • scripts/install-openshell.sh
  • scripts/install.sh
  • scripts/nemoclaw-start.sh
  • scripts/setup-dns-proxy.sh
  • scripts/setup-spark.sh
  • scripts/setup.sh
  • scripts/start-services.sh
  • scripts/telegram-bridge.js
  • scripts/walkthrough.sh
  • src/lib/build-context.ts
  • src/lib/dashboard.test.ts
  • src/lib/dashboard.ts
  • src/lib/gateway-state.ts
  • src/lib/inference-config.test.ts
  • src/lib/inference-config.ts
  • src/lib/local-inference.test.ts
  • src/lib/local-inference.ts
  • src/lib/nim.test.ts
  • src/lib/nim.ts
  • src/lib/onboard-session.test.ts
  • src/lib/onboard-session.ts
  • src/lib/preflight.test.ts
  • src/lib/preflight.ts
  • src/lib/runtime-recovery.test.ts
  • src/lib/runtime-recovery.ts
  • src/lib/url-utils.test.ts
  • src/lib/url-utils.ts
  • src/lib/validation.test.ts
  • src/lib/validation.ts
  • test/cli.test.js
  • test/credential-exposure.test.js
  • test/credentials.test.js
  • test/dns-proxy.test.js
  • test/e2e/brev-e2e.test.js
  • test/e2e/e2e-cloud-experimental/check-docs.sh
  • test/e2e/e2e-cloud-experimental/cleanup.sh
  • test/e2e/e2e-cloud-experimental/skip/01-onboard-completion.sh
  • test/e2e/e2e-cloud-experimental/skip/04-nemoclaw-openshell-status-parity.sh
  • test/e2e/e2e-cloud-experimental/skip/05-network-policy.sh
  • test/e2e/test-credential-sanitization.sh
  • test/e2e/test-e2e-cloud-experimental.sh
  • test/e2e/test-full-e2e.sh
  • test/e2e/test-gpu-e2e.sh
  • test/e2e/test-onboard-repair.sh
  • test/e2e/test-onboard-resume.sh
  • test/e2e/test-spark-install.sh
  • test/e2e/test-telegram-injection.sh
  • test/gateway-cleanup.test.js
  • test/install-preflight.test.js
  • test/local-inference.test.js
  • test/nemoclaw-cli-recovery.test.js
  • test/nemoclaw-start.test.js
  • test/nim.test.js
  • test/onboard-readiness.test.js
  • test/onboard-selection.test.js
  • test/onboard.test.js
  • test/platform.test.js
  • test/policies.test.js
  • test/preflight.test.js
  • test/registry.test.js
  • test/resolve-openshell.test.js
  • test/runner.test.js
  • test/runtime-shell.test.js
  • test/security-binaries-restriction.test.js
  • test/security-c2-dockerfile-injection.test.js
  • test/security-c4-manifest-traversal.test.js
  • test/security-method-wildcards.test.js
  • test/service-env.test.js
  • test/setup-sandbox-name.test.js
  • test/uninstall.test.js
  • tsconfig.cli.json
  • tsconfig.src.json
  • uninstall.sh
  • vitest.config.ts

📝 Walkthrough

Walkthrough

Large, multi-area update adding new network policy preset for Homebrew, extensive CI/workflow and action additions, many docs and scripts changes, numerous new/rewritten TypeScript modules and tests (including preflight, inference, nim, onboarding/session, runtime-recovery, validation), multiple bin shims to dist, registry and credentials hardening, k8s manifest, and many e2e/test script additions and revisions.

Changes

Cohort / File(s) Summary
Network policy presets
nemoclaw-blueprint/policies/presets/brew.yaml, nemoclaw-blueprint/policies/presets/discord.yaml, nemoclaw-blueprint/policies/presets/slack.yaml, nemoclaw-blueprint/policies/openclaw-sandbox.yaml
Added new brew preset and updated Discord/Slack/openclaw sandbox policy entries: host-specific REST/access rules, method/path restrictions, TLS/termination flags, and binary allowlists.
CI / GitHub Actions & Workflows
.github/actions/basic-checks/action.yaml, .github/actions/resolve-sandbox-base-image/action.yaml, .github/workflows/* (main.yaml, pr.yaml, sandbox-images-and-e2e.yaml, dco-check.yaml, docs-preview-deploy.yaml, e2e-brev.yaml, nightly-e2e.yaml, etc.)
Added composite actions, reusable workflows, new main/pr workflow structure, DCO enforcement workflow, workflow updates (timeouts, inputs, action pins), and sandbox-image/E2E orchestration changes.
CLI / bin shims & libs
bin/lib/*.js (many: inference-config.js, local-inference.js, nim.js, preflight.js, onboard-session.js, runtime-recovery.js, etc.)
Replaced multiple bin implementations with thin re-export shims to dist/ or added new shim modules; notable full rewrites moved logic into src/lib/*.ts and dist.
New/Updated TypeScript libraries
src/lib/{preflight,local-inference,inference-config,nim,validation,build-context,onboard-session,runtime-recovery,gateway-state,dashboard,url-utils,...}.ts
Many new modules added implementing preflight checks, local inference utilities, NIM management, inference-config, validation/classification, onboarding session management, runtime recovery, dashboard URL helpers, URL utils, build-context helpers, and gateway-state classification.
Registry / Credentials / Registry locking
bin/lib/credentials.js, bin/lib/registry.js
Credentials: lazy home resolution, normalization, stricter permissions, prompt improvements, execFileSync usage; Registry: cross-process lock for mutating ops, atomic saves, and exported lock helpers.
CLI behavior & nemoclaw runtime
bin/nemoclaw.js, bin/lib/version.js, bin/lib/platform.js
Added registry recovery/reconciliation, improved uninstall/logging/compat checks, added getVersion(), broadened shouldPatchCoredns signature.
Pre-commit / formatting / linting
.pre-commit-config.yaml, .prettierrc, .prettierignore, eslint.config.mjs, Makefile, package.json, nemoclaw/package.json
Pre-commit hooks reorganized and extended, Prettier config/ignore added, ESLint complexity rule added for bin scripts, Makefile format targets updated, package.json scripts/deps adjusted including engine bump to Node 22.16.0+.
Installer & scripts
install.sh, scripts/* (many: setup.sh removed, install-openshell.sh, nemoclaw-start.sh, brev-setup.sh, setup-dns-proxy.sh, setup-spark.sh, fix-coredns.sh, start-services.sh, etc.)
Major installer refactor: improved version resolution, PATH/profile fixes, nvm/Node checks, download verification, swap/setup, expanded/rewritten helper scripts, new DNS-proxy helper, removal of legacy setup script, and numerous reliability/security improvements.
Docker / k8s
Dockerfile, Dockerfile.base, k8s/nemoclaw-k8s.yaml, k8s/README.md
Dockerfiles: new build arg for device-auth toggle and state dir/symlink; Dockerfile.base adds writable memory path; new Kubernetes pod manifest and k8s docs for experimental deployment.
Docs & guides
AGENTS.md, CLAUDE.md, CONTRIBUTING.md, README.md, SECURITY.md, docs/**/*
Extensive documentation additions and rewrites: new agent guides, security best-practices, architecture diagrams, updated quickstart, troubleshooting, and many other doc updates.
Tests & e2e suites
src/**/* .test.ts, test/**/*, test/e2e/**/*, scripts/*
Large test expansion and updates: many new unit tests for new libs, updated CLI tests, new/rewritten E2E test scripts and harnesses (onboard resume/repair, credential sanitization, dns-proxy, spark, cloud-experimental), and coverage ratchet tooling.
Misc config & housekeeping
.dockerignore, .gitignore, .github/dco-bypass.txt, .github/PULL_REQUEST_TEMPLATE.md, ci/coverage-threshold-cli.json
Updated ignore lists, added DCO bypass list, PR template DCO sign-off, new CI coverage threshold file, and small metadata changes.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested reviewers

  • cv

Poem

🐰 A brew-tiful preset hops to view,
I mapped the hosts and rules anew,
Binaries guarded, endpoints dressed,
Sandbox safe, the policy blessed —
Tap, fetch, and sip your package brew! 🍻

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/brew-policy

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemoclaw-blueprint/policies/presets/brew.yaml`:
- Around line 12-54: The brew preset currently configures
registry/package-manager hosts (hosts: formulae.brew.sh, github.com, ghcr.io,
pkg-containers.githubusercontent.com, objects.githubusercontent.com,
raw.githubusercontent.com) with protocol: rest, tls: terminate and REST
method/path rules; change these entries to use access: full (remove protocol:
rest, tls: terminate and the rules block) so CONNECT/TLS tunneling is allowed
for package manager and registry traffic, ensuring auth/downloads work
correctly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 801a2016-7512-4d4d-bc95-132c616fd565

📥 Commits

Reviewing files that changed from the base of the PR and between 98e1b98 and 1646090.

📒 Files selected for processing (1)
  • nemoclaw-blueprint/policies/presets/brew.yaml

Comment on lines +12 to +54
- host: formulae.brew.sh
port: 443
protocol: rest
enforcement: enforce
tls: terminate
rules:
- allow: { method: GET, path: "/**" }
- host: github.com
port: 443
protocol: rest
enforcement: enforce
tls: terminate
rules:
- allow: { method: GET, path: "/Homebrew/**" }
- allow: { method: POST, path: "/Homebrew/**" }
- host: ghcr.io
port: 443
protocol: rest
enforcement: enforce
tls: terminate
rules:
- allow: { method: GET, path: "/**" }
- host: pkg-containers.githubusercontent.com
port: 443
protocol: rest
enforcement: enforce
tls: terminate
rules:
- allow: { method: GET, path: "/**" }
- host: objects.githubusercontent.com
port: 443
protocol: rest
enforcement: enforce
tls: terminate
rules:
- allow: { method: GET, path: "/**" }
- host: raw.githubusercontent.com
port: 443
protocol: rest
enforcement: enforce
tls: terminate
rules:
- allow: { method: GET, path: "/**" }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Use access: full for brew endpoints instead of REST + TLS termination.

For package-manager/registry traffic, tls: terminate with REST path rules will break CONNECT tunneling and can cause auth/download failures. Switch these endpoints to access: full and remove REST method/path filtering.

Suggested fix
     endpoints:
       - host: formulae.brew.sh
         port: 443
-        protocol: rest
         enforcement: enforce
-        tls: terminate
-        rules:
-          - allow: { method: GET, path: "/**" }
+        access: full
       - host: github.com
         port: 443
-        protocol: rest
         enforcement: enforce
-        tls: terminate
-        rules:
-          - allow: { method: GET, path: "/Homebrew/**" }
-          - allow: { method: POST, path: "/Homebrew/**" }
+        access: full
       - host: ghcr.io
         port: 443
-        protocol: rest
         enforcement: enforce
-        tls: terminate
-        rules:
-          - allow: { method: GET, path: "/**" }
+        access: full
       - host: pkg-containers.githubusercontent.com
         port: 443
-        protocol: rest
         enforcement: enforce
-        tls: terminate
-        rules:
-          - allow: { method: GET, path: "/**" }
+        access: full
       - host: objects.githubusercontent.com
         port: 443
-        protocol: rest
         enforcement: enforce
-        tls: terminate
-        rules:
-          - allow: { method: GET, path: "/**" }
+        access: full
       - host: raw.githubusercontent.com
         port: 443
-        protocol: rest
         enforcement: enforce
-        tls: terminate
-        rules:
-          - allow: { method: GET, path: "/**" }
+        access: full
Based on learnings: In `nemoclaw-blueprint/policies/presets/`, package manager and registry presets that use CONNECT tunneling should use `access: full`; `tls: terminate` breaks authentication and can cause connection drops.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- host: formulae.brew.sh
port: 443
protocol: rest
enforcement: enforce
tls: terminate
rules:
- allow: { method: GET, path: "/**" }
- host: github.com
port: 443
protocol: rest
enforcement: enforce
tls: terminate
rules:
- allow: { method: GET, path: "/Homebrew/**" }
- allow: { method: POST, path: "/Homebrew/**" }
- host: ghcr.io
port: 443
protocol: rest
enforcement: enforce
tls: terminate
rules:
- allow: { method: GET, path: "/**" }
- host: pkg-containers.githubusercontent.com
port: 443
protocol: rest
enforcement: enforce
tls: terminate
rules:
- allow: { method: GET, path: "/**" }
- host: objects.githubusercontent.com
port: 443
protocol: rest
enforcement: enforce
tls: terminate
rules:
- allow: { method: GET, path: "/**" }
- host: raw.githubusercontent.com
port: 443
protocol: rest
enforcement: enforce
tls: terminate
rules:
- allow: { method: GET, path: "/**" }
- host: formulae.brew.sh
port: 443
enforcement: enforce
access: full
- host: github.com
port: 443
enforcement: enforce
access: full
- host: ghcr.io
port: 443
enforcement: enforce
access: full
- host: pkg-containers.githubusercontent.com
port: 443
enforcement: enforce
access: full
- host: objects.githubusercontent.com
port: 443
enforcement: enforce
access: full
- host: raw.githubusercontent.com
port: 443
enforcement: enforce
access: full
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemoclaw-blueprint/policies/presets/brew.yaml` around lines 12 - 54, The brew
preset currently configures registry/package-manager hosts (hosts:
formulae.brew.sh, github.com, ghcr.io, pkg-containers.githubusercontent.com,
objects.githubusercontent.com, raw.githubusercontent.com) with protocol: rest,
tls: terminate and REST method/path rules; change these entries to use access:
full (remove protocol: rest, tls: terminate and the rules block) so CONNECT/TLS
tunneling is allowed for package manager and registry traffic, ensuring
auth/downloads work correctly.

cv and others added 27 commits April 2, 2026 01:51
* chore: add cyclomatic complexity rule (ratchet from 95)

Add ESLint complexity rule to bin/ and scripts/ to prevent new
functions from accumulating excessive branching. Starting threshold
is 95 (current worst offender: setupNim in onboard.js). Ratchet
plan: 95 → 40 → 25 → 15.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: ratchet complexity to 20, suppress existing violations

Suppress 6 functions that exceed the threshold with eslint-disable
comments so we can start enforcing at 20 instead of 95:

- setupNim (95), setupPolicies (41), setupInference (22) in onboard.js
- deploy (22), main IIFE (27) in nemoclaw.js
- applyPreset (24) in policies.js

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: suppress complexity for 3 missed functions

preflight (23), getReconciledSandboxGatewayState (25), sandboxStatus (27)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: peteryuqin <peter.yuqin@gmail.com>
* chore: add tsconfig.cli.json, root execa, TS coverage ratchet

Foundation for the CLI TypeScript migration (PR 0 of the shell
consolidation plan). No runtime changes — config, tooling, and
dependency only.

- tsconfig.cli.json: strict TS type-checking for bin/ and scripts/
  (noEmit, module: preserve — tsx handles the runtime)
- scripts/check-coverage-ratchet.ts: pure TS replacement for the
  bash+python coverage ratchet script (same logic, same tolerance)
- execa ^9.6.1 added to root devDependencies (used by PR 1+)
- pr.yaml: coverage ratchet step now runs the TS version via tsx
- .pre-commit-config.yaml: SPDX headers cover scripts/*.ts,
  new tsc-check-cli pre-push hook
- CONTRIBUTING.md: document typecheck:cli task and CLI pre-push hook
- Delete scripts/check-coverage-ratchet.sh

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Apply suggestion from @brandonpelfrey

* chore: address PR feedback — use types_or, add tsx devDep

- Use `types_or: [ts, tsx]` instead of file glob for tsc-check-cli
  hook per @brandonpelfrey's suggestion.
- Add `tsx` to devDependencies so CI doesn't re-fetch it on every run
  per CodeRabbit's suggestion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(ci): ignore GitHub "Apply suggestion" commits in commitlint

* fix(ci): lint only PR title since repo is squash-merge only

Reverts the commitlint ignores rule from the previous commit and
instead removes the per-commit lint step entirely.

Individual commit messages are discarded at merge time — only the
squash-merged PR title lands in main and drives changelog generation.
Drop the per-commit lint, keep the PR title check, and remove the
now-unnecessary fetch-depth: 0.

* Revert "fix(ci): lint only PR title since repo is squash-merge only"

This reverts commit 1257a47.

* Revert "fix(ci): ignore GitHub "Apply suggestion" commits in commitlint"

This reverts commit c395657.

* docs: fix markdownlint MD032 in README (blank line before list)

* refactor: make coverage ratchet script idiomatic TypeScript

- Wrap in main() with process.exitCode instead of scattered process.exit()
- Replace mutable flags with .map()/.some() over typed MetricResult[]
- Separate pure logic (checkMetrics) from formatting (formatReport)
- Throw with { cause } chaining instead of exit-in-helpers
- Derive CoverageThresholds from METRICS tuple (single source of truth)
- Exhaustive switch on CheckStatus discriminated union

* refactor: remove duplication in coverage ratchet script

- Drop STATUS_LABELS map; inline labels in exhaustive switch
- Extract common 'metric coverage is N%' preamble in formatResult
- Simplify ratchetedThresholds: use results directly (already in
  METRICS order) instead of re-scanning with .find() per metric
- Compute 'failed' once in main, pass into formatReport to avoid
  duplicate .some() scan

* refactor: simplify coverage ratchet with FP patterns

- Extract classify() as a named pure function (replaces nested ternary)
- loadJSON takes repo-relative paths, eliminating THRESHOLD_PATH and
  SUMMARY_PATH constants (DRY the join-with-REPO_ROOT pattern)
- Drop CoverageMetric/CoverageSummary interfaces (only pct is read);
  use structural type at the call site instead
- Inline ratchetedThresholds (one-liner, used once)
- formatReport derives fail/improved from results instead of taking
  a pre-computed boolean (let functions derive from data, don't
  thread derived state)
- sections.join("\n\n") replaces manual empty-string pushing
- Shorter type names (Thresholds, Status, Result) — no ambiguity
  in a single-purpose script

* refactor: strip coverage ratchet to failure-only output

prek hides output from commands that exit 0, so ok/improved
reporting was dead code. Remove Status, Result, classify,
formatResult, formatReport, and the ratcheted-thresholds
suggestion block. The script now just filters for regressions
and prints actionable errors on failure.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Brandon Pelfrey <bpelfrey@nvidia.com>
…ets (#438)

* fix: use CONNECT tunnel for WebSocket endpoints in Discord/Slack presets

The egress proxy's HTTP idle timeout (~2 min) kills long-lived WebSocket
connections when endpoints are configured with protocol:rest + tls:terminate.
Switch WebSocket endpoints to access:full (CONNECT tunnel) which bypasses
HTTP-level timeouts entirely.

Discord:
- gateway.discord.gg → access:full (WebSocket gateway)
- Add PUT/PATCH/DELETE methods for discord.com (message editing, reactions)
- Add media.discordapp.net for attachment access

Slack:
- Add wss-primary.slack.com and wss-backup.slack.com → access:full
  (Socket Mode WebSocket endpoints)

Partially addresses #409 — the policy-level fix enables WebSocket
connections to survive. The hardcoded 2-min timeout in openshell-sandbox
still affects any protocol:rest endpoints with long-lived connections.

Related: #361 (WhatsApp Web, same root cause)

* fix: correct comment wording for media endpoint and YAML formatting
* fix: remove unused RECOMMENDED_NODE_MAJOR from scripts/install.sh

Shellcheck flagged it as unused after the min/recommended merge.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: enforce full semver >=22.16.0 in installer scripts

The runtime checks only compared the major Node.js version, allowing
22.0–22.15 to pass despite package.json requiring >=22.16.0. Use the
version_gte() helper for full semver comparison in both installers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: harden version_gte and align fallback message

Guard version_gte() against prerelease suffixes (e.g. "22.16.0-rc.1")
that would crash bash arithmetic. Also update the manual-install
fallback message to reference MIN_NODE_VERSION instead of hardcoded "22".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: update test stubs for Node.js 22.16 minimum and add Node 20 rejection test

- Bump node stub in 'succeeds with acceptable Node.js' from v20.0.0 to v22.16.0
- Bump node stub in buildCurlPipeEnv from v22.14.0 to v22.16.0
- Add new test asserting Node.js 20 is rejected by ensure_supported_runtime

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: harden installer and onboard resiliency

* fix: address installer and debug review follow-ups

* fix: harden onboard resume across later setup steps

* test: simplify payload extraction in onboard tests

* test: keep onboard payload extraction target-compatible

* chore: align onboard session lint with complexity rule

* fix: harden onboard session safety and lock handling

* fix: tighten onboard session redaction and metadata handling
… blueprint digest (#769)

Reconciles #156 and #743 into a single comprehensive solution:

- Filter auth-profiles.json at copy time via cpSync filter (from #743)
- Recursive stripCredentials() with pattern-based field detection for
  deep config sanitization (from #156: CREDENTIAL_FIELDS set +
  CREDENTIAL_FIELD_PATTERN regex)
- Remove gateway config section (contains auth tokens) from sandbox
  openclaw.json
- Blueprint digest verification (SHA-256): recorded at snapshot time,
  validated on restore, empty/missing digest is a hard failure
- computeFileDigest() throws when blueprint file is missing instead of
  silently returning null
- Sanitize both snapshot-level and sandbox-bundle openclaw.json copies
- Backward compatible: old snapshots without blueprintDigest skip
  validation
- Bump SNAPSHOT_VERSION 2 → 3

Supersedes #156 and #743.
…ross reconnects (#1025)

* fix(sandbox): export proxy env vars with full NO_PROXY and persist across reconnects

OpenShell injects NO_PROXY=127.0.0.1,localhost,::1 into the sandbox, missing
inference.local and the gateway IP (10.200.0.1). This causes LLM inference
requests to route through the egress proxy instead of going direct, and the
proxy gateway IP itself gets proxied.

Add proxy configuration block to nemoclaw-start.sh that:
- Exports HTTP_PROXY, HTTPS_PROXY, and NO_PROXY with inference.local and
  the gateway IP included
- Persists via /etc/profile.d/nemoclaw-proxy.sh (root) or ~/.profile
  (non-root fallback) so values survive OpenShell reconnect injection
- Supports NEMOCLAW_PROXY_HOST / NEMOCLAW_PROXY_PORT overrides

The non-root fallback ensures the fix works in environments like Brev where
containers run without root privileges.

Tested on DGX Spark (ARM64) and Brev VM (x86_64). Verified NO_PROXY contains
inference.local and 10.200.0.1 inside the live sandbox after connect.

Ref: #626, #704
Ref: #704 (comment)

* fix(sandbox): write proxy config to ~/.bashrc for interactive reconnect sessions

OpenShell's `sandbox connect` spawns `/bin/bash -i` (interactive, non-login),
which sources ~/.bashrc — not ~/.profile or /etc/profile.d/*.  The previous
approach wrote to ~/.profile and /etc/profile.d/, neither of which is sourced
by `bash -i`, so the narrow OpenShell-injected NO_PROXY persisted in live
interactive sessions.

Changes:
- Write proxy snippet to ~/.bashrc (primary) and ~/.profile (login fallback)
- Export both uppercase and lowercase proxy variants (NO_PROXY + no_proxy,
  HTTP_PROXY + http_proxy, etc.) — Node.js undici prefers lowercase no_proxy
  over uppercase NO_PROXY when both are set
- Add idempotency guard to prevent duplicate blocks on container restart
- Update tests: verify .bashrc writing, idempotency, bash -i override
  behavior, and lowercase variant correctness

Tested on DGX Spark (ARM64) and Brev VM (x86_64) with full destroy +
re-onboard + live `env | grep proxy` verification inside the sandbox shell
via `openshell sandbox connect`.

Ref: #626

* fix(sandbox): replace stale proxy values on restart with begin/end markers

Use begin/end markers in .bashrc/.profile proxy snippet so
_write_proxy_snippet replaces the block when PROXY_HOST/PORT change
instead of silently keeping stale values. Adds test coverage for the
replacement path.

Addresses CodeRabbit review feedback on idempotency gap.

* fix(sandbox): resolve sandbox user home dynamically when running as root

When the entrypoint runs as root, $HOME is /root — the proxy snippet
was written to /root/.bashrc instead of the sandbox user's home.
Use getent passwd to look up the sandbox user's home when running as
UID 0; fall back to /sandbox if the user entry is missing.

Addresses CodeRabbit review feedback on _SANDBOX_HOME resolution.

---------

Co-authored-by: Carlos Villela <cvillela@nvidia.com>
#101)

* fix(policies): allow preset application for versionless policies (Fixes #35)

Fixes #35

Signed-off-by: Deepak Jain <deepujain@gmail.com>

* fix: remove stale complexity suppression in policies

---------

Signed-off-by: Deepak Jain <deepujain@gmail.com>
Co-authored-by: Kevin Jones <kejones@nvidia.com>
* fix: restore routed inference and connect UX

* fix: simplify detected local inference hint

* fix: remove stale local inference hint

* test: relax connect forward assertion
… (#1062)

* fix(sandbox): add DNS forwarder so web_fetch resolves hostnames (fixes #626)

The sandbox runs in an isolated network namespace (10.200.0.0/24) where
OpenShell's iptables rules reject all non-proxy traffic including UDP.
This causes getaddrinfo EAI_AGAIN for every outbound DNS lookup,
breaking web_fetch, web_search, and all Node.js HTTP tools.

Fix (three steps in new scripts/setup-dns-proxy.sh):
1. Run a Python UDP DNS forwarder on the pod-side veth gateway
   (10.200.0.1:53), forwarding to the real CoreDNS pod IP
2. Add an iptables rule in the sandbox namespace allowing UDP to
   the gateway on port 53 (the only non-proxy firewall exception)
3. Update the sandbox's /etc/resolv.conf to point to 10.200.0.1

Also broadens the CoreDNS patch (fix-coredns.sh / shouldPatchCoredns)
to run on all Docker-based runtimes, not just Colima — k3s-inside-Docker
has broken DNS forwarding on Linux hosts with systemd-resolved too.

Inspired by PR #732's DNS forwarder approach. Key differences:
- Uses kubectl exec instead of nsenter (fixes PR #732's launch bug)
- Handles the sandbox iptables constraint (all UDP blocked)
- Resolves systemd-resolved upstreams via resolvectl before falling
  back to 8.8.8.8
- Uses grep -F for fixed-string sandbox name matching

Tested on DGX Spark (ARM64): fresh destroy + onboard, getent hosts
and node dns.lookup both resolve from inside the sandbox.

* fix(sandbox): add runtime DNS verification after setup

setup-dns-proxy.sh now verifies all three DNS bridge layers from inside
the sandbox namespace post-deployment: resolv.conf target, iptables UDP
rule, and actual getent hosts resolution. Reports [PASS]/[FAIL] per
check so failures are immediately visible in onboard logs.
…1070)

* fix(install): add trap handler to spin() for clean Ctrl+C teardown

spin() runs commands in the background with a spinner animation but
had no signal handler. Pressing Ctrl+C during installation would
kill the parent shell while leaving the background process running
and the temp log file on disk.

Add INT/TERM trap inside spin() that kills the background process
and removes the temp file, then restore default signal handling
after the background process exits normally.

Closes #1020

* fix(install): add global EXIT trap for robust cleanup on all exit paths

The function-scoped INT/TERM trap in spin() handles Ctrl+C well, but
set -e triggering between trap clearance and explicit rm -f could
leak temp files or orphan background processes.

Add global _cleanup_pids and _cleanup_files arrays with an EXIT trap
that catches all exit paths. spin() registers its PID and temp file
on entry and deregisters on normal completion. This defense-in-depth
ensures cleanup even on unexpected exits.

Addresses CodeRabbit review suggestion.

* fix(install): defer cleanup array deregistration until after rm/kill

Move _cleanup_pids and _cleanup_files deregistration to after the
actual cleanup actions (rm -f, pid termination) so the global EXIT
trap still covers these resources if a signal arrives mid-cleanup.

Closes CodeRabbit review comment on #1070.

Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com>

---------

Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
Co-authored-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
…nimum (#1079)

* fix(install): upgrade Node.js via nvm when system version is below minimum

install_nodejs() returned early when any Node.js was found on PATH,
regardless of version. On systems with Node 20 pre-installed (e.g.
GitHub Actions ubuntu-24.04 runners), the installer skipped the nvm
upgrade path and then ensure_supported_runtime() rejected the old
version.

Additionally, ensure_nvm_loaded() skipped sourcing nvm.sh when any
node was on PATH, preventing the newly-installed Node 22 from being
activated in the parent shell.

Fixes #1078

* fix(test): update preflight test for nvm upgrade behavior

The installer now attempts to upgrade Node.js via nvm when the system
version is below minimum, instead of failing immediately. Update the
test to verify the new warn-and-upgrade path, with a fake curl stub
to keep the test fast.

* fix(install): also check npm version before skipping nvm upgrade

The early return in install_nodejs() only checked the Node.js version,
but ensure_supported_runtime() also requires npm >= 10. Gate the early
return on both Node.js and npm meeting the minimum requirements.

Addresses CodeRabbit review feedback on #1079.

* fix(onboard): increase endpoint probe timeout for large model inference

The onboard endpoint validation sends a full inference request to verify
the provider is reachable. The 20s max-time was too tight for large
models like nemotron-3-super-120b-a12b on NVIDIA Endpoints, causing
the probe to time out and onboard to fail in non-interactive mode.

Increase connect-timeout from 5s to 10s and max-time from 20s to 60s.
This only runs once during onboard, so the longer timeout is acceptable.

* fix(install): set nvm default after installing Node.js 22

When upgrading from an existing nvm with a different default version,
nvm use 22 only affects the current shell. Set the nvm default alias
so that subsequent shells (e.g. the e2e test harness) that source
nvm.sh also get Node 22.

* fix(install): always create nemoclaw shim and fix nvm alias default

Two issues preventing nemoclaw from being found after install.sh exits:

1. verify_nemoclaw() skipped shim creation when nemoclaw was found on
   nvm's PATH, but that PATH only exists inside install.sh's subshell.
   Always call ensure_nemoclaw_shim() to create ~/.local/bin/nemoclaw.

2. nvm alias default had an invalid --silent flag that silently failed,
   so the nvm default wasn't set for subsequent shells.
* chore(docs): add debug and uninstall to commands reference

check-docs.sh found 15 commands in --help but only 13 headings in
docs/reference/commands.md. Add the missing nemoclaw debug and
nemoclaw uninstall entries.

* chore(docs): add missing --sandbox flag to debug entry
* fix(onboard): retry gateway start with exponential backoff

On some hosts (Horde VMs, first-run environments), the embedded k3s
inside the OpenShell gateway needs more time to initialize than the
gateway's internal health-check window allows. The first attempt fails
with misleading orphaned-cgroup cleanup messages, but a second attempt
typically succeeds because container images are cached and cgroup state
is cleaner.

Replace the single-shot gateway start + separate health check with a
retry loop (up to 3 attempts, 10s/30s backoff). Each failed attempt
gets a clean destroyGateway() before retry. On final failure, the
error message now includes openshell doctor troubleshooting commands.

Upstream tracking: NVIDIA/OpenShell#433

* fix(onboard): skip retry for recovery path, preserve health-check timing

The recovery path (nemoclaw status) should not retry — only the onboard
wizard benefits from retries. Also restore the original behavior of not
sleeping after the final health-check iteration, matching the timing
the cli.test.js tests depend on.

* refactor(onboard): use p-retry for gateway start retry logic

Replace hand-rolled retry loop with p-retry library for gateway start
backoff. Cleaner, more conventional, and delegates retry/backoff
concerns to a well-tested library.

* fix(onboard): correct openshell doctor flags in troubleshooting output

* fix(onboard): use --name flag for doctor logs (targets container directly)

* fix(onboard): skip gateway destroy on recovery path failure

The recovery path (exitOnFailure=false) should preserve gateway state
for diagnostics. Gate destroyGateway() behind exitOnFailure so only
the onboard path (which will retry) tears down on failure.

* fix(onboard): run health checks regardless of gateway start exit code

The slow-start case (where gateway start exits non-zero because
OpenShell's internal health check timed out) is exactly the scenario
this retry logic targets. Always run the grace-period health-check
loop so a gateway that's still coming up gets sampled before teardown.

* fix(onboard): preserve gateway state on final failure for doctor diagnostics

Move destroyGateway() from the pRetry callback into onFailedAttempt
so it only runs between retries, not on the terminal failure. The
final failed gateway state is now preserved so openshell doctor logs
can inspect container logs for troubleshooting.
The pre-push hooks only ran vitest for the nemoclaw/ plugin project.
Root-level unit tests (test/*.test.js) were only caught by CI's
test-unit job, not locally. Add a vitest-root hook that runs
--project cli on pre-push so test failures are caught before push.
* feat: add Kubernetes testing infrastructure

Add k8s-testing/ directory with scripts and manifests for testing NemoClaw
on Kubernetes with Dynamo vLLM inference.

Includes:
- test-installer.sh: Public installer test (requires unattended install support)
- setup.sh: Manual setup from source for development
- Pod manifests for Docker-in-Docker execution

Architecture: OpenShell runs k3s inside Docker, so we use DinD pods
to provide the Docker daemon on Kubernetes.

Signed-off-by: rwipfelnv

* fix: add socat proxy for K8s DNS isolation workaround

OpenShell's nested k3s cluster cannot resolve Kubernetes DNS names,
so inference requests fail with 502 Bad Gateway. This adds:

- socat TCP proxy setup in setup.sh to forward localhost:8000 to the
  K8s vLLM service endpoint
- Provider configuration using host.openshell.internal:8000 which
  resolves to the workspace container from inside k3s
- Documentation explaining the network architecture and workaround
- Updated env var names to match PR #318 (NEMOCLAW_NON_INTERACTIVE)
- cgroup v2 compatibility fix for Docker daemon
- Removed memory limits that caused OOM

Tested: Inference requests from sandboxes now route correctly through
the socat proxy to the Dynamo vLLM endpoint.

Depends on: #318 (non-interactive mode), #365 (Dynamo provider)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: NemoKlaw - NemoClaw on Kubernetes with Dynamo support

Complete K8s deployment solution for NemoClaw:
- nemoklaw.yaml: Pod manifest with DinD, init containers, hostPath storage
- install.sh: Interactive installer with preflight checks
- Rename k8s-testing -> k8s, move old files to dev/

Key learnings:
- hostPath storage (/mnt/k8s-disks) avoids ephemeral storage eviction
- Init containers for docker config, openshell CLI, NemoClaw build
- Workspace container installs apt packages at runtime (can't share via volumes)
- socat proxy bridges K8s DNS to nested k3s (host.openshell.internal)

Tested successfully with Dynamo vLLM backend on EKS.

Signed-off-by: Robert Wipfel <rwipfel@nvidia.com>

* fix: rename NemoKlaw to NemoClaw and document known limitations

Address PR feedback:
- Rename NemoKlaw -> NemoClaw (avoid confusing naming)
- Rename nemoklaw.yaml -> nemoclaw-k8s.yaml
- Fix hardcoded endpoint to use generic example
- Remove log file from repo
- Document known limitations (HTTPS proxy issue)
- Update README with accurate status of what works/doesn't work

Signed-off-by: rwipfelnv
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: update DYNAMO_HOST to vllm-agg-frontend

The aggregated frontend service is the correct endpoint for
Dynamo vLLM inference.

Signed-off-by: Robert Wipfel <rwipfel@nvidia.com>

* docs: add Using NemoClaw section with CLI commands

- Add workspace shell access command
- Add sandbox status/logs/list commands
- Add chat completion test example
- Rename section from "What Can You Do?" to "Using NemoClaw"

Signed-off-by: Robert Wipfel <rwipfel@nvidia.com>

* refactor(k8s): simplify deployment to use official installer

- Use official NemoClaw installer (`curl | bash`) instead of git clone/build
- Switch to `custom` provider from PR #648 (supersedes dynamo-specific provider)
- Remove k8s/dev/ directory (no longer needed for testing)
- Use emptyDir volumes for portability across clusters
- Add /etc/hosts workaround for endpoint validation during onboarding
- Update README with verification steps for local inference

Tested end-to-end with Dynamo vLLM backend.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(k8s): resolve lint errors in yaml and markdown

- Remove multi-document YAML (move namespace creation to README)
- Add language specifier to fenced code block (```text)
- Add blank lines before lists per markdownlint rules

Signed-off-by: Robert Wipfel <rwipfel@nvidia.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

* docs(k8s): add experimental warning and clarify requirements

- Add explicit experimental warning at top of README
- Clarify this is for trying NemoClaw on k8s, not production
- Document privileged pod and DinD requirements upfront
- Add resource requirements to prerequisites

Signed-off-by: Robert Wipfel <rwipfel@nvidia.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

---------

Signed-off-by: rwipfelnv
Signed-off-by: Robert Wipfel <rwipfel@nvidia.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Co-authored-by: KJ <kejones@nvidia.com>
…json (#1033)

The save() function now writes to a temp file and renames it into place,
preventing half-written JSON from corrupting the registry.  All mutating
operations (registerSandbox, updateSandbox, removeSandbox, setDefault) are
wrapped in a mkdir-based advisory lock so concurrent nemoclaw processes
cannot produce lost updates.

Fixes #1011

Co-authored-by: KJ <kejones@nvidia.com>
…ders (#980)

* fix(onboard): force chat completions API for vLLM and NIM-local providers

vLLM's /v1/responses endpoint does not run the --tool-call-parser,
so tool calls arrive as raw text in the response content instead of
the structured tool_calls array. The probe picks openai-responses
because vLLM accepts the request, but parsing only works on
/v1/chat/completions.

Override preferredInferenceApi to openai-completions after validation
for both vllm and nim-local provider paths.

Fixes #976

* fix: log when overriding probe result to chat completions

Surface the API override during onboard so users see why responses
API was not selected.

* test: add regression test for vLLM chat completions override

Verifies that setupNim() forces preferredInferenceApi to openai-completions
for the vLLM provider path even when the probe detects openai-responses first.
This locks down the fix for #976 so the tool-call-parser override cannot
silently regress.

* test: add NIM-local chat completions override regression test

Companion test for the vLLM case: verifies that setupNim() forces
openai-completions for the NIM-local path too, since NIM uses vLLM
internally and has the same tool-call-parser limitation.

---------

Co-authored-by: Benedikt Schackenberg <benedikt@users.noreply.github.com>
Co-authored-by: Benedikt Schackenberg <69834303+BenediktSchackenberg@users.noreply.github.com>
#1041)

nemoclaw list and the global status display only read model/provider from the
local registry, which stores null when inference is configured after onboarding
(e.g. via openshell inference set). This makes both commands show 'unknown'
even when the gateway has a configured inference route.

Query openshell inference get once per invocation and prefer the live values
over stale registry entries, matching the existing behavior of the per-sandbox
status command.

Fixes #986

Co-authored-by: KJ <kejones@nvidia.com>
#1090)

* fix: improve setup.sh and brev-setup.sh for CI and non-interactive use

setup.sh:
- Add timestamped log output ([HH:MM:SS] prefix) for CI visibility
- Support NEMOCLAW_SANDBOX_NAME env var for custom sandbox names
- Add --no-tty flag to openshell sandbox create for non-interactive mode
- Pre-build base image locally when GHCR image is unavailable (forks)
- Add background progress reporter during sandbox build (heartbeat
  every 30s with Docker step info, filters secrets from output)
- Show build elapsed time on completion

brev-setup.sh:
- Add timestamped log output matching setup.sh format
- Support SKIP_VLLM=1 to skip vLLM install on CPU-only instances

* fix(test): update sandbox name test for NEMOCLAW_SANDBOX_NAME env var fallback
- Add mermaid diagram showing NemoClaw/OpenShell relationship
- Explain what each layer provides (onboarding, blueprint, state, bridges)
- Clarify that OpenShell is the runtime, NemoClaw is the opinionated stack
…1103)

Add project-level configuration files for AI coding agents:

- CLAUDE.md: Project overview, architecture map, development commands,
  code style conventions, test structure, and PR requirements.
- AGENTS.md: Agent-specific guidance including quick reference commands,
  key architecture decisions, common patterns for making changes,
  and gotchas for working with the dual-language stack.

These files help AI agents (Claude, Cursor, Codex, etc.) understand
the project structure, conventions, and development workflow without
needing to discover them from scratch each session.
…1104)

Ollama's Go runtime creates a dual-stack (AF_INET6) socket when given
OLLAMA_HOST=0.0.0.0, but WSL2's port relay only forwards IPv4 sockets
to the Windows host. Docker's host-gateway cannot reach the dual-stack
socket, causing inference requests to fail with connection refused.

On WSL2, let Ollama bind to the default 127.0.0.1 instead, which
creates an IPv4-only socket that the WSL2 relay forwards correctly.
…al sanitization (#1092)

* test(security): add E2E tests for command injection and credential sanitization

Adds two new Brev E2E test suites targeting the vulnerabilities fixed by
PR #119 (Telegram bridge command injection) and PR #156 (credential
exposure in migration snapshots + blueprint digest bypass).

Test suites:
- test-telegram-injection.sh: 8 tests covering command substitution,
  backtick injection, quote-breakout, parameter expansion, process
  table leaks, and SANDBOX_NAME validation
- test-credential-sanitization.sh: 13 tests covering auth-profiles.json
  deletion, credential field stripping, non-credential preservation,
  symlink safety, blueprint digest verification, and pattern-based
  field detection

These tests are expected to FAIL on main (unfixed code) and PASS
once PR #119 and #156 are merged.

Refs: #118, #119, #156, #813

* ci: temporarily disable repo guard for fork testing

* ci: bump bootstrap timeout, skip vLLM on CPU E2E runs

- Add SKIP_VLLM=1 support to brev-setup.sh
- Use SKIP_VLLM=1 in brev-e2e.test.js bootstrap
- Bump beforeAll timeout to 30 min for CPU instances
- Bump workflow timeout to 60 min for 3 test suites

* ci: bump bootstrap timeout to 40 min for sandbox image build

* ci: bump Brev instance to 8x32 for faster Docker builds

* ci: add real-time progress streaming for E2E bootstrap and tests

- Stream SSH output to CI log during bootstrap (no more silence)
- Add timestamps to brev-setup.sh and setup.sh info/warn/fail messages
- Add background progress reporter during sandbox Docker build (heartbeat every 30s showing elapsed time, current Docker step, and last log line)
- Stream test script output to CI log via tee + capture for assertions
- Filter potential secrets from progress heartbeat output

* ci: use NemoClaw launchable for E2E bootstrap

Replace bare 'brev create' + brev-setup.sh with 'brev start' using the
OpenShell-Community launch-nemoclaw.sh setup script. This installs Docker,
OpenShell CLI, and Node.js via the launchable's proven path, then runs
'nemoclaw onboard --non-interactive' to build the sandbox (testing whether
this path is faster than our manual setup.sh).

Changes:
- Default CPU back to 4x16 (8x32 didn't help — bottleneck was I/O)
- Launchable path: brev start + setup-script URL, poll for completion,
  rsync PR branch, npm ci, nemoclaw onboard
- Legacy path preserved (USE_LAUNCHABLE=0)
- Timestamped logging throughout for timing comparison
- New use_launchable workflow input (default: true)

* fix: prevent openshell sandbox create from hanging in non-interactive mode

openshell sandbox create without a command defaults to opening an interactive
shell inside the sandbox. In CI (non-interactive SSH), this hangs forever —
the sandbox goes Ready but the command never returns. The [?2004h] terminal
escape codes in CI logs were bash enabling bracketed paste mode, waiting for
input.

Add --no-tty -- true so the command exits immediately after the sandbox is
created and Ready.

* fix: source nvm in non-interactive SSH for launchable path

The launchable setup script installs Node.js via nvm, which sets up PATH
in ~/.nvm/nvm.sh. Non-interactive SSH doesn't source .bashrc, so npm/node
commands fail with 'command not found'. Source nvm.sh before running npm
in the launchable path and runRemoteTest.

* fix: setup.sh respects NEMOCLAW_SANDBOX_NAME env var

setup.sh defaulted to 'nemoclaw' ignoring the NEMOCLAW_SANDBOX_NAME env
var set by the CI test harness (e2e-test). Now uses $1 > $NEMOCLAW_SANDBOX_NAME > nemoclaw.

* ci: bump full E2E test timeout to 15 min for install + sandbox build

* ci: don't run full E2E alongside security tests (it destroys the sandbox)

The full E2E test runs install.sh --non-interactive which destroys and
rebuilds the sandbox. When TEST_SUITE=all, this kills the sandbox that
beforeAll created, causing credential-sanitization and telegram-injection
to fail with 'sandbox not running'. Only run full E2E when TEST_SUITE=full.

* ci: pre-build base image locally when GHCR image unavailable

On forks or before the first base-image workflow run, the GHCR base image
(ghcr.io/nvidia/nemoclaw/sandbox-base:latest) doesn't exist. This causes
the Dockerfile's FROM to fail. Now setup.sh checks for the base image
and builds Dockerfile.base locally if needed.

On subsequent builds, Docker layer cache makes this near-instant.
Once the GHCR base image is available, this becomes a no-op (docker pull
succeeds and the local build is skipped).

* ci: install nemoclaw CLI after bootstrap in non-launchable path

brev-setup.sh creates the sandbox but doesn't install the host-side
nemoclaw CLI that test scripts need for 'nemoclaw <name> status'.
Add npm install + build + link step after bootstrap.

* fix: use npm_config_prefix for nemoclaw CLI install so it lands on PATH

* fix: npm link from repo root where bin.nemoclaw is defined

* fix(ci): register sandbox in nemoclaw registry after setup.sh bootstrap

setup.sh creates the sandbox via openshell directly but never writes
~/.nemoclaw/sandboxes.json. The security test scripts check
`nemoclaw <name> status` which reads the registry, causing all E2E
runs to fail with 'Sandbox e2e-test not running'.

Write the registry entry after nemoclaw CLI install so the test
scripts can find the sandbox.

* style: shfmt formatting fix in setup.sh

* fix(test): exclude policy presets from C7 secret pattern scan

C7 greps for 'npm_' inside the sandbox and false-positives on
nemoclaw-blueprint/policies/presets/npm.yaml which contains rule
names like 'npm_yarn', not actual credentials. Filter out /policies/
paths from all three pattern checks.

* docs(ci): add test suite descriptions to e2e-brev workflow header

Document what each test_suite option runs so maintainers can make an
informed choice from the Actions UI without reading the test scripts.

* ci: re-enable repo guard for e2e-brev workflow

Re-enable the github.repository check so the workflow only runs on
NVIDIA/NemoClaw, not on forks.

* fix(test): update setup-sandbox-name test for NEMOCLAW_SANDBOX_NAME env var

setup.sh now uses ${1:-${NEMOCLAW_SANDBOX_NAME:-nemoclaw}} instead of
${1:-nemoclaw}. Update the test to match and add coverage for the env
var fallback path.

* fix(lint): add shellcheck directives for injection test payloads and fix stdio type

* fix(lint): suppress SC2034 for status_output in credential sanitization test

* fix: address CodeRabbit review — timeout, pipefail, fail-closed probes, shell injection in test

- Bump e2e-brev workflow timeout-minutes from 60 to 90
- Add fail-fast when launchable setup exceeds 40-min wait
- Add pipefail to remote pipeline commands in runRemoteTest and npm ci
- Fix backtick shell injection in validateName test loop (use process.argv)
- Make sandbox_exec fail closed with __PROBE_FAILED__ sentinel
- Add probe failure checks in C6/C7 sandbox assertions

---------

Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Replace manual fd lifecycle (`openSync`/`writeFileSync`/`closeSync`) in `acquireOnboardLock` with a single `writeFileSync` call. If the write threw (e.g. `ENOSPC`), the catch block re-threw because the error code is not `EEXIST`, bypassing `closeSync` and leaking the descriptor. Repeated failures accumulate leaked fds until the process crashes with `EMFILE`.
…l job (#1085)

* ci(nightly-e2e): remove environment NVIDIA_API_KEY from cloud-experimental job

Avoid spurious deployment records; secrets still come from secrets.NVIDIA_API_KEY.

Made-with: Cursor

* fix(ci): match cloud-experimental nightly artifact to default install log path

Upload /tmp/nemoclaw-e2e-cloud-experimental-install.log on failure; script uses
the same fixed default without UID.

* ci(test): split nightly cloud-experimental, shared cleanup, integration + Spark E2E

* test(cli): extend timeout for install fallback and uninstall --yes

* ci(test): skip huggingface policy add and check

---------

Co-authored-by: Carlos Villela <cvillela@nvidia.com>
kjw3 and others added 28 commits April 2, 2026 01:52
## Summary
- stop requiring `NVIDIA_API_KEY` for local-only `nemoclaw start` and
only gate the Telegram bridge when that bridge actually needs the key
- clean up the dashboard forward, `nemoclaw` gateway, and
`openshell-cluster-nemoclaw` Docker volumes when the last sandbox is
destroyed
- broaden unified-memory NVIDIA GPU detection beyond `GB10` while
keeping `spark: true` specific to GB10
- harden policy merge/retry behavior so truncated or error-like
current-policy reads rebuild from a clean `version: 1` scaffold instead
of producing malformed YAML

## Issue Mapping
Fixes #1191
Fixes #1160
Fixes #1182
Fixes #1162
Related #991

## Notes
- `#1188` was investigated but is not included in this PR.
- The current evidence still points to a deeper runtime / proxy
reachability problem on macOS + Colima rather than a bounded
NemoClaw-only fix.
- Keeping it out of this branch avoids speculative networking changes
without strong reproduction and cross-platform coverage.

## Validation
```bash
npx vitest run
npx eslint bin/nemoclaw.js bin/lib/nim.js bin/lib/policies.js test/cli.test.js test/nim.test.js test/policies.test.js test/service-env.test.js
npx tsc -p jsconfig.json --noEmit
```

## References Reviewed
- #1106
- #308
- #95
- #770

Signed-off-by: Kevin Jones <kejones@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
  - Core services can start without an NVIDIA API key.
- Enhanced unified‑memory GPU detection with more accurate capability
reporting.

* **Bug Fixes**
- Gateway and forwarded‑port cleanup only runs when the last sandbox is
removed and no live sandboxes remain.
- Telegram bridge now starts only when both required tokens are present;
clearer startup warnings.
- Policy parsing/merge more robust for metadata‑only or malformed
inputs; consistent version header formatting.

* **Tests**
- Added tests covering GPU detection, policy parsing/merge, CLI
sandbox/gateway flows, and service startup.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

When a user enters an invalid sandbox name (e.g. starting with a digit
like `7racii`), `promptValidatedSandboxName()` now shows the error and
**re-prompts** instead of calling `process.exit(1)`.

This prevents users from getting stuck in a loop where rerunning the
onboard script reuses the invalid name from the session state, requiring
manual deletion of `~/.nemoclaw` to recover.

## Changes

- **`bin/lib/onboard.js`**: Replace exit-on-invalid with a `while(true)`
retry loop in interactive mode. Non-interactive mode (CI/CD) still exits
with status 1.
- **`test/onboard.test.js`**: Source-level test verifying the retry loop
exists and non-interactive fallback is preserved.

## Before

```
Sandbox name: 7racii
Invalid sandbox name: '7racii'
Names must be lowercase...
# process.exit(1) — rerun uses same name from session, stuck
```

## After

```
Sandbox name: 7racii
Invalid sandbox name: '7racii'
Names must be lowercase...
Please try again.

Sandbox name: my-racii
✓ Valid name accepted
```

Closes #1120

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Sandbox name validation now allows interactive users to retry on
invalid input with a clear "Please try again" prompt; non-interactive
runs still terminate on invalid names.

* **Tests**
* Added and updated tests to verify interactive retry vs non-interactive
exit and to strengthen temporary-file handling assertions (more flexible
script naming and safer cleanup).
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Benedikt Schackenberg <benedikt@schackenberg.com>

---------

Signed-off-by: Benedikt Schackenberg <benedikt@schackenberg.com>
Co-authored-by: Carlos Villela <cv@lixo.org>
Co-authored-by: Carlos Villela <cvillela@nvidia.com>
<!-- markdownlint-disable MD041 -->
## Summary
Preview: https://nvidia.github.io/NemoClaw/pr-preview/pr-1209/

## Related Issue
<!-- Link to the issue: Fixes #NNN or Closes #NNN. Remove this section
if none. -->

## Changes
<!-- Bullet list of key changes. -->

## Type of Change
<!-- Check the one that applies. -->
- [ ] Code change for a new feature, bug fix, or refactor.
- [ ] Code change with doc updates.
- [x] Doc only. Prose changes without code sample modifications.
- [ ] Doc only. Includes code sample changes.

## Testing
<!-- What testing was done? -->
- [ ] `npx prek run --all-files` passes (or equivalently `make check`).
- [ ] `npm test` passes.
- [x] `make docs` builds without warnings. (for doc-only changes)

## Checklist

### General

- [ ] I have read and followed the [contributing
guide](https://github.com/NVIDIA/NemoClaw/blob/main/CONTRIBUTING.md).
- [ ] I have read and followed the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md).
(for doc-only changes)

### Code Changes
<!-- Skip if this is a doc-only PR. -->
- [ ] Formatters applied — `npx prek run --all-files` auto-fixes
formatting (or `make format` for targeted runs).
- [ ] Tests added or updated for new or changed behavior.
- [ ] No secrets, API keys, or credentials committed.
- [ ] Doc pages updated for any user-facing behavior changes (new
commands, changed defaults, new features, bug fixes that contradict
existing docs).

### Doc Changes
<!-- Skip if this PR has no doc changes. -->
- [ ] Follows the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md).
Try running the `update-docs` agent skill to draft changes while
complying with the style guide. For example, prompt your agent with
"`/update-docs` catch up the docs for the new changes I made in this
PR."
- [ ] New pages include SPDX license header and frontmatter, if creating
a new page.
- [ ] Cross-references and links verified.

---
<!-- DCO sign-off (required by CI). Replace with your real name and
email. -->
Signed-off-by: Your Name <your-email@example.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
* Updated README and site docs with a new Notice & Disclaimer clarifying
the software’s alpha status, advising against production use, and adding
warranty and liability disclaimers.
* Added a prominent announcement/warning on the docs homepage and
release notes, removed the prior embedded alpha include, and noted that
the software may access external materials subject to separate
terms—users must verify suitability and compliance.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

The NVIDIA provider (`build`) in `REMOTE_PROVIDER_CONFIG` is the only
remote provider missing the `skipVerify` flag. This causes `openshell
inference set` to attempt endpoint verification during onboarding, which
fails even when the API is reachable from the host — blocking onboarding
completely at step 4/7.

All other providers (OpenAI, Anthropic, Gemini, custom) already set
`skipVerify: true` so that `--no-verify` is passed to the inference
setup command. This adds the same flag to the NVIDIA provider for
consistent behavior.

## Changes

- Added `skipVerify: true` to the `build` entry in
`REMOTE_PROVIDER_CONFIG` (`bin/lib/onboard.js`)

## Test Plan

- [x] Verified all other providers already include `skipVerify: true`
- [x] Confirmed `skipVerify` controls the `--no-verify` flag at line
2420
- [x] Ran full test suite: 650/653 passed (3 pre-existing failures in
`service-env.test.js` unrelated to this change)

Closes #1130

Signed-off-by: latenighthackathon
<latenighthackathon@users.noreply.github.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Enabled skip verification support for the NVIDIA Endpoints provider
during onboarding, aligning its behavior with other remote providers for
a more consistent setup experience.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
Co-authored-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
Co-authored-by: Carlos Villela <cvillela@nvidia.com>
…1123)

## Summary
Improves test determinism, consistency, and reliability across CLI,
uninstall, and blueprint test suites by standardizing shell invocation,
tightening execution patterns, and removing redundant or outdated test
code.

---

## Related Issue
Fixes #977 (part 1)

---

## Changes
- Normalize shell invocation:
- Replace `bash -lc` with `bash -c` in uninstall tests to avoid shell
initialization side effects
- Improve CLI test stability:
  - Increase timeouts for long-running commands
  - Standardize usage of `runWithEnv(..., timeout)`
- Remove redundant / outdated test code:
  - Clean up unused or deprecated test logic in `runner.test.ts`
- Improve test consistency:
  - Align execution patterns across CLI and uninstall tests
- Preserve security coverage:
- Maintain regression protections (e.g., path validation and credential
handling)

---

## Verification
- `npm test` passes locally
- `npx prek run --all-files` passes in CI
- No changes to CLI behavior or runtime logic
- Existing security and regression tests continue to pass

---

## Rationale
Some tests relied on shell initialization behavior (`bash -lc`) and
inconsistent execution patterns, leading to flakiness and
non-deterministic outcomes.

These updates:
- eliminate shell-dependent variability  
- standardize execution across test suites  
- improve reliability without impacting functionality  

Additionally, minor cleanup removes redundant or outdated test code to
improve maintainability.

---

## Risk Assessment
**Low risk**

- Changes are limited to test code and execution behavior  
- No production code paths modified  
- Security and regression coverage preserved  

**Rollback**
- Fully reversible by reverting test changes

---

## Type of Change
- [x] Test / infrastructure improvement (no behavioral change)
- [x] Code cleanup / maintenance

---

## Testing
- [x] `npm test` passes  
- [x] `npx prek run --all-files` passes (CI)  

---

## Checklist

### General
- [x] Contributing guide followed  

### Code Changes
- [x] Formatters applied  
- [x] No user-facing behavior changes  
- [x] No secrets committed  

---

## Summary by CodeRabbit (updated)
* **Tests**
* Improved CLI and uninstall test determinism by standardizing shell
invocation
  * Increased timeouts to reduce flakiness in long-running test cases
* Removed redundant or outdated test logic for improved maintainability


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Tests**
  * Improved TypeScript typing inside test mocks for safer compilation.
* Added a shared snapshot constant and strengthened a null-check
assertion.
* Simplified test environment handling by explicitly setting HOME and
adjusting shell invocation semantics.
* Removed redundant inline comment assertions and tightened output
checks.
* Minor formatting and clarity tweaks across test suites for easier
maintenance.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Krish Sapru <ksapru@bu.edu>

---------

Co-authored-by: Carlos Villela <cvillela@nvidia.com>
## Problem

After pulling latest main, `prek run -a` fails with:

1. **ESLint error** in `snapshot.test.ts` —
`@typescript-eslint/consistent-type-imports` forbids `import()` type
annotations in generic parameters
2. **Slow tests** — gateway recovery tests sleep 8–19s each waiting for
health polls that will never succeed against stub `openshell` scripts,
adding ~40s to the suite

## Changes

### ESLint fix
- `snapshot.test.ts`: replace `importOriginal<typeof
import("node:fs")>()` with a top-level `import type fs from "node:fs"`
and use `importOriginal<typeof fs>()`

### Test performance (70s → 28s)
- `onboard.js`: make health-poll count and interval configurable via
`NEMOCLAW_HEALTH_POLL_COUNT` and `NEMOCLAW_HEALTH_POLL_INTERVAL` env
vars (production defaults unchanged)
- `cli.test.js`: set `POLL_COUNT=1`, `POLL_INTERVAL=0` in the
`runWithEnv` test helper so gateway recovery tests skip unnecessary
sleep loops
- `cli.test.js`: reduce test timeouts from 25s → 10s accordingly

### Formatting (pre-commit hooks)
- Prettier reformatting in `runner.test.ts`, `state.test.ts`,
`uninstall.test.js`

## Verification

- All 740 tests pass
- All pre-commit hooks pass (ESLint, Prettier, TypeScript, tests)
- No production behavior change (env vars default to existing values)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Chores**
* Made gateway health polling configurable via environment variables
(poll count and interval).

* **Tests**
* Reduced CLI test timeouts and injected health-check env vars to speed
CI runs.
* Improved test typing/mocking and cleaned up test formatting for
stability and readability.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
…son (#1221)

## Problem

`nemoclaw -v` reported `v0.1.0` — a hard-coded value in `package.json`
that was never updated when new `v*` git tags were created. The
`install.sh` banner had the same issue.

## Solution

Derive the version from git tags at runtime, with a layered fallback
chain:

| Install path | Primary source | Fallback |
|---|---|---|
| Dev clone (full repo) | `git describe --tags --match "v*"` |
`package.json` |
| `install.sh` remote (shallow clone) | `git describe` after explicit
`v*` tag fetch | `.version` file stamped during install |
| `npm install -g` (published tarball) | `.version` stamped by
`prepublishOnly` | `package.json` |

### Pre-push hook

Added `scripts/check-version-tag-sync.sh` as a prek pre-push hook. When
pushing a `v*` tag, it verifies `package.json` at the tagged commit has
a matching version. Regular branch pushes are unaffected.

## Files changed

- **`bin/lib/version.js`** (new) — version resolver: git describe →
`.version` → `package.json`
- **`bin/nemoclaw.js`** — uses `getVersion()` instead of `pkg.version`
- **`install.sh`** — `resolve_installer_version()` uses same fallback
chain; fetches `v*` tags into shallow clones
- **`package.json`** — `prepublishOnly` stamps `.version`; `files`
includes it in tarball
- **`.gitignore`** — ignores generated `.version`
- **`scripts/check-version-tag-sync.sh`** (new) — pre-push guard
- **`.pre-commit-config.yaml`** — registers the pre-push hook
- **`test/install-preflight.test.js`** — updated for new version
resolution

## Testing

- All 740 tests pass
- Verified all install paths (dev clone, shallow clone, tag divergence)
- Verified pre-push hook blocks/passes correctly


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* CLI and installer now determine and display releases using git tags,
stamped version files, or package metadata for more reliable version
reporting.
* Health poll retry counts and intervals for gateway operations are
configurable via environment variables.

* **Chores**
* Added a pre-push check to ensure git tag and package.json version stay
in sync.
  * `.version` files are now ignored.

* **Tests**
* Tests updated to accept git-describe style versions, use shorter CLI
timeouts, and include minor typing/formatting refactors.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

Six functions in `bin/lib/onboard.js` create temporary files using
`Date.now()` and `Math.random().toString(36)`, both of which are
predictable. A local attacker can win a TOCTOU race to pre-create a
symlink at the predicted path in `/tmp`, enabling:

1. **Data exfiltration** — redirect curl output (API responses with
model data) to an attacker-controlled location
2. **Script injection** — via `writeSandboxConfigSyncFile`, inject a
malicious script that gets piped into `openshell sandbox connect`

## Changes

**`bin/lib/onboard.js`:**
- Added `secureTempFile(prefix, ext)` helper that uses
`fs.mkdtempSync()` (OS-level `mkdtemp` syscall with cryptographically
random suffix)
- Replaced all 6 predictable temp file constructions:
  - `probeOpenAiLikeEndpoint` (line ~667)
  - `probeAnthropicEndpoint` (line ~711)
  - `fetchNvidiaEndpointModels` (line ~857)
  - `fetchOpenAiLikeModels` (line ~911)
  - `fetchAnthropicModels` (line ~947)
  - `writeSandboxConfigSyncFile` (line ~527)

**`test/onboard.test.js`:**
- Updated the sync file test to validate the new mkdtemp-based path
structure instead of the old predictable pattern

## Why this approach

The file already uses `fs.mkdtempSync()` securely in two other places
(lines 1781 and 2710), making this fix consistent with existing code
patterns rather than introducing new dependencies.

The remaining `Date.now()` usage in `patchStagedDockerfile` is a build
identifier written into the Dockerfile, not a temp filename — left
unchanged.

Fixes #1093

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Refactor**
* Improved temporary-file handling: sync scripts and downloaded
responses are now written to securely named temp subdirectories and
cleaned up at the directory level to reduce leftover artifacts.

* **Tests**
* Updated tests to validate new temp-file placement and naming pattern,
exact script contents, and tightened directory-level cleanup checks.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Benedikt Schackenberg
<6381261+BenediktSchackenberg@users.noreply.github.com>
…#1227)

## Problem

The port-availability preflight check in `checkPortAvailable` retries
with `sudo lsof` to detect root-owned listeners when unprivileged `lsof`
returns empty output. Without `-n` (non-interactive mode), `sudo` blocks
on a `Password:` prompt, stalling non-interactive installs:

- `curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash`
- `nemoclaw onboard --non-interactive`
- `NEMOCLAW_NON_INTERACTIVE=1`
- CI pipelines

## Fix

Use `sudo -n` so it fails immediately when passwordless sudo is
unavailable, falling through to the TCP bind probe instead of hanging.

This is the only `sudo` call in the codebase that should degrade
gracefully — the other `sudo` calls (swap creation) intentionally
require elevated privileges to succeed.
…1055)

## Summary

Fixes #1010

- Replace text-based string manipulation in `mergePresetIntoPolicy()`
with structured YAML parsing via the `yaml` package
- Merge `network_policies` by name: preset entries override existing on
name collision (prevents duplicates on re-apply)
- Preserve all non-network sections (`filesystem_policy`, `process`,
`landlock`, etc.)
- Falls back to text-based approach for non-standard entry formats
(backward compatibility)

## Problem

The previous implementation used regex and line splitting to inject
preset entries into existing policy YAML. This produced invalid YAML
when:
- The same preset was applied twice (duplicate entries)
- Indentation varied between the current policy and preset content
- `network_policies:` appeared at unexpected positions in the document

## Changes

- `bin/lib/policies.js` — Rewrite `mergePresetIntoPolicy()` to parse
YAML, merge objects, serialize back. Add `yaml` as dependency.
- `test/policies.test.js` — Add 3 tests with realistic preset data:
structured merge, name collision dedup, non-network section
preservation. Relax existing string-format assertions to check
correctness instead of exact formatting.

## Test plan

- [x] All 25 policy tests pass
- [x] Full suite: 603/605 pass (2 pre-existing failures in
`install-preflight.test.js`)
- [x] Tested with real presets (npm, pypi, slack) on a live sandbox

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Improved policy merging to prefer structured YAML parsing, preserve
other top-level sections, set a version field, and fall back to a
tolerant text-based merge when structured parsing or merging isn’t
possible.
* **Tests**
* Expanded tests for structured merges, preset overrides/deduplication,
preservation of non-policy sections, and relaxed header-placement
checks.
* **Chores**
  * Added a YAML parsing library as a runtime dependency.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: KJ <kejones@nvidia.com>
…ints (#1220)

## Summary

- Add missing `protocol: rest`, `enforcement: enforce`, and `tls:
terminate` to `statsig.anthropic.com` and `sentry.io` endpoints in
`openclaw-sandbox.yaml`

## Related Issue

Closes #1214

## Changes

Both endpoints define GET/POST method rules, but without `protocol:
rest` the proxy treats them as L4-only connections — the rules are never
evaluated and any HTTP method is allowed through.

The fix adds the same three fields that `api.anthropic.com` (the
adjacent endpoint in the same policy group) already has:

```yaml
protocol: rest
enforcement: enforce
tls: terminate
```

No new endpoints, no rule changes — just enabling L7 inspection on two
endpoints that already have rules written for it.

## Testing

- Verified the YAML structure matches the working `api.anthropic.com`
endpoint pattern
- No schema changes — `protocol`, `enforcement`, and `tls` are existing
fields used across the policy file

## Checklist

- [x] Conventional commit format
- [x] Scoped to issue, no unrelated changes
- [x] No secrets or credentials

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Updated network policy configurations for external service endpoints,
including enhanced security enforcement and TLS termination settings.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
Co-authored-by: KJ <kejones@nvidia.com>
…ript modules (#1240)

## Summary

- Extract ~210 lines of pure, side-effect-free functions from the
3,800-line `onboard.js` into **5 typed TypeScript modules** under
`src/lib/`:
- `gateway-state.ts` — gateway/sandbox state classification from
openshell output
- `validation.ts` — failure classification, API key validation, model ID
checks
  - `url-utils.ts` — URL normalization, text compaction, env formatting
  - `build-context.ts` — Docker build context filtering, recovery hints
  - `dashboard.ts` — dashboard URL resolution and construction
- Add **56 co-located unit tests** (`src/lib/*.test.ts`) for the
extracted modules
- Set up CLI TypeScript compilation: `tsconfig.src.json` compiles `src/`
→ `dist/` as CJS
- `onboard.js` imports from compiled `dist/lib/` output — transparent to
callers
- Pre-commit hook updated to build TS and include `dist/lib/` in
coverage

These functions are **not touched by any #924 blocker PR** (#781, #782,
#819, #672, #634, #890), so this extraction is safe to land immediately.

## Test plan

- [x] 598 CLI tests pass (542 existing + 56 new)
- [x] `tsc -p tsconfig.src.json` compiles cleanly
- [x] `tsc -p tsconfig.cli.json` type-checks cleanly
- [x] `tsc -p jsconfig.json` type-checks cleanly
- [x] Coverage ratchet passes with `dist/lib/` included

Closes #1237. Relates to #924 (shell consolidation).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Improved sandbox-creation recovery hints and targeted remediation
commands.
  * Smarter dashboard URL resolution and control-UI URL construction.

* **Bug Fixes**
  * More accurate gateway and sandbox state detection.
* Enhanced classification of validation/apply failures and safer
model/key validation.
  * Better provider URL normalization and loopback handling.

* **Tests**
  * Added comprehensive tests covering new utilities.

* **Chores**
* CLI now builds before CLI tests; CI/commit hooks updated to run the
CLI build.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ard (#1235)

## Summary

- Delete `scripts/setup.sh` (324 lines) — every step it performed is
already handled by `nemoclaw onboard` with better error recovery
(exponential backoff via pRetry), interactive prompts, registry
management, and WSL2-aware Ollama binding
- `nemoclaw setup` now delegates to `onboard` instead of shelling out to
the deleted script
- `brev-setup.sh` installs the nemoclaw CLI and runs `nemoclaw onboard
--non-interactive` instead of calling setup.sh
- Remove setup.sh-specific tests and assertions (3 test removals across
3 files); all equivalent assertions against `onboard.js` already exist

## Test plan

- [x] 541 CLI tests pass (`vitest run --project cli`)
- [ ] Verify `nemoclaw setup` prints deprecation notice and runs onboard
- [ ] Verify `brev-setup.sh` bootstraps correctly on a fresh Brev VM
(E2E)

Relates to #924 (shell consolidation). Supersedes #1230.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Refactor**
* Migrated legacy setup to a unified onboarding flow and removed the old
top-level setup orchestration.

* **Chores**
* Updated bootstrap to build/install the CLI locally and run onboarding
non-interactively.
  * Adjusted setup walkthrough instructions to reference onboarding.

* **Tests**
* Removed tests tied to the removed setup script and sandbox-name
behavior.
* Added CLI dispatch tests verifying setup argument forwarding into
onboarding.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…1217)

## Summary

Fixes #117 — Insecure authentication configuration in gateway setup.

> **Supersedes PR #123**, which targeted the correct security issues but
patched code paths that have since moved. Recommend closing #123 in
favor of this PR.

---

## Background 

NemoClaw runs an AI assistant inside a locked-down sandbox. The
**gateway** is the front door to that sandbox, and the **Control UI** is
a web dashboard users open in their browser to interact with the
assistant.

There were three security problems with that front door:

### Problem 1: Device auth was permanently disabled 🔓

The config had `dangerouslyDisableDeviceAuth: True` hardcoded. This
meant any device could connect to the gateway without proving its
identity. It was added as a temporary workaround (commit `5fb629f`) and
never reverted. While the sandbox network isolation mitigates this
today, it becomes dangerous if combined with LAN-bind changes or
cloudflared tunnels in remote deployments — resulting in an
unauthenticated, publicly reachable dashboard.

**Fix:** Default to `false` (device auth enabled). Developers who
genuinely need it disabled for headless/dev environments can set
`NEMOCLAW_DISABLE_DEVICE_AUTH=1` as a build arg.

### Problem 2: Insecure auth was always allowed, even over HTTPS 🌐

`allowInsecureAuth: True` was hardcoded, meaning insecure (non-HTTPS)
authentication was permitted regardless of deployment context. For local
development on `http://127.0.0.1` this is fine, but for
production/remote deployments over `https://` it defeats the purpose of
TLS.

**Fix:** Derive it from the `CHAT_UI_URL` scheme. If the URL is
`http://` → allow insecure auth (local dev). If `https://` → block it
(production).

### Problem 3: The auto-pair bouncer let everyone in 🚪

The `start_auto_pair` watcher automatically approved *every* pending
device pairing request with zero validation. A rogue or unexpected
client type could pair with the gateway unchallenged.

**Fix:** Only approve devices with recognized `clientId`
(`openclaw-control-ui`) or `clientMode` (`webchat`). Unknown clients are
rejected and logged.

---

## Why this PR exists (instead of merging #123)

PR #123 (opened Mar 17) correctly identified all three issues and
proposed the right fixes. However, since then, commit `0f8eedd` ("make
openclaw.json immutable at runtime, move config to build time") moved
the gateway `controlUi` configuration from `scripts/nemoclaw-start.sh`
into the **Dockerfile**. The config is now baked at image build time and
verified via SHA-256 hash at startup.

PR #123's changes to the start script's Python config block would patch
dead code — that block no longer exists. This PR applies the same
security logic where the code actually lives now:

| Issue | PR #123 patched | This PR patches |
|-------|----------------|----------------|
| `dangerouslyDisableDeviceAuth` | `scripts/nemoclaw-start.sh` (removed)
| `Dockerfile` (build-time config) |
| `allowInsecureAuth` | `scripts/nemoclaw-start.sh` (removed) |
`Dockerfile` (build-time config) |
| Auto-pair whitelisting | `scripts/nemoclaw-start.sh` ✅ |
`scripts/nemoclaw-start.sh` ✅ |

---

## Changes

### `Dockerfile` (+8/-2)
- `dangerouslyDisableDeviceAuth` → derived from
`NEMOCLAW_DISABLE_DEVICE_AUTH` env var (default `0` = secure)
- `allowInsecureAuth` → derived from `CHAT_UI_URL` scheme (`false` when
`https://`)
- New `ARG NEMOCLAW_DISABLE_DEVICE_AUTH=0` with `ENV` promotion (follows
existing C-2 safe pattern — no ARG interpolation into Python source)

### `scripts/nemoclaw-start.sh` (+12/-4)
- Auto-pair only approves `ALLOWED_CLIENTS = {'openclaw-control-ui'}`
and `ALLOWED_MODES = {'webchat'}`
- Validates `isinstance(device, dict)` before field access
- Rejects unknown clients with `[auto-pair] rejected unknown client=...
mode=...` log
- Logs `client=` on approval for audit trail
- Documents `NEMOCLAW_DISABLE_DEVICE_AUTH` in script header

### `test/security-c2-dockerfile-injection.test.js` (+6 tests)
- No hardcoded `True` for `dangerouslyDisableDeviceAuth`
- No hardcoded `True` for `allowInsecureAuth`
- Env var derivation for both settings
- `ARG` defaults to `0`
- `ENV` promotion before `RUN` layer

### `test/nemoclaw-start.test.js` (+7 tests)
- Whitelist definitions (`ALLOWED_CLIENTS`, `ALLOWED_MODES`)
- Rejection logic for unknown clients
- Dict type validation
- Client identity in approval logs
- No unconditional approval pattern
- `NEMOCLAW_DISABLE_DEVICE_AUTH` documented

---

## Why no Brev E2E test

These changes are **deterministic config values and simple conditional
logic**, not runtime behavior with environmental ambiguity:

1. **Dockerfile config** — Two Python booleans evaluated at build time:
`os.environ.get(...) == '1'` and `parsed.scheme != 'https'`. If the
source is correct, the output is correct.
2. **Auto-pair whitelisting** — Python set membership check. No shell
interpretation, no SSH pipeline, no quoting edge cases.

This is fundamentally different from something like the telegram bridge
injection fix, where you *must* send real payloads through a real shell
on a real sandbox to prove metacharacters don't get interpreted. Our
changes have no gap between "looks right in source" and "works right at
runtime." The 13 source-pattern tests provide sufficient confidence.

If deeper verification is desired in the future, a **build-time unit
test** that runs the Dockerfile's Python snippet and asserts the JSON
output would be the appropriate next step — a regular CI test, not a
Brev instance.

---

## Test plan

- [x] All 752 tests pass (739 existing + 13 new)
- [x] `shellcheck` ✅
- [x] `hadolint` ✅
- [x] `shfmt` / `prettier` ✅
- [x] `gitleaks` ✅
- [ ] Default onboarding: gateway starts with device auth enabled,
Control UI pairing works via auto-pair
- [ ] `NEMOCLAW_DISABLE_DEVICE_AUTH=1` build: device auth is disabled
(backward-compatible escape hatch)
- [ ] `CHAT_UI_URL=https://...` build: `allowInsecureAuth` is `false`
- [ ] `CHAT_UI_URL=http://127.0.0.1:18789` (default):
`allowInsecureAuth` is `true`
- [ ] Unknown device type connecting to gateway: auto-pair rejects and
logs
- [ ] `openclaw-control-ui` / `webchat` devices: auto-pair approves as
before

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Security**
* Device authentication can be toggled at build time; UI insecure-auth
is allowed only when the control UI URL uses http.
* **Bug Fixes**
* Auto-pair processing now validates inputs, rejects unknown
client/mode, logs identifiers, and deduplicates requests to avoid
repeated approvals/rejections.
* **Tests**
* Added tests for auto-pair validation, logging, de-duplication, and
configurable auth behavior.
* **Chores**
* Sandbox/setup flow now enforces the build-time device-auth setting for
images.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- markdownlint-disable MD041 -->
## Summary
Follow up of #1207 to add unit tests

## Changes
Add tests verifying CoreDNS patching is skipped on WSL2 and still
applied on non-WSL runtimes. Existing tests updated to pass explicit
platform opts so they work correctly when run from a WSL2 host.

## Type of Change
<!-- Check the one that applies. -->
- [ ] Code change for a new feature, bug fix, or refactor.
- [ ] Code change with doc updates.
- [ ] Doc only. Prose changes without code sample modifications.
- [ ] Doc only. Includes code sample changes.

## Testing
<!-- What testing was done? -->
- [X] `npx prek run --all-files` passes (or equivalently `make check`).
- [X] `npm test` passes.
- [ ] `make docs` builds without warnings. (for doc-only changes)

## Checklist

### General

- [X] I have read and followed the [contributing
guide](https://github.com/NVIDIA/NemoClaw/blob/main/CONTRIBUTING.md).
- [ ] I have read and followed the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md).
(for doc-only changes)

### Code Changes
<!-- Skip if this is a doc-only PR. -->
- [X] Formatters applied — `npx prek run --all-files` auto-fixes
formatting (or `make format` for targeted runs).
- [X] Tests added or updated for new or changed behavior.
- [X] No secrets, API keys, or credentials committed.
- [ ] Doc pages updated for any user-facing behavior changes (new
commands, changed defaults, new features, bug fixes that contradict
existing docs).

### Doc Changes
<!-- Skip if this PR has no doc changes. -->
- [ ] Follows the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md).
Try running the `update-docs` agent skill to draft changes while
complying with the style guide. For example, prompt your agent with
"`/update-docs` catch up the docs for the new changes I made in this
PR."
- [ ] New pages include SPDX license header and frontmatter, if creating
a new page.
- [ ] Cross-references and links verified.


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Tests**
* Improved platform runtime tests to pass full context options and
validate behavior for specific known runtimes.
* Added a coverage case ensuring WSL detection prevents the patch in the
appropriate scenario.
  * Retained the existing test for skipping unknown runtimes.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Carlos Villela <cvillela@nvidia.com>
## Summary
- **install.sh missing build:cli**: PR #1240 replaced `bin/lib/` CJS
modules with shims that re-export from `dist/lib/`, but `install.sh`
never ran `npm run build:cli`. Every fresh install crashed on `nemoclaw
onboard` with `Cannot find module '../../dist/lib/preflight'`
- **E2E destroy verification**: The cleanup phase used `nemoclaw list`
to verify a sandbox was gone, but `list` triggers gateway recovery which
can restart a destroyed gateway and re-import the sandbox. Now checks
the registry file directly instead
- Uses `--if-present` so installs targeting older tags without
`build:cli` don't break

## Test plan
- [ ] Run `install.sh` on a clean VM and verify `nemoclaw onboard` no
longer crashes
- [ ] Verify `dist/lib/preflight.js` exists after install completes
- [ ] Nightly E2E cloud-e2e passes the destroy verification phase

---------

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
## Summary

Fix incorrect `-f` shorthand in
`docs/monitoring/monitor-sandbox-activity.md`. The CLI only accepts
`--follow` — the `-f` flag is silently ignored, causing users to miss
real-time log output.

## Related Issue

Fixes #1021

## Changes

- Line 57: `nemoclaw <name> logs -f` → `nemoclaw <name> logs --follow`
- Line 90: `nemoclaw <name> logs -f` → `nemoclaw <name> logs --follow`

## Type of Change

- [ ] Code change for a new feature, bug fix, or refactor.
- [ ] Code change with doc updates.
- [x] Doc only. Prose changes without code sample modifications.
- [ ] Doc only. Includes code sample changes.

## Testing

- [ ] `npx prek run --all-files` passes (or equivalently `make check`).
- [ ] `npm test` passes.
- [x] `make docs` builds without warnings. (for doc-only changes)

## Checklist

### General

- [x] I have read and followed the [contributing
guide](https://github.com/NVIDIA/NemoClaw/blob/main/CONTRIBUTING.md).
- [x] I have read and followed the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md).
(for doc-only changes)

### Doc Changes

- [x] Follows the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md).
- [ ] New pages include SPDX license header and frontmatter, if creating
a new page.
- [x] Cross-references and links verified.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
* Standardized log-following examples to use the long-form --follow flag
in monitoring and troubleshooting instructions for consistent real-time
log streaming.
* Improved formatting by splitting a status/list command across lines to
enhance readability without changing command semantics.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: latenighthackathon
<latenighthackathon@users.noreply.github.com>

---------

Co-authored-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
Co-authored-by: Carlos Villela <cvillela@nvidia.com>
## Summary
- Convert `bin/lib/runtime-recovery.js` (84 lines) to
`src/lib/runtime-recovery.ts`
- Typed `StateClassification` interface for sandbox/gateway state
- 5 pure classifiers/parsers + 1 I/O function (`getRecoveryCommand`)
- Co-locate tests with additional edge cases

Stacked on #1240. 617 CLI tests pass. Coverage ratchet passes.

Relates to #924 (shell consolidation).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Improved runtime recovery logic to better parse CLI output and
determine recovery actions, including more reliable sandbox/gateway
state classification and recovery command selection.

* **Bug Fixes**
* Correctly treats "Disconnected" gateway status as inactive and handles
error/empty lines when listing sandboxes.

* **Tests**
* Expanded test coverage for edge cases, malformed input, and empty
states.

* **Refactor**
  * Runtime recovery helpers consolidated to a shared compiled module.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
- Convert `bin/lib/local-inference.js` (228 lines) to
`src/lib/local-inference.ts`
- Typed interfaces: `ValidationResult`, `GpuInfo`, `RunCaptureFn`
- 9 pure functions + 4 with DI-injected `runCapture`
- Co-locate tests (30 tests)

Stacked on #1240. 616 CLI tests pass. Coverage ratchet passes.

Relates to #924 (shell consolidation).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added local inference provider validation and connectivity diagnostics
for vllm-local and ollama-local.
* Implemented Ollama model discovery, warmup/probe commands, and
intelligent model selection based on GPU memory.
* Added health-check and model readiness checks to surface configuration
or networking issues earlier.

* **Tests**
* Enhanced test coverage for provider validation, model detection, and
unknown-provider handling; updated test targets accordingly.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
- Convert `bin/lib/nim.js` (259 lines) to `src/lib/nim.ts`
- Typed interfaces: `NimModel`, `GpuDetection`, `NimStatus`
- 4 pure functions + subprocess-heavy I/O functions (docker, nvidia-smi)
- Co-locate tests (16 tests) with updated mock pattern for dist/ paths

Stacked on #1240. 615 CLI tests pass. Coverage ratchet passes.

Relates to #924 (shell consolidation).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
  * GPU detection with capability assessment for NIM containers
* Container lifecycle management (start, stop, status monitoring) with
automatic health checks
  * Model enumeration and image selection for launching containers

* **Refactor**
* Simplified module surface and improved internal architecture for
maintainability

* **Tests**
* Updated test suite to align with the refactored module behavior and
stronger typings
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixes: #1163

Now, install.sh does onboarding step. Hence, there is no need to "run
nemoclaw onboard" in the setup-spark.sh
This also aligns with the spark setup instructions in the
spark-install.md

<!-- markdownlint-disable MD041 -->
## Summary
<!-- 1-3 sentences: what this PR does and why. -->

## Related Issue
<!-- Link to the issue: Fixes #NNN or Closes #NNN. Remove this section
if none. -->

## Changes
<!-- Bullet list of key changes. -->

## Type of Change
<!-- Check the one that applies. -->
- [ ] Code change for a new feature, bug fix, or refactor.
- [ ] Code change with doc updates.
- [ ] Doc only. Prose changes without code sample modifications.
- [ ] Doc only. Includes code sample changes.

## Testing
<!-- What testing was done? -->
- [ ] `npx prek run --all-files` passes (or equivalently `make check`).
- [ ] `npm test` passes.
- [ ] `make docs` builds without warnings. (for doc-only changes)

## Checklist

### General

- [ ] I have read and followed the [contributing
guide](https://github.com/NVIDIA/NemoClaw/blob/main/CONTRIBUTING.md).
- [ ] I have read and followed the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md).
(for doc-only changes)

### Code Changes
<!-- Skip if this is a doc-only PR. -->
- [ ] Formatters applied — `npx prek run --all-files` auto-fixes
formatting (or `make format` for targeted runs).
- [ ] Tests added or updated for new or changed behavior.
- [ ] No secrets, API keys, or credentials committed.
- [ ] Doc pages updated for any user-facing behavior changes (new
commands, changed defaults, new features, bug fixes that contradict
existing docs).

### Doc Changes
<!-- Skip if this PR has no doc changes. -->
- [ ] Follows the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md).
Try running the `update-docs` agent skill to draft changes while
complying with the style guide. For example, prompt your agent with
"`/update-docs` catch up the docs for the new changes I made in this
PR."
- [ ] New pages include SPDX license header and frontmatter, if creating
a new page.
- [ ] Cross-references and links verified.

---
<!-- DCO sign-off (required by CI). Replace with your real name and
email. -->
Signed-off-by: Your Name <your-email@example.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Streamlined DGX Spark Docker setup instructions by removing a
post-configuration guidance prompt.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Paritosh Dixit <paritoshd@nvidia.com>
Co-authored-by: Aaron Erickson 🦞 <aerickson@nvidia.com>
## Summary
- Convert `bin/lib/onboard-session.js` (440 lines) to
`src/lib/onboard-session.ts`
- Typed interfaces: `Session`, `StepState`, `SessionFailure`,
`LockResult`, `SessionUpdates`, `LockInfo`, `SessionMetadata`
- Full migration including fs I/O (session persistence, file locking)
and pure helpers (redaction, validation, normalization)
- Co-locate tests (14 tests) with updated cache-clearing for dist/ path

Stacked on #1240. 615 CLI tests pass. Coverage ratchet passes.

Relates to #924 (shell consolidation).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Refactor**
* Reorganized onboarding session management module structure and
improved distribution of the codebase. Session tracking, step-level
state management, lock handling, and sensitive data redaction
functionality have been consolidated and optimized for better
maintainability.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
- Convert `bin/lib/inference-config.js` (143 lines) to
`src/lib/inference-config.ts`
- Typed interfaces: `ProviderSelectionConfig`, `GatewayInference`
- All 3 exports are pure — straightforward full conversion
- Co-locate tests: `test/inference-config.test.js` →
`src/lib/inference-config.test.ts`

Stacked on #1240. 614 CLI tests pass. Coverage ratchet passes.

Relates to #924 (shell consolidation).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Refactor**
* Centralized inference configuration, provider selection, and gateway
output parsing into a single consolidated module for more consistent
behavior.

* **Tests**
* Updated tests to match the consolidated module and improved assertion
patterns.

* **Chores**
* Install/prep process now includes a CLI build step during project
setup (silently handled).
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
- remove `Qwen3.5 397B A17B` from the curated NVIDIA Endpoints cloud
model picker
- keep the rest of the routed inference model handling unchanged

## Why
QA has been using `Qwen3.5 397B A17B` as the default validation target
for NVIDIA Endpoints onboarding, but the endpoint team has confirmed
that this model is currently not working.

This PR does **not** claim that the rest of the NVIDIA Endpoints catalog
is broken. It only removes the specific known-bad curated picker entry
so onboarding defaults and manual QA runs stop selecting it from the
built-in list.

Related #1161

## Validation
```bash
npx vitest run test/inference-config.test.js
npx eslint bin/lib/inference-config.js test/inference-config.test.js
npx tsc -p jsconfig.json --noEmit
```

Signed-off-by: Kevin Jones <kejones@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
  * Removed Qwen3.5 397B A17B model from available cloud model options.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Co-authored-by: Aaron Erickson <aerickson@nvidia.com>
## Summary

- harden sandbox startup on Spark/arm64 by unwrapping the `env ...
nemoclaw-start` bootstrap self-invocation inside the entrypoint instead
of trying to `gosu` into it
- make `nemoclaw destroy` truthful: fail on real OpenShell delete
errors, but treat already-missing live sandboxes as safe stale-registry
cleanup
- add regression coverage for both the Spark bootstrap wrapper path and
the destroy error-handling paths

## Issues

- Fixes #1054
- Fixes #1028
- Fixes #1243

## Investigated But Not Included

- #1063
  - investigated during this branch
- still appears to be the same upstream trusted-proxy/local-DNS behavior
tracked by #396
- direct `curl` inside the sandbox works, which points away from a
bounded NemoClaw-only fix in this PR

## Validation

Local:

```bash
npm run build:cli
npx vitest run test/nemoclaw-start.test.js test/cli.test.js test/onboard.test.js
npx eslint bin/nemoclaw.js test/nemoclaw-start.test.js test/cli.test.js
npx tsc -p jsconfig.json --noEmit
```

Spark (`spark-d8c8`), commit `0efbd66` in a disposable worktree:

```bash
npm install --include=dev
cd nemoclaw && npm install --include=dev
cd ..
npm run build:cli
npx vitest run test/nemoclaw-start.test.js test/cli.test.js test/onboard.test.js
npx eslint bin/nemoclaw.js test/nemoclaw-start.test.js test/cli.test.js
npx tsc -p jsconfig.json --noEmit
```

Result:
- passed on Spark
- `3` files, `92` tests passed
- eslint passed
- typecheck passed

Spark caveat:
- the existing Spark checkout had incomplete root JS dev dependencies
(`vitest/config` missing), so the disposable worktree required a local
`npm install --include=dev` repair before validation

## Notes

Plan note:
- `/Users/kejones/nemoclaw-notes/spark-good-blockers-plan-2026-04-01.md`

Signed-off-by: Kevin Jones <kejones@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Destroy now treats an already-absent sandbox as successful, prints a
clear "already absent" message, proceeds with cleanup, and only fails
for genuine delete errors (printing delete output).
* Start-wrapper handling improved so env ... nemoclaw-start invocations
correctly export and pass through environment assignments.

* **Tests**
* Added tests for destroy behavior (failure vs already-missing) and for
the start-script argument-unwrapping.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Aaron Erickson <aerickson@nvidia.com>
…d" (#1288)

## Summary

The `classifyGatewayStatus` regex matched "Not connected" as connected
because `\bConnected\b` matches the substring. This was partially fixed
in #1268 (added Disconnected guard) but "Not connected" still slipped
through.

Fix:
- Move unavailable-pattern check before connected check
- Use strict line-anchored regex: `/^\s*(?:Status:\s*)?Connected\s*$/im`
- Add regression tests for "Status: Not connected" → inactive

This fix was intended for #1268 but the auto-merge squashed before the
commit landed.

Fixes #1279.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Improved gateway connection status detection to handle edge cases and
properly classify various status formats, ensuring more accurate
connection state reporting.

* **Tests**
  * Added test coverage for additional gateway status variations.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary

Adds a `/cut-release-tag` agent skill that walks through creating
annotated semver tags on main. The skill:

- Fetches the latest semver tag and computes patch/minor/major bump
options
- Prompts for which bump (patch default)
- Shows the changelog since the last tag
- Creates an annotated tag + moves the `latest` floating tag
- Pushes both

Tested live — used it to cut `v0.0.3` just now.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
* Added release tagging workflow documentation with step-by-step
instructions for semantic version management, including tag creation,
changelog reference points, and verification checks.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@laitingsheng laitingsheng deleted the fix/brew-policy branch April 2, 2026 01:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

brew installer support or byo package guidance