fix: improve gateway lifecycle recovery by kjw3 · Pull Request #953 · NVIDIA/NemoClaw

kjw3 · 2026-03-26T01:57:19Z

Supersedes #952 and the closed #908. This branch rebuilds the same work as a single verified signed commit so it can satisfy the branch signature requirements.

Summary

preserve a healthy shared nemoclaw gateway across repeat onboarding
reconcile live OpenShell sandbox state during connect and status instead of trusting stale local registry entries
classify restart/rebuild lifecycle failures so users get deterministic guidance instead of generic transport errors
extend double-onboard coverage so creating a second sandbox does not break the first

Issues

addresses #849
addresses part of #859
improves #888 by distinguishing:
- gateway trust rotation / handshake failure
- gateway metadata exists but the restarted API still refuses connections
- gateway rebuilt and the old sandbox no longer exists

Security

no secret persistence added
no TLS downgrade or bypass added
no destructive auto-recovery of healthy sandboxes
no new shell-out paths beyond resolved openshell wrappers

Validation

npx vitest run test/cli.test.js test/onboard.test.js test/onboard-readiness.test.js test/registry.test.js
npx eslint bin/nemoclaw.js bin/lib/onboard.js test/cli.test.js test/onboard.test.js
npx tsc -p jsconfig.json --noEmit
bash -n test/e2e/test-double-onboard.sh
shellcheck test/e2e/test-double-onboard.sh

Brev CPU Validation

Environment:

brev-cpu
instance: kj-nemoclaw-cpu-20260325-155447
branch commit tested: 43cf8eb

Validated on a real disposable Linux host:

onboard sandbox A / onboard sandbox B path preserves the shared gateway and keeps the first sandbox reachable
after openshell gateway stop + openshell gateway start --name nemoclaw, NemoClaw now surfaces a precise post-restart classification instead of a generic transport failure
after destructive gateway rebuild, NemoClaw removes the stale local sandbox entry when the old sandbox is gone
rerunning onboard after that rebuild recreates the sandbox cleanly and returns to Ready

Residual

This PR does not make OpenShell gateway restarts durable. It makes the failure modes explicit, safer, and easier to recover from.
On the tested Brev CPU host, I did not find a safe non-destructive recovery once the restarted gateway API entered the persistent Connection refused state. That remains a gateway/runtime limitation, not something this PR tries to bypass.

Summary by CodeRabbit

New Features
- Reuse healthy gateways automatically, verify live sandbox/gateway state before actions, and add a non-fatal gateway startup mode for recovery.
Bug Fixes
- Improved gateway health detection, smarter port/forward handling, enhanced stale registry reconciliation, and consolidated dashboard/control UI routing and URL/token behavior.
Tests
- Expanded unit, integration, and e2e tests for lifecycle recovery, ANSI-safe outputs, logs forwarding, and registry reconciliation.
Chores
- Clean staged build context (remove Python venv/cache) and add helper cleanup script.
Documentation
- Minor macOS first-run checklist formatting and simplified NVIDIA Endpoints label.

coderabbitai · 2026-03-26T01:57:31Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Detects and reuses a healthy shared NemoClaw OpenShell gateway, adds live sandbox reconciliation for connect/status flows, introduces recovery-aware gateway startup modes, runs staged build cleanup for blueprint artifacts, and expands unit/CLI/e2e tests covering gateway lifecycle and registry reconciliation.

Changes

Cohort / File(s)	Summary
Core gateway lifecycle & onboard helpers `bin/lib/onboard.js`	Added ANSI-tolerant parsing (`stripAnsi`, `getActiveGatewayName`) and `isGatewayHealthy()`. Refactored startup into `startGatewayWithOptions(..., { exitOnFailure })`, added `startGatewayForRecovery()`, updated `preflight()` to reuse healthy gateways (skip 8080 check) and destroy stale state only when detected. Exported new functions.
CLI — openshell invocation, sandbox reconciliation & recovery `bin/nemoclaw.js`	Replaced generic runner usage with `getOpenshellBinary`, `runOpenshell`, `captureOpenshell`. Introduced sandbox/gateway reconciliation helpers (`getSandboxGatewayState`, `getNamedGatewayLifecycleState`, `getReconciledSandboxGatewayState`, `ensureLiveSandboxOrExit`) and `recoverNamedGatewayRuntime`. Made sandbox commands async and recovery-aware; updated connect/status/logs/destroy flows.
Staged build cleanup & setup `scripts/clean-staged-tree.sh`, `scripts/setup.sh`	Added `clean-staged-tree.sh` to remove Python artifacts (`.venv`, `.pytest_cache`, `__pycache__`) and invoked it from `scripts/setup.sh` for the `nemoclaw-blueprint` build context.
Tests — unit, CLI, gateway-cleanup, e2e `test/cli.test.js`, `test/onboard.test.js`, `test/gateway-cleanup.test.js`, `test/e2e/test-double-onboard.sh`, `test/onboard-readiness.test.js`	Expanded CLI test harness (`runWithEnv`) and many integration tests for logs forwarding, ANSI handling, registry mutation/reconciliation, recovery flows, and lifecycle messaging. Added `isGatewayHealthy` unit tests and mocked `startGateway()` integration test. Updated gateway-cleanup test expectations and rewrote double-onboard e2e into lifecycle-recovery flow with fake OpenAI endpoint and timeout handling.
Misc docs `README.md`	Small formatting tweak in macOS first-run checklist (inserted an empty bullet).

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI as CLI (bin/nemoclaw.js)
    participant Onboard as Onboard (bin/lib/onboard.js)
    participant OpenShell as OpenShell
    participant Registry as Local Registry

    rect rgba(200,150,100,0.5)
    Note over User,Registry: Initial onboarding may recreate gateway
    User->>CLI: onboard
    CLI->>Onboard: startGateway()
    Onboard->>OpenShell: openshell status / gateway info
    Onboard->>OpenShell: destroy/start gateway (if stale)
    OpenShell-->>Onboard: gateway ready
    Onboard-->>CLI: success (new gateway)
    end

    rect rgba(100,150,200,0.5)
    Note over User,Registry: Reuse healthy shared gateway
    User->>CLI: onboard
    CLI->>Onboard: startGateway()
    Onboard->>OpenShell: openshell status / gateway info
    OpenShell-->>Onboard: Connected & named
    Onboard->>OpenShell: select existing gateway
    Onboard-->>CLI: success (reused gateway)
    end

    rect rgba(150,200,100,0.5)
    Note over User,Registry: Sandbox connect/status reconciliation
    User->>CLI: sandbox connect / status
    CLI->>Onboard: ensureLiveSandboxOrExit
    Onboard->>OpenShell: openshell sandbox get / logs
    OpenShell-->>Onboard: live sandbox state
    Onboard->>Registry: compare with local registry
    alt sandbox live
        Onboard->>OpenShell: forward/connect
    else registry stale
        Onboard->>Registry: remove stale entry
        Onboard-->>User: guidance / recovery messaging
    end
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🐰
I sniffed the gateways, found one strong,
Reused the path and hopped along.
Two sandboxes, both intact,
No stale traces left to track,
Onboard sings a lively song!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 19.05% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: improve gateway lifecycle recovery' directly aligns with the main objective of the PR, which is to improve how the system handles gateway lifecycle recovery and reuse.
Linked Issues check	✅ Passed	The PR comprehensively addresses all coding objectives from issue `#908`: preserves healthy gateways via isGatewayHealthy() and startGatewayWithOptions(), reconciles live sandbox state via getSandboxGatewayState() and ensureLiveSandboxOrExit(), classifies failure modes in sandboxStatus(), and extends test coverage for gateway reuse and recovery flows.
Out of Scope Changes check	✅ Passed	All changes are in-scope: gateway health detection, lifecycle classification, recovery flows, async sandbox operations, test coverage expansion, and supporting infrastructure like clean-staged-tree.sh align with stated objectives.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/gateway-lifecycle-recovery-final

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-03-26T02:06:38Z

🚀 Docs preview ready!

https://NVIDIA.github.io/NemoClaw/pr-preview/pr-953/

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

bin/lib/onboard.js (1)

1487-1488: Centralize this staged-tree scrubber.

The same .venv / .pytest_cache / __pycache__ cleanup now lives here and in scripts/setup.sh, so the next artifact tweak has to land in two places. A shared helper/script would keep the legacy setup path aligned with onboard.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 1487 - 1488, Duplicate cleanup commands for
removing .venv, .pytest_cache and __pycache__ are present in onboard.js
(run(...) calls) and scripts/setup.sh; extract them into a single centralized
"staged-tree scrubber" (either a small shell script like
scripts/clean-staged-tree.sh or a JS helper exported from
bin/lib/cleanStagedTree.js) and replace the inline run(...) invocations in
onboard.js and the matching snippet in scripts/setup.sh to call that central
helper. Ensure the central helper performs the same operations (rm -rf for .venv
and .pytest_cache and find ... -name __pycache__ -prune -exec rm -rf {} +) and
that onboard.js's call uses the existing run(...) wrapper semantics (preserve {
ignoreError: true } behavior) or that scripts/setup.sh invokes the script with
identical error-tolerant behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/nemoclaw.js`:
- Around line 100-120: The connected check in getNamedGatewayLifecycleState uses
/Connected/i which also matches "Disconnected"; update the connected detection
to match the exact status token/line in cleanStatus (e.g. replace the
/Connected/i.test(cleanStatus) usage with a more specific regex such as one that
anchors to the status line like /^Status:\s*Connected\b/im.test(cleanStatus) or
a whole-word match for "Connected" so "Disconnected" does not match), leaving
the rest of the function logic unchanged and still using the cleanStatus
variable.

In `@test/e2e/test-double-onboard.sh`:
- Around line 411-418: The current acceptance regex for gateway_status_output
incorrectly treats "Removed stale local registry entry" as a valid gateway-stop
outcome; modify the grep pattern in the gateway status check to remove that
phrase so only lifecycle/recovery messages ("Recovered NemoClaw gateway
runtime", "gateway is no longer configured after restart/rebuild", "gateway is
still refusing connections after restart", "gateway trust material rotated after
restart") count as passes, and keep the existing pass/fail behavior
(pass(...)/fail(...)) around gateway_status_output. After this check, add an
explicit assertion that the registry still contains SANDBOX_B by calling the
registry query used elsewhere in the test (the same registry listing helper that
references SANDBOX_B) and fail the test if SANDBOX_B is missing; reference
gateway_status_output, SANDBOX_B, and the pass/fail helpers when implementing
these changes.

---

Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 1487-1488: Duplicate cleanup commands for removing .venv,
.pytest_cache and __pycache__ are present in onboard.js (run(...) calls) and
scripts/setup.sh; extract them into a single centralized "staged-tree scrubber"
(either a small shell script like scripts/clean-staged-tree.sh or a JS helper
exported from bin/lib/cleanStagedTree.js) and replace the inline run(...)
invocations in onboard.js and the matching snippet in scripts/setup.sh to call
that central helper. Ensure the central helper performs the same operations (rm
-rf for .venv and .pytest_cache and find ... -name __pycache__ -prune -exec rm
-rf {} +) and that onboard.js's call uses the existing run(...) wrapper
semantics (preserve { ignoreError: true } behavior) or that scripts/setup.sh
invokes the script with identical error-tolerant behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 17d70415-acbc-418a-b174-fa1a4f01e638

📥 Commits

Reviewing files that changed from the base of the PR and between 0acb2e4 and 70fabaa.

📒 Files selected for processing (7)

bin/lib/onboard.js
bin/nemoclaw.js
scripts/setup.sh
test/cli.test.js
test/e2e/test-double-onboard.sh
test/gateway-cleanup.test.js
test/onboard.test.js

coderabbitai

🧹 Nitpick comments (1)

bin/lib/onboard.js (1)

188-203: Consider using regex for ANSI stripping.

The manual character-by-character approach works but is harder to maintain and has O(n²) complexity due to string concatenation. The existing isSandboxReady function at line 171 already uses a regex for ANSI stripping.

♻️ Suggested simplification using regex

 function stripAnsi(value = "") {
-  let cleaned = "";
-  for (let i = 0; i < value.length; i += 1) {
-    if (value.charCodeAt(i) === 27 && value[i + 1] === "[") {
-      i += 2;
-      while (i < value.length && /[0-9;]/.test(value[i])) {
-        i += 1;
-      }
-      if (value[i] === "m") {
-        continue;
-      }
-    }
-    cleaned += value[i] || "";
-  }
-  return cleaned;
+  // eslint-disable-next-line no-control-regex
+  return value.replace(/\x1b\[[0-9;]*m/g, "");
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 188 - 203, The stripAnsi function currently
iterates character-by-character and builds a string which is harder to maintain
and can be O(n²); replace its implementation with a single regex-based removal
(same approach used in isSandboxReady) to strip ANSI escape sequences and return
value.replace(ANSI_REGEX, '') where ANSI_REGEX matches CSI/escape sequences;
update stripAnsi to use that regex and ensure it defaults value to an empty
string as before.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 188-203: The stripAnsi function currently iterates
character-by-character and builds a string which is harder to maintain and can
be O(n²); replace its implementation with a single regex-based removal (same
approach used in isSandboxReady) to strip ANSI escape sequences and return
value.replace(ANSI_REGEX, '') where ANSI_REGEX matches CSI/escape sequences;
update stripAnsi to use that regex and ensure it defaults value to an empty
string as before.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0b3e9fc7-003d-423c-b21f-81102d9fc253

📥 Commits

Reviewing files that changed from the base of the PR and between e560a17 and fdb20b9.

📒 Files selected for processing (1)

bin/lib/onboard.js

* fix: improve gateway lifecycle recovery (NVIDIA#953) * fix: improve gateway lifecycle recovery * docs: fix readme markdown list spacing * fix: tighten gateway lifecycle review follow-ups * fix: simplify tokenized control ui output * fix: restore chat route in control ui urls * refactor: simplify ansi stripping in onboard * fix: shorten control ui url output * fix: move control ui below cli next steps * fix: swap hard/soft ulimit settings in start script (NVIDIA#951) Fixes NVIDIA#949 Co-authored-by: KJ <kejones@nvidia.com> --------- Co-authored-by: KJ <kejones@nvidia.com> Co-authored-by: Emily Wilkins <80470879+epwilkins@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

@brandonpelfrey

* fix: improve gateway lifecycle recovery (NVIDIA#953) * fix: improve gateway lifecycle recovery * docs: fix readme markdown list spacing * fix: tighten gateway lifecycle review follow-ups * fix: simplify tokenized control ui output * fix: restore chat route in control ui urls * refactor: simplify ansi stripping in onboard * fix: shorten control ui url output * fix: move control ui below cli next steps * fix: swap hard/soft ulimit settings in start script (NVIDIA#951) Fixes NVIDIA#949 Co-authored-by: KJ <kejones@nvidia.com> * chore: add cyclomatic complexity lint rule (NVIDIA#875) * chore: add cyclomatic complexity rule (ratchet from 95) Add ESLint complexity rule to bin/ and scripts/ to prevent new functions from accumulating excessive branching. Starting threshold is 95 (current worst offender: setupNim in onboard.js). Ratchet plan: 95 → 40 → 25 → 15. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: ratchet complexity to 20, suppress existing violations Suppress 6 functions that exceed the threshold with eslint-disable comments so we can start enforcing at 20 instead of 95: - setupNim (95), setupPolicies (41), setupInference (22) in onboard.js - deploy (22), main IIFE (27) in nemoclaw.js - applyPreset (24) in policies.js Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: suppress complexity for 3 missed functions preflight (23), getReconciledSandboxGatewayState (25), sandboxStatus (27) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add host-side config and state file locations to README (NVIDIA#903) Signed-off-by: peteryuqin <peter.yuqin@gmail.com> * chore: add tsconfig.cli.json, root execa, TS coverage ratchet (NVIDIA#913) * chore: add tsconfig.cli.json, root execa, TS coverage ratchet Foundation for the CLI TypeScript migration (PR 0 of the shell consolidation plan). No runtime changes — config, tooling, and dependency only. - tsconfig.cli.json: strict TS type-checking for bin/ and scripts/ (noEmit, module: preserve — tsx handles the runtime) - scripts/check-coverage-ratchet.ts: pure TS replacement for the bash+python coverage ratchet script (same logic, same tolerance) - execa ^9.6.1 added to root devDependencies (used by PR 1+) - pr.yaml: coverage ratchet step now runs the TS version via tsx - .pre-commit-config.yaml: SPDX headers cover scripts/*.ts, new tsc-check-cli pre-push hook - CONTRIBUTING.md: document typecheck:cli task and CLI pre-push hook - Delete scripts/check-coverage-ratchet.sh Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Apply suggestion from @brandonpelfrey * chore: address PR feedback — use types_or, add tsx devDep - Use `types_or: [ts, tsx]` instead of file glob for tsc-check-cli hook per @brandonpelfrey's suggestion. - Add `tsx` to devDependencies so CI doesn't re-fetch it on every run per CodeRabbit's suggestion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(ci): ignore GitHub "Apply suggestion" commits in commitlint * fix(ci): lint only PR title since repo is squash-merge only Reverts the commitlint ignores rule from the previous commit and instead removes the per-commit lint step entirely. Individual commit messages are discarded at merge time — only the squash-merged PR title lands in main and drives changelog generation. Drop the per-commit lint, keep the PR title check, and remove the now-unnecessary fetch-depth: 0. * Revert "fix(ci): lint only PR title since repo is squash-merge only" This reverts commit 1257a47. * Revert "fix(ci): ignore GitHub "Apply suggestion" commits in commitlint" This reverts commit c395657. * docs: fix markdownlint MD032 in README (blank line before list) * refactor: make coverage ratchet script idiomatic TypeScript - Wrap in main() with process.exitCode instead of scattered process.exit() - Replace mutable flags with .map()/.some() over typed MetricResult[] - Separate pure logic (checkMetrics) from formatting (formatReport) - Throw with { cause } chaining instead of exit-in-helpers - Derive CoverageThresholds from METRICS tuple (single source of truth) - Exhaustive switch on CheckStatus discriminated union * refactor: remove duplication in coverage ratchet script - Drop STATUS_LABELS map; inline labels in exhaustive switch - Extract common 'metric coverage is N%' preamble in formatResult - Simplify ratchetedThresholds: use results directly (already in METRICS order) instead of re-scanning with .find() per metric - Compute 'failed' once in main, pass into formatReport to avoid duplicate .some() scan * refactor: simplify coverage ratchet with FP patterns - Extract classify() as a named pure function (replaces nested ternary) - loadJSON takes repo-relative paths, eliminating THRESHOLD_PATH and SUMMARY_PATH constants (DRY the join-with-REPO_ROOT pattern) - Drop CoverageMetric/CoverageSummary interfaces (only pct is read); use structural type at the call site instead - Inline ratchetedThresholds (one-liner, used once) - formatReport derives fail/improved from results instead of taking a pre-computed boolean (let functions derive from data, don't thread derived state) - sections.join("\n\n") replaces manual empty-string pushing - Shorter type names (Thresholds, Status, Result) — no ambiguity in a single-purpose script * refactor: strip coverage ratchet to failure-only output prek hides output from commands that exit 0, so ok/improved reporting was dead code. Remove Status, Result, classify, formatResult, formatReport, and the ratcheted-thresholds suggestion block. The script now just filters for regressions and prints actionable errors on failure. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Brandon Pelfrey <bpelfrey@nvidia.com> * fix: use CONNECT tunnel for WebSocket endpoints in Discord/Slack presets (NVIDIA#438) * fix: use CONNECT tunnel for WebSocket endpoints in Discord/Slack presets The egress proxy's HTTP idle timeout (~2 min) kills long-lived WebSocket connections when endpoints are configured with protocol:rest + tls:terminate. Switch WebSocket endpoints to access:full (CONNECT tunnel) which bypasses HTTP-level timeouts entirely. Discord: - gateway.discord.gg → access:full (WebSocket gateway) - Add PUT/PATCH/DELETE methods for discord.com (message editing, reactions) - Add media.discordapp.net for attachment access Slack: - Add wss-primary.slack.com and wss-backup.slack.com → access:full (Socket Mode WebSocket endpoints) Partially addresses NVIDIA#409 — the policy-level fix enables WebSocket connections to survive. The hardcoded 2-min timeout in openshell-sandbox still affects any protocol:rest endpoints with long-lived connections. Related: NVIDIA#361 (WhatsApp Web, same root cause) * fix: correct comment wording for media endpoint and YAML formatting * fix: standardize Node.js minimum version to 22.16 (NVIDIA#840) * fix: remove unused RECOMMENDED_NODE_MAJOR from scripts/install.sh Shellcheck flagged it as unused after the min/recommended merge. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: enforce full semver >=22.16.0 in installer scripts The runtime checks only compared the major Node.js version, allowing 22.0–22.15 to pass despite package.json requiring >=22.16.0. Use the version_gte() helper for full semver comparison in both installers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: harden version_gte and align fallback message Guard version_gte() against prerelease suffixes (e.g. "22.16.0-rc.1") that would crash bash arithmetic. Also update the manual-install fallback message to reference MIN_NODE_VERSION instead of hardcoded "22". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: update test stubs for Node.js 22.16 minimum and add Node 20 rejection test - Bump node stub in 'succeeds with acceptable Node.js' from v20.0.0 to v22.16.0 - Bump node stub in buildCurlPipeEnv from v22.14.0 to v22.16.0 - Add new test asserting Node.js 20 is rejected by ensure_supported_runtime --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Signed-off-by: peteryuqin <peter.yuqin@gmail.com> Co-authored-by: KJ <kejones@nvidia.com> Co-authored-by: Emily Wilkins <80470879+epwilkins@users.noreply.github.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Peter <peter.yuqin@gmail.com> Co-authored-by: Brandon Pelfrey <bpelfrey@nvidia.com> Co-authored-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

@brandonpelfrey

* fix: improve gateway lifecycle recovery (NVIDIA#953) * fix: improve gateway lifecycle recovery * docs: fix readme markdown list spacing * fix: tighten gateway lifecycle review follow-ups * fix: simplify tokenized control ui output * fix: restore chat route in control ui urls * refactor: simplify ansi stripping in onboard * fix: shorten control ui url output * fix: move control ui below cli next steps * fix: swap hard/soft ulimit settings in start script (NVIDIA#951) Fixes NVIDIA#949 Co-authored-by: KJ <kejones@nvidia.com> * chore: add cyclomatic complexity lint rule (NVIDIA#875) * chore: add cyclomatic complexity rule (ratchet from 95) Add ESLint complexity rule to bin/ and scripts/ to prevent new functions from accumulating excessive branching. Starting threshold is 95 (current worst offender: setupNim in onboard.js). Ratchet plan: 95 → 40 → 25 → 15. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: ratchet complexity to 20, suppress existing violations Suppress 6 functions that exceed the threshold with eslint-disable comments so we can start enforcing at 20 instead of 95: - setupNim (95), setupPolicies (41), setupInference (22) in onboard.js - deploy (22), main IIFE (27) in nemoclaw.js - applyPreset (24) in policies.js Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: suppress complexity for 3 missed functions preflight (23), getReconciledSandboxGatewayState (25), sandboxStatus (27) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add host-side config and state file locations to README (NVIDIA#903) Signed-off-by: peteryuqin <peter.yuqin@gmail.com> * chore: add tsconfig.cli.json, root execa, TS coverage ratchet (NVIDIA#913) * chore: add tsconfig.cli.json, root execa, TS coverage ratchet Foundation for the CLI TypeScript migration (PR 0 of the shell consolidation plan). No runtime changes — config, tooling, and dependency only. - tsconfig.cli.json: strict TS type-checking for bin/ and scripts/ (noEmit, module: preserve — tsx handles the runtime) - scripts/check-coverage-ratchet.ts: pure TS replacement for the bash+python coverage ratchet script (same logic, same tolerance) - execa ^9.6.1 added to root devDependencies (used by PR 1+) - pr.yaml: coverage ratchet step now runs the TS version via tsx - .pre-commit-config.yaml: SPDX headers cover scripts/*.ts, new tsc-check-cli pre-push hook - CONTRIBUTING.md: document typecheck:cli task and CLI pre-push hook - Delete scripts/check-coverage-ratchet.sh Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Apply suggestion from @brandonpelfrey * chore: address PR feedback — use types_or, add tsx devDep - Use `types_or: [ts, tsx]` instead of file glob for tsc-check-cli hook per @brandonpelfrey's suggestion. - Add `tsx` to devDependencies so CI doesn't re-fetch it on every run per CodeRabbit's suggestion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(ci): ignore GitHub "Apply suggestion" commits in commitlint * fix(ci): lint only PR title since repo is squash-merge only Reverts the commitlint ignores rule from the previous commit and instead removes the per-commit lint step entirely. Individual commit messages are discarded at merge time — only the squash-merged PR title lands in main and drives changelog generation. Drop the per-commit lint, keep the PR title check, and remove the now-unnecessary fetch-depth: 0. * Revert "fix(ci): lint only PR title since repo is squash-merge only" This reverts commit 1257a47. * Revert "fix(ci): ignore GitHub "Apply suggestion" commits in commitlint" This reverts commit c395657. * docs: fix markdownlint MD032 in README (blank line before list) * refactor: make coverage ratchet script idiomatic TypeScript - Wrap in main() with process.exitCode instead of scattered process.exit() - Replace mutable flags with .map()/.some() over typed MetricResult[] - Separate pure logic (checkMetrics) from formatting (formatReport) - Throw with { cause } chaining instead of exit-in-helpers - Derive CoverageThresholds from METRICS tuple (single source of truth) - Exhaustive switch on CheckStatus discriminated union * refactor: remove duplication in coverage ratchet script - Drop STATUS_LABELS map; inline labels in exhaustive switch - Extract common 'metric coverage is N%' preamble in formatResult - Simplify ratchetedThresholds: use results directly (already in METRICS order) instead of re-scanning with .find() per metric - Compute 'failed' once in main, pass into formatReport to avoid duplicate .some() scan * refactor: simplify coverage ratchet with FP patterns - Extract classify() as a named pure function (replaces nested ternary) - loadJSON takes repo-relative paths, eliminating THRESHOLD_PATH and SUMMARY_PATH constants (DRY the join-with-REPO_ROOT pattern) - Drop CoverageMetric/CoverageSummary interfaces (only pct is read); use structural type at the call site instead - Inline ratchetedThresholds (one-liner, used once) - formatReport derives fail/improved from results instead of taking a pre-computed boolean (let functions derive from data, don't thread derived state) - sections.join("\n\n") replaces manual empty-string pushing - Shorter type names (Thresholds, Status, Result) — no ambiguity in a single-purpose script * refactor: strip coverage ratchet to failure-only output prek hides output from commands that exit 0, so ok/improved reporting was dead code. Remove Status, Result, classify, formatResult, formatReport, and the ratcheted-thresholds suggestion block. The script now just filters for regressions and prints actionable errors on failure. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Brandon Pelfrey <bpelfrey@nvidia.com> * fix: use CONNECT tunnel for WebSocket endpoints in Discord/Slack presets (NVIDIA#438) * fix: use CONNECT tunnel for WebSocket endpoints in Discord/Slack presets The egress proxy's HTTP idle timeout (~2 min) kills long-lived WebSocket connections when endpoints are configured with protocol:rest + tls:terminate. Switch WebSocket endpoints to access:full (CONNECT tunnel) which bypasses HTTP-level timeouts entirely. Discord: - gateway.discord.gg → access:full (WebSocket gateway) - Add PUT/PATCH/DELETE methods for discord.com (message editing, reactions) - Add media.discordapp.net for attachment access Slack: - Add wss-primary.slack.com and wss-backup.slack.com → access:full (Socket Mode WebSocket endpoints) Partially addresses NVIDIA#409 — the policy-level fix enables WebSocket connections to survive. The hardcoded 2-min timeout in openshell-sandbox still affects any protocol:rest endpoints with long-lived connections. Related: NVIDIA#361 (WhatsApp Web, same root cause) * fix: correct comment wording for media endpoint and YAML formatting * fix: standardize Node.js minimum version to 22.16 (NVIDIA#840) * fix: remove unused RECOMMENDED_NODE_MAJOR from scripts/install.sh Shellcheck flagged it as unused after the min/recommended merge. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: enforce full semver >=22.16.0 in installer scripts The runtime checks only compared the major Node.js version, allowing 22.0–22.15 to pass despite package.json requiring >=22.16.0. Use the version_gte() helper for full semver comparison in both installers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: harden version_gte and align fallback message Guard version_gte() against prerelease suffixes (e.g. "22.16.0-rc.1") that would crash bash arithmetic. Also update the manual-install fallback message to reference MIN_NODE_VERSION instead of hardcoded "22". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: update test stubs for Node.js 22.16 minimum and add Node 20 rejection test - Bump node stub in 'succeeds with acceptable Node.js' from v20.0.0 to v22.16.0 - Bump node stub in buildCurlPipeEnv from v22.14.0 to v22.16.0 - Add new test asserting Node.js 20 is rejected by ensure_supported_runtime --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: harden installer and onboard resiliency (NVIDIA#961) * fix: harden installer and onboard resiliency * fix: address installer and debug review follow-ups * fix: harden onboard resume across later setup steps * test: simplify payload extraction in onboard tests * test: keep onboard payload extraction target-compatible * chore: align onboard session lint with complexity rule * fix: harden onboard session safety and lock handling * fix: tighten onboard session redaction and metadata handling * fix(security): strip credentials from migration snapshots and enforce blueprint digest (NVIDIA#769) Reconciles NVIDIA#156 and NVIDIA#743 into a single comprehensive solution: - Filter auth-profiles.json at copy time via cpSync filter (from NVIDIA#743) - Recursive stripCredentials() with pattern-based field detection for deep config sanitization (from NVIDIA#156: CREDENTIAL_FIELDS set + CREDENTIAL_FIELD_PATTERN regex) - Remove gateway config section (contains auth tokens) from sandbox openclaw.json - Blueprint digest verification (SHA-256): recorded at snapshot time, validated on restore, empty/missing digest is a hard failure - computeFileDigest() throws when blueprint file is missing instead of silently returning null - Sanitize both snapshot-level and sandbox-bundle openclaw.json copies - Backward compatible: old snapshots without blueprintDigest skip validation - Bump SNAPSHOT_VERSION 2 → 3 Supersedes NVIDIA#156 and NVIDIA#743. * fix(sandbox): export proxy env vars with full NO_PROXY and persist across reconnects (NVIDIA#1025) * fix(sandbox): export proxy env vars with full NO_PROXY and persist across reconnects OpenShell injects NO_PROXY=127.0.0.1,localhost,::1 into the sandbox, missing inference.local and the gateway IP (10.200.0.1). This causes LLM inference requests to route through the egress proxy instead of going direct, and the proxy gateway IP itself gets proxied. Add proxy configuration block to nemoclaw-start.sh that: - Exports HTTP_PROXY, HTTPS_PROXY, and NO_PROXY with inference.local and the gateway IP included - Persists via /etc/profile.d/nemoclaw-proxy.sh (root) or ~/.profile (non-root fallback) so values survive OpenShell reconnect injection - Supports NEMOCLAW_PROXY_HOST / NEMOCLAW_PROXY_PORT overrides The non-root fallback ensures the fix works in environments like Brev where containers run without root privileges. Tested on DGX Spark (ARM64) and Brev VM (x86_64). Verified NO_PROXY contains inference.local and 10.200.0.1 inside the live sandbox after connect. Ref: NVIDIA#626, NVIDIA#704 Ref: NVIDIA#704 (comment) * fix(sandbox): write proxy config to ~/.bashrc for interactive reconnect sessions OpenShell's `sandbox connect` spawns `/bin/bash -i` (interactive, non-login), which sources ~/.bashrc — not ~/.profile or /etc/profile.d/*. The previous approach wrote to ~/.profile and /etc/profile.d/, neither of which is sourced by `bash -i`, so the narrow OpenShell-injected NO_PROXY persisted in live interactive sessions. Changes: - Write proxy snippet to ~/.bashrc (primary) and ~/.profile (login fallback) - Export both uppercase and lowercase proxy variants (NO_PROXY + no_proxy, HTTP_PROXY + http_proxy, etc.) — Node.js undici prefers lowercase no_proxy over uppercase NO_PROXY when both are set - Add idempotency guard to prevent duplicate blocks on container restart - Update tests: verify .bashrc writing, idempotency, bash -i override behavior, and lowercase variant correctness Tested on DGX Spark (ARM64) and Brev VM (x86_64) with full destroy + re-onboard + live `env | grep proxy` verification inside the sandbox shell via `openshell sandbox connect`. Ref: NVIDIA#626 * fix(sandbox): replace stale proxy values on restart with begin/end markers Use begin/end markers in .bashrc/.profile proxy snippet so _write_proxy_snippet replaces the block when PROXY_HOST/PORT change instead of silently keeping stale values. Adds test coverage for the replacement path. Addresses CodeRabbit review feedback on idempotency gap. * fix(sandbox): resolve sandbox user home dynamically when running as root When the entrypoint runs as root, $HOME is /root — the proxy snippet was written to /root/.bashrc instead of the sandbox user's home. Use getent passwd to look up the sandbox user's home when running as UID 0; fall back to /sandbox if the user entry is missing. Addresses CodeRabbit review feedback on _SANDBOX_HOME resolution. --------- Co-authored-by: Carlos Villela <cvillela@nvidia.com> * fix(policies): preset application for versionless policies (Fixes NVIDIA#35) (NVIDIA#101) * fix(policies): allow preset application for versionless policies (Fixes NVIDIA#35) Fixes NVIDIA#35 Signed-off-by: Deepak Jain <deepujain@gmail.com> * fix: remove stale complexity suppression in policies --------- Signed-off-by: Deepak Jain <deepujain@gmail.com> Co-authored-by: Kevin Jones <kejones@nvidia.com> * fix: restore routed inference and connect UX (NVIDIA#1037) * fix: restore routed inference and connect UX * fix: simplify detected local inference hint * fix: remove stale local inference hint * test: relax connect forward assertion --------- Signed-off-by: peteryuqin <peter.yuqin@gmail.com> Signed-off-by: Deepak Jain <deepujain@gmail.com> Co-authored-by: KJ <kejones@nvidia.com> Co-authored-by: Emily Wilkins <80470879+epwilkins@users.noreply.github.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Peter <peter.yuqin@gmail.com> Co-authored-by: Brandon Pelfrey <bpelfrey@nvidia.com> Co-authored-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com> Co-authored-by: Lucas Wang <lucas_wang@lucas-futures.com> Co-authored-by: senthilr-nv <senthilr@nvidia.com> Co-authored-by: Deepak Jain <deepujain@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* fix: improve gateway lifecycle recovery * docs: fix readme markdown list spacing * fix: tighten gateway lifecycle review follow-ups * fix: simplify tokenized control ui output * fix: restore chat route in control ui urls * refactor: simplify ansi stripping in onboard * fix: shorten control ui url output * fix: move control ui below cli next steps

fix: improve gateway lifecycle recovery

70fabaa

kjw3 mentioned this pull request Mar 26, 2026

fix: improve gateway lifecycle recovery #952

Closed

cv approved these changes Mar 26, 2026

View reviewed changes

docs: fix readme markdown list spacing

c46e937

coderabbitai bot reviewed Mar 26, 2026

View reviewed changes

Comment thread bin/nemoclaw.js

Comment thread test/e2e/test-double-onboard.sh

kjw3 added 2 commits March 25, 2026 22:29

fix: tighten gateway lifecycle review follow-ups

e560a17

fix: simplify tokenized control ui output

fdb20b9

coderabbitai bot reviewed Mar 26, 2026

View reviewed changes

kjw3 added 5 commits March 25, 2026 22:46

fix: restore chat route in control ui urls

ef9cb72

refactor: simplify ansi stripping in onboard

ad7f6ab

fix: shorten control ui url output

c224b96

Merge branch 'main' into fix/gateway-lifecycle-recovery-final

6e32240

fix: move control ui below cli next steps

fa72dc5

kjw3 merged commit 6ae809a into main Mar 26, 2026
8 checks passed

kjw3 deleted the fix/gateway-lifecycle-recovery-final branch March 26, 2026 03:23

coderabbitai bot mentioned this pull request Mar 30, 2026

Fix/exclude venv from build context #1110

Closed

18 tasks

senthilr-nv mentioned this pull request Mar 31, 2026

Regression: nemoclaw logs --follow broken again (was fixed in #424) #1146

Closed

This was referenced Apr 2, 2026

fix: add policy preset for brew #1292

Closed

fix(security): use providers for messaging credential injection #1081

Merged

fix(security): bundle sandbox, Telegram, and update hardening #1416

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve gateway lifecycle recovery#953

fix: improve gateway lifecycle recovery#953
kjw3 merged 9 commits intomainfrom
fix/gateway-lifecycle-recovery-final

kjw3 commented Mar 26, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 26, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

github-actions bot commented Mar 26, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kjw3 commented Mar 26, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Issues

Security

Validation

Brev CPU Validation

Residual

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

github-actions bot commented Mar 26, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kjw3 commented Mar 26, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 26, 2026 •

edited

Loading