fix: improve gateway lifecycle recovery by kjw3 · Pull Request #908 · NVIDIA/NemoClaw

kjw3 · 2026-03-25T19:31:52Z

Summary

preserve a healthy shared nemoclaw gateway across repeat onboarding
reconcile live OpenShell sandbox state during connect and status instead of trusting stale local registry entries
classify restart/rebuild lifecycle failures so users get deterministic guidance instead of generic transport errors
extend double-onboard coverage so creating a second sandbox does not break the first

Issues

addresses #849
addresses part of #859
improves #888 by distinguishing:
- gateway trust rotation / handshake failure
- gateway metadata exists but the restarted API still refuses connections
- gateway rebuilt and the old sandbox no longer exists

Security

no secret persistence added
no TLS downgrade or bypass added
no destructive auto-recovery of healthy sandboxes
no new shell-out paths beyond resolved openshell wrappers

Validation

npx vitest run test/cli.test.js test/onboard.test.js test/onboard-readiness.test.js test/registry.test.js
npx eslint bin/nemoclaw.js bin/lib/onboard.js test/cli.test.js test/onboard.test.js
npx tsc -p jsconfig.json --noEmit
bash -n test/e2e/test-double-onboard.sh
shellcheck test/e2e/test-double-onboard.sh

Brev CPU Validation

Environment:

brev-cpu
instance: kj-nemoclaw-cpu-20260325-155447
branch commit tested: 43cf8eb

Validated on a real disposable Linux host:

onboard sandbox A / onboard sandbox B path preserves the shared gateway and keeps the first sandbox reachable
after openshell gateway stop + openshell gateway start --name nemoclaw, NemoClaw now surfaces a precise post-restart classification instead of a generic transport failure
after destructive gateway rebuild, NemoClaw removes the stale local sandbox entry when the old sandbox is gone
rerunning onboard after that rebuild recreates the sandbox cleanly and returns to Ready

Residual

This PR does not make OpenShell gateway restarts durable. It makes the failure modes explicit, safer, and easier to recover from.
On the tested Brev CPU host, I did not find a safe non-destructive recovery once the restarted gateway API entered the persistent Connection refused state. That remains a gateway/runtime limitation, not something this PR tries to bypass.

Summary by CodeRabbit

New Features
- Onboard/startup now detect and reuse a healthy shared gateway, preserving existing sandboxes and speeding repeated setups.
- Added a recovery-aware startup mode that tries recovery without forcing process exit.
Bug Fixes
- Port 8080 is allowed when an active gateway is present to avoid spurious failures.
- clearer status/connect messaging and more predictable handling of stale registry entries; guidance updated for unrecoverable gateway states.
Tests
- Expanded unit, CLI, integration, and e2e tests covering gateway reuse, recovery flows, logging, and registry behaviors.

coderabbitai · 2026-03-25T19:31:59Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds gateway health-check and reuse logic to onboarding, implements sandbox↔gateway reconciliation using OpenShell (capture, recover, select/start), updates CLI sandbox commands to be gateway-aware, and expands unit, integration, and e2e tests to verify reuse, recovery, and stale-registry handling.

Changes

Cohort / File(s)	Summary
Gateway Health & Onboard `bin/lib/onboard.js`	Added `isGatewayHealthy()`; preflight now detects/reuses healthy NemoClaw gateways and skips port 8080 availability failure when reusing; refactored gateway startup into `startGatewayWithOptions(exitOnFailure)`, `startGatewayForRecovery()`, and `startGateway()`; exported new helpers.
OpenShell Integration & Reconciliation `bin/nemoclaw.js`	Added OpenShell resolution/execution helpers (`resolveOpenshell`, `runOpenshell`, `captureOpenshell`) and ANSI/output parsing; introduced sandbox↔gateway reconciliation and recovery flows (`getReconciledSandboxGatewayState`, `ensureLiveSandboxOrExit`, `recoverNamedGatewayRuntime`, etc.); updated sandbox commands to be gateway-aware and use argument-array invocation for openshell calls.
Test Harness & CLI Tests `test/cli.test.js`	Added `runWithEnv()` wrapper with per-invocation env/timeouts and unique HOME handling; added integration tests that stub `openshell` to exercise NotFound stale-registry removal, gateway transport/identity drift messaging, successful recovery flows, and unrecoverable guidance messaging.
Unit & Integration Tests for Onboard `test/onboard.test.js`	Imported `isGatewayHealthy()`; added unit tests for various `openshell status` + gateway-info permutations and an integration-style test ensuring a healthy gateway is selected (no destroy/start).
End-to-End Test Update `test/e2e/test-double-onboard.sh`	Updated e2e expectations to assert reuse of existing NemoClaw gateway on repeated `onboard` runs; added assertion ensuring the first sandbox persists after creating a second (regression `#849`).

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Onboard
    participant OpenShell
    participant Gateway
    User->>Onboard: run `nemoclaw onboard`
    Onboard->>OpenShell: capture `openshell status`
    OpenShell-->>Onboard: status output
    Onboard->>OpenShell: capture gateway info
    OpenShell-->>Onboard: gateway metadata
    alt gateway healthy & connected
        Onboard->>Gateway: select existing gateway (set OPENSHELL_GATEWAY)
        Gateway-->>Onboard: selected
        Onboard-->>User: reuse gateway, skip destroy/start
    else gateway stale or unhealthy
        Onboard->>OpenShell: destroy stale gateway (if stale)
        OpenShell-->>Onboard: destroyed
        Onboard->>OpenShell: start gateway
        OpenShell-->>Onboard: started
        Onboard->>OpenShell: verify gateway health
        OpenShell-->>Onboard: health result
        Onboard-->>User: complete (or fail)
    end

sequenceDiagram
    actor User
    participant CLI as NemoClaw CLI
    participant Reconciler
    participant OpenShell
    participant Registry
    User->>CLI: `sandbox status` / `sandbox connect`
    CLI->>Reconciler: ensureLiveSandboxOrExit(sandbox)
    Reconciler->>OpenShell: `openshell sandbox get` / `gateway` queries
    alt sandbox present & gateway healthy
        OpenShell-->>Reconciler: sandbox + gateway ok
        Reconciler-->>CLI: healthy state (may select gateway)
    else gateway transport error or identity drift
        OpenShell-->>Reconciler: transport/identity error
        Reconciler->>OpenShell: attempt select/start recovery (startGatewayForRecovery)
        alt recovery succeeds
            OpenShell-->>Reconciler: sandbox available after recovery
            Reconciler-->>CLI: recovered state (message)
        else recovery fails
            Reconciler-->>CLI: failure with guidance
        end
    else sandbox missing (NotFound)
        OpenShell-->>Reconciler: NotFound
        Reconciler->>Registry: remove stale `sandboxes.json` entry
        Registry-->>Reconciler: removed
        Reconciler-->>CLI: missing state (message)
    end
    CLI-->>User: status/connect result and messaging

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I sniff the gateway, soft and bright,
I check the status in the moonlit night.
If "Connected" sings, I hop inside with glee,
No needless destroy — I leave it be.
Reuse, recover, gentle rabbit decree.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: improve gateway lifecycle recovery' accurately summarizes the main change: enhancing how NemoClaw handles OpenShell gateway and sandbox lifecycle recovery across repeated operations.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/gateway-lifecycle-recovery

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

bin/nemoclaw.js (1)
115-137: Consider delegating to onboard's startGateway to avoid divergence.

recoverNamedGatewayRuntime re-implements gateway startup logic (gateway start --name nemoclaw) without the health-check retry loop, version pinning, or CoreDNS patching present in bin/lib/onboard.js:startGateway. If startGateway is updated (e.g., new startup flags, different health verification), this function won't inherit those changes.

Since startGateway is now exported, consider importing and delegating to it for the actual gateway start operation, or at minimum extract the shared gateway-start logic into a reusable helper.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` around lines 115 - 137, The recoverNamedGatewayRuntime
function duplicates gateway startup logic; import and call the exported
startGateway from bin/lib/onboard.js (or extract common logic into a shared
helper) instead of directly running runOpenshell(["gateway","start",...]) so you
inherit health-check retries, version pinning and CoreDNS patches; update
recoverNamedGatewayRuntime to call startGateway when needed, then re-check
getNamedGatewayLifecycleState() and set process.env.OPENSHELL_GATEWAY =
"nemoclaw" on success, preserving the existing return shape
(recovered/before/after/attempted/via).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@bin/nemoclaw.js`:
- Around line 115-137: The recoverNamedGatewayRuntime function duplicates
gateway startup logic; import and call the exported startGateway from
bin/lib/onboard.js (or extract common logic into a shared helper) instead of
directly running runOpenshell(["gateway","start",...]) so you inherit
health-check retries, version pinning and CoreDNS patches; update
recoverNamedGatewayRuntime to call startGateway when needed, then re-check
getNamedGatewayLifecycleState() and set process.env.OPENSHELL_GATEWAY =
"nemoclaw" on success, preserving the existing return shape
(recovered/before/after/attempted/via).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d1106af7-673d-44bd-9650-57b992f3829d

📥 Commits

Reviewing files that changed from the base of the PR and between 0ef5dd2 and 43cf8eb.

📒 Files selected for processing (5)

bin/lib/onboard.js
bin/nemoclaw.js
test/cli.test.js
test/e2e/test-double-onboard.sh
test/onboard.test.js

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

bin/lib/onboard.js (1)

188-191: Consider clarifying the semantic reuse of hasStaleGateway.

The isGatewayHealthy function correctly checks that the gateway is both connected and NemoClaw-owned. However, reusing hasStaleGateway here is semantically confusing—the name implies checking for staleness, but in this context it verifies ownership.

Consider extracting a helper with clearer naming:

♻️ Optional refactor for clarity

+function isNemoClawGateway(gwInfoOutput) {
+  return typeof gwInfoOutput === "string" && gwInfoOutput.length > 0 && gwInfoOutput.includes(GATEWAY_NAME);
+}
+
 function hasStaleGateway(gwInfoOutput) {
-  return typeof gwInfoOutput === "string" && gwInfoOutput.length > 0 && gwInfoOutput.includes(GATEWAY_NAME);
+  return isNemoClawGateway(gwInfoOutput);
 }

 function isGatewayHealthy(statusOutput = "", gwInfoOutput = "") {
   const connected = typeof statusOutput === "string" && statusOutput.includes("Connected");
-  return connected && hasStaleGateway(gwInfoOutput);
+  return connected && isNemoClawGateway(gwInfoOutput);
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 188 - 191, isGatewayHealthy currently uses
hasStaleGateway which is misleading because that name implies checking staleness
rather than ownership; introduce a new helper (e.g., isGatewayOwned or
isNemoClawOwned) that inspects gwInfoOutput for NemoClaw/ownership and use that
in isGatewayHealthy (connected && isGatewayOwned(gwInfoOutput)); keep
hasStaleGateway for actual staleness checks elsewhere so behavior doesn't
change, and ensure the new helper accepts the same gwInfoOutput parameter as
hasStaleGateway.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/nemoclaw.js`:
- Around line 623-627: The sandboxLogs function currently pushes the wrong flag
for streaming logs: replace the call that pushes "--tail" (inside sandboxLogs,
where args.push("--tail") is used) with "--follow" so the follow parameter
enables streaming; ensure the code uses args.push("--follow") when follow is
truthy so it matches the CLI help and the caller's expected behavior.

---

Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 188-191: isGatewayHealthy currently uses hasStaleGateway which is
misleading because that name implies checking staleness rather than ownership;
introduce a new helper (e.g., isGatewayOwned or isNemoClawOwned) that inspects
gwInfoOutput for NemoClaw/ownership and use that in isGatewayHealthy (connected
&& isGatewayOwned(gwInfoOutput)); keep hasStaleGateway for actual staleness
checks elsewhere so behavior doesn't change, and ensure the new helper accepts
the same gwInfoOutput parameter as hasStaleGateway.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e77da17a-ee2f-424b-98d8-c4a182f079ce

📥 Commits

Reviewing files that changed from the base of the PR and between 43cf8eb and 24878fc.

📒 Files selected for processing (2)

bin/lib/onboard.js
bin/nemoclaw.js

coderabbitai

♻️ Duplicate comments (1)

bin/nemoclaw.js (1)
635-639: ⚠️ Potential issue | 🟡 Minor

Use --follow instead of --tail to enable log streaming.

The function parameter is follow but the code passes --tail. The --tail flag is typically for showing the last N lines, whereas --follow enables streaming. This matches the help text on line 717 which documents logs [--follow].
🐛 Proposed fix
 function sandboxLogs(sandboxName, follow) {
   const args = ["logs", sandboxName];
-  if (follow) args.push("--tail");
+  if (follow) args.push("--follow");
   runOpenshell(args);
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` around lines 635 - 639, The sandboxLogs function currently
pushes the wrong flag ("--tail") when the parameter is named follow; change the
flag to "--follow" so log streaming works as intended: update sandboxLogs (which
builds args and calls runOpenshell) to push "--follow" when follow is truthy and
remove/replace "--tail".

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@bin/nemoclaw.js`:
- Around line 635-639: The sandboxLogs function currently pushes the wrong flag
("--tail") when the parameter is named follow; change the flag to "--follow" so
log streaming works as intended: update sandboxLogs (which builds args and calls
runOpenshell) to push "--follow" when follow is truthy and remove/replace
"--tail".

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 54410379-bfaf-4a9c-872c-11c701963a9b

📥 Commits

Reviewing files that changed from the base of the PR and between 24878fc and d47a7e9.

📒 Files selected for processing (3)

bin/lib/onboard.js
bin/nemoclaw.js
test/cli.test.js

🚧 Files skipped from review as they are similar to previous changes (1)

test/cli.test.js

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

bin/nemoclaw.js (2)
97-97: Unused ignoreError option in captureOpenshell calls.

captureOpenshell() (lines 73-84) doesn't use the ignoreError option—it always captures output without throwing. The option here is a no-op and may mislead future maintainers into thinking it affects behavior.

Consider removing { ignoreError: true } from captureOpenshell calls, or document that captureOpenshell intentionally ignores errors by design.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` at line 97, The call to captureOpenshell in the variable
gatewayInfo passes a no-op option { ignoreError: true } which is not consumed by
the function; remove the unused option to avoid misleading callers or update
captureOpenshell to honor ignoreError. Locate the captureOpenshell references
(e.g., the call that assigns gatewayInfo) and either (A) remove the second
parameter from that call (and other similar calls) so it becomes
captureOpenshell(["gateway", "info", "-g", "nemoclaw"]) or (B) implement
handling of an ignoreError flag inside the captureOpenshell function so it
conditionally suppresses exceptions—pick one approach and make all calls and the
function implementation consistent. Ensure gatewayInfo and other callers compile
and behavior remains unchanged after the change.
568-570: Consider using captureOpenshell for consistency.

This is the only place still using the legacy _runCapture with a shell command string. Consider refactoring to use captureOpenshell(["inference", "get"]) for consistency with the new array-based approach and to avoid shell parsing.
♻️ Proposed refactor
   const live = parseGatewayInference(
-    _runCapture("openshell inference get 2>/dev/null", { ignoreError: true })
+    captureOpenshell(["inference", "get"], { ignoreError: true }).output
   );
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` around lines 568 - 570, Replace the legacy _runCapture call
with the new array-based captureOpenshell to avoid shell parsing: change the
argument passed to parseGatewayInference from _runCapture("openshell inference
get 2>/dev/null", { ignoreError: true }) to captureOpenshell(["inference",
"get"], { ignoreError: true }); keep parseGatewayInference(...) unchanged, and
ensure captureOpenshell behavior preserves the previous ignoreError semantics
(and suppresses or ignores stderr similarly to the removed "2>/dev/null"
redirection).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/nemoclaw.js`:
- Line 577: The console.log line that prints policies uses 6 spaces of
indentation and should be aligned with surrounding 4-space indenting; locate the
line containing console.log(`    Policies: ${(sb.policies || []).join(", ") ||
"none"}`) and change its indentation to 4 spaces so it matches the surrounding
block (keep the string content unchanged, only adjust leading whitespace).
- Around line 181-186: The conditional that detects the handshake error uses the
raw variable output instead of the ANSI-stripped cleanOutput, causing the regex
to miss matches when ANSI codes are present; update the check to test
cleanOutput (i.e., change the regex condition to use cleanOutput) so it matches
consistently with the other checks (see the writer calls and the surrounding
handshake/error handling logic) and ensure cleanOutput is defined before this
conditional.

---

Nitpick comments:
In `@bin/nemoclaw.js`:
- Line 97: The call to captureOpenshell in the variable gatewayInfo passes a
no-op option { ignoreError: true } which is not consumed by the function; remove
the unused option to avoid misleading callers or update captureOpenshell to
honor ignoreError. Locate the captureOpenshell references (e.g., the call that
assigns gatewayInfo) and either (A) remove the second parameter from that call
(and other similar calls) so it becomes captureOpenshell(["gateway", "info",
"-g", "nemoclaw"]) or (B) implement handling of an ignoreError flag inside the
captureOpenshell function so it conditionally suppresses exceptions—pick one
approach and make all calls and the function implementation consistent. Ensure
gatewayInfo and other callers compile and behavior remains unchanged after the
change.
- Around line 568-570: Replace the legacy _runCapture call with the new
array-based captureOpenshell to avoid shell parsing: change the argument passed
to parseGatewayInference from _runCapture("openshell inference get 2>/dev/null",
{ ignoreError: true }) to captureOpenshell(["inference", "get"], { ignoreError:
true }); keep parseGatewayInference(...) unchanged, and ensure captureOpenshell
behavior preserves the previous ignoreError semantics (and suppresses or ignores
stderr similarly to the removed "2>/dev/null" redirection).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6207b5fb-0687-40fa-8c43-e1bcd7462915

📥 Commits

Reviewing files that changed from the base of the PR and between d47a7e9 and 5868c85.

📒 Files selected for processing (2)

bin/nemoclaw.js
test/cli.test.js

🚧 Files skipped from review as they are similar to previous changes (1)

test/cli.test.js

…into fix/gateway-lifecycle-recovery

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/nemoclaw.js`:
- Around line 187-195: Replace uses of the raw output variable with the
ANSI-stripped cleanOutput in the error-matching branches so the regex checks are
robust to ANSI escape codes: change the two conditions testing /Connection
refused|transport error/i.test(output) and /Missing gateway auth token|device
identity required/i.test(output) to use cleanOutput instead, keeping the
existing writer(...) calls and sandboxName reference unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5b98502a-f50f-48d7-9597-52d40f5794a6

📥 Commits

Reviewing files that changed from the base of the PR and between 5868c85 and a698469.

📒 Files selected for processing (1)

bin/nemoclaw.js

…into fix/gateway-lifecycle-recovery

patelspratik · 2026-03-26T00:10:08Z

+  };
+}
+
+function stripAnsi(value = "") {


why a different stripAnsi than the one in onboard?

# Conflicts: # bin/lib/onboard.js # test/e2e-gateway-isolation.sh

…into fix/gateway-lifecycle-recovery

# Conflicts: # test/e2e-gateway-isolation.sh

kjw3 · 2026-03-26T01:51:58Z

Closing this in favor of #952. This branch had to be rewritten onto signed commits to satisfy verified-signature requirements, so the signed replacement PR now carries the same work forward.

kjw3 added 3 commits March 25, 2026 15:08

fix: preserve healthy gateway across sandbox lifecycle

40d7ad7

fix: reconcile live sandbox state during connect

f443e49

test: cover gateway reuse across double onboard

a7f74f3

kjw3 added 4 commits March 25, 2026 15:38

fix: classify gateway trust rotation on reconnect

8af243d

fix: classify unreachable gateway after restart

76f5ea7

fix: detect unreachable restarted gateway from status

5322096

fix: distinguish missing gateway after rebuild

43cf8eb

kjw3 marked this pull request as ready for review March 25, 2026 22:04