fix(onboard): retry gateway start with exponential backoff by ericksoa · Pull Request #1051 · NVIDIA/NemoClaw

ericksoa · 2026-03-28T17:47:06Z

Summary

Retry openshell gateway start up to 3 times with exponential backoff during onboard, using p-retry
Each failed attempt gets a clean destroyGateway() (including Docker volume cleanup) before retry
Recovery path (nemoclaw status) remains single-attempt to keep CLI responsiveness
Final failure message now includes openshell doctor troubleshooting commands

Root Cause

On first-run environments (Horde VMs, fresh Ubuntu 24.04), the embedded k3s inside the OpenShell gateway can exceed the gateway's internal health-check timeout during initialization. The first attempt fails, but a second attempt typically succeeds because container images are cached and cgroup state is cleaner. Previously, NemoClaw made a single attempt and gave up immediately.

Upstream tracking: NVIDIA/OpenShell#433

Changes

bin/lib/onboard.js — Replace single-shot gateway start with p-retry (3 attempts, exponential backoff). Only the onboard path retries; the recovery path (startGatewayForRecovery) stays single-attempt.

package.json — Add p-retry@^4.6.2 as a direct dependency (v4 is CJS-compatible).

test/gateway-cleanup.test.js — Update pattern-matching test to reflect that destroyGateway() now lives inside the retry callback.

Testing

npm test — all unit tests pass including cli.test.js (recovery path timing preserved)
Manual: on macOS the retry path is not exercised (gateway starts on first attempt), confirming no regression for the happy path

Summary by CodeRabbit

Improvements
- Startup now automatically retries with exponential backoff, performs repeated health checks before declaring success, logs progress during attempts, and prints a single clear "Gateway is healthy" on success.
- On final failure, messaging is consolidated with concise troubleshooting guidance; transient cleanup runs between retry attempts.
Chores
- Added a retry helper dependency to improve startup reliability.
Tests
- Updated gateway startup tests to reflect the new retry and cleanup behavior.

On some hosts (Horde VMs, first-run environments), the embedded k3s inside the OpenShell gateway needs more time to initialize than the gateway's internal health-check window allows. The first attempt fails with misleading orphaned-cgroup cleanup messages, but a second attempt typically succeeds because container images are cached and cgroup state is cleaner. Replace the single-shot gateway start + separate health check with a retry loop (up to 3 attempts, 10s/30s backoff). Each failed attempt gets a clean destroyGateway() before retry. On final failure, the error message now includes openshell doctor troubleshooting commands. Upstream tracking: NVIDIA/OpenShell#433

coderabbitai · 2026-03-28T17:47:17Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 00d1cf46-7f9a-4441-bcba-88a29591c9fb

📥 Commits

Reviewing files that changed from the base of the PR and between 9a68f51 and 32a845f.

📒 Files selected for processing (1)

bin/lib/onboard.js

📝 Walkthrough

Walkthrough

Wraps gateway start in a p-retry loop with exponential backoff. Each attempt runs openshell gateway start (ignore errors), performs up-to-5 external health checks (2s apart), invokes destroyGateway() on failed attempts when configured, and centralizes final failure reporting. Adds p-retry dependency and updates a test.

Changes

Cohort / File(s)	Summary
Onboard runtime logic `bin/lib/onboard.js`	Replace single-start + fixed health-check loop with a `p-retry`-wrapped retry flow. Compute retry count from `exitOnFailure`. Each attempt runs `openshell gateway start` (`ignoreError:true`), polls health up to 5 times (2s apart), throws on health failure to trigger retry, and uses `onFailedAttempt` to log progress and call `destroyGateway()` when configured. Consolidates final success/failure messaging; removes prior per-iteration "Stale state removed... please rerun" behavior.
Tests `test/gateway-cleanup.test.js`	Update assertions to reflect retry-based behavior: no longer assert strict invocation counts of `destroyGateway()`; instead check that `startGatewayWithOptions` contains `destroyGateway()` usage and adjust comment to note retry-loop cleanup behavior.
Dependencies / manifest `package.json`	Add runtime dependency `p-retry@^4.6.2`. Minor devDependency reordering/formatting and small formatting tweak to `dependencies.openclaw` entry.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as "Nemoclaw CLI"
  participant OS_CLI as "OpenShell CLI\n(openshell gateway start)"
  participant Gateway as "Gateway Container\n(k3s)"

  CLI->>OS_CLI: run "openshell gateway start" (ignoreError:true)
  OS_CLI->>Gateway: start container & internal health checks
  alt start returns success
    OS_CLI-->>CLI: start returned 0
    CLI->>CLI: perform up to 5 external health samples (2s interval)
    alt isGatewayHealthy == true
      CLI-->>Gateway: confirmed healthy
    else not healthy after samples
      CLI->>CLI: throw to trigger retry
    end
  else start returned non-zero / immediate failure
    OS_CLI-->>CLI: non-success
    CLI->>CLI: throw to trigger retry
  end
  Note over CLI: p-retry schedules next attempt (minTimeout:10s, factor:3) and calls onFailedAttempt which may run destroyGateway() before next attempt

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

"I hop, I nudge, I try once more,
Destroy and wait, then open the door.
Backoff counts the patient beats,
Until k3s hums and routes its fleets.
🐇✨"

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding retry logic with exponential backoff to the gateway start function, which directly addresses the linked issue.
Linked Issues check	✅ Passed	The changes implement all core coding requirements from issue `#1050`: retry logic up to 3 times with exponential backoff [`#1050`], destroyGateway() called between retries [`#1050`], and improved diagnostics with openshell doctor commands [`#1050`].
Out of Scope Changes check	✅ Passed	All changes are in-scope: retry mechanism in onboard.js, p-retry dependency addition, and test assertion updates directly support the PR objectives with no extraneous modifications.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/gateway-start-retry

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

The recovery path (nemoclaw status) should not retry — only the onboard wizard benefits from retries. Also restore the original behavior of not sleeping after the final health-check iteration, matching the timing the cli.test.js tests depend on.

Replace hand-rolled retry loop with p-retry library for gateway start backoff. Cleaner, more conventional, and delegates retry/backoff concerns to a well-tested library.

…ctly)

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 1623-1642: The current recovery path always calls destroyGateway()
on a single failure even when retries is 0, which destroys gateway volumes
needed for diagnostics; change the cleanup so destroyGateway() is only invoked
when teardown is intended (e.g., when exitOnFailure is true or retries > 0 /
another retry will occur). Locate the pRetry callback where retries is defined
and the throw new Error("Gateway failed to start") branch, and wrap or gate the
destroyGateway() call so it runs only for the onboarding/teardown path
(exitOnFailure === true or retries > 0) rather than unconditionally during the
recovery/startGatewayForRecovery flow.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f07a35c3-d0a1-4a34-a63d-b1578f964058

📥 Commits

Reviewing files that changed from the base of the PR and between 1d6e04d and dfdef21.

📒 Files selected for processing (1)

bin/lib/onboard.js

The recovery path (exitOnFailure=false) should preserve gateway state for diagnostics. Gate destroyGateway() behind exitOnFailure so only the onboard path (which will retry) tears down on failure.

…emoClaw into fix/gateway-start-retry

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

bin/lib/onboard.js (1)
1640-1659: ⚠️ Potential issue | 🟠 Major

Keep the last failed gateway intact for the new doctor commands.

destroyGateway() runs on every failed onboarding attempt, including the terminal one, and Line 1655 still tells users the state was removed. That tears down the very gateway/container state the new openshell doctor logs --name nemoclaw guidance needs to inspect. Please only clean up when another retry will run, and update the final message accordingly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 1640 - 1659, The current catch block always
calls destroyGateway() on failure and then prints that stale state was removed,
which destroys the gateway logs the new doctor commands need; change logic so
destroyGateway() is only called when another retry will run (e.g., call
destroyGateway() inside the retry handler or check err.attemptNumber < retries+1
before destroying), and in the final catch path do NOT call destroyGateway() and
update the final console.error text from "Stale state removed. Please rerun:
nemoclaw onboard" to a message that preserves state and directs users to run the
diagnostic commands (openshell doctor logs --name nemoclaw and openshell doctor
check) for inspection; keep references to destroyGateway(), the retry mechanism
(retries/minTimeout/factor/onFailedAttempt) and the catch block when making the
change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 1626-1638: The code currently only runs the 5-attempt health-check
loop when startResult.status === 0, skipping the grace-period sampling for
slow-start cases where runOpenshell(["gateway","start", ...]) exits non-zero;
move or duplicate the for-loop so the runCaptureOpenshell
status/namedInfo/currentInfo checks with isGatewayHealthy(...) are executed
regardless of startResult.status (i.e., run the same 5-iteration retry loop
after calling runOpenshell), so a gateway that initially exits non-zero still
gets the extra sampling attempts to become healthy before teardown.

---

Duplicate comments:
In `@bin/lib/onboard.js`:
- Around line 1640-1659: The current catch block always calls destroyGateway()
on failure and then prints that stale state was removed, which destroys the
gateway logs the new doctor commands need; change logic so destroyGateway() is
only called when another retry will run (e.g., call destroyGateway() inside the
retry handler or check err.attemptNumber < retries+1 before destroying), and in
the final catch path do NOT call destroyGateway() and update the final
console.error text from "Stale state removed. Please rerun: nemoclaw onboard" to
a message that preserves state and directs users to run the diagnostic commands
(openshell doctor logs --name nemoclaw and openshell doctor check) for
inspection; keep references to destroyGateway(), the retry mechanism
(retries/minTimeout/factor/onFailedAttempt) and the catch block when making the
change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 90377827-1035-4286-9a60-31a4d9d73064

📥 Commits

Reviewing files that changed from the base of the PR and between 1d6e04d and 9a3b9c7.

📒 Files selected for processing (1)

bin/lib/onboard.js

The slow-start case (where gateway start exits non-zero because OpenShell's internal health check timed out) is exactly the scenario this retry logic targets. Always run the grace-period health-check loop so a gateway that's still coming up gets sampled before teardown.

coderabbitai

♻️ Duplicate comments (1)

bin/lib/onboard.js (1)

1628-1638: ⚠️ Potential issue | 🟠 Major

Run grace-period health checks regardless of exit status.

The startResult.status === 0 guard skips the health sampling when openshell gateway start exits non-zero—but the slow-start scenario this retry path targets is precisely when OpenShell's internal health-check times out (non-zero exit) while k3s is still initializing. The gateway may actually be healthy moments after that timeout, so destroying it immediately without sampling wastes a viable attempt.

Suggested fix

-      if (startResult.status === 0) {
-        for (let i = 0; i < 5; i++) {
-          const status = runCaptureOpenshell(["status"], { ignoreError: true });
-          const namedInfo = runCaptureOpenshell(["gateway", "info", "-g", GATEWAY_NAME], { ignoreError: true });
-          const currentInfo = runCaptureOpenshell(["gateway", "info"], { ignoreError: true });
-          if (isGatewayHealthy(status, namedInfo, currentInfo)) {
-            return; // success
-          }
-          if (i < 4) sleep(2);
-        }
+      for (let i = 0; i < 5; i += 1) {
+        const status = runCaptureOpenshell(["status"], { ignoreError: true });
+        const namedInfo = runCaptureOpenshell(["gateway", "info", "-g", GATEWAY_NAME], { ignoreError: true });
+        const currentInfo = runCaptureOpenshell(["gateway", "info"], { ignoreError: true });
+        if (isGatewayHealthy(status, namedInfo, currentInfo)) {
+          return; // success
+        }
+        if (i < 4) sleep(2);
       }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 1628 - 1638, The health-sampling loop is
currently gated by startResult.status === 0 which skips retrying when "openshell
gateway start" returns non-zero; move or remove that guard so the for-loop that
calls runCaptureOpenshell(["status"]),
runCaptureOpenshell(["gateway","info","-g", GATEWAY_NAME]) and
runCaptureOpenshell(["gateway","info"]) and checks isGatewayHealthy(...) runs
regardless of startResult.status, keeping the same loop/backoff (sleep(2)) and
early return on success; ensure the existing behavior (return on
isGatewayHealthy success and the 5-attempt retry) is preserved but executed even
when startResult.status !== 0.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@bin/lib/onboard.js`:
- Around line 1628-1638: The health-sampling loop is currently gated by
startResult.status === 0 which skips retrying when "openshell gateway start"
returns non-zero; move or remove that guard so the for-loop that calls
runCaptureOpenshell(["status"]), runCaptureOpenshell(["gateway","info","-g",
GATEWAY_NAME]) and runCaptureOpenshell(["gateway","info"]) and checks
isGatewayHealthy(...) runs regardless of startResult.status, keeping the same
loop/backoff (sleep(2)) and early return on success; ensure the existing
behavior (return on isGatewayHealthy success and the 5-attempt retry) is
preserved but executed even when startResult.status !== 0.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4523112d-d108-4d23-9e00-fc25e540e161

📥 Commits

Reviewing files that changed from the base of the PR and between 1d6e04d and 9a3b9c7.

📒 Files selected for processing (1)

bin/lib/onboard.js

…nostics Move destroyGateway() from the pRetry callback into onFailedAttempt so it only runs between retries, not on the terminal failure. The final failed gateway state is now preserved so openshell doctor logs can inspect container logs for troubleshooting.

`npm install -g .` from a fresh clone creates a symlink (equivalent to `npm link`) and does NOT install dependencies. This causes `require('p-retry')` to fail at runtime since node_modules doesn't exist locally. - Add dep bootstrap to `prepare` script: installs production deps if missing, which runs during `npm link` / `npm install -g .` but is a no-op during normal `npm install` (deps already present) - Add `bundleDependencies` for tarball-based install paths (npm pack/publish) Fixes regression from #1051.

…1112) * fix(install): ensure p-retry is available for npm install -g from source `npm install -g .` from a fresh clone creates a symlink (equivalent to `npm link`) and does NOT install dependencies. This causes `require('p-retry')` to fail at runtime since node_modules doesn't exist locally. - Add dep bootstrap to `prepare` script: installs production deps if missing, which runs during `npm link` / `npm install -g .` but is a no-op during normal `npm install` (deps already present) - Add `bundleDependencies` for tarball-based install paths (npm pack/publish) Fixes regression from #1051. * fix(install): always install production deps in prepare script Address review feedback: remove conditional node_modules/p-retry check and unconditionally run npm install --omit=dev --ignore-scripts. This is a no-op when deps are already present and ensures they resolve in the global install case. --------- Co-authored-by: Carlos Villela <cvillela@nvidia.com>

* fix(onboard): retry gateway start with exponential backoff On some hosts (Horde VMs, first-run environments), the embedded k3s inside the OpenShell gateway needs more time to initialize than the gateway's internal health-check window allows. The first attempt fails with misleading orphaned-cgroup cleanup messages, but a second attempt typically succeeds because container images are cached and cgroup state is cleaner. Replace the single-shot gateway start + separate health check with a retry loop (up to 3 attempts, 10s/30s backoff). Each failed attempt gets a clean destroyGateway() before retry. On final failure, the error message now includes openshell doctor troubleshooting commands. Upstream tracking: NVIDIA/OpenShell#433 * fix(onboard): skip retry for recovery path, preserve health-check timing The recovery path (nemoclaw status) should not retry — only the onboard wizard benefits from retries. Also restore the original behavior of not sleeping after the final health-check iteration, matching the timing the cli.test.js tests depend on. * refactor(onboard): use p-retry for gateway start retry logic Replace hand-rolled retry loop with p-retry library for gateway start backoff. Cleaner, more conventional, and delegates retry/backoff concerns to a well-tested library. * fix(onboard): correct openshell doctor flags in troubleshooting output * fix(onboard): use --name flag for doctor logs (targets container directly) * fix(onboard): skip gateway destroy on recovery path failure The recovery path (exitOnFailure=false) should preserve gateway state for diagnostics. Gate destroyGateway() behind exitOnFailure so only the onboard path (which will retry) tears down on failure. * fix(onboard): run health checks regardless of gateway start exit code The slow-start case (where gateway start exits non-zero because OpenShell's internal health check timed out) is exactly the scenario this retry logic targets. Always run the grace-period health-check loop so a gateway that's still coming up gets sampled before teardown. * fix(onboard): preserve gateway state on final failure for doctor diagnostics Move destroyGateway() from the pRetry callback into onFailedAttempt so it only runs between retries, not on the terminal failure. The final failed gateway state is now preserved so openshell doctor logs can inspect container logs for troubleshooting.

…VIDIA#1112) * fix(install): ensure p-retry is available for npm install -g from source `npm install -g .` from a fresh clone creates a symlink (equivalent to `npm link`) and does NOT install dependencies. This causes `require('p-retry')` to fail at runtime since node_modules doesn't exist locally. - Add dep bootstrap to `prepare` script: installs production deps if missing, which runs during `npm link` / `npm install -g .` but is a no-op during normal `npm install` (deps already present) - Add `bundleDependencies` for tarball-based install paths (npm pack/publish) Fixes regression from NVIDIA#1051. * fix(install): always install production deps in prepare script Address review feedback: remove conditional node_modules/p-retry check and unconditionally run npm install --omit=dev --ignore-scripts. This is a no-op when deps are already present and ensures they resolve in the global install case. --------- Co-authored-by: Carlos Villela <cvillela@nvidia.com>

* fix(onboard): retry gateway start with exponential backoff On some hosts (Horde VMs, first-run environments), the embedded k3s inside the OpenShell gateway needs more time to initialize than the gateway's internal health-check window allows. The first attempt fails with misleading orphaned-cgroup cleanup messages, but a second attempt typically succeeds because container images are cached and cgroup state is cleaner. Replace the single-shot gateway start + separate health check with a retry loop (up to 3 attempts, 10s/30s backoff). Each failed attempt gets a clean destroyGateway() before retry. On final failure, the error message now includes openshell doctor troubleshooting commands. Upstream tracking: NVIDIA/OpenShell#433 * fix(onboard): skip retry for recovery path, preserve health-check timing The recovery path (nemoclaw status) should not retry — only the onboard wizard benefits from retries. Also restore the original behavior of not sleeping after the final health-check iteration, matching the timing the cli.test.js tests depend on. * refactor(onboard): use p-retry for gateway start retry logic Replace hand-rolled retry loop with p-retry library for gateway start backoff. Cleaner, more conventional, and delegates retry/backoff concerns to a well-tested library. * fix(onboard): correct openshell doctor flags in troubleshooting output * fix(onboard): use --name flag for doctor logs (targets container directly) * fix(onboard): skip gateway destroy on recovery path failure The recovery path (exitOnFailure=false) should preserve gateway state for diagnostics. Gate destroyGateway() behind exitOnFailure so only the onboard path (which will retry) tears down on failure. * fix(onboard): run health checks regardless of gateway start exit code The slow-start case (where gateway start exits non-zero because OpenShell's internal health check timed out) is exactly the scenario this retry logic targets. Always run the grace-period health-check loop so a gateway that's still coming up gets sampled before teardown. * fix(onboard): preserve gateway state on final failure for doctor diagnostics Move destroyGateway() from the pRetry callback into onFailedAttempt so it only runs between retries, not on the terminal failure. The final failed gateway state is now preserved so openshell doctor logs can inspect container logs for troubleshooting.

…1112) * fix(install): ensure p-retry is available for npm install -g from source `npm install -g .` from a fresh clone creates a symlink (equivalent to `npm link`) and does NOT install dependencies. This causes `require('p-retry')` to fail at runtime since node_modules doesn't exist locally. - Add dep bootstrap to `prepare` script: installs production deps if missing, which runs during `npm link` / `npm install -g .` but is a no-op during normal `npm install` (deps already present) - Add `bundleDependencies` for tarball-based install paths (npm pack/publish) Fixes regression from #1051. * fix(install): always install production deps in prepare script Address review feedback: remove conditional node_modules/p-retry check and unconditionally run npm install --omit=dev --ignore-scripts. This is a no-op when deps are already present and ensures they resolve in the global install case. --------- Co-authored-by: Carlos Villela <cvillela@nvidia.com>

* fix(onboard): retry gateway start with exponential backoff On some hosts (Horde VMs, first-run environments), the embedded k3s inside the OpenShell gateway needs more time to initialize than the gateway's internal health-check window allows. The first attempt fails with misleading orphaned-cgroup cleanup messages, but a second attempt typically succeeds because container images are cached and cgroup state is cleaner. Replace the single-shot gateway start + separate health check with a retry loop (up to 3 attempts, 10s/30s backoff). Each failed attempt gets a clean destroyGateway() before retry. On final failure, the error message now includes openshell doctor troubleshooting commands. Upstream tracking: NVIDIA/OpenShell#433 * fix(onboard): skip retry for recovery path, preserve health-check timing The recovery path (nemoclaw status) should not retry — only the onboard wizard benefits from retries. Also restore the original behavior of not sleeping after the final health-check iteration, matching the timing the cli.test.js tests depend on. * refactor(onboard): use p-retry for gateway start retry logic Replace hand-rolled retry loop with p-retry library for gateway start backoff. Cleaner, more conventional, and delegates retry/backoff concerns to a well-tested library. * fix(onboard): correct openshell doctor flags in troubleshooting output * fix(onboard): use --name flag for doctor logs (targets container directly) * fix(onboard): skip gateway destroy on recovery path failure The recovery path (exitOnFailure=false) should preserve gateway state for diagnostics. Gate destroyGateway() behind exitOnFailure so only the onboard path (which will retry) tears down on failure. * fix(onboard): run health checks regardless of gateway start exit code The slow-start case (where gateway start exits non-zero because OpenShell's internal health check timed out) is exactly the scenario this retry logic targets. Always run the grace-period health-check loop so a gateway that's still coming up gets sampled before teardown. * fix(onboard): preserve gateway state on final failure for doctor diagnostics Move destroyGateway() from the pRetry callback into onFailedAttempt so it only runs between retries, not on the terminal failure. The final failed gateway state is now preserved so openshell doctor logs can inspect container logs for troubleshooting.

…VIDIA#1112) * fix(install): ensure p-retry is available for npm install -g from source `npm install -g .` from a fresh clone creates a symlink (equivalent to `npm link`) and does NOT install dependencies. This causes `require('p-retry')` to fail at runtime since node_modules doesn't exist locally. - Add dep bootstrap to `prepare` script: installs production deps if missing, which runs during `npm link` / `npm install -g .` but is a no-op during normal `npm install` (deps already present) - Add `bundleDependencies` for tarball-based install paths (npm pack/publish) Fixes regression from NVIDIA#1051. * fix(install): always install production deps in prepare script Address review feedback: remove conditional node_modules/p-retry check and unconditionally run npm install --omit=dev --ignore-scripts. This is a no-op when deps are already present and ensures they resolve in the global install case. --------- Co-authored-by: Carlos Villela <cvillela@nvidia.com>

* fix(onboard): retry gateway start with exponential backoff On some hosts (Horde VMs, first-run environments), the embedded k3s inside the OpenShell gateway needs more time to initialize than the gateway's internal health-check window allows. The first attempt fails with misleading orphaned-cgroup cleanup messages, but a second attempt typically succeeds because container images are cached and cgroup state is cleaner. Replace the single-shot gateway start + separate health check with a retry loop (up to 3 attempts, 10s/30s backoff). Each failed attempt gets a clean destroyGateway() before retry. On final failure, the error message now includes openshell doctor troubleshooting commands. Upstream tracking: NVIDIA/OpenShell#433 * fix(onboard): skip retry for recovery path, preserve health-check timing The recovery path (nemoclaw status) should not retry — only the onboard wizard benefits from retries. Also restore the original behavior of not sleeping after the final health-check iteration, matching the timing the cli.test.js tests depend on. * refactor(onboard): use p-retry for gateway start retry logic Replace hand-rolled retry loop with p-retry library for gateway start backoff. Cleaner, more conventional, and delegates retry/backoff concerns to a well-tested library. * fix(onboard): correct openshell doctor flags in troubleshooting output * fix(onboard): use --name flag for doctor logs (targets container directly) * fix(onboard): skip gateway destroy on recovery path failure The recovery path (exitOnFailure=false) should preserve gateway state for diagnostics. Gate destroyGateway() behind exitOnFailure so only the onboard path (which will retry) tears down on failure. * fix(onboard): run health checks regardless of gateway start exit code The slow-start case (where gateway start exits non-zero because OpenShell's internal health check timed out) is exactly the scenario this retry logic targets. Always run the grace-period health-check loop so a gateway that's still coming up gets sampled before teardown. * fix(onboard): preserve gateway state on final failure for doctor diagnostics Move destroyGateway() from the pRetry callback into onFailedAttempt so it only runs between retries, not on the terminal failure. The final failed gateway state is now preserved so openshell doctor logs can inspect container logs for troubleshooting.

…VIDIA#1112) * fix(install): ensure p-retry is available for npm install -g from source `npm install -g .` from a fresh clone creates a symlink (equivalent to `npm link`) and does NOT install dependencies. This causes `require('p-retry')` to fail at runtime since node_modules doesn't exist locally. - Add dep bootstrap to `prepare` script: installs production deps if missing, which runs during `npm link` / `npm install -g .` but is a no-op during normal `npm install` (deps already present) - Add `bundleDependencies` for tarball-based install paths (npm pack/publish) Fixes regression from NVIDIA#1051. * fix(install): always install production deps in prepare script Address review feedback: remove conditional node_modules/p-retry check and unconditionally run npm install --omit=dev --ignore-scripts. This is a no-op when deps are already present and ensures they resolve in the global install case. --------- Co-authored-by: Carlos Villela <cvillela@nvidia.com>

ericksoa added the bug Something isn't working label Mar 28, 2026

ericksoa self-assigned this Mar 28, 2026

ericksoa added Getting Started Use this label to identify setup, installation, or onboarding issues. OpenShell Support for OpenShell, a safe, private runtime for autonomous AI agents fix and removed OpenShell Support for OpenShell, a safe, private runtime for autonomous AI agents labels Mar 28, 2026

ericksoa added 2 commits March 28, 2026 11:00

refactor(onboard): use p-retry for gateway start retry logic

4ab7ac4

Replace hand-rolled retry loop with p-retry library for gateway start backoff. Cleaner, more conventional, and delegates retry/backoff concerns to a well-tested library.

ericksoa requested a review from cv March 28, 2026 18:31

ericksoa added 2 commits March 28, 2026 11:37

fix(onboard): correct openshell doctor flags in troubleshooting output

49507a8

fix(onboard): use --name flag for doctor logs (targets container dire…

1d6e04d

…ctly)

BenediktSchackenberg mentioned this pull request Mar 28, 2026

fix(onboard): gateway start fails on first attempt due to k3s startup timeout #1050

Closed

ericksoa added 2 commits March 29, 2026 20:25

Merge remote-tracking branch 'origin/main' into fix/gateway-start-retry

5979776

Merge branch 'main' into fix/gateway-start-retry

dfdef21

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

Comment thread bin/lib/onboard.js

ericksoa added 3 commits March 29, 2026 20:46

Merge branch 'main' into fix/gateway-start-retry

5bef78a

fix(onboard): skip gateway destroy on recovery path failure

a07d37c

The recovery path (exitOnFailure=false) should preserve gateway state for diagnostics. Gate destroyGateway() behind exitOnFailure so only the onboard path (which will retry) tears down on failure.

Merge branch 'fix/gateway-start-retry' of https://github.com/NVIDIA/N…

9a3b9c7

…emoClaw into fix/gateway-start-retry

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

Comment thread bin/lib/onboard.js Outdated

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

cv approved these changes Mar 30, 2026

View reviewed changes

cv merged commit 97c889c into main Mar 30, 2026
9 checks passed

ericksoa mentioned this pull request Mar 30, 2026

fix(install): ensure p-retry resolves for npm install -g from source #1112

Merged

3 tasks

tommylin-signalpro mentioned this pull request Apr 7, 2026

fix(onboard): increase gateway health-poll window and preserve state on first retry #1325

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(onboard): retry gateway start with exponential backoff#1051

fix(onboard): retry gateway start with exponential backoff#1051
cv merged 12 commits intomainfrom
fix/gateway-start-retry

ericksoa commented Mar 28, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 28, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ericksoa commented Mar 28, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Changes

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ericksoa commented Mar 28, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 28, 2026 •

edited

Loading