Skip to content

fix(onboard): retry gateway start with exponential backoff#1051

Merged
cv merged 12 commits intomainfrom
fix/gateway-start-retry
Mar 30, 2026
Merged

fix(onboard): retry gateway start with exponential backoff#1051
cv merged 12 commits intomainfrom
fix/gateway-start-retry

Conversation

@ericksoa
Copy link
Copy Markdown
Contributor

@ericksoa ericksoa commented Mar 28, 2026

Summary

  • Retry openshell gateway start up to 3 times with exponential backoff during onboard, using p-retry
  • Each failed attempt gets a clean destroyGateway() (including Docker volume cleanup) before retry
  • Recovery path (nemoclaw status) remains single-attempt to keep CLI responsiveness
  • Final failure message now includes openshell doctor troubleshooting commands

Fixes #1050

Root Cause

On first-run environments (Horde VMs, fresh Ubuntu 24.04), the embedded k3s inside the OpenShell gateway can exceed the gateway's internal health-check timeout during initialization. The first attempt fails, but a second attempt typically succeeds because container images are cached and cgroup state is cleaner. Previously, NemoClaw made a single attempt and gave up immediately.

Upstream tracking: NVIDIA/OpenShell#433

Changes

bin/lib/onboard.js — Replace single-shot gateway start with p-retry (3 attempts, exponential backoff). Only the onboard path retries; the recovery path (startGatewayForRecovery) stays single-attempt.

package.json — Add p-retry@^4.6.2 as a direct dependency (v4 is CJS-compatible).

test/gateway-cleanup.test.js — Update pattern-matching test to reflect that destroyGateway() now lives inside the retry callback.

Testing

  • npm test — all unit tests pass including cli.test.js (recovery path timing preserved)
  • Manual: on macOS the retry path is not exercised (gateway starts on first attempt), confirming no regression for the happy path

Summary by CodeRabbit

  • Improvements

    • Startup now automatically retries with exponential backoff, performs repeated health checks before declaring success, logs progress during attempts, and prints a single clear "Gateway is healthy" on success.
    • On final failure, messaging is consolidated with concise troubleshooting guidance; transient cleanup runs between retry attempts.
  • Chores

    • Added a retry helper dependency to improve startup reliability.
  • Tests

    • Updated gateway startup tests to reflect the new retry and cleanup behavior.

On some hosts (Horde VMs, first-run environments), the embedded k3s
inside the OpenShell gateway needs more time to initialize than the
gateway's internal health-check window allows. The first attempt fails
with misleading orphaned-cgroup cleanup messages, but a second attempt
typically succeeds because container images are cached and cgroup state
is cleaner.

Replace the single-shot gateway start + separate health check with a
retry loop (up to 3 attempts, 10s/30s backoff). Each failed attempt
gets a clean destroyGateway() before retry. On final failure, the
error message now includes openshell doctor troubleshooting commands.

Upstream tracking: NVIDIA/OpenShell#433
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 28, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 00d1cf46-7f9a-4441-bcba-88a29591c9fb

📥 Commits

Reviewing files that changed from the base of the PR and between 9a68f51 and 32a845f.

📒 Files selected for processing (1)
  • bin/lib/onboard.js

📝 Walkthrough

Walkthrough

Wraps gateway start in a p-retry loop with exponential backoff. Each attempt runs openshell gateway start (ignore errors), performs up-to-5 external health checks (2s apart), invokes destroyGateway() on failed attempts when configured, and centralizes final failure reporting. Adds p-retry dependency and updates a test.

Changes

Cohort / File(s) Summary
Onboard runtime logic
bin/lib/onboard.js
Replace single-start + fixed health-check loop with a p-retry-wrapped retry flow. Compute retry count from exitOnFailure. Each attempt runs openshell gateway start (ignoreError:true), polls health up to 5 times (2s apart), throws on health failure to trigger retry, and uses onFailedAttempt to log progress and call destroyGateway() when configured. Consolidates final success/failure messaging; removes prior per-iteration "Stale state removed... please rerun" behavior.
Tests
test/gateway-cleanup.test.js
Update assertions to reflect retry-based behavior: no longer assert strict invocation counts of destroyGateway(); instead check that startGatewayWithOptions contains destroyGateway() usage and adjust comment to note retry-loop cleanup behavior.
Dependencies / manifest
package.json
Add runtime dependency p-retry@^4.6.2. Minor devDependency reordering/formatting and small formatting tweak to dependencies.openclaw entry.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as "Nemoclaw CLI"
  participant OS_CLI as "OpenShell CLI\n(openshell gateway start)"
  participant Gateway as "Gateway Container\n(k3s)"

  CLI->>OS_CLI: run "openshell gateway start" (ignoreError:true)
  OS_CLI->>Gateway: start container & internal health checks
  alt start returns success
    OS_CLI-->>CLI: start returned 0
    CLI->>CLI: perform up to 5 external health samples (2s interval)
    alt isGatewayHealthy == true
      CLI-->>Gateway: confirmed healthy
    else not healthy after samples
      CLI->>CLI: throw to trigger retry
    end
  else start returned non-zero / immediate failure
    OS_CLI-->>CLI: non-success
    CLI->>CLI: throw to trigger retry
  end
  Note over CLI: p-retry schedules next attempt (minTimeout:10s, factor:3) and calls onFailedAttempt which may run destroyGateway() before next attempt
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

"I hop, I nudge, I try once more,
Destroy and wait, then open the door.
Backoff counts the patient beats,
Until k3s hums and routes its fleets.
🐇✨"

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding retry logic with exponential backoff to the gateway start function, which directly addresses the linked issue.
Linked Issues check ✅ Passed The changes implement all core coding requirements from issue #1050: retry logic up to 3 times with exponential backoff [#1050], destroyGateway() called between retries [#1050], and improved diagnostics with openshell doctor commands [#1050].
Out of Scope Changes check ✅ Passed All changes are in-scope: retry mechanism in onboard.js, p-retry dependency addition, and test assertion updates directly support the PR objectives with no extraneous modifications.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/gateway-start-retry

Comment @coderabbitai help to get the list of available commands and usage tips.

@ericksoa ericksoa added the bug Something isn't working label Mar 28, 2026
@ericksoa ericksoa self-assigned this Mar 28, 2026
@ericksoa ericksoa added Getting Started Use this label to identify setup, installation, or onboarding issues. OpenShell Support for OpenShell, a safe, private runtime for autonomous AI agents fix and removed OpenShell Support for OpenShell, a safe, private runtime for autonomous AI agents labels Mar 28, 2026
The recovery path (nemoclaw status) should not retry — only the onboard
wizard benefits from retries. Also restore the original behavior of not
sleeping after the final health-check iteration, matching the timing
the cli.test.js tests depend on.
Replace hand-rolled retry loop with p-retry library for gateway start
backoff. Cleaner, more conventional, and delegates retry/backoff
concerns to a well-tested library.
@ericksoa ericksoa requested a review from cv March 28, 2026 18:31
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 1623-1642: The current recovery path always calls destroyGateway()
on a single failure even when retries is 0, which destroys gateway volumes
needed for diagnostics; change the cleanup so destroyGateway() is only invoked
when teardown is intended (e.g., when exitOnFailure is true or retries > 0 /
another retry will occur). Locate the pRetry callback where retries is defined
and the throw new Error("Gateway failed to start") branch, and wrap or gate the
destroyGateway() call so it runs only for the onboarding/teardown path
(exitOnFailure === true or retries > 0) rather than unconditionally during the
recovery/startGatewayForRecovery flow.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f07a35c3-d0a1-4a34-a63d-b1578f964058

📥 Commits

Reviewing files that changed from the base of the PR and between 1d6e04d and dfdef21.

📒 Files selected for processing (1)
  • bin/lib/onboard.js

Comment thread bin/lib/onboard.js
The recovery path (exitOnFailure=false) should preserve gateway state
for diagnostics. Gate destroyGateway() behind exitOnFailure so only
the onboard path (which will retry) tears down on failure.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
bin/lib/onboard.js (1)

1640-1659: ⚠️ Potential issue | 🟠 Major

Keep the last failed gateway intact for the new doctor commands.

destroyGateway() runs on every failed onboarding attempt, including the terminal one, and Line 1655 still tells users the state was removed. That tears down the very gateway/container state the new openshell doctor logs --name nemoclaw guidance needs to inspect. Please only clean up when another retry will run, and update the final message accordingly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 1640 - 1659, The current catch block always
calls destroyGateway() on failure and then prints that stale state was removed,
which destroys the gateway logs the new doctor commands need; change logic so
destroyGateway() is only called when another retry will run (e.g., call
destroyGateway() inside the retry handler or check err.attemptNumber < retries+1
before destroying), and in the final catch path do NOT call destroyGateway() and
update the final console.error text from "Stale state removed. Please rerun:
nemoclaw onboard" to a message that preserves state and directs users to run the
diagnostic commands (openshell doctor logs --name nemoclaw and openshell doctor
check) for inspection; keep references to destroyGateway(), the retry mechanism
(retries/minTimeout/factor/onFailedAttempt) and the catch block when making the
change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 1626-1638: The code currently only runs the 5-attempt health-check
loop when startResult.status === 0, skipping the grace-period sampling for
slow-start cases where runOpenshell(["gateway","start", ...]) exits non-zero;
move or duplicate the for-loop so the runCaptureOpenshell
status/namedInfo/currentInfo checks with isGatewayHealthy(...) are executed
regardless of startResult.status (i.e., run the same 5-iteration retry loop
after calling runOpenshell), so a gateway that initially exits non-zero still
gets the extra sampling attempts to become healthy before teardown.

---

Duplicate comments:
In `@bin/lib/onboard.js`:
- Around line 1640-1659: The current catch block always calls destroyGateway()
on failure and then prints that stale state was removed, which destroys the
gateway logs the new doctor commands need; change logic so destroyGateway() is
only called when another retry will run (e.g., call destroyGateway() inside the
retry handler or check err.attemptNumber < retries+1 before destroying), and in
the final catch path do NOT call destroyGateway() and update the final
console.error text from "Stale state removed. Please rerun: nemoclaw onboard" to
a message that preserves state and directs users to run the diagnostic commands
(openshell doctor logs --name nemoclaw and openshell doctor check) for
inspection; keep references to destroyGateway(), the retry mechanism
(retries/minTimeout/factor/onFailedAttempt) and the catch block when making the
change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 90377827-1035-4286-9a60-31a4d9d73064

📥 Commits

Reviewing files that changed from the base of the PR and between 1d6e04d and 9a3b9c7.

📒 Files selected for processing (1)
  • bin/lib/onboard.js

Comment thread bin/lib/onboard.js Outdated
The slow-start case (where gateway start exits non-zero because
OpenShell's internal health check timed out) is exactly the scenario
this retry logic targets. Always run the grace-period health-check
loop so a gateway that's still coming up gets sampled before teardown.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
bin/lib/onboard.js (1)

1628-1638: ⚠️ Potential issue | 🟠 Major

Run grace-period health checks regardless of exit status.

The startResult.status === 0 guard skips the health sampling when openshell gateway start exits non-zero—but the slow-start scenario this retry path targets is precisely when OpenShell's internal health-check times out (non-zero exit) while k3s is still initializing. The gateway may actually be healthy moments after that timeout, so destroying it immediately without sampling wastes a viable attempt.

Suggested fix
-      if (startResult.status === 0) {
-        for (let i = 0; i < 5; i++) {
-          const status = runCaptureOpenshell(["status"], { ignoreError: true });
-          const namedInfo = runCaptureOpenshell(["gateway", "info", "-g", GATEWAY_NAME], { ignoreError: true });
-          const currentInfo = runCaptureOpenshell(["gateway", "info"], { ignoreError: true });
-          if (isGatewayHealthy(status, namedInfo, currentInfo)) {
-            return; // success
-          }
-          if (i < 4) sleep(2);
-        }
+      for (let i = 0; i < 5; i += 1) {
+        const status = runCaptureOpenshell(["status"], { ignoreError: true });
+        const namedInfo = runCaptureOpenshell(["gateway", "info", "-g", GATEWAY_NAME], { ignoreError: true });
+        const currentInfo = runCaptureOpenshell(["gateway", "info"], { ignoreError: true });
+        if (isGatewayHealthy(status, namedInfo, currentInfo)) {
+          return; // success
+        }
+        if (i < 4) sleep(2);
       }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 1628 - 1638, The health-sampling loop is
currently gated by startResult.status === 0 which skips retrying when "openshell
gateway start" returns non-zero; move or remove that guard so the for-loop that
calls runCaptureOpenshell(["status"]),
runCaptureOpenshell(["gateway","info","-g", GATEWAY_NAME]) and
runCaptureOpenshell(["gateway","info"]) and checks isGatewayHealthy(...) runs
regardless of startResult.status, keeping the same loop/backoff (sleep(2)) and
early return on success; ensure the existing behavior (return on
isGatewayHealthy success and the 5-attempt retry) is preserved but executed even
when startResult.status !== 0.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@bin/lib/onboard.js`:
- Around line 1628-1638: The health-sampling loop is currently gated by
startResult.status === 0 which skips retrying when "openshell gateway start"
returns non-zero; move or remove that guard so the for-loop that calls
runCaptureOpenshell(["status"]), runCaptureOpenshell(["gateway","info","-g",
GATEWAY_NAME]) and runCaptureOpenshell(["gateway","info"]) and checks
isGatewayHealthy(...) runs regardless of startResult.status, keeping the same
loop/backoff (sleep(2)) and early return on success; ensure the existing
behavior (return on isGatewayHealthy success and the 5-attempt retry) is
preserved but executed even when startResult.status !== 0.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4523112d-d108-4d23-9e00-fc25e540e161

📥 Commits

Reviewing files that changed from the base of the PR and between 1d6e04d and 9a3b9c7.

📒 Files selected for processing (1)
  • bin/lib/onboard.js

…nostics

Move destroyGateway() from the pRetry callback into onFailedAttempt
so it only runs between retries, not on the terminal failure. The
final failed gateway state is now preserved so openshell doctor logs
can inspect container logs for troubleshooting.
@cv cv merged commit 97c889c into main Mar 30, 2026
9 checks passed
ericksoa added a commit that referenced this pull request Mar 30, 2026
`npm install -g .` from a fresh clone creates a symlink (equivalent to
`npm link`) and does NOT install dependencies. This causes `require('p-retry')`
to fail at runtime since node_modules doesn't exist locally.

- Add dep bootstrap to `prepare` script: installs production deps if missing,
  which runs during `npm link` / `npm install -g .` but is a no-op during
  normal `npm install` (deps already present)
- Add `bundleDependencies` for tarball-based install paths (npm pack/publish)

Fixes regression from #1051.
ericksoa added a commit that referenced this pull request Mar 30, 2026
…1112)

* fix(install): ensure p-retry is available for npm install -g from source

`npm install -g .` from a fresh clone creates a symlink (equivalent to
`npm link`) and does NOT install dependencies. This causes `require('p-retry')`
to fail at runtime since node_modules doesn't exist locally.

- Add dep bootstrap to `prepare` script: installs production deps if missing,
  which runs during `npm link` / `npm install -g .` but is a no-op during
  normal `npm install` (deps already present)
- Add `bundleDependencies` for tarball-based install paths (npm pack/publish)

Fixes regression from #1051.

* fix(install): always install production deps in prepare script

Address review feedback: remove conditional node_modules/p-retry check
and unconditionally run npm install --omit=dev --ignore-scripts. This is
a no-op when deps are already present and ensures they resolve in the
global install case.

---------

Co-authored-by: Carlos Villela <cvillela@nvidia.com>
quanticsoul4772 pushed a commit to quanticsoul4772/NemoClaw that referenced this pull request Mar 30, 2026
* fix(onboard): retry gateway start with exponential backoff

On some hosts (Horde VMs, first-run environments), the embedded k3s
inside the OpenShell gateway needs more time to initialize than the
gateway's internal health-check window allows. The first attempt fails
with misleading orphaned-cgroup cleanup messages, but a second attempt
typically succeeds because container images are cached and cgroup state
is cleaner.

Replace the single-shot gateway start + separate health check with a
retry loop (up to 3 attempts, 10s/30s backoff). Each failed attempt
gets a clean destroyGateway() before retry. On final failure, the
error message now includes openshell doctor troubleshooting commands.

Upstream tracking: NVIDIA/OpenShell#433

* fix(onboard): skip retry for recovery path, preserve health-check timing

The recovery path (nemoclaw status) should not retry — only the onboard
wizard benefits from retries. Also restore the original behavior of not
sleeping after the final health-check iteration, matching the timing
the cli.test.js tests depend on.

* refactor(onboard): use p-retry for gateway start retry logic

Replace hand-rolled retry loop with p-retry library for gateway start
backoff. Cleaner, more conventional, and delegates retry/backoff
concerns to a well-tested library.

* fix(onboard): correct openshell doctor flags in troubleshooting output

* fix(onboard): use --name flag for doctor logs (targets container directly)

* fix(onboard): skip gateway destroy on recovery path failure

The recovery path (exitOnFailure=false) should preserve gateway state
for diagnostics. Gate destroyGateway() behind exitOnFailure so only
the onboard path (which will retry) tears down on failure.

* fix(onboard): run health checks regardless of gateway start exit code

The slow-start case (where gateway start exits non-zero because
OpenShell's internal health check timed out) is exactly the scenario
this retry logic targets. Always run the grace-period health-check
loop so a gateway that's still coming up gets sampled before teardown.

* fix(onboard): preserve gateway state on final failure for doctor diagnostics

Move destroyGateway() from the pRetry callback into onFailedAttempt
so it only runs between retries, not on the terminal failure. The
final failed gateway state is now preserved so openshell doctor logs
can inspect container logs for troubleshooting.
quanticsoul4772 pushed a commit to quanticsoul4772/NemoClaw that referenced this pull request Mar 30, 2026
…VIDIA#1112)

* fix(install): ensure p-retry is available for npm install -g from source

`npm install -g .` from a fresh clone creates a symlink (equivalent to
`npm link`) and does NOT install dependencies. This causes `require('p-retry')`
to fail at runtime since node_modules doesn't exist locally.

- Add dep bootstrap to `prepare` script: installs production deps if missing,
  which runs during `npm link` / `npm install -g .` but is a no-op during
  normal `npm install` (deps already present)
- Add `bundleDependencies` for tarball-based install paths (npm pack/publish)

Fixes regression from NVIDIA#1051.

* fix(install): always install production deps in prepare script

Address review feedback: remove conditional node_modules/p-retry check
and unconditionally run npm install --omit=dev --ignore-scripts. This is
a no-op when deps are already present and ensures they resolve in the
global install case.

---------

Co-authored-by: Carlos Villela <cvillela@nvidia.com>
laitingsheng pushed a commit that referenced this pull request Apr 2, 2026
* fix(onboard): retry gateway start with exponential backoff

On some hosts (Horde VMs, first-run environments), the embedded k3s
inside the OpenShell gateway needs more time to initialize than the
gateway's internal health-check window allows. The first attempt fails
with misleading orphaned-cgroup cleanup messages, but a second attempt
typically succeeds because container images are cached and cgroup state
is cleaner.

Replace the single-shot gateway start + separate health check with a
retry loop (up to 3 attempts, 10s/30s backoff). Each failed attempt
gets a clean destroyGateway() before retry. On final failure, the
error message now includes openshell doctor troubleshooting commands.

Upstream tracking: NVIDIA/OpenShell#433

* fix(onboard): skip retry for recovery path, preserve health-check timing

The recovery path (nemoclaw status) should not retry — only the onboard
wizard benefits from retries. Also restore the original behavior of not
sleeping after the final health-check iteration, matching the timing
the cli.test.js tests depend on.

* refactor(onboard): use p-retry for gateway start retry logic

Replace hand-rolled retry loop with p-retry library for gateway start
backoff. Cleaner, more conventional, and delegates retry/backoff
concerns to a well-tested library.

* fix(onboard): correct openshell doctor flags in troubleshooting output

* fix(onboard): use --name flag for doctor logs (targets container directly)

* fix(onboard): skip gateway destroy on recovery path failure

The recovery path (exitOnFailure=false) should preserve gateway state
for diagnostics. Gate destroyGateway() behind exitOnFailure so only
the onboard path (which will retry) tears down on failure.

* fix(onboard): run health checks regardless of gateway start exit code

The slow-start case (where gateway start exits non-zero because
OpenShell's internal health check timed out) is exactly the scenario
this retry logic targets. Always run the grace-period health-check
loop so a gateway that's still coming up gets sampled before teardown.

* fix(onboard): preserve gateway state on final failure for doctor diagnostics

Move destroyGateway() from the pRetry callback into onFailedAttempt
so it only runs between retries, not on the terminal failure. The
final failed gateway state is now preserved so openshell doctor logs
can inspect container logs for troubleshooting.
laitingsheng pushed a commit that referenced this pull request Apr 2, 2026
…1112)

* fix(install): ensure p-retry is available for npm install -g from source

`npm install -g .` from a fresh clone creates a symlink (equivalent to
`npm link`) and does NOT install dependencies. This causes `require('p-retry')`
to fail at runtime since node_modules doesn't exist locally.

- Add dep bootstrap to `prepare` script: installs production deps if missing,
  which runs during `npm link` / `npm install -g .` but is a no-op during
  normal `npm install` (deps already present)
- Add `bundleDependencies` for tarball-based install paths (npm pack/publish)

Fixes regression from #1051.

* fix(install): always install production deps in prepare script

Address review feedback: remove conditional node_modules/p-retry check
and unconditionally run npm install --omit=dev --ignore-scripts. This is
a no-op when deps are already present and ensures they resolve in the
global install case.

---------

Co-authored-by: Carlos Villela <cvillela@nvidia.com>
lakamsani pushed a commit to lakamsani/NemoClaw that referenced this pull request Apr 4, 2026
* fix(onboard): retry gateway start with exponential backoff

On some hosts (Horde VMs, first-run environments), the embedded k3s
inside the OpenShell gateway needs more time to initialize than the
gateway's internal health-check window allows. The first attempt fails
with misleading orphaned-cgroup cleanup messages, but a second attempt
typically succeeds because container images are cached and cgroup state
is cleaner.

Replace the single-shot gateway start + separate health check with a
retry loop (up to 3 attempts, 10s/30s backoff). Each failed attempt
gets a clean destroyGateway() before retry. On final failure, the
error message now includes openshell doctor troubleshooting commands.

Upstream tracking: NVIDIA/OpenShell#433

* fix(onboard): skip retry for recovery path, preserve health-check timing

The recovery path (nemoclaw status) should not retry — only the onboard
wizard benefits from retries. Also restore the original behavior of not
sleeping after the final health-check iteration, matching the timing
the cli.test.js tests depend on.

* refactor(onboard): use p-retry for gateway start retry logic

Replace hand-rolled retry loop with p-retry library for gateway start
backoff. Cleaner, more conventional, and delegates retry/backoff
concerns to a well-tested library.

* fix(onboard): correct openshell doctor flags in troubleshooting output

* fix(onboard): use --name flag for doctor logs (targets container directly)

* fix(onboard): skip gateway destroy on recovery path failure

The recovery path (exitOnFailure=false) should preserve gateway state
for diagnostics. Gate destroyGateway() behind exitOnFailure so only
the onboard path (which will retry) tears down on failure.

* fix(onboard): run health checks regardless of gateway start exit code

The slow-start case (where gateway start exits non-zero because
OpenShell's internal health check timed out) is exactly the scenario
this retry logic targets. Always run the grace-period health-check
loop so a gateway that's still coming up gets sampled before teardown.

* fix(onboard): preserve gateway state on final failure for doctor diagnostics

Move destroyGateway() from the pRetry callback into onFailedAttempt
so it only runs between retries, not on the terminal failure. The
final failed gateway state is now preserved so openshell doctor logs
can inspect container logs for troubleshooting.
lakamsani pushed a commit to lakamsani/NemoClaw that referenced this pull request Apr 4, 2026
…VIDIA#1112)

* fix(install): ensure p-retry is available for npm install -g from source

`npm install -g .` from a fresh clone creates a symlink (equivalent to
`npm link`) and does NOT install dependencies. This causes `require('p-retry')`
to fail at runtime since node_modules doesn't exist locally.

- Add dep bootstrap to `prepare` script: installs production deps if missing,
  which runs during `npm link` / `npm install -g .` but is a no-op during
  normal `npm install` (deps already present)
- Add `bundleDependencies` for tarball-based install paths (npm pack/publish)

Fixes regression from NVIDIA#1051.

* fix(install): always install production deps in prepare script

Address review feedback: remove conditional node_modules/p-retry check
and unconditionally run npm install --omit=dev --ignore-scripts. This is
a no-op when deps are already present and ensures they resolve in the
global install case.

---------

Co-authored-by: Carlos Villela <cvillela@nvidia.com>
gemini2026 pushed a commit to gemini2026/NemoClaw that referenced this pull request Apr 14, 2026
* fix(onboard): retry gateway start with exponential backoff

On some hosts (Horde VMs, first-run environments), the embedded k3s
inside the OpenShell gateway needs more time to initialize than the
gateway's internal health-check window allows. The first attempt fails
with misleading orphaned-cgroup cleanup messages, but a second attempt
typically succeeds because container images are cached and cgroup state
is cleaner.

Replace the single-shot gateway start + separate health check with a
retry loop (up to 3 attempts, 10s/30s backoff). Each failed attempt
gets a clean destroyGateway() before retry. On final failure, the
error message now includes openshell doctor troubleshooting commands.

Upstream tracking: NVIDIA/OpenShell#433

* fix(onboard): skip retry for recovery path, preserve health-check timing

The recovery path (nemoclaw status) should not retry — only the onboard
wizard benefits from retries. Also restore the original behavior of not
sleeping after the final health-check iteration, matching the timing
the cli.test.js tests depend on.

* refactor(onboard): use p-retry for gateway start retry logic

Replace hand-rolled retry loop with p-retry library for gateway start
backoff. Cleaner, more conventional, and delegates retry/backoff
concerns to a well-tested library.

* fix(onboard): correct openshell doctor flags in troubleshooting output

* fix(onboard): use --name flag for doctor logs (targets container directly)

* fix(onboard): skip gateway destroy on recovery path failure

The recovery path (exitOnFailure=false) should preserve gateway state
for diagnostics. Gate destroyGateway() behind exitOnFailure so only
the onboard path (which will retry) tears down on failure.

* fix(onboard): run health checks regardless of gateway start exit code

The slow-start case (where gateway start exits non-zero because
OpenShell's internal health check timed out) is exactly the scenario
this retry logic targets. Always run the grace-period health-check
loop so a gateway that's still coming up gets sampled before teardown.

* fix(onboard): preserve gateway state on final failure for doctor diagnostics

Move destroyGateway() from the pRetry callback into onFailedAttempt
so it only runs between retries, not on the terminal failure. The
final failed gateway state is now preserved so openshell doctor logs
can inspect container logs for troubleshooting.
gemini2026 pushed a commit to gemini2026/NemoClaw that referenced this pull request Apr 14, 2026
…VIDIA#1112)

* fix(install): ensure p-retry is available for npm install -g from source

`npm install -g .` from a fresh clone creates a symlink (equivalent to
`npm link`) and does NOT install dependencies. This causes `require('p-retry')`
to fail at runtime since node_modules doesn't exist locally.

- Add dep bootstrap to `prepare` script: installs production deps if missing,
  which runs during `npm link` / `npm install -g .` but is a no-op during
  normal `npm install` (deps already present)
- Add `bundleDependencies` for tarball-based install paths (npm pack/publish)

Fixes regression from NVIDIA#1051.

* fix(install): always install production deps in prepare script

Address review feedback: remove conditional node_modules/p-retry check
and unconditionally run npm install --omit=dev --ignore-scripts. This is
a no-op when deps are already present and ensures they resolve in the
global install case.

---------

Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working fix Getting Started Use this label to identify setup, installation, or onboarding issues.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(onboard): gateway start fails on first attempt due to k3s startup timeout

2 participants