fix(onboard): retry gateway start with exponential backoff#1051
Conversation
On some hosts (Horde VMs, first-run environments), the embedded k3s inside the OpenShell gateway needs more time to initialize than the gateway's internal health-check window allows. The first attempt fails with misleading orphaned-cgroup cleanup messages, but a second attempt typically succeeds because container images are cached and cgroup state is cleaner. Replace the single-shot gateway start + separate health check with a retry loop (up to 3 attempts, 10s/30s backoff). Each failed attempt gets a clean destroyGateway() before retry. On final failure, the error message now includes openshell doctor troubleshooting commands. Upstream tracking: NVIDIA/OpenShell#433
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughWraps gateway start in a p-retry loop with exponential backoff. Each attempt runs Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as "Nemoclaw CLI"
participant OS_CLI as "OpenShell CLI\n(openshell gateway start)"
participant Gateway as "Gateway Container\n(k3s)"
CLI->>OS_CLI: run "openshell gateway start" (ignoreError:true)
OS_CLI->>Gateway: start container & internal health checks
alt start returns success
OS_CLI-->>CLI: start returned 0
CLI->>CLI: perform up to 5 external health samples (2s interval)
alt isGatewayHealthy == true
CLI-->>Gateway: confirmed healthy
else not healthy after samples
CLI->>CLI: throw to trigger retry
end
else start returned non-zero / immediate failure
OS_CLI-->>CLI: non-success
CLI->>CLI: throw to trigger retry
end
Note over CLI: p-retry schedules next attempt (minTimeout:10s, factor:3) and calls onFailedAttempt which may run destroyGateway() before next attempt
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
The recovery path (nemoclaw status) should not retry — only the onboard wizard benefits from retries. Also restore the original behavior of not sleeping after the final health-check iteration, matching the timing the cli.test.js tests depend on.
Replace hand-rolled retry loop with p-retry library for gateway start backoff. Cleaner, more conventional, and delegates retry/backoff concerns to a well-tested library.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/onboard.js`:
- Around line 1623-1642: The current recovery path always calls destroyGateway()
on a single failure even when retries is 0, which destroys gateway volumes
needed for diagnostics; change the cleanup so destroyGateway() is only invoked
when teardown is intended (e.g., when exitOnFailure is true or retries > 0 /
another retry will occur). Locate the pRetry callback where retries is defined
and the throw new Error("Gateway failed to start") branch, and wrap or gate the
destroyGateway() call so it runs only for the onboarding/teardown path
(exitOnFailure === true or retries > 0) rather than unconditionally during the
recovery/startGatewayForRecovery flow.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f07a35c3-d0a1-4a34-a63d-b1578f964058
📒 Files selected for processing (1)
bin/lib/onboard.js
The recovery path (exitOnFailure=false) should preserve gateway state for diagnostics. Gate destroyGateway() behind exitOnFailure so only the onboard path (which will retry) tears down on failure.
…emoClaw into fix/gateway-start-retry
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
bin/lib/onboard.js (1)
1640-1659:⚠️ Potential issue | 🟠 MajorKeep the last failed gateway intact for the new doctor commands.
destroyGateway()runs on every failed onboarding attempt, including the terminal one, and Line 1655 still tells users the state was removed. That tears down the very gateway/container state the newopenshell doctor logs --name nemoclawguidance needs to inspect. Please only clean up when another retry will run, and update the final message accordingly.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 1640 - 1659, The current catch block always calls destroyGateway() on failure and then prints that stale state was removed, which destroys the gateway logs the new doctor commands need; change logic so destroyGateway() is only called when another retry will run (e.g., call destroyGateway() inside the retry handler or check err.attemptNumber < retries+1 before destroying), and in the final catch path do NOT call destroyGateway() and update the final console.error text from "Stale state removed. Please rerun: nemoclaw onboard" to a message that preserves state and directs users to run the diagnostic commands (openshell doctor logs --name nemoclaw and openshell doctor check) for inspection; keep references to destroyGateway(), the retry mechanism (retries/minTimeout/factor/onFailedAttempt) and the catch block when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/onboard.js`:
- Around line 1626-1638: The code currently only runs the 5-attempt health-check
loop when startResult.status === 0, skipping the grace-period sampling for
slow-start cases where runOpenshell(["gateway","start", ...]) exits non-zero;
move or duplicate the for-loop so the runCaptureOpenshell
status/namedInfo/currentInfo checks with isGatewayHealthy(...) are executed
regardless of startResult.status (i.e., run the same 5-iteration retry loop
after calling runOpenshell), so a gateway that initially exits non-zero still
gets the extra sampling attempts to become healthy before teardown.
---
Duplicate comments:
In `@bin/lib/onboard.js`:
- Around line 1640-1659: The current catch block always calls destroyGateway()
on failure and then prints that stale state was removed, which destroys the
gateway logs the new doctor commands need; change logic so destroyGateway() is
only called when another retry will run (e.g., call destroyGateway() inside the
retry handler or check err.attemptNumber < retries+1 before destroying), and in
the final catch path do NOT call destroyGateway() and update the final
console.error text from "Stale state removed. Please rerun: nemoclaw onboard" to
a message that preserves state and directs users to run the diagnostic commands
(openshell doctor logs --name nemoclaw and openshell doctor check) for
inspection; keep references to destroyGateway(), the retry mechanism
(retries/minTimeout/factor/onFailedAttempt) and the catch block when making the
change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 90377827-1035-4286-9a60-31a4d9d73064
📒 Files selected for processing (1)
bin/lib/onboard.js
The slow-start case (where gateway start exits non-zero because OpenShell's internal health check timed out) is exactly the scenario this retry logic targets. Always run the grace-period health-check loop so a gateway that's still coming up gets sampled before teardown.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
bin/lib/onboard.js (1)
1628-1638:⚠️ Potential issue | 🟠 MajorRun grace-period health checks regardless of exit status.
The
startResult.status === 0guard skips the health sampling whenopenshell gateway startexits non-zero—but the slow-start scenario this retry path targets is precisely when OpenShell's internal health-check times out (non-zero exit) while k3s is still initializing. The gateway may actually be healthy moments after that timeout, so destroying it immediately without sampling wastes a viable attempt.Suggested fix
- if (startResult.status === 0) { - for (let i = 0; i < 5; i++) { - const status = runCaptureOpenshell(["status"], { ignoreError: true }); - const namedInfo = runCaptureOpenshell(["gateway", "info", "-g", GATEWAY_NAME], { ignoreError: true }); - const currentInfo = runCaptureOpenshell(["gateway", "info"], { ignoreError: true }); - if (isGatewayHealthy(status, namedInfo, currentInfo)) { - return; // success - } - if (i < 4) sleep(2); - } + for (let i = 0; i < 5; i += 1) { + const status = runCaptureOpenshell(["status"], { ignoreError: true }); + const namedInfo = runCaptureOpenshell(["gateway", "info", "-g", GATEWAY_NAME], { ignoreError: true }); + const currentInfo = runCaptureOpenshell(["gateway", "info"], { ignoreError: true }); + if (isGatewayHealthy(status, namedInfo, currentInfo)) { + return; // success + } + if (i < 4) sleep(2); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 1628 - 1638, The health-sampling loop is currently gated by startResult.status === 0 which skips retrying when "openshell gateway start" returns non-zero; move or remove that guard so the for-loop that calls runCaptureOpenshell(["status"]), runCaptureOpenshell(["gateway","info","-g", GATEWAY_NAME]) and runCaptureOpenshell(["gateway","info"]) and checks isGatewayHealthy(...) runs regardless of startResult.status, keeping the same loop/backoff (sleep(2)) and early return on success; ensure the existing behavior (return on isGatewayHealthy success and the 5-attempt retry) is preserved but executed even when startResult.status !== 0.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@bin/lib/onboard.js`:
- Around line 1628-1638: The health-sampling loop is currently gated by
startResult.status === 0 which skips retrying when "openshell gateway start"
returns non-zero; move or remove that guard so the for-loop that calls
runCaptureOpenshell(["status"]), runCaptureOpenshell(["gateway","info","-g",
GATEWAY_NAME]) and runCaptureOpenshell(["gateway","info"]) and checks
isGatewayHealthy(...) runs regardless of startResult.status, keeping the same
loop/backoff (sleep(2)) and early return on success; ensure the existing
behavior (return on isGatewayHealthy success and the 5-attempt retry) is
preserved but executed even when startResult.status !== 0.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4523112d-d108-4d23-9e00-fc25e540e161
📒 Files selected for processing (1)
bin/lib/onboard.js
…nostics Move destroyGateway() from the pRetry callback into onFailedAttempt so it only runs between retries, not on the terminal failure. The final failed gateway state is now preserved so openshell doctor logs can inspect container logs for troubleshooting.
`npm install -g .` from a fresh clone creates a symlink (equivalent to
`npm link`) and does NOT install dependencies. This causes `require('p-retry')`
to fail at runtime since node_modules doesn't exist locally.
- Add dep bootstrap to `prepare` script: installs production deps if missing,
which runs during `npm link` / `npm install -g .` but is a no-op during
normal `npm install` (deps already present)
- Add `bundleDependencies` for tarball-based install paths (npm pack/publish)
Fixes regression from #1051.
…1112) * fix(install): ensure p-retry is available for npm install -g from source `npm install -g .` from a fresh clone creates a symlink (equivalent to `npm link`) and does NOT install dependencies. This causes `require('p-retry')` to fail at runtime since node_modules doesn't exist locally. - Add dep bootstrap to `prepare` script: installs production deps if missing, which runs during `npm link` / `npm install -g .` but is a no-op during normal `npm install` (deps already present) - Add `bundleDependencies` for tarball-based install paths (npm pack/publish) Fixes regression from #1051. * fix(install): always install production deps in prepare script Address review feedback: remove conditional node_modules/p-retry check and unconditionally run npm install --omit=dev --ignore-scripts. This is a no-op when deps are already present and ensures they resolve in the global install case. --------- Co-authored-by: Carlos Villela <cvillela@nvidia.com>
* fix(onboard): retry gateway start with exponential backoff On some hosts (Horde VMs, first-run environments), the embedded k3s inside the OpenShell gateway needs more time to initialize than the gateway's internal health-check window allows. The first attempt fails with misleading orphaned-cgroup cleanup messages, but a second attempt typically succeeds because container images are cached and cgroup state is cleaner. Replace the single-shot gateway start + separate health check with a retry loop (up to 3 attempts, 10s/30s backoff). Each failed attempt gets a clean destroyGateway() before retry. On final failure, the error message now includes openshell doctor troubleshooting commands. Upstream tracking: NVIDIA/OpenShell#433 * fix(onboard): skip retry for recovery path, preserve health-check timing The recovery path (nemoclaw status) should not retry — only the onboard wizard benefits from retries. Also restore the original behavior of not sleeping after the final health-check iteration, matching the timing the cli.test.js tests depend on. * refactor(onboard): use p-retry for gateway start retry logic Replace hand-rolled retry loop with p-retry library for gateway start backoff. Cleaner, more conventional, and delegates retry/backoff concerns to a well-tested library. * fix(onboard): correct openshell doctor flags in troubleshooting output * fix(onboard): use --name flag for doctor logs (targets container directly) * fix(onboard): skip gateway destroy on recovery path failure The recovery path (exitOnFailure=false) should preserve gateway state for diagnostics. Gate destroyGateway() behind exitOnFailure so only the onboard path (which will retry) tears down on failure. * fix(onboard): run health checks regardless of gateway start exit code The slow-start case (where gateway start exits non-zero because OpenShell's internal health check timed out) is exactly the scenario this retry logic targets. Always run the grace-period health-check loop so a gateway that's still coming up gets sampled before teardown. * fix(onboard): preserve gateway state on final failure for doctor diagnostics Move destroyGateway() from the pRetry callback into onFailedAttempt so it only runs between retries, not on the terminal failure. The final failed gateway state is now preserved so openshell doctor logs can inspect container logs for troubleshooting.
…VIDIA#1112) * fix(install): ensure p-retry is available for npm install -g from source `npm install -g .` from a fresh clone creates a symlink (equivalent to `npm link`) and does NOT install dependencies. This causes `require('p-retry')` to fail at runtime since node_modules doesn't exist locally. - Add dep bootstrap to `prepare` script: installs production deps if missing, which runs during `npm link` / `npm install -g .` but is a no-op during normal `npm install` (deps already present) - Add `bundleDependencies` for tarball-based install paths (npm pack/publish) Fixes regression from NVIDIA#1051. * fix(install): always install production deps in prepare script Address review feedback: remove conditional node_modules/p-retry check and unconditionally run npm install --omit=dev --ignore-scripts. This is a no-op when deps are already present and ensures they resolve in the global install case. --------- Co-authored-by: Carlos Villela <cvillela@nvidia.com>
* fix(onboard): retry gateway start with exponential backoff On some hosts (Horde VMs, first-run environments), the embedded k3s inside the OpenShell gateway needs more time to initialize than the gateway's internal health-check window allows. The first attempt fails with misleading orphaned-cgroup cleanup messages, but a second attempt typically succeeds because container images are cached and cgroup state is cleaner. Replace the single-shot gateway start + separate health check with a retry loop (up to 3 attempts, 10s/30s backoff). Each failed attempt gets a clean destroyGateway() before retry. On final failure, the error message now includes openshell doctor troubleshooting commands. Upstream tracking: NVIDIA/OpenShell#433 * fix(onboard): skip retry for recovery path, preserve health-check timing The recovery path (nemoclaw status) should not retry — only the onboard wizard benefits from retries. Also restore the original behavior of not sleeping after the final health-check iteration, matching the timing the cli.test.js tests depend on. * refactor(onboard): use p-retry for gateway start retry logic Replace hand-rolled retry loop with p-retry library for gateway start backoff. Cleaner, more conventional, and delegates retry/backoff concerns to a well-tested library. * fix(onboard): correct openshell doctor flags in troubleshooting output * fix(onboard): use --name flag for doctor logs (targets container directly) * fix(onboard): skip gateway destroy on recovery path failure The recovery path (exitOnFailure=false) should preserve gateway state for diagnostics. Gate destroyGateway() behind exitOnFailure so only the onboard path (which will retry) tears down on failure. * fix(onboard): run health checks regardless of gateway start exit code The slow-start case (where gateway start exits non-zero because OpenShell's internal health check timed out) is exactly the scenario this retry logic targets. Always run the grace-period health-check loop so a gateway that's still coming up gets sampled before teardown. * fix(onboard): preserve gateway state on final failure for doctor diagnostics Move destroyGateway() from the pRetry callback into onFailedAttempt so it only runs between retries, not on the terminal failure. The final failed gateway state is now preserved so openshell doctor logs can inspect container logs for troubleshooting.
…1112) * fix(install): ensure p-retry is available for npm install -g from source `npm install -g .` from a fresh clone creates a symlink (equivalent to `npm link`) and does NOT install dependencies. This causes `require('p-retry')` to fail at runtime since node_modules doesn't exist locally. - Add dep bootstrap to `prepare` script: installs production deps if missing, which runs during `npm link` / `npm install -g .` but is a no-op during normal `npm install` (deps already present) - Add `bundleDependencies` for tarball-based install paths (npm pack/publish) Fixes regression from #1051. * fix(install): always install production deps in prepare script Address review feedback: remove conditional node_modules/p-retry check and unconditionally run npm install --omit=dev --ignore-scripts. This is a no-op when deps are already present and ensures they resolve in the global install case. --------- Co-authored-by: Carlos Villela <cvillela@nvidia.com>
* fix(onboard): retry gateway start with exponential backoff On some hosts (Horde VMs, first-run environments), the embedded k3s inside the OpenShell gateway needs more time to initialize than the gateway's internal health-check window allows. The first attempt fails with misleading orphaned-cgroup cleanup messages, but a second attempt typically succeeds because container images are cached and cgroup state is cleaner. Replace the single-shot gateway start + separate health check with a retry loop (up to 3 attempts, 10s/30s backoff). Each failed attempt gets a clean destroyGateway() before retry. On final failure, the error message now includes openshell doctor troubleshooting commands. Upstream tracking: NVIDIA/OpenShell#433 * fix(onboard): skip retry for recovery path, preserve health-check timing The recovery path (nemoclaw status) should not retry — only the onboard wizard benefits from retries. Also restore the original behavior of not sleeping after the final health-check iteration, matching the timing the cli.test.js tests depend on. * refactor(onboard): use p-retry for gateway start retry logic Replace hand-rolled retry loop with p-retry library for gateway start backoff. Cleaner, more conventional, and delegates retry/backoff concerns to a well-tested library. * fix(onboard): correct openshell doctor flags in troubleshooting output * fix(onboard): use --name flag for doctor logs (targets container directly) * fix(onboard): skip gateway destroy on recovery path failure The recovery path (exitOnFailure=false) should preserve gateway state for diagnostics. Gate destroyGateway() behind exitOnFailure so only the onboard path (which will retry) tears down on failure. * fix(onboard): run health checks regardless of gateway start exit code The slow-start case (where gateway start exits non-zero because OpenShell's internal health check timed out) is exactly the scenario this retry logic targets. Always run the grace-period health-check loop so a gateway that's still coming up gets sampled before teardown. * fix(onboard): preserve gateway state on final failure for doctor diagnostics Move destroyGateway() from the pRetry callback into onFailedAttempt so it only runs between retries, not on the terminal failure. The final failed gateway state is now preserved so openshell doctor logs can inspect container logs for troubleshooting.
…VIDIA#1112) * fix(install): ensure p-retry is available for npm install -g from source `npm install -g .` from a fresh clone creates a symlink (equivalent to `npm link`) and does NOT install dependencies. This causes `require('p-retry')` to fail at runtime since node_modules doesn't exist locally. - Add dep bootstrap to `prepare` script: installs production deps if missing, which runs during `npm link` / `npm install -g .` but is a no-op during normal `npm install` (deps already present) - Add `bundleDependencies` for tarball-based install paths (npm pack/publish) Fixes regression from NVIDIA#1051. * fix(install): always install production deps in prepare script Address review feedback: remove conditional node_modules/p-retry check and unconditionally run npm install --omit=dev --ignore-scripts. This is a no-op when deps are already present and ensures they resolve in the global install case. --------- Co-authored-by: Carlos Villela <cvillela@nvidia.com>
* fix(onboard): retry gateway start with exponential backoff On some hosts (Horde VMs, first-run environments), the embedded k3s inside the OpenShell gateway needs more time to initialize than the gateway's internal health-check window allows. The first attempt fails with misleading orphaned-cgroup cleanup messages, but a second attempt typically succeeds because container images are cached and cgroup state is cleaner. Replace the single-shot gateway start + separate health check with a retry loop (up to 3 attempts, 10s/30s backoff). Each failed attempt gets a clean destroyGateway() before retry. On final failure, the error message now includes openshell doctor troubleshooting commands. Upstream tracking: NVIDIA/OpenShell#433 * fix(onboard): skip retry for recovery path, preserve health-check timing The recovery path (nemoclaw status) should not retry — only the onboard wizard benefits from retries. Also restore the original behavior of not sleeping after the final health-check iteration, matching the timing the cli.test.js tests depend on. * refactor(onboard): use p-retry for gateway start retry logic Replace hand-rolled retry loop with p-retry library for gateway start backoff. Cleaner, more conventional, and delegates retry/backoff concerns to a well-tested library. * fix(onboard): correct openshell doctor flags in troubleshooting output * fix(onboard): use --name flag for doctor logs (targets container directly) * fix(onboard): skip gateway destroy on recovery path failure The recovery path (exitOnFailure=false) should preserve gateway state for diagnostics. Gate destroyGateway() behind exitOnFailure so only the onboard path (which will retry) tears down on failure. * fix(onboard): run health checks regardless of gateway start exit code The slow-start case (where gateway start exits non-zero because OpenShell's internal health check timed out) is exactly the scenario this retry logic targets. Always run the grace-period health-check loop so a gateway that's still coming up gets sampled before teardown. * fix(onboard): preserve gateway state on final failure for doctor diagnostics Move destroyGateway() from the pRetry callback into onFailedAttempt so it only runs between retries, not on the terminal failure. The final failed gateway state is now preserved so openshell doctor logs can inspect container logs for troubleshooting.
…VIDIA#1112) * fix(install): ensure p-retry is available for npm install -g from source `npm install -g .` from a fresh clone creates a symlink (equivalent to `npm link`) and does NOT install dependencies. This causes `require('p-retry')` to fail at runtime since node_modules doesn't exist locally. - Add dep bootstrap to `prepare` script: installs production deps if missing, which runs during `npm link` / `npm install -g .` but is a no-op during normal `npm install` (deps already present) - Add `bundleDependencies` for tarball-based install paths (npm pack/publish) Fixes regression from NVIDIA#1051. * fix(install): always install production deps in prepare script Address review feedback: remove conditional node_modules/p-retry check and unconditionally run npm install --omit=dev --ignore-scripts. This is a no-op when deps are already present and ensures they resolve in the global install case. --------- Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Summary
openshell gateway startup to 3 times with exponential backoff during onboard, usingp-retrydestroyGateway()(including Docker volume cleanup) before retrynemoclaw status) remains single-attempt to keep CLI responsivenessopenshell doctortroubleshooting commandsFixes #1050
Root Cause
On first-run environments (Horde VMs, fresh Ubuntu 24.04), the embedded k3s inside the OpenShell gateway can exceed the gateway's internal health-check timeout during initialization. The first attempt fails, but a second attempt typically succeeds because container images are cached and cgroup state is cleaner. Previously, NemoClaw made a single attempt and gave up immediately.
Upstream tracking: NVIDIA/OpenShell#433
Changes
bin/lib/onboard.js— Replace single-shot gateway start withp-retry(3 attempts, exponential backoff). Only the onboard path retries; the recovery path (startGatewayForRecovery) stays single-attempt.package.json— Addp-retry@^4.6.2as a direct dependency (v4 is CJS-compatible).test/gateway-cleanup.test.js— Update pattern-matching test to reflect thatdestroyGateway()now lives inside the retry callback.Testing
npm test— all unit tests pass including cli.test.js (recovery path timing preserved)Summary by CodeRabbit
Improvements
Chores
Tests