fix: gate sandbox registration on creation readiness and guard WSL name truncation by brianwtaylor · Pull Request #229 · NVIDIA/NemoClaw

brianwtaylor · 2026-03-17T17:57:19Z

Summary

Remove the `awk` pipe from the sandbox create command that silently swallowed non-zero exit codes
Add a readiness polling loop that waits for the sandbox to reach `Ready` state in k3s before registering locally
Clean up orphaned sandboxes on readiness timeout so retries with the same name work cleanly
Move build-context cleanup before the exit-code check so temp files are always cleaned up
Guard against truncated sandbox names on WSL with RFC 1123 validation (including 63-char limit) in `applyPreset()`
Extract `isSandboxReady()` as a shared function so tests validate the production parser
Clean up stale NemoClaw gateway and port forward before the preflight port check so re-running `nemoclaw onboard` works without manual intervention

Fixes #21
Fixes #22
Fixes #140
Fixes #152
Closes #397

Root cause

`createSandbox()` piped `openshell sandbox create` through `awk` to deduplicate log lines. In bash, the exit status of a pipeline is the status of the last command — `awk`, which always exits 0. If sandbox creation fails, `run()` sees exit 0 and continues to register a phantom sandbox.

WSL sandbox name truncation (#21)

On WSL, hyphenated sandbox names like `my-assistant` can be truncated during shell argument parsing (e.g. to `m`), causing `openshell policy set --wait` to receive a garbage name. Added RFC 1123 validation in `applyPreset()` as defense-in-depth and quoted sandbox names in error output.

Stale port conflicts on re-onboard (#397)

A previous `nemoclaw onboard` session leaves the OpenShell gateway container (port 8080) and port forward (18789) running after exit. The next `onboard` invocation fails at the preflight port check before reaching the existing cleanup code. Moved detection and teardown of a NemoClaw-owned gateway into `preflight()`, before the port availability check.

Test plan

Automated Tests

```
node --test test/onboard-readiness.test.js
```

23 tests across 3 suites:

11 sandbox readiness parsing (ANSI stripping, exact match, multi-sandbox, tab-separated)
6 WSL regression (hyphenated names, truncation detection, command quoting, RFC 1123 validation)
6 stale gateway detection (real output, ANSI output, null/empty/error, non-NemoClaw gateway)

Hardware Validation

Environment	Full Suite	PR Tests
DGX Spark (aarch64, Ubuntu 24.04, Node 22)	162/167 pass (5 pre-existing)	23/23 pass
Windows WSL2 (x86_64, Ubuntu 22.04, Docker Desktop, Node 22)	165/167 pass (2 pre-existing)	23/23 pass

coderabbitai · 2026-03-18T00:08:51Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Replaces piped/awk sandbox-create detection with direct openshell exit-status checks, adds ANSI-stripped readiness polling (≈60s), removes orphaned sandboxes on timeout, moves build-context cleanup earlier, defers registry registration/policy apply until sandbox is Ready, and adds validation and unit tests for name/readiness handling.

Changes

Cohort / File(s)	Summary
Onboarding flow `bin/lib/onboard.js`	Use direct `openshell sandbox create` with captured exit status; remove ignoreError masking; move build-context cleanup earlier; poll `openshell sandbox list` (strip ANSI) for exact-name `Ready` up to ~60s; delete orphaned sandbox on timeout; register sandbox and apply policies only after readiness; surface failures with guidance.
Readiness & regression tests `test/onboard-readiness.test.js`	Add unit tests for `isSandboxReady` parsing (ANSI stripping, exact-first-column match, tabs, multiple entries) and regression tests protecting against WSL/truncated-name issues and policy-command quoting.
Preset validation `bin/lib/policies.js`	Validate `sandboxName` in `applyPreset` (non-empty, ≤63 chars, RFC1123-style lowercase alnum/hyphen) and throw descriptive error on invalid/truncated names before applying presets.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as NemoClaw CLI
    participant OS as OpenShell CLI
    participant REG as Local Registry
    participant FS as Build Context

    CLI->>OS: openshell sandbox create ... (capture exit status)
    alt exit 0
        CLI->>FS: remove build context
        CLI->>OS: poll "openshell sandbox list" (strip ANSI) up to 60s
        alt sandbox Ready (exact name, not "NotReady")
            CLI->>REG: register sandbox
            CLI->>OS: apply policies / port-forward
        else timeout / not Ready
            CLI->>OS: openshell sandbox delete <name> || true
            CLI-->>CLI: log diagnostics and exit non-zero
        end
    else non-zero exit
        CLI-->>CLI: surface error, log diagnostics, exit non-zero
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

"I hopped through ANSI trails and chased the stray,
Matched names exactly, chased the ghosts away.
I cleaned the build, then waited for the light—
When Ready gleamed, I set the policies right.
🐇 Ready, steady, bound!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	Title accurately summarizes the main changes: gating sandbox registration on readiness and adding name validation guards for WSL truncation.
Linked Issues check	✅ Passed	All objectives from `#21`, `#22`, `#140`, `#152` are met: readiness polling [`#140`, `#152`], failure surfacing [`#22`], orphaned sandbox cleanup [`#22`], RFC1123 validation [`#21`], and regression test coverage.
Out of Scope Changes check	✅ Passed	All changes directly address linked issue objectives: readiness gate, failure handling, cleanup, name validation, and test coverage. No unrelated or tangential modifications detected.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

test/onboard-readiness.test.js (1)
7-16: Test the production readiness logic, not a copied helper.

isSandboxReady() is reimplemented locally, so this suite can stay green even if bin/lib/onboard.js drifts or if registry.registerSandbox() is accidentally moved above the wait loop again. Extract the parser from onboarding and import it here; then add one mocked create/onboard test that asserts registration happens only after Ready and never on timeout.

Also applies to: 18-80
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/onboard-readiness.test.js` around lines 7 - 16, Replace the local
isSandboxReady() copy with the canonical parser exported from
bin/lib/onboard.js: export the readiness-parsing function (e.g.,
parseSandboxReadiness or isSandboxReady) from onboard.js, import that function
into test/onboard-readiness.test.js and remove the duplicated implementation;
then add a mocked integration test that uses the real create/onboard flow with a
fake registry (spy/stub registry.registerSandbox) to assert registerSandbox is
called only after the parser observes "Ready" in output and is never called when
the parser times out (simulate output with and without "Ready" and assert
registerSandbox call counts/ordering accordingly).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 352-357: When the ready check fails for sandboxName (the branch
where ready is false and you call process.exit(1)), ensure we clean up the
orphaned remote sandbox so future runs that call registry.getSandbox() won't be
blocked: either invoke the OpenShell deletion flow for sandboxName before
exiting (the same mechanism used elsewhere to remove sandboxes) or at minimum
emit a concrete cleanup command in the error output that users can copy (e.g.,
"openshell sandbox delete <sandboxName>"); perform the delete call and only exit
after it completes or log the explicit delete command along with the existing
log lines so the orphan is removed or easily cleaned up.

---

Nitpick comments:
In `@test/onboard-readiness.test.js`:
- Around line 7-16: Replace the local isSandboxReady() copy with the canonical
parser exported from bin/lib/onboard.js: export the readiness-parsing function
(e.g., parseSandboxReadiness or isSandboxReady) from onboard.js, import that
function into test/onboard-readiness.test.js and remove the duplicated
implementation; then add a mocked integration test that uses the real
create/onboard flow with a fake registry (spy/stub registry.registerSandbox) to
assert registerSandbox is called only after the parser observes "Ready" in
output and is never called when the parser times out (simulate output with and
without "Ready" and assert registerSandbox call counts/ordering accordingly).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8473049c-d6e3-48d6-9737-799f92480a0d

📥 Commits

Reviewing files that changed from the base of the PR and between 2b5febe and 04b0d98.

📒 Files selected for processing (2)

bin/lib/onboard.js
test/onboard-readiness.test.js

coderabbitai

🧹 Nitpick comments (1)

bin/lib/onboard.js (1)
349-349: Minor: Consider using the existing sleep() helper.

Line 349 inlines require("child_process").spawnSync("sleep", ["2"]) while a sleep() helper already exists at line 75. Using sleep(2) would improve consistency.
♻️ Suggested change
-    require("child_process").spawnSync("sleep", ["2"]);
+    sleep(2);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` at line 349, Replace the inline blocking call
require("child_process").spawnSync("sleep", ["2"]) with the existing sleep
helper (sleep(2)) to maintain consistency; locate the inline call in onboard.js
(the spawnSync usage) and swap it to call the sleep function defined earlier
(the sleep() helper at line ~75) and remove the unnecessary require usage so the
module uses the shared helper instead of a new child_process invocation.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@bin/lib/onboard.js`:
- Line 349: Replace the inline blocking call
require("child_process").spawnSync("sleep", ["2"]) with the existing sleep
helper (sleep(2)) to maintain consistency; locate the inline call in onboard.js
(the spawnSync usage) and swap it to call the sleep function defined earlier
(the sleep() helper at line ~75) and remove the unnecessary require usage so the
module uses the shared helper instead of a new child_process invocation.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e3933dd4-7f69-47f7-9405-9f33bb7ec3cb

📥 Commits

Reviewing files that changed from the base of the PR and between 04b0d98 and 0d01452.

📒 Files selected for processing (3)

bin/lib/onboard.js
bin/lib/policies.js
test/onboard-readiness.test.js

🚧 Files skipped from review as they are similar to previous changes (1)

test/onboard-readiness.test.js

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/policies.js`:
- Around line 87-94: The guard for sandboxName currently checks shape with the
regex but not length, so add an explicit max-length check (RFC 1123 limit of 63
characters) before/alongside the regex validation: check sandboxName.length <=
63 and throw the same Error (or an updated message) if length exceeds 63;
reference the sandboxName variable and the existing regex validation block to
locate where to add this deterministic length check and update the error text to
mention the 63-character limit.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: dafb3791-daad-4247-b9bf-146ca6b4def2

📥 Commits

Reviewing files that changed from the base of the PR and between 0d01452 and 3d4bbb0.

📒 Files selected for processing (2)

bin/lib/policies.js
test/onboard-readiness.test.js

🚧 Files skipped from review as they are similar to previous changes (1)

test/onboard-readiness.test.js

…ures The sandbox create command was piped through awk to deduplicate log lines. In bash, the exit status of a pipeline is the status of the last command (awk, always 0), so creation failures were silently swallowed. NemoClaw then registered a phantom sandbox in ~/.nemoclaw/sandboxes.json that caused "sandbox not found" on every subsequent connect/status call. This is the root cause of the WSL2 + Docker Desktop failures reported in #140 and #152 — sandbox creation fails due to Docker networking issues, but onboarding completes as if it succeeded. Three changes: 1. Remove the awk pipe so the real exit code flows through to run() 2. Poll openshell sandbox list for Ready state before registering (matches the gateway health check pattern at lines 121-132) 3. Move build-context cleanup before the exit-code check so temp files are always cleaned up, even on failure Signed-off-by: Brian Taylor <brian.taylor818@gmail.com>

includes("Ready") falsely matched "NotReady" because "Ready" is a substring. Use a word-boundary regex with a NotReady exclusion so sandboxes stuck in error states are not registered as healthy. Also remove the off-by-one break at i=29 so the loop sleeps the full 60s before timing out. Signed-off-by: Brian Taylor <brian.taylor818@gmail.com>

Use column-based matching (split on whitespace, check cols[0]) instead of substring includes(). Prevents false positives when one sandbox name is a prefix of another (e.g. "my" matching "my-assistant"). Signed-off-by: Brian Taylor <brian.taylor818@gmail.com>

Signed-off-by: Brian Taylor <brian.taylor818@gmail.com>

On WSL, hyphenated sandbox names like "my-assistant" can be truncated to "m" during shell argument parsing, causing "sandbox not found" failures when applying policy presets. - Add RFC 1123 validation in applyPreset() to catch truncated names early with a clear error message - Quote sandbox name in error output (was unquoted on line 356) - Add 6 WSL-specific regression tests covering hyphenated names, multi-hyphen names, truncation detection, and command quoting

When the sandbox is created but never reaches Ready within 60s, delete it before exiting so the next onboard retry with the same name doesn't fail on "sandbox already exists".

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

bin/lib/onboard.js (2)

380-387: ⚠️ Potential issue | 🟡 Minor

RFC 1123 validation is missing the label-length bound.

The current check allows overly long names; RFC 1123 labels should be 1–63 chars. Catch this early to avoid later openshell failure.

Proposed fix

-  if (!/^[a-z0-9]([a-z0-9-]*[a-z0-9])?$/.test(sandboxName)) {
+  const isValidSandboxName =
+    sandboxName.length >= 1 &&
+    sandboxName.length <= 63 &&
+    /^[a-z0-9]([a-z0-9-]*[a-z0-9])?$/.test(sandboxName);
+  if (!isValidSandboxName) {
     console.error(`  Invalid sandbox name: '${sandboxName}'`);
-    console.error("  Names must be lowercase, contain only letters, numbers, and hyphens,");
+    console.error("  Names must be 1-63 chars, lowercase, contain only letters, numbers, and hyphens,");
     console.error("  and must start and end with a letter or number.");
     process.exit(1);
   }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 380 - 387, The RFC1123 validation for
sandboxName currently only checks characters but not label length; update the
validation in the sandboxName check (the if that tests
/^[a-z0-9]([a-z0-9-]*[a-z0-9])?$/) to enforce labels are 1–63 characters long by
either augmenting the regex to include the {1,63} bound or by adding an explicit
length check (ensure sandboxName.length >=1 && sandboxName.length <=63) before
calling process.exit(1), and keep the same error messages for consistency.

433-443: ⚠️ Potential issue | 🟠 Major

Environment values are shell-interpolated without quoting.

Lines 433-436 and Line 443 currently pass raw env values into a shell command. If values contain spaces or shell metacharacters, command behavior can break or be injected.

Proposed fix

   const chatUiUrl = process.env.CHAT_UI_URL || 'http://127.0.0.1:18789';
   const envArgs = [`CHAT_UI_URL=${chatUiUrl}`];
   if (process.env.NVIDIA_API_KEY) {
     envArgs.push(`NVIDIA_API_KEY=${process.env.NVIDIA_API_KEY}`);
   }
+  const quotedEnvArgs = envArgs.map((entry) => shellQuote(entry));
@@
   const createResult = run(
-    `openshell sandbox create ${createArgs.join(" ")} -- env ${envArgs.join(" ")} nemoclaw-start 2>&1`,
+    `openshell sandbox create ${createArgs.join(" ")} -- env ${quotedEnvArgs.join(" ")} nemoclaw-start 2>&1`,
     { ignoreError: true }
   );

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 433 - 443, The env values (chatUiUrl and
process.env.NVIDIA_API_KEY) are being interpolated directly into the shell
command via envArgs and createArgs passed to run(), which allows spaces or shell
metacharacters to break or inject commands; update the code to properly
shell-escape or quote each environment value before joining into envArgs (or,
better, switch to invoking run/openshell with an argument array instead of a
single interpolated command) so that envArgs contains safely-escaped entries
(handle existing quotes by escaping single quotes or using a proven sh-escaping
helper) and then call run() with the escaped/array form so createResult receives
a safe command string/argv list.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 483-487: The timeout path currently always prints that the
orphaned sandbox was removed despite suppressing errors in the run call; modify
the handling around the run(`openshell sandbox delete "${sandboxName}" ...`, {
ignoreError: true }) invocation to capture its success/failure (e.g., check its
exit status or thrown error) and branch the messages accordingly: if delete
succeeded print the existing "has been removed" message, otherwise print that
automatic deletion failed and include a suggested manual cleanup command (e.g.,
`openshell sandbox delete "${sandboxName}"`) and where to find logs. Update
references to run and sandboxName in onboard.js accordingly so the output
accurately reflects the deletion result.

In `@test/onboard-readiness.test.js`:
- Around line 12-18: The test contains a duplicated parser (isSandboxReady) that
can drift from the production onboarding parser; remove this local copy and
instead import and use the shared readiness parser exported from the onboarding
module (the same parser used in bin/lib/onboard.js). Update the onboarding
module to export its readiness-parsing function (the function used to parse
readiness lines in onboard.js) if not already exported, then replace the test's
isSandboxReady implementation with an import of that exported helper and call it
in the test so production and test parsing stay in sync.

---

Outside diff comments:
In `@bin/lib/onboard.js`:
- Around line 380-387: The RFC1123 validation for sandboxName currently only
checks characters but not label length; update the validation in the sandboxName
check (the if that tests /^[a-z0-9]([a-z0-9-]*[a-z0-9])?$/) to enforce labels
are 1–63 characters long by either augmenting the regex to include the {1,63}
bound or by adding an explicit length check (ensure sandboxName.length >=1 &&
sandboxName.length <=63) before calling process.exit(1), and keep the same error
messages for consistency.
- Around line 433-443: The env values (chatUiUrl and process.env.NVIDIA_API_KEY)
are being interpolated directly into the shell command via envArgs and
createArgs passed to run(), which allows spaces or shell metacharacters to break
or inject commands; update the code to properly shell-escape or quote each
environment value before joining into envArgs (or, better, switch to invoking
run/openshell with an argument array instead of a single interpolated command)
so that envArgs contains safely-escaped entries (handle existing quotes by
escaping single quotes or using a proven sh-escaping helper) and then call run()
with the escaped/array form so createResult receives a safe command string/argv
list.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4ba383ff-b3ab-4241-87e3-8c68ec125569

📥 Commits

Reviewing files that changed from the base of the PR and between 682a94f and 6d617be.

📒 Files selected for processing (3)

bin/lib/onboard.js
bin/lib/policies.js
test/onboard-readiness.test.js

✅ Files skipped from review due to trivial changes (1)

bin/lib/policies.js

…saging - Extract readiness parser from inline code to exported isSandboxReady() so tests validate the production code, not a duplicated copy - Branch cleanup messaging on delete result — report manual cleanup command if the orphan delete fails

A previous onboard session may leave the OpenShell gateway container and port forward running, causing port 8080/18789 conflicts on the next invocation. Detect a NemoClaw-owned gateway before the port availability check and tear it down automatically. Closes #397

Runs `nemoclaw onboard` three times without cleanup between runs, verifying that each subsequent onboard recovers automatically from stale gateway, port forward, and registry entries left behind. Regression test for #397.

…me truncation (NVIDIA#229) * fix: gate sandbox registration on readiness and surface creation failures The sandbox create command was piped through awk to deduplicate log lines. In bash, the exit status of a pipeline is the status of the last command (awk, always 0), so creation failures were silently swallowed. NemoClaw then registered a phantom sandbox in ~/.nemoclaw/sandboxes.json that caused "sandbox not found" on every subsequent connect/status call. This is the root cause of the WSL2 + Docker Desktop failures reported in NVIDIA#140 and NVIDIA#152 — sandbox creation fails due to Docker networking issues, but onboarding completes as if it succeeded. Three changes: 1. Remove the awk pipe so the real exit code flows through to run() 2. Poll openshell sandbox list for Ready state before registering (matches the gateway health check pattern at lines 121-132) 3. Move build-context cleanup before the exit-code check so temp files are always cleaned up, even on failure Signed-off-by: Brian Taylor <brian.taylor818@gmail.com> * fix: use word-boundary match for Ready status and fix timeout includes("Ready") falsely matched "NotReady" because "Ready" is a substring. Use a word-boundary regex with a NotReady exclusion so sandboxes stuck in error states are not registered as healthy. Also remove the off-by-one break at i=29 so the loop sleeps the full 60s before timing out. Signed-off-by: Brian Taylor <brian.taylor818@gmail.com> * fix: exact-match sandbox name in readiness check Use column-based matching (split on whitespace, check cols[0]) instead of substring includes(). Prevents false positives when one sandbox name is a prefix of another (e.g. "my" matching "my-assistant"). Signed-off-by: Brian Taylor <brian.taylor818@gmail.com> * test: add readiness gate parsing tests for sandbox creation Signed-off-by: Brian Taylor <brian.taylor818@gmail.com> * fix: guard against truncated sandbox names on WSL (fixes NVIDIA#21) On WSL, hyphenated sandbox names like "my-assistant" can be truncated to "m" during shell argument parsing, causing "sandbox not found" failures when applying policy presets. - Add RFC 1123 validation in applyPreset() to catch truncated names early with a clear error message - Quote sandbox name in error output (was unquoted on line 356) - Add 6 WSL-specific regression tests covering hyphenated names, multi-hyphen names, truncation detection, and command quoting * fix: clean up orphaned sandbox on readiness timeout When the sandbox is created but never reaches Ready within 60s, delete it before exiting so the next onboard retry with the same name doesn't fail on "sandbox already exists". * chore: remove issue references from code comments * fix: enforce RFC 1123 63-char limit in sandbox name validation * fix: extract isSandboxReady as shared function and branch cleanup messaging - Extract readiness parser from inline code to exported isSandboxReady() so tests validate the production code, not a duplicated copy - Branch cleanup messaging on delete result — report manual cleanup command if the orphan delete fails * fix: clean up stale gateway and port forward before preflight check A previous onboard session may leave the OpenShell gateway container and port forward running, causing port 8080/18789 conflicts on the next invocation. Detect a NemoClaw-owned gateway before the port availability check and tear it down automatically. Closes NVIDIA#397 * test: add double-onboard e2e test for stale state recovery Runs `nemoclaw onboard` three times without cleanup between runs, verifying that each subsequent onboard recovers automatically from stale gateway, port forward, and registry entries left behind. Regression test for NVIDIA#397. --------- Signed-off-by: Brian Taylor <brian.taylor818@gmail.com>

brianwtaylor mentioned this pull request Mar 17, 2026

[WSL2 + Docker Desktop] Sandbox not found after GPU allocation failure prevents policy setup #140

Closed

brianwtaylor force-pushed the fix/sandbox-creation-readiness-gate branch from e4b1dad to 5e5847d Compare March 17, 2026 18:01

mattezell mentioned this pull request Mar 17, 2026

fix: skip --gpu on WSL2 where GPU passthrough to k3s is unsupported #209

Closed

ksapru mentioned this pull request Mar 17, 2026

feat: add lightweight observability and metrics service #230

Open

brianwtaylor mentioned this pull request Mar 17, 2026

Do not record a sandbox as created when createSandbox() fails during onboarding #22

Closed

brianwtaylor force-pushed the fix/sandbox-creation-readiness-gate branch from bda881b to 588e54b Compare March 18, 2026 00:08

brianwtaylor force-pushed the fix/sandbox-creation-readiness-gate branch from 588e54b to 04b0d98 Compare March 18, 2026 20:36

coderabbitai bot reviewed Mar 18, 2026

View reviewed changes

Comment thread bin/lib/onboard.js

brianwtaylor changed the title ~~fix: gate sandbox registration on creation readiness~~ fix: gate sandbox registration on creation readiness and guard WSL name truncation Mar 18, 2026

coderabbitai bot reviewed Mar 18, 2026

View reviewed changes

Comment thread bin/lib/policies.js

wscurran added the Platform: Windows/WSL Support for Windows Subsystem for Linux label Mar 18, 2026

brianwtaylor added 8 commits March 18, 2026 18:12

test: add readiness gate parsing tests for sandbox creation

7a01e35

Signed-off-by: Brian Taylor <brian.taylor818@gmail.com>

fix: clean up orphaned sandbox on readiness timeout

7357372

When the sandbox is created but never reaches Ready within 60s, delete it before exiting so the next onboard retry with the same name doesn't fail on "sandbox already exists".

chore: remove issue references from code comments

8d4a23b

fix: enforce RFC 1123 63-char limit in sandbox name validation

6d617be

brianwtaylor force-pushed the fix/sandbox-creation-readiness-gate branch from 682a94f to 6d617be Compare March 19, 2026 01:12

coderabbitai bot reviewed Mar 19, 2026

View reviewed changes

Comment thread bin/lib/onboard.js Outdated

Comment thread test/onboard-readiness.test.js Outdated

brianwtaylor added 2 commits March 18, 2026 18:34

brianwtaylor force-pushed the fix/sandbox-creation-readiness-gate branch from 2dd8ab0 to 6a1da48 Compare March 19, 2026 08:51

test: add double-onboard e2e test for stale state recovery

c43c66a

Runs `nemoclaw onboard` three times without cleanup between runs, verifying that each subsequent onboard recovers automatically from stale gateway, port forward, and registry entries left behind. Regression test for #397.

brianwtaylor force-pushed the fix/sandbox-creation-readiness-gate branch from 17e574d to c43c66a Compare March 19, 2026 10:14

kjw3 self-assigned this Mar 19, 2026

kjw3 approved these changes Mar 19, 2026

View reviewed changes

kjw3 merged commit 3dafa3a into NVIDIA:main Mar 19, 2026
3 checks passed

miyoungc mentioned this pull request Mar 20, 2026

fix: make onboard robust when rerun after a failed install #84

Closed

3 tasks

mafueee pushed a commit to mafueee/NemoClaw that referenced this pull request Mar 28, 2026

fix(cluster): skip DNS probe for IP-literal registry hosts (NVIDIA#229)

1d33d4c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: gate sandbox registration on creation readiness and guard WSL name truncation#229

fix: gate sandbox registration on creation readiness and guard WSL name truncation#229
kjw3 merged 11 commits intoNVIDIA:mainfrom
brianwtaylor:fix/sandbox-creation-readiness-gate

brianwtaylor commented Mar 17, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Mar 18, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

brianwtaylor commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

WSL sandbox name truncation (#21)

Stale port conflicts on re-onboard (#397)

Test plan

Automated Tests

Hardware Validation

Uh oh!

coderabbitai bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

brianwtaylor commented Mar 17, 2026 •

edited

Loading

coderabbitai bot commented Mar 18, 2026 •

edited

Loading