fix: gate sandbox registration on creation readiness and guard WSL name truncation#229
Conversation
e4b1dad to
5e5847d
Compare
bda881b to
588e54b
Compare
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplaces piped/awk sandbox-create detection with direct openshell exit-status checks, adds ANSI-stripped readiness polling (≈60s), removes orphaned sandboxes on timeout, moves build-context cleanup earlier, defers registry registration/policy apply until sandbox is Ready, and adds validation and unit tests for name/readiness handling. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as NemoClaw CLI
participant OS as OpenShell CLI
participant REG as Local Registry
participant FS as Build Context
CLI->>OS: openshell sandbox create ... (capture exit status)
alt exit 0
CLI->>FS: remove build context
CLI->>OS: poll "openshell sandbox list" (strip ANSI) up to 60s
alt sandbox Ready (exact name, not "NotReady")
CLI->>REG: register sandbox
CLI->>OS: apply policies / port-forward
else timeout / not Ready
CLI->>OS: openshell sandbox delete <name> || true
CLI-->>CLI: log diagnostics and exit non-zero
end
else non-zero exit
CLI-->>CLI: surface error, log diagnostics, exit non-zero
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
588e54b to
04b0d98
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
test/onboard-readiness.test.js (1)
7-16: Test the production readiness logic, not a copied helper.
isSandboxReady()is reimplemented locally, so this suite can stay green even ifbin/lib/onboard.jsdrifts or ifregistry.registerSandbox()is accidentally moved above the wait loop again. Extract the parser from onboarding and import it here; then add one mocked create/onboard test that asserts registration happens only afterReadyand never on timeout.Also applies to: 18-80
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/onboard-readiness.test.js` around lines 7 - 16, Replace the local isSandboxReady() copy with the canonical parser exported from bin/lib/onboard.js: export the readiness-parsing function (e.g., parseSandboxReadiness or isSandboxReady) from onboard.js, import that function into test/onboard-readiness.test.js and remove the duplicated implementation; then add a mocked integration test that uses the real create/onboard flow with a fake registry (spy/stub registry.registerSandbox) to assert registerSandbox is called only after the parser observes "Ready" in output and is never called when the parser times out (simulate output with and without "Ready" and assert registerSandbox call counts/ordering accordingly).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/onboard.js`:
- Around line 352-357: When the ready check fails for sandboxName (the branch
where ready is false and you call process.exit(1)), ensure we clean up the
orphaned remote sandbox so future runs that call registry.getSandbox() won't be
blocked: either invoke the OpenShell deletion flow for sandboxName before
exiting (the same mechanism used elsewhere to remove sandboxes) or at minimum
emit a concrete cleanup command in the error output that users can copy (e.g.,
"openshell sandbox delete <sandboxName>"); perform the delete call and only exit
after it completes or log the explicit delete command along with the existing
log lines so the orphan is removed or easily cleaned up.
---
Nitpick comments:
In `@test/onboard-readiness.test.js`:
- Around line 7-16: Replace the local isSandboxReady() copy with the canonical
parser exported from bin/lib/onboard.js: export the readiness-parsing function
(e.g., parseSandboxReadiness or isSandboxReady) from onboard.js, import that
function into test/onboard-readiness.test.js and remove the duplicated
implementation; then add a mocked integration test that uses the real
create/onboard flow with a fake registry (spy/stub registry.registerSandbox) to
assert registerSandbox is called only after the parser observes "Ready" in
output and is never called when the parser times out (simulate output with and
without "Ready" and assert registerSandbox call counts/ordering accordingly).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8473049c-d6e3-48d6-9737-799f92480a0d
📒 Files selected for processing (2)
bin/lib/onboard.jstest/onboard-readiness.test.js
There was a problem hiding this comment.
🧹 Nitpick comments (1)
bin/lib/onboard.js (1)
349-349: Minor: Consider using the existingsleep()helper.Line 349 inlines
require("child_process").spawnSync("sleep", ["2"])while asleep()helper already exists at line 75. Usingsleep(2)would improve consistency.♻️ Suggested change
- require("child_process").spawnSync("sleep", ["2"]); + sleep(2);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` at line 349, Replace the inline blocking call require("child_process").spawnSync("sleep", ["2"]) with the existing sleep helper (sleep(2)) to maintain consistency; locate the inline call in onboard.js (the spawnSync usage) and swap it to call the sleep function defined earlier (the sleep() helper at line ~75) and remove the unnecessary require usage so the module uses the shared helper instead of a new child_process invocation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@bin/lib/onboard.js`:
- Line 349: Replace the inline blocking call
require("child_process").spawnSync("sleep", ["2"]) with the existing sleep
helper (sleep(2)) to maintain consistency; locate the inline call in onboard.js
(the spawnSync usage) and swap it to call the sleep function defined earlier
(the sleep() helper at line ~75) and remove the unnecessary require usage so the
module uses the shared helper instead of a new child_process invocation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e3933dd4-7f69-47f7-9405-9f33bb7ec3cb
📒 Files selected for processing (3)
bin/lib/onboard.jsbin/lib/policies.jstest/onboard-readiness.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
- test/onboard-readiness.test.js
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/policies.js`:
- Around line 87-94: The guard for sandboxName currently checks shape with the
regex but not length, so add an explicit max-length check (RFC 1123 limit of 63
characters) before/alongside the regex validation: check sandboxName.length <=
63 and throw the same Error (or an updated message) if length exceeds 63;
reference the sandboxName variable and the existing regex validation block to
locate where to add this deterministic length check and update the error text to
mention the 63-character limit.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: dafb3791-daad-4247-b9bf-146ca6b4def2
📒 Files selected for processing (2)
bin/lib/policies.jstest/onboard-readiness.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
- test/onboard-readiness.test.js
…ures The sandbox create command was piped through awk to deduplicate log lines. In bash, the exit status of a pipeline is the status of the last command (awk, always 0), so creation failures were silently swallowed. NemoClaw then registered a phantom sandbox in ~/.nemoclaw/sandboxes.json that caused "sandbox not found" on every subsequent connect/status call. This is the root cause of the WSL2 + Docker Desktop failures reported in #140 and #152 — sandbox creation fails due to Docker networking issues, but onboarding completes as if it succeeded. Three changes: 1. Remove the awk pipe so the real exit code flows through to run() 2. Poll openshell sandbox list for Ready state before registering (matches the gateway health check pattern at lines 121-132) 3. Move build-context cleanup before the exit-code check so temp files are always cleaned up, even on failure Signed-off-by: Brian Taylor <brian.taylor818@gmail.com>
includes("Ready") falsely matched "NotReady" because "Ready" is a
substring. Use a word-boundary regex with a NotReady exclusion so
sandboxes stuck in error states are not registered as healthy.
Also remove the off-by-one break at i=29 so the loop sleeps the
full 60s before timing out.
Signed-off-by: Brian Taylor <brian.taylor818@gmail.com>
Use column-based matching (split on whitespace, check cols[0]) instead of substring includes(). Prevents false positives when one sandbox name is a prefix of another (e.g. "my" matching "my-assistant"). Signed-off-by: Brian Taylor <brian.taylor818@gmail.com>
Signed-off-by: Brian Taylor <brian.taylor818@gmail.com>
On WSL, hyphenated sandbox names like "my-assistant" can be truncated to "m" during shell argument parsing, causing "sandbox not found" failures when applying policy presets. - Add RFC 1123 validation in applyPreset() to catch truncated names early with a clear error message - Quote sandbox name in error output (was unquoted on line 356) - Add 6 WSL-specific regression tests covering hyphenated names, multi-hyphen names, truncation detection, and command quoting
When the sandbox is created but never reaches Ready within 60s, delete it before exiting so the next onboard retry with the same name doesn't fail on "sandbox already exists".
682a94f to
6d617be
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
bin/lib/onboard.js (2)
380-387:⚠️ Potential issue | 🟡 MinorRFC 1123 validation is missing the label-length bound.
The current check allows overly long names; RFC 1123 labels should be 1–63 chars. Catch this early to avoid later
openshellfailure.Proposed fix
- if (!/^[a-z0-9]([a-z0-9-]*[a-z0-9])?$/.test(sandboxName)) { + const isValidSandboxName = + sandboxName.length >= 1 && + sandboxName.length <= 63 && + /^[a-z0-9]([a-z0-9-]*[a-z0-9])?$/.test(sandboxName); + if (!isValidSandboxName) { console.error(` Invalid sandbox name: '${sandboxName}'`); - console.error(" Names must be lowercase, contain only letters, numbers, and hyphens,"); + console.error(" Names must be 1-63 chars, lowercase, contain only letters, numbers, and hyphens,"); console.error(" and must start and end with a letter or number."); process.exit(1); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 380 - 387, The RFC1123 validation for sandboxName currently only checks characters but not label length; update the validation in the sandboxName check (the if that tests /^[a-z0-9]([a-z0-9-]*[a-z0-9])?$/) to enforce labels are 1–63 characters long by either augmenting the regex to include the {1,63} bound or by adding an explicit length check (ensure sandboxName.length >=1 && sandboxName.length <=63) before calling process.exit(1), and keep the same error messages for consistency.
433-443:⚠️ Potential issue | 🟠 MajorEnvironment values are shell-interpolated without quoting.
Lines 433-436 and Line 443 currently pass raw env values into a shell command. If values contain spaces or shell metacharacters, command behavior can break or be injected.
Proposed fix
const chatUiUrl = process.env.CHAT_UI_URL || 'http://127.0.0.1:18789'; const envArgs = [`CHAT_UI_URL=${chatUiUrl}`]; if (process.env.NVIDIA_API_KEY) { envArgs.push(`NVIDIA_API_KEY=${process.env.NVIDIA_API_KEY}`); } + const quotedEnvArgs = envArgs.map((entry) => shellQuote(entry)); @@ const createResult = run( - `openshell sandbox create ${createArgs.join(" ")} -- env ${envArgs.join(" ")} nemoclaw-start 2>&1`, + `openshell sandbox create ${createArgs.join(" ")} -- env ${quotedEnvArgs.join(" ")} nemoclaw-start 2>&1`, { ignoreError: true } );🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 433 - 443, The env values (chatUiUrl and process.env.NVIDIA_API_KEY) are being interpolated directly into the shell command via envArgs and createArgs passed to run(), which allows spaces or shell metacharacters to break or inject commands; update the code to properly shell-escape or quote each environment value before joining into envArgs (or, better, switch to invoking run/openshell with an argument array instead of a single interpolated command) so that envArgs contains safely-escaped entries (handle existing quotes by escaping single quotes or using a proven sh-escaping helper) and then call run() with the escaped/array form so createResult receives a safe command string/argv list.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/onboard.js`:
- Around line 483-487: The timeout path currently always prints that the
orphaned sandbox was removed despite suppressing errors in the run call; modify
the handling around the run(`openshell sandbox delete "${sandboxName}" ...`, {
ignoreError: true }) invocation to capture its success/failure (e.g., check its
exit status or thrown error) and branch the messages accordingly: if delete
succeeded print the existing "has been removed" message, otherwise print that
automatic deletion failed and include a suggested manual cleanup command (e.g.,
`openshell sandbox delete "${sandboxName}"`) and where to find logs. Update
references to run and sandboxName in onboard.js accordingly so the output
accurately reflects the deletion result.
In `@test/onboard-readiness.test.js`:
- Around line 12-18: The test contains a duplicated parser (isSandboxReady) that
can drift from the production onboarding parser; remove this local copy and
instead import and use the shared readiness parser exported from the onboarding
module (the same parser used in bin/lib/onboard.js). Update the onboarding
module to export its readiness-parsing function (the function used to parse
readiness lines in onboard.js) if not already exported, then replace the test's
isSandboxReady implementation with an import of that exported helper and call it
in the test so production and test parsing stay in sync.
---
Outside diff comments:
In `@bin/lib/onboard.js`:
- Around line 380-387: The RFC1123 validation for sandboxName currently only
checks characters but not label length; update the validation in the sandboxName
check (the if that tests /^[a-z0-9]([a-z0-9-]*[a-z0-9])?$/) to enforce labels
are 1–63 characters long by either augmenting the regex to include the {1,63}
bound or by adding an explicit length check (ensure sandboxName.length >=1 &&
sandboxName.length <=63) before calling process.exit(1), and keep the same error
messages for consistency.
- Around line 433-443: The env values (chatUiUrl and process.env.NVIDIA_API_KEY)
are being interpolated directly into the shell command via envArgs and
createArgs passed to run(), which allows spaces or shell metacharacters to break
or inject commands; update the code to properly shell-escape or quote each
environment value before joining into envArgs (or, better, switch to invoking
run/openshell with an argument array instead of a single interpolated command)
so that envArgs contains safely-escaped entries (handle existing quotes by
escaping single quotes or using a proven sh-escaping helper) and then call run()
with the escaped/array form so createResult receives a safe command string/argv
list.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4ba383ff-b3ab-4241-87e3-8c68ec125569
📒 Files selected for processing (3)
bin/lib/onboard.jsbin/lib/policies.jstest/onboard-readiness.test.js
✅ Files skipped from review due to trivial changes (1)
- bin/lib/policies.js
…saging - Extract readiness parser from inline code to exported isSandboxReady() so tests validate the production code, not a duplicated copy - Branch cleanup messaging on delete result — report manual cleanup command if the orphan delete fails
A previous onboard session may leave the OpenShell gateway container and port forward running, causing port 8080/18789 conflicts on the next invocation. Detect a NemoClaw-owned gateway before the port availability check and tear it down automatically. Closes #397
2dd8ab0 to
6a1da48
Compare
Runs `nemoclaw onboard` three times without cleanup between runs, verifying that each subsequent onboard recovers automatically from stale gateway, port forward, and registry entries left behind. Regression test for #397.
17e574d to
c43c66a
Compare
…me truncation (NVIDIA#229) * fix: gate sandbox registration on readiness and surface creation failures The sandbox create command was piped through awk to deduplicate log lines. In bash, the exit status of a pipeline is the status of the last command (awk, always 0), so creation failures were silently swallowed. NemoClaw then registered a phantom sandbox in ~/.nemoclaw/sandboxes.json that caused "sandbox not found" on every subsequent connect/status call. This is the root cause of the WSL2 + Docker Desktop failures reported in NVIDIA#140 and NVIDIA#152 — sandbox creation fails due to Docker networking issues, but onboarding completes as if it succeeded. Three changes: 1. Remove the awk pipe so the real exit code flows through to run() 2. Poll openshell sandbox list for Ready state before registering (matches the gateway health check pattern at lines 121-132) 3. Move build-context cleanup before the exit-code check so temp files are always cleaned up, even on failure Signed-off-by: Brian Taylor <brian.taylor818@gmail.com> * fix: use word-boundary match for Ready status and fix timeout includes("Ready") falsely matched "NotReady" because "Ready" is a substring. Use a word-boundary regex with a NotReady exclusion so sandboxes stuck in error states are not registered as healthy. Also remove the off-by-one break at i=29 so the loop sleeps the full 60s before timing out. Signed-off-by: Brian Taylor <brian.taylor818@gmail.com> * fix: exact-match sandbox name in readiness check Use column-based matching (split on whitespace, check cols[0]) instead of substring includes(). Prevents false positives when one sandbox name is a prefix of another (e.g. "my" matching "my-assistant"). Signed-off-by: Brian Taylor <brian.taylor818@gmail.com> * test: add readiness gate parsing tests for sandbox creation Signed-off-by: Brian Taylor <brian.taylor818@gmail.com> * fix: guard against truncated sandbox names on WSL (fixes NVIDIA#21) On WSL, hyphenated sandbox names like "my-assistant" can be truncated to "m" during shell argument parsing, causing "sandbox not found" failures when applying policy presets. - Add RFC 1123 validation in applyPreset() to catch truncated names early with a clear error message - Quote sandbox name in error output (was unquoted on line 356) - Add 6 WSL-specific regression tests covering hyphenated names, multi-hyphen names, truncation detection, and command quoting * fix: clean up orphaned sandbox on readiness timeout When the sandbox is created but never reaches Ready within 60s, delete it before exiting so the next onboard retry with the same name doesn't fail on "sandbox already exists". * chore: remove issue references from code comments * fix: enforce RFC 1123 63-char limit in sandbox name validation * fix: extract isSandboxReady as shared function and branch cleanup messaging - Extract readiness parser from inline code to exported isSandboxReady() so tests validate the production code, not a duplicated copy - Branch cleanup messaging on delete result — report manual cleanup command if the orphan delete fails * fix: clean up stale gateway and port forward before preflight check A previous onboard session may leave the OpenShell gateway container and port forward running, causing port 8080/18789 conflicts on the next invocation. Detect a NemoClaw-owned gateway before the port availability check and tear it down automatically. Closes NVIDIA#397 * test: add double-onboard e2e test for stale state recovery Runs `nemoclaw onboard` three times without cleanup between runs, verifying that each subsequent onboard recovers automatically from stale gateway, port forward, and registry entries left behind. Regression test for NVIDIA#397. --------- Signed-off-by: Brian Taylor <brian.taylor818@gmail.com>
…me truncation (NVIDIA#229) * fix: gate sandbox registration on readiness and surface creation failures The sandbox create command was piped through awk to deduplicate log lines. In bash, the exit status of a pipeline is the status of the last command (awk, always 0), so creation failures were silently swallowed. NemoClaw then registered a phantom sandbox in ~/.nemoclaw/sandboxes.json that caused "sandbox not found" on every subsequent connect/status call. This is the root cause of the WSL2 + Docker Desktop failures reported in NVIDIA#140 and NVIDIA#152 — sandbox creation fails due to Docker networking issues, but onboarding completes as if it succeeded. Three changes: 1. Remove the awk pipe so the real exit code flows through to run() 2. Poll openshell sandbox list for Ready state before registering (matches the gateway health check pattern at lines 121-132) 3. Move build-context cleanup before the exit-code check so temp files are always cleaned up, even on failure Signed-off-by: Brian Taylor <brian.taylor818@gmail.com> * fix: use word-boundary match for Ready status and fix timeout includes("Ready") falsely matched "NotReady" because "Ready" is a substring. Use a word-boundary regex with a NotReady exclusion so sandboxes stuck in error states are not registered as healthy. Also remove the off-by-one break at i=29 so the loop sleeps the full 60s before timing out. Signed-off-by: Brian Taylor <brian.taylor818@gmail.com> * fix: exact-match sandbox name in readiness check Use column-based matching (split on whitespace, check cols[0]) instead of substring includes(). Prevents false positives when one sandbox name is a prefix of another (e.g. "my" matching "my-assistant"). Signed-off-by: Brian Taylor <brian.taylor818@gmail.com> * test: add readiness gate parsing tests for sandbox creation Signed-off-by: Brian Taylor <brian.taylor818@gmail.com> * fix: guard against truncated sandbox names on WSL (fixes NVIDIA#21) On WSL, hyphenated sandbox names like "my-assistant" can be truncated to "m" during shell argument parsing, causing "sandbox not found" failures when applying policy presets. - Add RFC 1123 validation in applyPreset() to catch truncated names early with a clear error message - Quote sandbox name in error output (was unquoted on line 356) - Add 6 WSL-specific regression tests covering hyphenated names, multi-hyphen names, truncation detection, and command quoting * fix: clean up orphaned sandbox on readiness timeout When the sandbox is created but never reaches Ready within 60s, delete it before exiting so the next onboard retry with the same name doesn't fail on "sandbox already exists". * chore: remove issue references from code comments * fix: enforce RFC 1123 63-char limit in sandbox name validation * fix: extract isSandboxReady as shared function and branch cleanup messaging - Extract readiness parser from inline code to exported isSandboxReady() so tests validate the production code, not a duplicated copy - Branch cleanup messaging on delete result — report manual cleanup command if the orphan delete fails * fix: clean up stale gateway and port forward before preflight check A previous onboard session may leave the OpenShell gateway container and port forward running, causing port 8080/18789 conflicts on the next invocation. Detect a NemoClaw-owned gateway before the port availability check and tear it down automatically. Closes NVIDIA#397 * test: add double-onboard e2e test for stale state recovery Runs `nemoclaw onboard` three times without cleanup between runs, verifying that each subsequent onboard recovers automatically from stale gateway, port forward, and registry entries left behind. Regression test for NVIDIA#397. --------- Signed-off-by: Brian Taylor <brian.taylor818@gmail.com>
Summary
Fixes #21
Fixes #22
Fixes #140
Fixes #152
Closes #397
Root cause
`createSandbox()` piped `openshell sandbox create` through `awk` to deduplicate log lines. In bash, the exit status of a pipeline is the status of the last command — `awk`, which always exits 0. If sandbox creation fails, `run()` sees exit 0 and continues to register a phantom sandbox.
WSL sandbox name truncation (#21)
On WSL, hyphenated sandbox names like `my-assistant` can be truncated during shell argument parsing (e.g. to `m`), causing `openshell policy set --wait` to receive a garbage name. Added RFC 1123 validation in `applyPreset()` as defense-in-depth and quoted sandbox names in error output.
Stale port conflicts on re-onboard (#397)
A previous `nemoclaw onboard` session leaves the OpenShell gateway container (port 8080) and port forward (18789) running after exit. The next `onboard` invocation fails at the preflight port check before reaching the existing cleanup code. Moved detection and teardown of a NemoClaw-owned gateway into `preflight()`, before the port availability check.
Test plan
Automated Tests
```
node --test test/onboard-readiness.test.js
```
23 tests across 3 suites:
Hardware Validation