fix: harden spark startup and destroy handling by kjw3 · Pull Request #1273 · NVIDIA/NemoClaw

kjw3 · 2026-04-01T20:18:30Z

Summary

harden sandbox startup on Spark/arm64 by unwrapping the env ... nemoclaw-start bootstrap self-invocation inside the entrypoint instead of trying to gosu into it
make nemoclaw destroy truthful: fail on real OpenShell delete errors, but treat already-missing live sandboxes as safe stale-registry cleanup
add regression coverage for both the Spark bootstrap wrapper path and the destroy error-handling paths

Issues

Fixes gosu seccomp failure on arm64 DGX OS — failed switching to 'sandbox': operation not permitted #1054
Fixes onboard fails with "failed switching to sandbox: operation not permitted" on DGX Spark #1028
Fixes [All Platform] [v0.0.2] can't destroy sandbox #1243

Investigated But Not Included

Tavily web_search fails with EAI_AGAIN inside NemoClaw sandbox on DGX Spark — same root cause as #396 #1063
- investigated during this branch
- still appears to be the same upstream trusted-proxy/local-DNS behavior tracked by NemoClaw sandbox: Gemini web_search fails with EAI_AGAIN until Google host + node are allowed, and trusted proxy still does local DNS lookup #396
- direct curl inside the sandbox works, which points away from a bounded NemoClaw-only fix in this PR

Validation

Local:

npm run build:cli
npx vitest run test/nemoclaw-start.test.js test/cli.test.js test/onboard.test.js
npx eslint bin/nemoclaw.js test/nemoclaw-start.test.js test/cli.test.js
npx tsc -p jsconfig.json --noEmit

Spark (spark-d8c8), commit 0efbd66 in a disposable worktree:

npm install --include=dev
cd nemoclaw && npm install --include=dev
cd ..
npm run build:cli
npx vitest run test/nemoclaw-start.test.js test/cli.test.js test/onboard.test.js
npx eslint bin/nemoclaw.js test/nemoclaw-start.test.js test/cli.test.js
npx tsc -p jsconfig.json --noEmit

Result:

passed on Spark
3 files, 92 tests passed
eslint passed
typecheck passed

Spark caveat:

the existing Spark checkout had incomplete root JS dev dependencies (vitest/config missing), so the disposable worktree required a local npm install --include=dev repair before validation

Notes

Plan note:

/Users/kejones/nemoclaw-notes/spark-good-blockers-plan-2026-04-01.md

Signed-off-by: Kevin Jones kejones@nvidia.com

Summary by CodeRabbit

Bug Fixes
- Destroy now treats an already-absent sandbox as successful, prints a clear "already absent" message, proceeds with cleanup, and only fails for genuine delete errors (printing delete output).
- Start-wrapper handling improved so env ... nemoclaw-start invocations correctly export and pass through environment assignments.
Tests
- Added tests for destroy behavior (failure vs already-missing) and for the start-script argument-unwrapping.

coderabbitai · 2026-04-01T20:18:37Z

📝 Walkthrough

Walkthrough

Updates sandbox-destroy flow to treat "already absent" delete results as non-fatal, capture and inspect OpenShell delete output, and proceed with gateway cleanup when appropriate; adds env-wrapper unwrapping to the start script and tests covering both behaviors.

Changes

Cohort / File(s)	Summary
Sandbox Destruction Logic `bin/nemoclaw.js`	Capture `openshell sandbox delete` stdout/stderr, strip ANSI and case-insensitively classify "missing/Not Found" as already-gone; treat non-zero exit as fatal only if not already-gone, print delete output on fatal failure, allow registry/gateway cleanup when delete succeeded or sandbox already absent, and log already-absent case.
Startup Script Argument Handling `scripts/nemoclaw-start.sh`	Add handling for `env ... nemoclaw-start` wrappers: scan args until the self token, export preceding `KEY=VALUE` pairs into the environment, and `set --` to strip the wrapper before building `NEMOCLAW_CMD`; keep prior direct-self invocation branch.
Tests: destroy & start script `test/cli.test.js`, `test/nemoclaw-start.test.js`	Add CLI tests for (a) failing `openshell sandbox delete` with non-handled stderr (expect exit 1, registry unchanged, gateway teardown skipped) and (b) delete exiting with "sandbox not found" (expect exit 0, registry cleared, gateway teardown performed). Add Vitest assertions that `scripts/nemoclaw-start.sh` contains the `env`-wrapper unwrapping patterns.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant CLI as nemoclaw CLI
  participant OpenShell
  participant Registry
  participant Gateway

  User->>CLI: request destroy 'alpha'
  CLI->>OpenShell: openshell sandbox delete alpha (capture stdout/stderr, exit)
  OpenShell-->>CLI: exit code + stdout/stderr
  CLI->>CLI: strip ANSI, case-insensitive regex -> alreadyGone?
  alt delete succeeded or alreadyGone
    CLI->>Registry: remove sandboxes.alpha
    CLI->>Gateway: forward stop / gateway destroy -g nemoclaw
    Gateway-->>CLI: teardown logs
    CLI-->>User: exit 0, "Sandbox 'alpha' destroyed"
  else genuine delete error
    CLI-->>User: print captured delete output
    CLI-->>User: exit 1, "Failed to destroy sandbox 'alpha'."
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 I nibbled through args and stripped the shell,

Matched missing sandboxes where shadows dwell.
I logged the vanish, then tidy and hop—
Cleanup continued, no hiccup, no stop.
🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: harden spark startup and destroy handling' directly summarizes the main changes: hardening startup via env wrapper unwrapping and improving destroy error handling.
Linked Issues check	✅ Passed	Code changes comprehensively address all three linked issues: `#1054/`#1028 solved by env wrapper unwrapping in nemoclaw-start.sh avoiding gosu, `#1243` solved by improved destroy error handling distinguishing real failures from already-missing sandboxes.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to the stated objectives: env wrapper handling, destroy error handling, and regression tests. No unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/spark-good-blockers

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/nemoclaw.js`:
- Around line 131-143: The delete-result matcher in isMissingSandboxDeleteResult
is missing the bare "sandbox not found" and "NotFound" forms; update the regex
used by isMissingSandboxDeleteResult to align with the pattern used by
getSandboxGatewayState (include /NotFound|sandbox not found/i alongside the
existing variants) so getSandboxDeleteOutcome correctly flags alreadyGone in the
same cases the state checker does (refer to functions
isMissingSandboxDeleteResult and getSandboxDeleteOutcome).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e04a3ec2-d546-4502-804e-6508eea7eab1

📥 Commits

Reviewing files that changed from the base of the PR and between 3f923a4 and 0efbd66.

📒 Files selected for processing (4)

bin/nemoclaw.js
scripts/nemoclaw-start.sh
test/cli.test.js
test/nemoclaw-start.test.js

coderabbitai

♻️ Duplicate comments (1)

bin/nemoclaw.js (1)

131-134: ⚠️ Potential issue | 🟠 Major

Expand missing-sandbox matching to include NotFound and bare sandbox not found.

Line 132 misses common OpenShell not-found forms, so alreadyGone can be false-negative and make destroy fail incorrectly for an already-absent sandbox.

Suggested fix

 function isMissingSandboxDeleteResult(output = "") {
-  return /sandbox .* not found|sandbox .* not present|sandbox does not exist|no such sandbox/i.test(
+  return /NotFound|sandbox(?:\s+\S+)?\s+not\s+(?:found|present)|sandbox does not exist|no such sandbox/i.test(
     stripAnsi(output),
   );
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` around lines 131 - 134, The isMissingSandboxDeleteResult
function's regex misses common not-found variants (e.g., "NotFound", "Not
Found", and bare "sandbox not found"), causing alreadyGone to be false-negative;
update the regex in isMissingSandboxDeleteResult (the pattern passed to
test(stripAnsi(output))) to include alternatives for \bNotFound\b, \bNot Found\b
and an explicit "sandbox not found" token (keep the existing alternatives and
the i flag), so destroy/alreadyGone logic correctly treats these outputs as
missing.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@bin/nemoclaw.js`:
- Around line 131-134: The isMissingSandboxDeleteResult function's regex misses
common not-found variants (e.g., "NotFound", "Not Found", and bare "sandbox not
found"), causing alreadyGone to be false-negative; update the regex in
isMissingSandboxDeleteResult (the pattern passed to test(stripAnsi(output))) to
include alternatives for \bNotFound\b, \bNot Found\b and an explicit "sandbox
not found" token (keep the existing alternatives and the i flag), so
destroy/alreadyGone logic correctly treats these outputs as missing.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c32ce301-5315-434c-a371-4a24b33e6c96

📥 Commits

Reviewing files that changed from the base of the PR and between 0efbd66 and 7255354.

📒 Files selected for processing (4)

bin/nemoclaw.js
scripts/nemoclaw-start.sh
test/cli.test.js
test/nemoclaw-start.test.js

✅ Files skipped from review due to trivial changes (1)

test/cli.test.js

🚧 Files skipped from review as they are similar to previous changes (2)

scripts/nemoclaw-start.sh
test/nemoclaw-start.test.js

coderabbitai

🧹 Nitpick comments (1)

bin/nemoclaw.js (1)
172-175: Consider extracting stripAnsi to a shared utility.

This function is duplicated in bin/lib/runtime-recovery.js:6-9. For maintainability, consider exporting it from a common module (e.g., runtime-recovery.js already has it, or a new bin/lib/string-utils.js).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` around lines 172 - 175, The stripAnsi function is
duplicated; extract it into a shared utility module (e.g., create/export
stripAnsi from a new or existing module used by both scripts), preserve its
signature (function stripAnsi(value = "") { return
String(value).replace(/\x1b\[[0-9;]*m/g, ""); }), then replace the local
definitions in both files with an import of stripAnsi and remove the duplicated
implementations; update any references to ensure they call the imported
stripAnsi and run tests/lint to confirm no regressions.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@bin/nemoclaw.js`:
- Around line 172-175: The stripAnsi function is duplicated; extract it into a
shared utility module (e.g., create/export stripAnsi from a new or existing
module used by both scripts), preserve its signature (function stripAnsi(value =
"") { return String(value).replace(/\x1b\[[0-9;]*m/g, ""); }), then replace the
local definitions in both files with an import of stripAnsi and remove the
duplicated implementations; update any references to ensure they call the
imported stripAnsi and run tests/lint to confirm no regressions.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 98a37753-c31a-4b69-82ff-b811a5516cc5

📥 Commits

Reviewing files that changed from the base of the PR and between 7255354 and 84f8dcf.

📒 Files selected for processing (2)

bin/nemoclaw.js
test/cli.test.js

…emoClaw into fix/spark-good-blockers

ericksoa

LGTM. Good destroy error handling and Spark startup fix. Merged with main, tests pass.

## Summary - harden sandbox startup on Spark/arm64 by unwrapping the `env ... nemoclaw-start` bootstrap self-invocation inside the entrypoint instead of trying to `gosu` into it - make `nemoclaw destroy` truthful: fail on real OpenShell delete errors, but treat already-missing live sandboxes as safe stale-registry cleanup - add regression coverage for both the Spark bootstrap wrapper path and the destroy error-handling paths ## Issues - Fixes #1054 - Fixes #1028 - Fixes #1243 ## Investigated But Not Included - #1063 - investigated during this branch - still appears to be the same upstream trusted-proxy/local-DNS behavior tracked by #396 - direct `curl` inside the sandbox works, which points away from a bounded NemoClaw-only fix in this PR ## Validation Local: ```bash npm run build:cli npx vitest run test/nemoclaw-start.test.js test/cli.test.js test/onboard.test.js npx eslint bin/nemoclaw.js test/nemoclaw-start.test.js test/cli.test.js npx tsc -p jsconfig.json --noEmit ``` Spark (`spark-d8c8`), commit `0efbd66` in a disposable worktree: ```bash npm install --include=dev cd nemoclaw && npm install --include=dev cd .. npm run build:cli npx vitest run test/nemoclaw-start.test.js test/cli.test.js test/onboard.test.js npx eslint bin/nemoclaw.js test/nemoclaw-start.test.js test/cli.test.js npx tsc -p jsconfig.json --noEmit ``` Result: - passed on Spark - `3` files, `92` tests passed - eslint passed - typecheck passed Spark caveat: - the existing Spark checkout had incomplete root JS dev dependencies (`vitest/config` missing), so the disposable worktree required a local `npm install --include=dev` repair before validation ## Notes Plan note: - `/Users/kejones/nemoclaw-notes/spark-good-blockers-plan-2026-04-01.md` Signed-off-by: Kevin Jones <kejones@nvidia.com>  ## Summary by CodeRabbit * **Bug Fixes** * Destroy now treats an already-absent sandbox as successful, prints a clear "already absent" message, proceeds with cleanup, and only fails for genuine delete errors (printing delete output). * Start-wrapper handling improved so env ... nemoclaw-start invocations correctly export and pass through environment assignments. * **Tests** * Added tests for destroy behavior (failure vs already-missing) and for the start-script argument-unwrapping.  --------- Co-authored-by: Aaron Erickson <aerickson@nvidia.com>

## Summary - harden sandbox startup on Spark/arm64 by unwrapping the `env ... nemoclaw-start` bootstrap self-invocation inside the entrypoint instead of trying to `gosu` into it - make `nemoclaw destroy` truthful: fail on real OpenShell delete errors, but treat already-missing live sandboxes as safe stale-registry cleanup - add regression coverage for both the Spark bootstrap wrapper path and the destroy error-handling paths ## Issues - Fixes NVIDIA#1054 - Fixes NVIDIA#1028 - Fixes NVIDIA#1243 ## Investigated But Not Included - NVIDIA#1063 - investigated during this branch - still appears to be the same upstream trusted-proxy/local-DNS behavior tracked by NVIDIA#396 - direct `curl` inside the sandbox works, which points away from a bounded NemoClaw-only fix in this PR ## Validation Local: ```bash npm run build:cli npx vitest run test/nemoclaw-start.test.js test/cli.test.js test/onboard.test.js npx eslint bin/nemoclaw.js test/nemoclaw-start.test.js test/cli.test.js npx tsc -p jsconfig.json --noEmit ``` Spark (`spark-d8c8`), commit `0efbd66` in a disposable worktree: ```bash npm install --include=dev cd nemoclaw && npm install --include=dev cd .. npm run build:cli npx vitest run test/nemoclaw-start.test.js test/cli.test.js test/onboard.test.js npx eslint bin/nemoclaw.js test/nemoclaw-start.test.js test/cli.test.js npx tsc -p jsconfig.json --noEmit ``` Result: - passed on Spark - `3` files, `92` tests passed - eslint passed - typecheck passed Spark caveat: - the existing Spark checkout had incomplete root JS dev dependencies (`vitest/config` missing), so the disposable worktree required a local `npm install --include=dev` repair before validation ## Notes Plan note: - `/Users/kejones/nemoclaw-notes/spark-good-blockers-plan-2026-04-01.md` Signed-off-by: Kevin Jones <kejones@nvidia.com>  ## Summary by CodeRabbit * **Bug Fixes** * Destroy now treats an already-absent sandbox as successful, prints a clear "already absent" message, proceeds with cleanup, and only fails for genuine delete errors (printing delete output). * Start-wrapper handling improved so env ... nemoclaw-start invocations correctly export and pass through environment assignments. * **Tests** * Added tests for destroy behavior (failure vs already-missing) and for the start-script argument-unwrapping.  --------- Co-authored-by: Aaron Erickson <aerickson@nvidia.com>

fix: harden spark startup and destroy handling

0efbd66

kjw3 marked this pull request as ready for review April 1, 2026 20:30

coderabbitai Bot reviewed Apr 1, 2026

View reviewed changes

Comment thread bin/nemoclaw.js

wscurran added Platform: DGX Spark Support for DGX Spark Getting Started Use this label to identify setup, installation, or onboarding issues. status: triage For new items that haven't been reviewed yet. and removed status: triage For new items that haven't been reviewed yet. labels Apr 1, 2026

Merge branch 'main' into fix/spark-good-blockers

7255354

coderabbitai Bot reviewed Apr 1, 2026

View reviewed changes

kjw3 added 2 commits April 1, 2026 19:48

fix: align destroy missing-sandbox matching

3f698e7

fix: match spaced not-found sandbox errors

84f8dcf

coderabbitai Bot reviewed Apr 1, 2026

View reviewed changes

kjw3 and others added 6 commits April 1, 2026 20:01

Merge branch 'main' into fix/spark-good-blockers

d382dea

Merge branch 'main' into fix/spark-good-blockers

abdc6a5

Merge branch 'main' into fix/spark-good-blockers

a1b5030

Merge branch 'main' into fix/spark-good-blockers

efa1911

Merge remote-tracking branch 'origin/main' into fix/spark-good-blockers

95c14bb

Merge branch 'fix/spark-good-blockers' of https://github.com/NVIDIA/N…

3baab8e

…emoClaw into fix/spark-good-blockers

ericksoa approved these changes Apr 2, 2026

View reviewed changes

ericksoa merged commit b8fab8c into main Apr 2, 2026
6 checks passed

kjw3 deleted the fix/spark-good-blockers branch April 2, 2026 00:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: harden spark startup and destroy handling#1273

fix: harden spark startup and destroy handling#1273
ericksoa merged 10 commits intomainfrom
fix/spark-good-blockers

kjw3 commented Apr 1, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 1, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

ericksoa left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kjw3 commented Apr 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Issues

Investigated But Not Included

Validation

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ericksoa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kjw3 commented Apr 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 1, 2026 •

edited

Loading