Skip to content

fix: harden spark startup and destroy handling#1273

Merged
ericksoa merged 10 commits intomainfrom
fix/spark-good-blockers
Apr 2, 2026
Merged

fix: harden spark startup and destroy handling#1273
ericksoa merged 10 commits intomainfrom
fix/spark-good-blockers

Conversation

@kjw3
Copy link
Copy Markdown
Contributor

@kjw3 kjw3 commented Apr 1, 2026

Summary

  • harden sandbox startup on Spark/arm64 by unwrapping the env ... nemoclaw-start bootstrap self-invocation inside the entrypoint instead of trying to gosu into it
  • make nemoclaw destroy truthful: fail on real OpenShell delete errors, but treat already-missing live sandboxes as safe stale-registry cleanup
  • add regression coverage for both the Spark bootstrap wrapper path and the destroy error-handling paths

Issues

Investigated But Not Included

Validation

Local:

npm run build:cli
npx vitest run test/nemoclaw-start.test.js test/cli.test.js test/onboard.test.js
npx eslint bin/nemoclaw.js test/nemoclaw-start.test.js test/cli.test.js
npx tsc -p jsconfig.json --noEmit

Spark (spark-d8c8), commit 0efbd66 in a disposable worktree:

npm install --include=dev
cd nemoclaw && npm install --include=dev
cd ..
npm run build:cli
npx vitest run test/nemoclaw-start.test.js test/cli.test.js test/onboard.test.js
npx eslint bin/nemoclaw.js test/nemoclaw-start.test.js test/cli.test.js
npx tsc -p jsconfig.json --noEmit

Result:

  • passed on Spark
  • 3 files, 92 tests passed
  • eslint passed
  • typecheck passed

Spark caveat:

  • the existing Spark checkout had incomplete root JS dev dependencies (vitest/config missing), so the disposable worktree required a local npm install --include=dev repair before validation

Notes

Plan note:

  • /Users/kejones/nemoclaw-notes/spark-good-blockers-plan-2026-04-01.md

Signed-off-by: Kevin Jones kejones@nvidia.com

Summary by CodeRabbit

  • Bug Fixes

    • Destroy now treats an already-absent sandbox as successful, prints a clear "already absent" message, proceeds with cleanup, and only fails for genuine delete errors (printing delete output).
    • Start-wrapper handling improved so env ... nemoclaw-start invocations correctly export and pass through environment assignments.
  • Tests

    • Added tests for destroy behavior (failure vs already-missing) and for the start-script argument-unwrapping.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 1, 2026

📝 Walkthrough

Walkthrough

Updates sandbox-destroy flow to treat "already absent" delete results as non-fatal, capture and inspect OpenShell delete output, and proceed with gateway cleanup when appropriate; adds env-wrapper unwrapping to the start script and tests covering both behaviors.

Changes

Cohort / File(s) Summary
Sandbox Destruction Logic
bin/nemoclaw.js
Capture openshell sandbox delete stdout/stderr, strip ANSI and case-insensitively classify "missing/Not Found" as already-gone; treat non-zero exit as fatal only if not already-gone, print delete output on fatal failure, allow registry/gateway cleanup when delete succeeded or sandbox already absent, and log already-absent case.
Startup Script Argument Handling
scripts/nemoclaw-start.sh
Add handling for env ... nemoclaw-start wrappers: scan args until the self token, export preceding KEY=VALUE pairs into the environment, and set -- to strip the wrapper before building NEMOCLAW_CMD; keep prior direct-self invocation branch.
Tests: destroy & start script
test/cli.test.js, test/nemoclaw-start.test.js
Add CLI tests for (a) failing openshell sandbox delete with non-handled stderr (expect exit 1, registry unchanged, gateway teardown skipped) and (b) delete exiting with "sandbox not found" (expect exit 0, registry cleared, gateway teardown performed). Add Vitest assertions that scripts/nemoclaw-start.sh contains the env-wrapper unwrapping patterns.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant CLI as nemoclaw CLI
  participant OpenShell
  participant Registry
  participant Gateway

  User->>CLI: request destroy 'alpha'
  CLI->>OpenShell: openshell sandbox delete alpha (capture stdout/stderr, exit)
  OpenShell-->>CLI: exit code + stdout/stderr
  CLI->>CLI: strip ANSI, case-insensitive regex -> alreadyGone?
  alt delete succeeded or alreadyGone
    CLI->>Registry: remove sandboxes.alpha
    CLI->>Gateway: forward stop / gateway destroy -g nemoclaw
    Gateway-->>CLI: teardown logs
    CLI-->>User: exit 0, "Sandbox 'alpha' destroyed"
  else genuine delete error
    CLI-->>User: print captured delete output
    CLI-->>User: exit 1, "Failed to destroy sandbox 'alpha'."
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 I nibbled through args and stripped the shell,

Matched missing sandboxes where shadows dwell.
I logged the vanish, then tidy and hop—
Cleanup continued, no hiccup, no stop.
🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: harden spark startup and destroy handling' directly summarizes the main changes: hardening startup via env wrapper unwrapping and improving destroy error handling.
Linked Issues check ✅ Passed Code changes comprehensively address all three linked issues: #1054/#1028 solved by env wrapper unwrapping in nemoclaw-start.sh avoiding gosu, #1243 solved by improved destroy error handling distinguishing real failures from already-missing sandboxes.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the stated objectives: env wrapper handling, destroy error handling, and regression tests. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/spark-good-blockers

Comment @coderabbitai help to get the list of available commands and usage tips.

@kjw3 kjw3 marked this pull request as ready for review April 1, 2026 20:30
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/nemoclaw.js`:
- Around line 131-143: The delete-result matcher in isMissingSandboxDeleteResult
is missing the bare "sandbox not found" and "NotFound" forms; update the regex
used by isMissingSandboxDeleteResult to align with the pattern used by
getSandboxGatewayState (include /NotFound|sandbox not found/i alongside the
existing variants) so getSandboxDeleteOutcome correctly flags alreadyGone in the
same cases the state checker does (refer to functions
isMissingSandboxDeleteResult and getSandboxDeleteOutcome).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e04a3ec2-d546-4502-804e-6508eea7eab1

📥 Commits

Reviewing files that changed from the base of the PR and between 3f923a4 and 0efbd66.

📒 Files selected for processing (4)
  • bin/nemoclaw.js
  • scripts/nemoclaw-start.sh
  • test/cli.test.js
  • test/nemoclaw-start.test.js

Comment thread bin/nemoclaw.js
@wscurran wscurran added Platform: DGX Spark Support for DGX Spark Getting Started Use this label to identify setup, installation, or onboarding issues. status: triage For new items that haven't been reviewed yet. and removed status: triage For new items that haven't been reviewed yet. labels Apr 1, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
bin/nemoclaw.js (1)

131-134: ⚠️ Potential issue | 🟠 Major

Expand missing-sandbox matching to include NotFound and bare sandbox not found.

Line 132 misses common OpenShell not-found forms, so alreadyGone can be false-negative and make destroy fail incorrectly for an already-absent sandbox.

Suggested fix
 function isMissingSandboxDeleteResult(output = "") {
-  return /sandbox .* not found|sandbox .* not present|sandbox does not exist|no such sandbox/i.test(
+  return /NotFound|sandbox(?:\s+\S+)?\s+not\s+(?:found|present)|sandbox does not exist|no such sandbox/i.test(
     stripAnsi(output),
   );
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` around lines 131 - 134, The isMissingSandboxDeleteResult
function's regex misses common not-found variants (e.g., "NotFound", "Not
Found", and bare "sandbox not found"), causing alreadyGone to be false-negative;
update the regex in isMissingSandboxDeleteResult (the pattern passed to
test(stripAnsi(output))) to include alternatives for \bNotFound\b, \bNot Found\b
and an explicit "sandbox not found" token (keep the existing alternatives and
the i flag), so destroy/alreadyGone logic correctly treats these outputs as
missing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@bin/nemoclaw.js`:
- Around line 131-134: The isMissingSandboxDeleteResult function's regex misses
common not-found variants (e.g., "NotFound", "Not Found", and bare "sandbox not
found"), causing alreadyGone to be false-negative; update the regex in
isMissingSandboxDeleteResult (the pattern passed to test(stripAnsi(output))) to
include alternatives for \bNotFound\b, \bNot Found\b and an explicit "sandbox
not found" token (keep the existing alternatives and the i flag), so
destroy/alreadyGone logic correctly treats these outputs as missing.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c32ce301-5315-434c-a371-4a24b33e6c96

📥 Commits

Reviewing files that changed from the base of the PR and between 0efbd66 and 7255354.

📒 Files selected for processing (4)
  • bin/nemoclaw.js
  • scripts/nemoclaw-start.sh
  • test/cli.test.js
  • test/nemoclaw-start.test.js
✅ Files skipped from review due to trivial changes (1)
  • test/cli.test.js
🚧 Files skipped from review as they are similar to previous changes (2)
  • scripts/nemoclaw-start.sh
  • test/nemoclaw-start.test.js

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
bin/nemoclaw.js (1)

172-175: Consider extracting stripAnsi to a shared utility.

This function is duplicated in bin/lib/runtime-recovery.js:6-9. For maintainability, consider exporting it from a common module (e.g., runtime-recovery.js already has it, or a new bin/lib/string-utils.js).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` around lines 172 - 175, The stripAnsi function is
duplicated; extract it into a shared utility module (e.g., create/export
stripAnsi from a new or existing module used by both scripts), preserve its
signature (function stripAnsi(value = "") { return
String(value).replace(/\x1b\[[0-9;]*m/g, ""); }), then replace the local
definitions in both files with an import of stripAnsi and remove the duplicated
implementations; update any references to ensure they call the imported
stripAnsi and run tests/lint to confirm no regressions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@bin/nemoclaw.js`:
- Around line 172-175: The stripAnsi function is duplicated; extract it into a
shared utility module (e.g., create/export stripAnsi from a new or existing
module used by both scripts), preserve its signature (function stripAnsi(value =
"") { return String(value).replace(/\x1b\[[0-9;]*m/g, ""); }), then replace the
local definitions in both files with an import of stripAnsi and remove the
duplicated implementations; update any references to ensure they call the
imported stripAnsi and run tests/lint to confirm no regressions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 98a37753-c31a-4b69-82ff-b811a5516cc5

📥 Commits

Reviewing files that changed from the base of the PR and between 7255354 and 84f8dcf.

📒 Files selected for processing (2)
  • bin/nemoclaw.js
  • test/cli.test.js

Copy link
Copy Markdown
Contributor

@ericksoa ericksoa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Good destroy error handling and Spark startup fix. Merged with main, tests pass.

@ericksoa ericksoa merged commit b8fab8c into main Apr 2, 2026
6 checks passed
@kjw3 kjw3 deleted the fix/spark-good-blockers branch April 2, 2026 00:53
laitingsheng pushed a commit that referenced this pull request Apr 2, 2026
## Summary

- harden sandbox startup on Spark/arm64 by unwrapping the `env ...
nemoclaw-start` bootstrap self-invocation inside the entrypoint instead
of trying to `gosu` into it
- make `nemoclaw destroy` truthful: fail on real OpenShell delete
errors, but treat already-missing live sandboxes as safe stale-registry
cleanup
- add regression coverage for both the Spark bootstrap wrapper path and
the destroy error-handling paths

## Issues

- Fixes #1054
- Fixes #1028
- Fixes #1243

## Investigated But Not Included

- #1063
  - investigated during this branch
- still appears to be the same upstream trusted-proxy/local-DNS behavior
tracked by #396
- direct `curl` inside the sandbox works, which points away from a
bounded NemoClaw-only fix in this PR

## Validation

Local:

```bash
npm run build:cli
npx vitest run test/nemoclaw-start.test.js test/cli.test.js test/onboard.test.js
npx eslint bin/nemoclaw.js test/nemoclaw-start.test.js test/cli.test.js
npx tsc -p jsconfig.json --noEmit
```

Spark (`spark-d8c8`), commit `0efbd66` in a disposable worktree:

```bash
npm install --include=dev
cd nemoclaw && npm install --include=dev
cd ..
npm run build:cli
npx vitest run test/nemoclaw-start.test.js test/cli.test.js test/onboard.test.js
npx eslint bin/nemoclaw.js test/nemoclaw-start.test.js test/cli.test.js
npx tsc -p jsconfig.json --noEmit
```

Result:
- passed on Spark
- `3` files, `92` tests passed
- eslint passed
- typecheck passed

Spark caveat:
- the existing Spark checkout had incomplete root JS dev dependencies
(`vitest/config` missing), so the disposable worktree required a local
`npm install --include=dev` repair before validation

## Notes

Plan note:
- `/Users/kejones/nemoclaw-notes/spark-good-blockers-plan-2026-04-01.md`

Signed-off-by: Kevin Jones <kejones@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Destroy now treats an already-absent sandbox as successful, prints a
clear "already absent" message, proceeds with cleanup, and only fails
for genuine delete errors (printing delete output).
* Start-wrapper handling improved so env ... nemoclaw-start invocations
correctly export and pass through environment assignments.

* **Tests**
* Added tests for destroy behavior (failure vs already-missing) and for
the start-script argument-unwrapping.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Aaron Erickson <aerickson@nvidia.com>
lakamsani pushed a commit to lakamsani/NemoClaw that referenced this pull request Apr 4, 2026
## Summary

- harden sandbox startup on Spark/arm64 by unwrapping the `env ...
nemoclaw-start` bootstrap self-invocation inside the entrypoint instead
of trying to `gosu` into it
- make `nemoclaw destroy` truthful: fail on real OpenShell delete
errors, but treat already-missing live sandboxes as safe stale-registry
cleanup
- add regression coverage for both the Spark bootstrap wrapper path and
the destroy error-handling paths

## Issues

- Fixes NVIDIA#1054
- Fixes NVIDIA#1028
- Fixes NVIDIA#1243

## Investigated But Not Included

- NVIDIA#1063
  - investigated during this branch
- still appears to be the same upstream trusted-proxy/local-DNS behavior
tracked by NVIDIA#396
- direct `curl` inside the sandbox works, which points away from a
bounded NemoClaw-only fix in this PR

## Validation

Local:

```bash
npm run build:cli
npx vitest run test/nemoclaw-start.test.js test/cli.test.js test/onboard.test.js
npx eslint bin/nemoclaw.js test/nemoclaw-start.test.js test/cli.test.js
npx tsc -p jsconfig.json --noEmit
```

Spark (`spark-d8c8`), commit `0efbd66` in a disposable worktree:

```bash
npm install --include=dev
cd nemoclaw && npm install --include=dev
cd ..
npm run build:cli
npx vitest run test/nemoclaw-start.test.js test/cli.test.js test/onboard.test.js
npx eslint bin/nemoclaw.js test/nemoclaw-start.test.js test/cli.test.js
npx tsc -p jsconfig.json --noEmit
```

Result:
- passed on Spark
- `3` files, `92` tests passed
- eslint passed
- typecheck passed

Spark caveat:
- the existing Spark checkout had incomplete root JS dev dependencies
(`vitest/config` missing), so the disposable worktree required a local
`npm install --include=dev` repair before validation

## Notes

Plan note:
- `/Users/kejones/nemoclaw-notes/spark-good-blockers-plan-2026-04-01.md`

Signed-off-by: Kevin Jones <kejones@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Destroy now treats an already-absent sandbox as successful, prints a
clear "already absent" message, proceeds with cleanup, and only fails
for genuine delete errors (printing delete output).
* Start-wrapper handling improved so env ... nemoclaw-start invocations
correctly export and pass through environment assignments.

* **Tests**
* Added tests for destroy behavior (failure vs already-missing) and for
the start-script argument-unwrapping.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Aaron Erickson <aerickson@nvidia.com>
gemini2026 pushed a commit to gemini2026/NemoClaw that referenced this pull request Apr 14, 2026
## Summary

- harden sandbox startup on Spark/arm64 by unwrapping the `env ...
nemoclaw-start` bootstrap self-invocation inside the entrypoint instead
of trying to `gosu` into it
- make `nemoclaw destroy` truthful: fail on real OpenShell delete
errors, but treat already-missing live sandboxes as safe stale-registry
cleanup
- add regression coverage for both the Spark bootstrap wrapper path and
the destroy error-handling paths

## Issues

- Fixes NVIDIA#1054
- Fixes NVIDIA#1028
- Fixes NVIDIA#1243

## Investigated But Not Included

- NVIDIA#1063
  - investigated during this branch
- still appears to be the same upstream trusted-proxy/local-DNS behavior
tracked by NVIDIA#396
- direct `curl` inside the sandbox works, which points away from a
bounded NemoClaw-only fix in this PR

## Validation

Local:

```bash
npm run build:cli
npx vitest run test/nemoclaw-start.test.js test/cli.test.js test/onboard.test.js
npx eslint bin/nemoclaw.js test/nemoclaw-start.test.js test/cli.test.js
npx tsc -p jsconfig.json --noEmit
```

Spark (`spark-d8c8`), commit `0efbd66` in a disposable worktree:

```bash
npm install --include=dev
cd nemoclaw && npm install --include=dev
cd ..
npm run build:cli
npx vitest run test/nemoclaw-start.test.js test/cli.test.js test/onboard.test.js
npx eslint bin/nemoclaw.js test/nemoclaw-start.test.js test/cli.test.js
npx tsc -p jsconfig.json --noEmit
```

Result:
- passed on Spark
- `3` files, `92` tests passed
- eslint passed
- typecheck passed

Spark caveat:
- the existing Spark checkout had incomplete root JS dev dependencies
(`vitest/config` missing), so the disposable worktree required a local
`npm install --include=dev` repair before validation

## Notes

Plan note:
- `/Users/kejones/nemoclaw-notes/spark-good-blockers-plan-2026-04-01.md`

Signed-off-by: Kevin Jones <kejones@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Destroy now treats an already-absent sandbox as successful, prints a
clear "already absent" message, proceeds with cleanup, and only fails
for genuine delete errors (printing delete output).
* Start-wrapper handling improved so env ... nemoclaw-start invocations
correctly export and pass through environment assignments.

* **Tests**
* Added tests for destroy behavior (failure vs already-missing) and for
the start-script argument-unwrapping.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Aaron Erickson <aerickson@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Getting Started Use this label to identify setup, installation, or onboarding issues. Platform: DGX Spark Support for DGX Spark

Projects

None yet

3 participants