fix: harden spark startup and destroy handling#1273
Conversation
📝 WalkthroughWalkthroughUpdates sandbox-destroy flow to treat "already absent" delete results as non-fatal, capture and inspect OpenShell delete output, and proceed with gateway cleanup when appropriate; adds env-wrapper unwrapping to the start script and tests covering both behaviors. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI as nemoclaw CLI
participant OpenShell
participant Registry
participant Gateway
User->>CLI: request destroy 'alpha'
CLI->>OpenShell: openshell sandbox delete alpha (capture stdout/stderr, exit)
OpenShell-->>CLI: exit code + stdout/stderr
CLI->>CLI: strip ANSI, case-insensitive regex -> alreadyGone?
alt delete succeeded or alreadyGone
CLI->>Registry: remove sandboxes.alpha
CLI->>Gateway: forward stop / gateway destroy -g nemoclaw
Gateway-->>CLI: teardown logs
CLI-->>User: exit 0, "Sandbox 'alpha' destroyed"
else genuine delete error
CLI-->>User: print captured delete output
CLI-->>User: exit 1, "Failed to destroy sandbox 'alpha'."
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/nemoclaw.js`:
- Around line 131-143: The delete-result matcher in isMissingSandboxDeleteResult
is missing the bare "sandbox not found" and "NotFound" forms; update the regex
used by isMissingSandboxDeleteResult to align with the pattern used by
getSandboxGatewayState (include /NotFound|sandbox not found/i alongside the
existing variants) so getSandboxDeleteOutcome correctly flags alreadyGone in the
same cases the state checker does (refer to functions
isMissingSandboxDeleteResult and getSandboxDeleteOutcome).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e04a3ec2-d546-4502-804e-6508eea7eab1
📒 Files selected for processing (4)
bin/nemoclaw.jsscripts/nemoclaw-start.shtest/cli.test.jstest/nemoclaw-start.test.js
There was a problem hiding this comment.
♻️ Duplicate comments (1)
bin/nemoclaw.js (1)
131-134:⚠️ Potential issue | 🟠 MajorExpand missing-sandbox matching to include
NotFoundand baresandbox not found.Line 132 misses common OpenShell not-found forms, so
alreadyGonecan be false-negative and makedestroyfail incorrectly for an already-absent sandbox.Suggested fix
function isMissingSandboxDeleteResult(output = "") { - return /sandbox .* not found|sandbox .* not present|sandbox does not exist|no such sandbox/i.test( + return /NotFound|sandbox(?:\s+\S+)?\s+not\s+(?:found|present)|sandbox does not exist|no such sandbox/i.test( stripAnsi(output), ); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/nemoclaw.js` around lines 131 - 134, The isMissingSandboxDeleteResult function's regex misses common not-found variants (e.g., "NotFound", "Not Found", and bare "sandbox not found"), causing alreadyGone to be false-negative; update the regex in isMissingSandboxDeleteResult (the pattern passed to test(stripAnsi(output))) to include alternatives for \bNotFound\b, \bNot Found\b and an explicit "sandbox not found" token (keep the existing alternatives and the i flag), so destroy/alreadyGone logic correctly treats these outputs as missing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@bin/nemoclaw.js`:
- Around line 131-134: The isMissingSandboxDeleteResult function's regex misses
common not-found variants (e.g., "NotFound", "Not Found", and bare "sandbox not
found"), causing alreadyGone to be false-negative; update the regex in
isMissingSandboxDeleteResult (the pattern passed to test(stripAnsi(output))) to
include alternatives for \bNotFound\b, \bNot Found\b and an explicit "sandbox
not found" token (keep the existing alternatives and the i flag), so
destroy/alreadyGone logic correctly treats these outputs as missing.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c32ce301-5315-434c-a371-4a24b33e6c96
📒 Files selected for processing (4)
bin/nemoclaw.jsscripts/nemoclaw-start.shtest/cli.test.jstest/nemoclaw-start.test.js
✅ Files skipped from review due to trivial changes (1)
- test/cli.test.js
🚧 Files skipped from review as they are similar to previous changes (2)
- scripts/nemoclaw-start.sh
- test/nemoclaw-start.test.js
There was a problem hiding this comment.
🧹 Nitpick comments (1)
bin/nemoclaw.js (1)
172-175: Consider extractingstripAnsito a shared utility.This function is duplicated in
bin/lib/runtime-recovery.js:6-9. For maintainability, consider exporting it from a common module (e.g.,runtime-recovery.jsalready has it, or a newbin/lib/string-utils.js).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/nemoclaw.js` around lines 172 - 175, The stripAnsi function is duplicated; extract it into a shared utility module (e.g., create/export stripAnsi from a new or existing module used by both scripts), preserve its signature (function stripAnsi(value = "") { return String(value).replace(/\x1b\[[0-9;]*m/g, ""); }), then replace the local definitions in both files with an import of stripAnsi and remove the duplicated implementations; update any references to ensure they call the imported stripAnsi and run tests/lint to confirm no regressions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@bin/nemoclaw.js`:
- Around line 172-175: The stripAnsi function is duplicated; extract it into a
shared utility module (e.g., create/export stripAnsi from a new or existing
module used by both scripts), preserve its signature (function stripAnsi(value =
"") { return String(value).replace(/\x1b\[[0-9;]*m/g, ""); }), then replace the
local definitions in both files with an import of stripAnsi and remove the
duplicated implementations; update any references to ensure they call the
imported stripAnsi and run tests/lint to confirm no regressions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 98a37753-c31a-4b69-82ff-b811a5516cc5
📒 Files selected for processing (2)
bin/nemoclaw.jstest/cli.test.js
…emoClaw into fix/spark-good-blockers
ericksoa
left a comment
There was a problem hiding this comment.
LGTM. Good destroy error handling and Spark startup fix. Merged with main, tests pass.
## Summary - harden sandbox startup on Spark/arm64 by unwrapping the `env ... nemoclaw-start` bootstrap self-invocation inside the entrypoint instead of trying to `gosu` into it - make `nemoclaw destroy` truthful: fail on real OpenShell delete errors, but treat already-missing live sandboxes as safe stale-registry cleanup - add regression coverage for both the Spark bootstrap wrapper path and the destroy error-handling paths ## Issues - Fixes #1054 - Fixes #1028 - Fixes #1243 ## Investigated But Not Included - #1063 - investigated during this branch - still appears to be the same upstream trusted-proxy/local-DNS behavior tracked by #396 - direct `curl` inside the sandbox works, which points away from a bounded NemoClaw-only fix in this PR ## Validation Local: ```bash npm run build:cli npx vitest run test/nemoclaw-start.test.js test/cli.test.js test/onboard.test.js npx eslint bin/nemoclaw.js test/nemoclaw-start.test.js test/cli.test.js npx tsc -p jsconfig.json --noEmit ``` Spark (`spark-d8c8`), commit `0efbd66` in a disposable worktree: ```bash npm install --include=dev cd nemoclaw && npm install --include=dev cd .. npm run build:cli npx vitest run test/nemoclaw-start.test.js test/cli.test.js test/onboard.test.js npx eslint bin/nemoclaw.js test/nemoclaw-start.test.js test/cli.test.js npx tsc -p jsconfig.json --noEmit ``` Result: - passed on Spark - `3` files, `92` tests passed - eslint passed - typecheck passed Spark caveat: - the existing Spark checkout had incomplete root JS dev dependencies (`vitest/config` missing), so the disposable worktree required a local `npm install --include=dev` repair before validation ## Notes Plan note: - `/Users/kejones/nemoclaw-notes/spark-good-blockers-plan-2026-04-01.md` Signed-off-by: Kevin Jones <kejones@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Destroy now treats an already-absent sandbox as successful, prints a clear "already absent" message, proceeds with cleanup, and only fails for genuine delete errors (printing delete output). * Start-wrapper handling improved so env ... nemoclaw-start invocations correctly export and pass through environment assignments. * **Tests** * Added tests for destroy behavior (failure vs already-missing) and for the start-script argument-unwrapping. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Aaron Erickson <aerickson@nvidia.com>
## Summary - harden sandbox startup on Spark/arm64 by unwrapping the `env ... nemoclaw-start` bootstrap self-invocation inside the entrypoint instead of trying to `gosu` into it - make `nemoclaw destroy` truthful: fail on real OpenShell delete errors, but treat already-missing live sandboxes as safe stale-registry cleanup - add regression coverage for both the Spark bootstrap wrapper path and the destroy error-handling paths ## Issues - Fixes NVIDIA#1054 - Fixes NVIDIA#1028 - Fixes NVIDIA#1243 ## Investigated But Not Included - NVIDIA#1063 - investigated during this branch - still appears to be the same upstream trusted-proxy/local-DNS behavior tracked by NVIDIA#396 - direct `curl` inside the sandbox works, which points away from a bounded NemoClaw-only fix in this PR ## Validation Local: ```bash npm run build:cli npx vitest run test/nemoclaw-start.test.js test/cli.test.js test/onboard.test.js npx eslint bin/nemoclaw.js test/nemoclaw-start.test.js test/cli.test.js npx tsc -p jsconfig.json --noEmit ``` Spark (`spark-d8c8`), commit `0efbd66` in a disposable worktree: ```bash npm install --include=dev cd nemoclaw && npm install --include=dev cd .. npm run build:cli npx vitest run test/nemoclaw-start.test.js test/cli.test.js test/onboard.test.js npx eslint bin/nemoclaw.js test/nemoclaw-start.test.js test/cli.test.js npx tsc -p jsconfig.json --noEmit ``` Result: - passed on Spark - `3` files, `92` tests passed - eslint passed - typecheck passed Spark caveat: - the existing Spark checkout had incomplete root JS dev dependencies (`vitest/config` missing), so the disposable worktree required a local `npm install --include=dev` repair before validation ## Notes Plan note: - `/Users/kejones/nemoclaw-notes/spark-good-blockers-plan-2026-04-01.md` Signed-off-by: Kevin Jones <kejones@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Destroy now treats an already-absent sandbox as successful, prints a clear "already absent" message, proceeds with cleanup, and only fails for genuine delete errors (printing delete output). * Start-wrapper handling improved so env ... nemoclaw-start invocations correctly export and pass through environment assignments. * **Tests** * Added tests for destroy behavior (failure vs already-missing) and for the start-script argument-unwrapping. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Aaron Erickson <aerickson@nvidia.com>
## Summary - harden sandbox startup on Spark/arm64 by unwrapping the `env ... nemoclaw-start` bootstrap self-invocation inside the entrypoint instead of trying to `gosu` into it - make `nemoclaw destroy` truthful: fail on real OpenShell delete errors, but treat already-missing live sandboxes as safe stale-registry cleanup - add regression coverage for both the Spark bootstrap wrapper path and the destroy error-handling paths ## Issues - Fixes NVIDIA#1054 - Fixes NVIDIA#1028 - Fixes NVIDIA#1243 ## Investigated But Not Included - NVIDIA#1063 - investigated during this branch - still appears to be the same upstream trusted-proxy/local-DNS behavior tracked by NVIDIA#396 - direct `curl` inside the sandbox works, which points away from a bounded NemoClaw-only fix in this PR ## Validation Local: ```bash npm run build:cli npx vitest run test/nemoclaw-start.test.js test/cli.test.js test/onboard.test.js npx eslint bin/nemoclaw.js test/nemoclaw-start.test.js test/cli.test.js npx tsc -p jsconfig.json --noEmit ``` Spark (`spark-d8c8`), commit `0efbd66` in a disposable worktree: ```bash npm install --include=dev cd nemoclaw && npm install --include=dev cd .. npm run build:cli npx vitest run test/nemoclaw-start.test.js test/cli.test.js test/onboard.test.js npx eslint bin/nemoclaw.js test/nemoclaw-start.test.js test/cli.test.js npx tsc -p jsconfig.json --noEmit ``` Result: - passed on Spark - `3` files, `92` tests passed - eslint passed - typecheck passed Spark caveat: - the existing Spark checkout had incomplete root JS dev dependencies (`vitest/config` missing), so the disposable worktree required a local `npm install --include=dev` repair before validation ## Notes Plan note: - `/Users/kejones/nemoclaw-notes/spark-good-blockers-plan-2026-04-01.md` Signed-off-by: Kevin Jones <kejones@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Destroy now treats an already-absent sandbox as successful, prints a clear "already absent" message, proceeds with cleanup, and only fails for genuine delete errors (printing delete output). * Start-wrapper handling improved so env ... nemoclaw-start invocations correctly export and pass through environment assignments. * **Tests** * Added tests for destroy behavior (failure vs already-missing) and for the start-script argument-unwrapping. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Aaron Erickson <aerickson@nvidia.com>
Summary
env ... nemoclaw-startbootstrap self-invocation inside the entrypoint instead of trying togosuinto itnemoclaw destroytruthful: fail on real OpenShell delete errors, but treat already-missing live sandboxes as safe stale-registry cleanupIssues
Investigated But Not Included
curlinside the sandbox works, which points away from a bounded NemoClaw-only fix in this PRValidation
Local:
Spark (
spark-d8c8), commit0efbd66in a disposable worktree:Result:
3files,92tests passedSpark caveat:
vitest/configmissing), so the disposable worktree required a localnpm install --include=devrepair before validationNotes
Plan note:
/Users/kejones/nemoclaw-notes/spark-good-blockers-plan-2026-04-01.mdSigned-off-by: Kevin Jones kejones@nvidia.com
Summary by CodeRabbit
Bug Fixes
Tests