fix: clean up Docker volumes after failed gateway start by WuKongAI-CMU · Pull Request #95 · NVIDIA/NemoClaw

WuKongAI-CMU · 2026-03-17T01:50:34Z

Summary

Extract destroyGateway() helper that runs openshell gateway destroy AND removes orphaned openshell-cluster-nemoclaw Docker volumes
Call it before gateway start (existing pre-cleanup), after failed start (new), and after failed health check (new)
Error messages now tell users to simply rerun the installer instead of requiring manual docker volume rm

Fixes #17

Root cause

openshell gateway destroy does not always remove the Docker volumes it created. When a subsequent openshell gateway start finds these volumes, it fails with "Corrupted cluster state". Users had to discover docker volume rm on their own.

Before

Gateway failed to start. Run: openshell gateway info
# Next run:
Error: Corrupted cluster state. Please manually remove Docker volumes.

After

Gateway failed to start. Cleaning up stale state...
Stale state removed. Please rerun the installer.
If the error persists, run: openshell gateway info

Test plan

All 39 tests pass
Verify cleanup removes volumes after intentionally failed gateway start
Verify rerun succeeds after cleanup

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

New Features
- Added automatic gateway cleanup on startup, failed startups, and health check failures
- New volume management utilities for handling stale gateway resources
- Manual cleanup command generation for recovery scenarios
Tests
- Added comprehensive test coverage for gateway cleanup functionality

After a failed gateway start, leftover Docker volumes from openshell-cluster-nemoclaw cause "Corrupted cluster state" errors on subsequent runs, requiring manual `docker volume rm` to recover. Extract destroyGateway() that runs both `openshell gateway destroy` and removes orphaned Docker volumes. Call it: 1. Before starting the gateway (existing pre-cleanup) 2. After a failed gateway start (new — ensures clean retry) 3. After a failed health check (new — same cleanup path) The error messages now tell the user to simply rerun the installer instead of requiring manual Docker volume management. Fixes NVIDIA#17 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: peteryuqin <peter.yuqin@gmail.com>

kjw3 · 2026-03-18T14:48:04Z

Scope check after review: PR #95 should stay focused on fixing #17. Related issues #120 and #311 are separate and already have their own open PRs (#121 and #310). Issue #23 is broader; #95 addresses one concrete slice of it but should not absorb the full rerun/idempotency scope.

I rebased the #17 fix onto current origin/main, resolved the bin/lib/onboard.js overlap, added cleanup fallback tests, and pushed the result to origin branch fix/gateway-cleanup-mainline at commit 4b123ef for validation across macOS, Brev, and Spark.

coderabbitai · 2026-03-18T14:52:31Z

📝 Walkthrough

Walkthrough

This change introduces a gateway lifecycle cleanup system that automatically removes stale Docker volumes when gateway startup fails or the gateway is destroyed. It adds five helper functions to handle volume enumeration, inspection, removal, and error reporting, along with enhanced startup logic that triggers cleanup on failure scenarios.

Changes

Cohort / File(s)	Summary
Gateway Cleanup Lifecycle `bin/lib/onboard.js`	Added `destroyGateway()` wrapper that orchestrates gateway destruction and volume cleanup. Introduced `gatewayVolumeCandidates()` to identify Docker volumes, `cleanupGatewayVolumes()` to remove volumes and track failures, `manualGatewayVolumeCleanupCommand()` to format recovery commands, and `reportGatewayCleanupResult()` to emit cleanup guidance. Enhanced startup logic to call `destroyGateway()` before fresh start and on failure/health-check paths. Replaced hard-coded gateway name with `DEFAULT_GATEWAY_NAME` constant. Exported all new helpers.
Cleanup Function Tests `test/onboard.test.js`	Added comprehensive test suite covering volume candidate enumeration, successful and failed volume removal scenarios, and manual recovery command generation for leftover volumes.

Sequence Diagram

sequenceDiagram
    actor User
    participant Gateway Startup
    participant destroyGateway
    participant Docker Volumes
    participant Cleanup Reporter
    
    User->>Gateway Startup: Start gateway
    Gateway Startup->>destroyGateway: Teardown old state
    destroyGateway->>Docker Volumes: List and remove stale volumes
    Docker Volumes-->>destroyGateway: Removal results (success/failed)
    destroyGateway->>Cleanup Reporter: Report cleanup outcome
    Cleanup Reporter-->>Gateway Startup: Cleanup status
    
    alt Startup succeeds
        Gateway Startup->>User: Gateway ready
    else Startup fails
        Gateway Startup->>destroyGateway: Trigger cleanup
        destroyGateway->>Docker Volumes: Remove volumes
        Docker Volumes-->>destroyGateway: Results with failures
        destroyGateway->>Cleanup Reporter: Report + show recovery command
        Cleanup Reporter-->>Gateway Startup: Cleanup summary
        Gateway Startup->>User: Error + recovery instructions
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A rabbit hops through Docker's land,
Sweeping volumes with careful hand,
When gateways stumble and fail to start,
I clean the mess—a tidy art!
No manual toil, no mess so grand, 🧹✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 45.45% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately describes the main change: implementing cleanup of Docker volumes after failed gateway starts.
Linked Issues check	✅ Passed	The PR implements all key requirements from issue `#17`: new cleanup functions prevent lingering volumes, destroyGateway() is called on failures, error messages guide users to rerun, and tests validate the cleanup behavior.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to fixing issue `#17`: adding gateway cleanup functions, integrating them into startup/failure paths, and testing the new cleanup logic.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

test/onboard.test.js (1)

12-53: Add a regression test for Docker-unavailable cleanup behavior.

Current tests validate remove success/failure, but not the case where Docker itself is unavailable. That path is important for avoiding false “cleanup succeeded” messaging.

🧪 Suggested test case

 describe("gateway cleanup helpers", () => {
+  it("marks cleanup as failed when Docker is unavailable", () => {
+    const runFn = (cmd) => {
+      if (cmd.startsWith("docker info")) return { status: 1 };
+      return { status: 1 };
+    };
+
+    const result = cleanupGatewayVolumes(runFn);
+    assert.deepEqual(result, {
+      removedVolumes: [],
+      failedVolumes: ["openshell-cluster-nemoclaw"],
+    });
+  });
+
   it("uses the known OpenShell volume name for the default gateway", () => {
     assert.deepEqual(gatewayVolumeCandidates(), ["openshell-cluster-nemoclaw"]);
   });

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/onboard.test.js` around lines 12 - 53, Add a new test case in
onboard.test.js that simulates Docker being entirely unavailable by having the
runFn return a non-zero status (e.g., {status: 1}) for the "docker volume
inspect" command; call cleanupGatewayVolumes(runFn) and assert it reports
removedVolumes: [] and failedVolumes includes the known gateway name (from
gatewayVolumeCandidates()/openshell-cluster-nemoclaw), then verify
manualGatewayVolumeCleanupCommand(result.failedVolumes) returns the exact manual
removal string; locate and update the existing describe("gateway cleanup
helpers") block and reuse the same helper functions (cleanupGatewayVolumes and
manualGatewayVolumeCleanupCommand) to add this regression test.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 98-107: cleanupGatewayVolumes currently treats any non-zero result
from runFn(`docker volume inspect ...`) as "volume absent", which hides
Docker-daemon/CLI failures; change the logic in cleanupGatewayVolumes (and its
use of runFn and gatewayVolumeCandidates/DEFAULT_GATEWAY_NAME) so that when
inspectResult.status !== 0 you first verify Docker is reachable (e.g., call
runFn('docker info' or 'docker version') and check its status/output) and if
Docker is unavailable return/throw an error (or propagate failure) instead of
continuing; only treat non-zero inspect as "absent" when Docker is confirmed
working. Ensure callers handle the propagated error or failure status.
- Around line 134-136: The command string in destroyGateway interpolates
gatewayName directly; change it to shell-quote gatewayName before interpolation
to avoid injection — e.g., escape any single quotes in gatewayName and wrap it
in single quotes when building the command passed to runFn (reference function
destroyGateway and runFn); leave the call to cleanupGatewayVolumes(gatewayName)
unchanged.

---

Nitpick comments:
In `@test/onboard.test.js`:
- Around line 12-53: Add a new test case in onboard.test.js that simulates
Docker being entirely unavailable by having the runFn return a non-zero status
(e.g., {status: 1}) for the "docker volume inspect" command; call
cleanupGatewayVolumes(runFn) and assert it reports removedVolumes: [] and
failedVolumes includes the known gateway name (from
gatewayVolumeCandidates()/openshell-cluster-nemoclaw), then verify
manualGatewayVolumeCleanupCommand(result.failedVolumes) returns the exact manual
removal string; locate and update the existing describe("gateway cleanup
helpers") block and reuse the same helper functions (cleanupGatewayVolumes and
manualGatewayVolumeCleanupCommand) to add this regression test.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6e05a240-9cab-4301-ad5a-86e1106e305c

📥 Commits

Reviewing files that changed from the base of the PR and between 1e23347 and 4b123ef.

📒 Files selected for processing (2)

bin/lib/onboard.js
test/onboard.test.js

coderabbitai · 2026-03-18T14:55:35Z

+function cleanupGatewayVolumes(runFn = run, gatewayName = DEFAULT_GATEWAY_NAME) {
+  const removedVolumes = [];
+  const failedVolumes = [];
+
+  for (const volumeName of gatewayVolumeCandidates(gatewayName)) {
+    const inspectResult = runFn(`docker volume inspect ${shellQuote(volumeName)} >/dev/null 2>&1`, {
+      ignoreError: true,
+      stdio: "ignore",
+    });
+    if (inspectResult.status !== 0) continue;


⚠️ Potential issue | 🟠 Major

Handle Docker-unavailable cases explicitly before reporting cleanup success.

Right now, a non-zero docker volume inspect is treated as "volume absent" (Line 107), so Docker daemon/CLI failures can end up reported as successful stale-state cleanup.

🔧 Proposed fix

function cleanupGatewayVolumes(runFn = run, gatewayName = DEFAULT_GATEWAY_NAME) { const removedVolumes = []; const failedVolumes = []; + const dockerReady = runFn("docker info >/dev/null 2>&1", { + ignoreError: true, + stdio: "ignore", + }); + if (dockerReady.status !== 0) { + return { removedVolumes, failedVolumes: gatewayVolumeCandidates(gatewayName) }; + } for (const volumeName of gatewayVolumeCandidates(gatewayName)) { const inspectResult = runFn(`docker volume inspect ${shellQuote(volumeName)} >/dev/null 2>&1`, { ignoreError: true, stdio: "ignore", });

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 98 - 107, cleanupGatewayVolumes currently treats any non-zero result from runFn(`docker volume inspect ...`) as "volume absent", which hides Docker-daemon/CLI failures; change the logic in cleanupGatewayVolumes (and its use of runFn and gatewayVolumeCandidates/DEFAULT_GATEWAY_NAME) so that when inspectResult.status !== 0 you first verify Docker is reachable (e.g., call runFn('docker info' or 'docker version') and check its status/output) and if Docker is unavailable return/throw an error (or propagate failure) instead of continuing; only treat non-zero inspect as "absent" when Docker is confirmed working. Ensure callers handle the propagated error or failure status.

coderabbitai · 2026-03-18T14:55:35Z

+function destroyGateway(runFn = run, gatewayName = DEFAULT_GATEWAY_NAME) {
+  runFn(`openshell gateway destroy -g ${gatewayName} 2>/dev/null || true`, { ignoreError: true });
+  return cleanupGatewayVolumes(runFn, gatewayName);


⚠️ Potential issue | 🟠 Major

Quote gatewayName in destroyGateway command construction.

gatewayName is interpolated directly into a shell command on Line 135. Even if current callers use a constant, this exported helper should be safe by default.

🔧 Proposed fix

function destroyGateway(runFn = run, gatewayName = DEFAULT_GATEWAY_NAME) { - runFn(`openshell gateway destroy -g ${gatewayName} 2>/dev/null || true`, { ignoreError: true }); + runFn(`openshell gateway destroy -g ${shellQuote(gatewayName)} 2>/dev/null || true`, { ignoreError: true }); return cleanupGatewayVolumes(runFn, gatewayName); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

function destroyGateway(runFn = run, gatewayName = DEFAULT_GATEWAY_NAME) {

runFn(`openshell gateway destroy -g ${gatewayName} 2>/dev/null || true`, { ignoreError: true });

return cleanupGatewayVolumes(runFn, gatewayName);

function destroyGateway(runFn = run, gatewayName = DEFAULT_GATEWAY_NAME) {

runFn(`openshell gateway destroy -g ${shellQuote(gatewayName)} 2>/dev/null || true`, { ignoreError: true });

return cleanupGatewayVolumes(runFn, gatewayName);

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 134 - 136, The command string in destroyGateway interpolates gatewayName directly; change it to shell-quote gatewayName before interpolation to avoid injection — e.g., escape any single quotes in gatewayName and wrap it in single quotes when building the command passed to runFn (reference function destroyGateway and runFn); leave the call to cleanupGatewayVolumes(gatewayName) unchanged.

## Summary - stop requiring `NVIDIA_API_KEY` for local-only `nemoclaw start` and only gate the Telegram bridge when that bridge actually needs the key - clean up the dashboard forward, `nemoclaw` gateway, and `openshell-cluster-nemoclaw` Docker volumes when the last sandbox is destroyed - broaden unified-memory NVIDIA GPU detection beyond `GB10` while keeping `spark: true` specific to GB10 - harden policy merge/retry behavior so truncated or error-like current-policy reads rebuild from a clean `version: 1` scaffold instead of producing malformed YAML ## Issue Mapping Fixes #1191 Fixes #1160 Fixes #1182 Fixes #1162 Related #991 ## Notes - `#1188` was investigated but is not included in this PR. - The current evidence still points to a deeper runtime / proxy reachability problem on macOS + Colima rather than a bounded NemoClaw-only fix. - Keeping it out of this branch avoids speculative networking changes without strong reproduction and cross-platform coverage. ## Validation ```bash npx vitest run npx eslint bin/nemoclaw.js bin/lib/nim.js bin/lib/policies.js test/cli.test.js test/nim.test.js test/policies.test.js test/service-env.test.js npx tsc -p jsconfig.json --noEmit ``` ## References Reviewed - #1106 - #308 - #95 - #770 Signed-off-by: Kevin Jones <kejones@nvidia.com>  ## Summary by CodeRabbit * **New Features** - Core services can start without an NVIDIA API key. - Enhanced unified‑memory GPU detection with more accurate capability reporting. * **Bug Fixes** - Gateway and forwarded‑port cleanup only runs when the last sandbox is removed and no live sandboxes remain. - Telegram bridge now starts only when both required tokens are present; clearer startup warnings. - Policy parsing/merge more robust for metadata‑only or malformed inputs; consistent version header formatting. * **Tests** - Added tests covering GPU detection, policy parsing/merge, CLI sandbox/gateway flows, and service startup.

## Summary - stop requiring `NVIDIA_API_KEY` for local-only `nemoclaw start` and only gate the Telegram bridge when that bridge actually needs the key - clean up the dashboard forward, `nemoclaw` gateway, and `openshell-cluster-nemoclaw` Docker volumes when the last sandbox is destroyed - broaden unified-memory NVIDIA GPU detection beyond `GB10` while keeping `spark: true` specific to GB10 - harden policy merge/retry behavior so truncated or error-like current-policy reads rebuild from a clean `version: 1` scaffold instead of producing malformed YAML ## Issue Mapping Fixes NVIDIA#1191 Fixes NVIDIA#1160 Fixes NVIDIA#1182 Fixes NVIDIA#1162 Related NVIDIA#991 ## Notes - `NVIDIA#1188` was investigated but is not included in this PR. - The current evidence still points to a deeper runtime / proxy reachability problem on macOS + Colima rather than a bounded NemoClaw-only fix. - Keeping it out of this branch avoids speculative networking changes without strong reproduction and cross-platform coverage. ## Validation ```bash npx vitest run npx eslint bin/nemoclaw.js bin/lib/nim.js bin/lib/policies.js test/cli.test.js test/nim.test.js test/policies.test.js test/service-env.test.js npx tsc -p jsconfig.json --noEmit ``` ## References Reviewed - NVIDIA#1106 - NVIDIA#308 - NVIDIA#95 - NVIDIA#770 Signed-off-by: Kevin Jones <kejones@nvidia.com>  ## Summary by CodeRabbit * **New Features** - Core services can start without an NVIDIA API key. - Enhanced unified‑memory GPU detection with more accurate capability reporting. * **Bug Fixes** - Gateway and forwarded‑port cleanup only runs when the last sandbox is removed and no live sandboxes remain. - Telegram bridge now starts only when both required tokens are present; clearer startup warnings. - Policy parsing/merge more robust for metadata‑only or malformed inputs; consistent version header formatting. * **Tests** - Added tests covering GPU detection, policy parsing/merge, CLI sandbox/gateway flows, and service startup.

WuKongAI-CMU and others added 2 commits March 18, 2026 10:36

fix gateway cleanup fallback and tests

4b123ef

kjw3 self-assigned this Mar 18, 2026

kjw3 mentioned this pull request Mar 18, 2026

Make nemoclaw onboard robust when rerun after a failed install instead of requiring uninstall.sh as a workaround #23

Closed

kjw3 added the bug Something isn't working label Mar 18, 2026

kjw3 force-pushed the fix/gateway-cleanup branch from 45a21ef to 4b123ef Compare March 18, 2026 14:51

coderabbitai bot reviewed Mar 18, 2026

View reviewed changes

ericksoa closed this Mar 18, 2026

WuKongAI-CMU mentioned this pull request Mar 19, 2026

fix: quote shell interpolations and add timeouts in nim.js #97

Closed

1 task

mafueee pushed a commit to mafueee/NemoClaw that referenced this pull request Mar 28, 2026

chore(ci): remove Gitlab CI config (NVIDIA#95)

03939e0

kjw3 mentioned this pull request Mar 31, 2026

fix: address core blocker lifecycle regressions #1208

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: clean up Docker volumes after failed gateway start#95

fix: clean up Docker volumes after failed gateway start#95
WuKongAI-CMU wants to merge 2 commits intoNVIDIA:mainfrom
WuKongAI-CMU:fix/gateway-cleanup

WuKongAI-CMU commented Mar 17, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

kjw3 commented Mar 18, 2026

Uh oh!

coderabbitai bot commented Mar 18, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 18, 2026

Uh oh!

coderabbitai bot Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

WuKongAI-CMU commented Mar 17, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Before

After

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

kjw3 commented Mar 18, 2026

Uh oh!

coderabbitai bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WuKongAI-CMU commented Mar 17, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 18, 2026 •

edited

Loading