fix: clean up Docker volumes after failed gateway start#95
fix: clean up Docker volumes after failed gateway start#95WuKongAI-CMU wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
After a failed gateway start, leftover Docker volumes from openshell-cluster-nemoclaw cause "Corrupted cluster state" errors on subsequent runs, requiring manual `docker volume rm` to recover. Extract destroyGateway() that runs both `openshell gateway destroy` and removes orphaned Docker volumes. Call it: 1. Before starting the gateway (existing pre-cleanup) 2. After a failed gateway start (new — ensures clean retry) 3. After a failed health check (new — same cleanup path) The error messages now tell the user to simply rerun the installer instead of requiring manual Docker volume management. Fixes NVIDIA#17 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: peteryuqin <peter.yuqin@gmail.com>
|
Scope check after review: PR #95 should stay focused on fixing #17. Related issues #120 and #311 are separate and already have their own open PRs (#121 and #310). Issue #23 is broader; #95 addresses one concrete slice of it but should not absorb the full rerun/idempotency scope. I rebased the #17 fix onto current origin/main, resolved the bin/lib/onboard.js overlap, added cleanup fallback tests, and pushed the result to origin branch fix/gateway-cleanup-mainline at commit 4b123ef for validation across macOS, Brev, and Spark. |
45a21ef to
4b123ef
Compare
📝 WalkthroughWalkthroughThis change introduces a gateway lifecycle cleanup system that automatically removes stale Docker volumes when gateway startup fails or the gateway is destroyed. It adds five helper functions to handle volume enumeration, inspection, removal, and error reporting, along with enhanced startup logic that triggers cleanup on failure scenarios. Changes
Sequence DiagramsequenceDiagram
actor User
participant Gateway Startup
participant destroyGateway
participant Docker Volumes
participant Cleanup Reporter
User->>Gateway Startup: Start gateway
Gateway Startup->>destroyGateway: Teardown old state
destroyGateway->>Docker Volumes: List and remove stale volumes
Docker Volumes-->>destroyGateway: Removal results (success/failed)
destroyGateway->>Cleanup Reporter: Report cleanup outcome
Cleanup Reporter-->>Gateway Startup: Cleanup status
alt Startup succeeds
Gateway Startup->>User: Gateway ready
else Startup fails
Gateway Startup->>destroyGateway: Trigger cleanup
destroyGateway->>Docker Volumes: Remove volumes
Docker Volumes-->>destroyGateway: Results with failures
destroyGateway->>Cleanup Reporter: Report + show recovery command
Cleanup Reporter-->>Gateway Startup: Cleanup summary
Gateway Startup->>User: Error + recovery instructions
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
test/onboard.test.js (1)
12-53: Add a regression test for Docker-unavailable cleanup behavior.Current tests validate remove success/failure, but not the case where Docker itself is unavailable. That path is important for avoiding false “cleanup succeeded” messaging.
🧪 Suggested test case
describe("gateway cleanup helpers", () => { + it("marks cleanup as failed when Docker is unavailable", () => { + const runFn = (cmd) => { + if (cmd.startsWith("docker info")) return { status: 1 }; + return { status: 1 }; + }; + + const result = cleanupGatewayVolumes(runFn); + assert.deepEqual(result, { + removedVolumes: [], + failedVolumes: ["openshell-cluster-nemoclaw"], + }); + }); + it("uses the known OpenShell volume name for the default gateway", () => { assert.deepEqual(gatewayVolumeCandidates(), ["openshell-cluster-nemoclaw"]); });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/onboard.test.js` around lines 12 - 53, Add a new test case in onboard.test.js that simulates Docker being entirely unavailable by having the runFn return a non-zero status (e.g., {status: 1}) for the "docker volume inspect" command; call cleanupGatewayVolumes(runFn) and assert it reports removedVolumes: [] and failedVolumes includes the known gateway name (from gatewayVolumeCandidates()/openshell-cluster-nemoclaw), then verify manualGatewayVolumeCleanupCommand(result.failedVolumes) returns the exact manual removal string; locate and update the existing describe("gateway cleanup helpers") block and reuse the same helper functions (cleanupGatewayVolumes and manualGatewayVolumeCleanupCommand) to add this regression test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/onboard.js`:
- Around line 98-107: cleanupGatewayVolumes currently treats any non-zero result
from runFn(`docker volume inspect ...`) as "volume absent", which hides
Docker-daemon/CLI failures; change the logic in cleanupGatewayVolumes (and its
use of runFn and gatewayVolumeCandidates/DEFAULT_GATEWAY_NAME) so that when
inspectResult.status !== 0 you first verify Docker is reachable (e.g., call
runFn('docker info' or 'docker version') and check its status/output) and if
Docker is unavailable return/throw an error (or propagate failure) instead of
continuing; only treat non-zero inspect as "absent" when Docker is confirmed
working. Ensure callers handle the propagated error or failure status.
- Around line 134-136: The command string in destroyGateway interpolates
gatewayName directly; change it to shell-quote gatewayName before interpolation
to avoid injection — e.g., escape any single quotes in gatewayName and wrap it
in single quotes when building the command passed to runFn (reference function
destroyGateway and runFn); leave the call to cleanupGatewayVolumes(gatewayName)
unchanged.
---
Nitpick comments:
In `@test/onboard.test.js`:
- Around line 12-53: Add a new test case in onboard.test.js that simulates
Docker being entirely unavailable by having the runFn return a non-zero status
(e.g., {status: 1}) for the "docker volume inspect" command; call
cleanupGatewayVolumes(runFn) and assert it reports removedVolumes: [] and
failedVolumes includes the known gateway name (from
gatewayVolumeCandidates()/openshell-cluster-nemoclaw), then verify
manualGatewayVolumeCleanupCommand(result.failedVolumes) returns the exact manual
removal string; locate and update the existing describe("gateway cleanup
helpers") block and reuse the same helper functions (cleanupGatewayVolumes and
manualGatewayVolumeCleanupCommand) to add this regression test.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6e05a240-9cab-4301-ad5a-86e1106e305c
📒 Files selected for processing (2)
bin/lib/onboard.jstest/onboard.test.js
| function cleanupGatewayVolumes(runFn = run, gatewayName = DEFAULT_GATEWAY_NAME) { | ||
| const removedVolumes = []; | ||
| const failedVolumes = []; | ||
|
|
||
| for (const volumeName of gatewayVolumeCandidates(gatewayName)) { | ||
| const inspectResult = runFn(`docker volume inspect ${shellQuote(volumeName)} >/dev/null 2>&1`, { | ||
| ignoreError: true, | ||
| stdio: "ignore", | ||
| }); | ||
| if (inspectResult.status !== 0) continue; |
There was a problem hiding this comment.
Handle Docker-unavailable cases explicitly before reporting cleanup success.
Right now, a non-zero docker volume inspect is treated as "volume absent" (Line 107), so Docker daemon/CLI failures can end up reported as successful stale-state cleanup.
🔧 Proposed fix
function cleanupGatewayVolumes(runFn = run, gatewayName = DEFAULT_GATEWAY_NAME) {
const removedVolumes = [];
const failedVolumes = [];
+ const dockerReady = runFn("docker info >/dev/null 2>&1", {
+ ignoreError: true,
+ stdio: "ignore",
+ });
+ if (dockerReady.status !== 0) {
+ return { removedVolumes, failedVolumes: gatewayVolumeCandidates(gatewayName) };
+ }
for (const volumeName of gatewayVolumeCandidates(gatewayName)) {
const inspectResult = runFn(`docker volume inspect ${shellQuote(volumeName)} >/dev/null 2>&1`, {
ignoreError: true,
stdio: "ignore",
});🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bin/lib/onboard.js` around lines 98 - 107, cleanupGatewayVolumes currently
treats any non-zero result from runFn(`docker volume inspect ...`) as "volume
absent", which hides Docker-daemon/CLI failures; change the logic in
cleanupGatewayVolumes (and its use of runFn and
gatewayVolumeCandidates/DEFAULT_GATEWAY_NAME) so that when inspectResult.status
!== 0 you first verify Docker is reachable (e.g., call runFn('docker info' or
'docker version') and check its status/output) and if Docker is unavailable
return/throw an error (or propagate failure) instead of continuing; only treat
non-zero inspect as "absent" when Docker is confirmed working. Ensure callers
handle the propagated error or failure status.
| function destroyGateway(runFn = run, gatewayName = DEFAULT_GATEWAY_NAME) { | ||
| runFn(`openshell gateway destroy -g ${gatewayName} 2>/dev/null || true`, { ignoreError: true }); | ||
| return cleanupGatewayVolumes(runFn, gatewayName); |
There was a problem hiding this comment.
Quote gatewayName in destroyGateway command construction.
gatewayName is interpolated directly into a shell command on Line 135. Even if current callers use a constant, this exported helper should be safe by default.
🔧 Proposed fix
function destroyGateway(runFn = run, gatewayName = DEFAULT_GATEWAY_NAME) {
- runFn(`openshell gateway destroy -g ${gatewayName} 2>/dev/null || true`, { ignoreError: true });
+ runFn(`openshell gateway destroy -g ${shellQuote(gatewayName)} 2>/dev/null || true`, { ignoreError: true });
return cleanupGatewayVolumes(runFn, gatewayName);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| function destroyGateway(runFn = run, gatewayName = DEFAULT_GATEWAY_NAME) { | |
| runFn(`openshell gateway destroy -g ${gatewayName} 2>/dev/null || true`, { ignoreError: true }); | |
| return cleanupGatewayVolumes(runFn, gatewayName); | |
| function destroyGateway(runFn = run, gatewayName = DEFAULT_GATEWAY_NAME) { | |
| runFn(`openshell gateway destroy -g ${shellQuote(gatewayName)} 2>/dev/null || true`, { ignoreError: true }); | |
| return cleanupGatewayVolumes(runFn, gatewayName); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bin/lib/onboard.js` around lines 134 - 136, The command string in
destroyGateway interpolates gatewayName directly; change it to shell-quote
gatewayName before interpolation to avoid injection — e.g., escape any single
quotes in gatewayName and wrap it in single quotes when building the command
passed to runFn (reference function destroyGateway and runFn); leave the call to
cleanupGatewayVolumes(gatewayName) unchanged.
## Summary - stop requiring `NVIDIA_API_KEY` for local-only `nemoclaw start` and only gate the Telegram bridge when that bridge actually needs the key - clean up the dashboard forward, `nemoclaw` gateway, and `openshell-cluster-nemoclaw` Docker volumes when the last sandbox is destroyed - broaden unified-memory NVIDIA GPU detection beyond `GB10` while keeping `spark: true` specific to GB10 - harden policy merge/retry behavior so truncated or error-like current-policy reads rebuild from a clean `version: 1` scaffold instead of producing malformed YAML ## Issue Mapping Fixes #1191 Fixes #1160 Fixes #1182 Fixes #1162 Related #991 ## Notes - `#1188` was investigated but is not included in this PR. - The current evidence still points to a deeper runtime / proxy reachability problem on macOS + Colima rather than a bounded NemoClaw-only fix. - Keeping it out of this branch avoids speculative networking changes without strong reproduction and cross-platform coverage. ## Validation ```bash npx vitest run npx eslint bin/nemoclaw.js bin/lib/nim.js bin/lib/policies.js test/cli.test.js test/nim.test.js test/policies.test.js test/service-env.test.js npx tsc -p jsconfig.json --noEmit ``` ## References Reviewed - #1106 - #308 - #95 - #770 Signed-off-by: Kevin Jones <kejones@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** - Core services can start without an NVIDIA API key. - Enhanced unified‑memory GPU detection with more accurate capability reporting. * **Bug Fixes** - Gateway and forwarded‑port cleanup only runs when the last sandbox is removed and no live sandboxes remain. - Telegram bridge now starts only when both required tokens are present; clearer startup warnings. - Policy parsing/merge more robust for metadata‑only or malformed inputs; consistent version header formatting. * **Tests** - Added tests covering GPU detection, policy parsing/merge, CLI sandbox/gateway flows, and service startup. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary - stop requiring `NVIDIA_API_KEY` for local-only `nemoclaw start` and only gate the Telegram bridge when that bridge actually needs the key - clean up the dashboard forward, `nemoclaw` gateway, and `openshell-cluster-nemoclaw` Docker volumes when the last sandbox is destroyed - broaden unified-memory NVIDIA GPU detection beyond `GB10` while keeping `spark: true` specific to GB10 - harden policy merge/retry behavior so truncated or error-like current-policy reads rebuild from a clean `version: 1` scaffold instead of producing malformed YAML ## Issue Mapping Fixes #1191 Fixes #1160 Fixes #1182 Fixes #1162 Related #991 ## Notes - `#1188` was investigated but is not included in this PR. - The current evidence still points to a deeper runtime / proxy reachability problem on macOS + Colima rather than a bounded NemoClaw-only fix. - Keeping it out of this branch avoids speculative networking changes without strong reproduction and cross-platform coverage. ## Validation ```bash npx vitest run npx eslint bin/nemoclaw.js bin/lib/nim.js bin/lib/policies.js test/cli.test.js test/nim.test.js test/policies.test.js test/service-env.test.js npx tsc -p jsconfig.json --noEmit ``` ## References Reviewed - #1106 - #308 - #95 - #770 Signed-off-by: Kevin Jones <kejones@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** - Core services can start without an NVIDIA API key. - Enhanced unified‑memory GPU detection with more accurate capability reporting. * **Bug Fixes** - Gateway and forwarded‑port cleanup only runs when the last sandbox is removed and no live sandboxes remain. - Telegram bridge now starts only when both required tokens are present; clearer startup warnings. - Policy parsing/merge more robust for metadata‑only or malformed inputs; consistent version header formatting. * **Tests** - Added tests covering GPU detection, policy parsing/merge, CLI sandbox/gateway flows, and service startup. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary - stop requiring `NVIDIA_API_KEY` for local-only `nemoclaw start` and only gate the Telegram bridge when that bridge actually needs the key - clean up the dashboard forward, `nemoclaw` gateway, and `openshell-cluster-nemoclaw` Docker volumes when the last sandbox is destroyed - broaden unified-memory NVIDIA GPU detection beyond `GB10` while keeping `spark: true` specific to GB10 - harden policy merge/retry behavior so truncated or error-like current-policy reads rebuild from a clean `version: 1` scaffold instead of producing malformed YAML ## Issue Mapping Fixes NVIDIA#1191 Fixes NVIDIA#1160 Fixes NVIDIA#1182 Fixes NVIDIA#1162 Related NVIDIA#991 ## Notes - `NVIDIA#1188` was investigated but is not included in this PR. - The current evidence still points to a deeper runtime / proxy reachability problem on macOS + Colima rather than a bounded NemoClaw-only fix. - Keeping it out of this branch avoids speculative networking changes without strong reproduction and cross-platform coverage. ## Validation ```bash npx vitest run npx eslint bin/nemoclaw.js bin/lib/nim.js bin/lib/policies.js test/cli.test.js test/nim.test.js test/policies.test.js test/service-env.test.js npx tsc -p jsconfig.json --noEmit ``` ## References Reviewed - NVIDIA#1106 - NVIDIA#308 - NVIDIA#95 - NVIDIA#770 Signed-off-by: Kevin Jones <kejones@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** - Core services can start without an NVIDIA API key. - Enhanced unified‑memory GPU detection with more accurate capability reporting. * **Bug Fixes** - Gateway and forwarded‑port cleanup only runs when the last sandbox is removed and no live sandboxes remain. - Telegram bridge now starts only when both required tokens are present; clearer startup warnings. - Policy parsing/merge more robust for metadata‑only or malformed inputs; consistent version header formatting. * **Tests** - Added tests covering GPU detection, policy parsing/merge, CLI sandbox/gateway flows, and service startup. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary - stop requiring `NVIDIA_API_KEY` for local-only `nemoclaw start` and only gate the Telegram bridge when that bridge actually needs the key - clean up the dashboard forward, `nemoclaw` gateway, and `openshell-cluster-nemoclaw` Docker volumes when the last sandbox is destroyed - broaden unified-memory NVIDIA GPU detection beyond `GB10` while keeping `spark: true` specific to GB10 - harden policy merge/retry behavior so truncated or error-like current-policy reads rebuild from a clean `version: 1` scaffold instead of producing malformed YAML ## Issue Mapping Fixes NVIDIA#1191 Fixes NVIDIA#1160 Fixes NVIDIA#1182 Fixes NVIDIA#1162 Related NVIDIA#991 ## Notes - `NVIDIA#1188` was investigated but is not included in this PR. - The current evidence still points to a deeper runtime / proxy reachability problem on macOS + Colima rather than a bounded NemoClaw-only fix. - Keeping it out of this branch avoids speculative networking changes without strong reproduction and cross-platform coverage. ## Validation ```bash npx vitest run npx eslint bin/nemoclaw.js bin/lib/nim.js bin/lib/policies.js test/cli.test.js test/nim.test.js test/policies.test.js test/service-env.test.js npx tsc -p jsconfig.json --noEmit ``` ## References Reviewed - NVIDIA#1106 - NVIDIA#308 - NVIDIA#95 - NVIDIA#770 Signed-off-by: Kevin Jones <kejones@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** - Core services can start without an NVIDIA API key. - Enhanced unified‑memory GPU detection with more accurate capability reporting. * **Bug Fixes** - Gateway and forwarded‑port cleanup only runs when the last sandbox is removed and no live sandboxes remain. - Telegram bridge now starts only when both required tokens are present; clearer startup warnings. - Policy parsing/merge more robust for metadata‑only or malformed inputs; consistent version header formatting. * **Tests** - Added tests covering GPU detection, policy parsing/merge, CLI sandbox/gateway flows, and service startup. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
destroyGateway()helper that runsopenshell gateway destroyAND removes orphanedopenshell-cluster-nemoclawDocker volumesdocker volume rmFixes #17
Root cause
openshell gateway destroydoes not always remove the Docker volumes it created. When a subsequentopenshell gateway startfinds these volumes, it fails with "Corrupted cluster state". Users had to discoverdocker volume rmon their own.Before
After
Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
Tests