fix: skip --gpu on WSL2 where GPU passthrough to k3s is unsupported#209
fix: skip --gpu on WSL2 where GPU passthrough to k3s is unsupported#209mattezell wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
On WSL2, nvidia-smi works at the host layer but the GPU cannot be passed through to the k3s cluster inside the OpenShell gateway container (Docker Desktop limitation). This causes nemoclaw onboard to create dead sandboxes that immediately report 'not found'. - Add isWSL2() detection via /proc/version in bin/lib/nim.js - Set nimCapable: false when WSL2 detected (GPU visible but unusable) - Add WSL2 info message during onboard preflight - Fix scripts/setup.sh legacy path with same WSL2 check - Add test/wsl2.test.js Tested on WSL2 Ubuntu 24.04 + Docker Desktop + RTX 5090 Laptop. No behavior change on native Linux, macOS, or DGX. Fixes NVIDIA#208 Signed-off-by: Matt Ezell <ezell.matt@gmail.com>
|
Fixes: #208 |
|
Noticed #140 describes the same WSL2 symptoms. PR #229 fixes the error-masking side (awk pipe swallowing exit codes), which is a solid improvement for all platforms. This PR fixes the upstream root cause on WSL2 specifically — preventing --gpu from being passed when the GPU can't actually reach the container runtime. The two fixes are complementary: #229 ensures failures surface clearly, this PR prevents the failure from occurring on WSL2 in the first place. |
|
Nice catch on the WSL2 GPU passthrough issue, @mattezell! That's a real pain point for folks on that platform. Just wanted to flag that the codebase has changed a fair amount since this was opened — we've introduced CI checks and landed several new features. When you get a chance, would you be able to rebase against the latest main? That'll make it much easier for us to review and merge. Thanks! |
|
@cv thanks. It looks like the issue I filed has been closed, with a later submitted PR having been selected as the fix, so I will go ahead and close this out. |
…l inference (NVIDIA#209) * feat(inference): add sandbox-system inference route for platform-level inference Add a separate system-level inference route that the sandbox supervisor can use in-process for platform functions (e.g., embedded agent harness for policy analysis), distinct from the user-facing inference.local endpoint. The system route is accessed via an in-process API on the supervisor, ensuring userland code in the sandbox netns cannot reach it. - Extend proto with route_name fields on Set/Get inference messages - Add ResolvedRoute.name field to the router for route segregation - Server resolves both user and sandbox-system routes in bundles - Sandbox partitions routes into user/system caches on refresh - Expose InferenceContext::system_inference() in-process API - CLI --sandbox flag targets the system route on set/get/update - Integration tests using mock:// routes for the full in-process path Closes NVIDIA#207 * refactor(cli): rename --sandbox flag to --system for inference commands The --sandbox flag could be misread as targeting a user-level sandbox operation. Rename to --system to clearly indicate it configures the platform-level system inference route. * style(cli): collapse short if-else per rustfmt 1.94 --------- Co-authored-by: John Myers <johntmyers@users.noreply.github.com>
Problem
nemoclaw onboardforces--gpuon bothopenshell gateway startandopenshell sandbox createwhenevernvidia-smiis detected. On WSL2 with Docker Desktop,nvidia-smiworks at the host layer but the GPU cannot be passed through to the k3s cluster inside the OpenShell gateway container. The result is a dead sandbox that immediately reports"not found"on every subsequent command.There is no
--no-gpuflag, environment variable, or config option to override this.Affected users: Everyone on WSL2 with any NVIDIA GPU.
Confirmed on:
Fixes #208
Solution
Detect WSL2 via
/proc/version(which contains "microsoft" or "WSL" on WSL2 kernels) and setnimCapable: false. The existingif (gpu && gpu.nimCapable)guards inonboard.js(lines 116, 187) automatically skip--gpu. Cloud inference works normally through the OpenShell proxy.Changes
bin/lib/nim.jsisWSL2()helper; setnimCapable: falseandwsl2: trueon WSL2bin/lib/onboard.jsscripts/setup.shtest/wsl2.test.jsisWSL2()anddetectGpu()WSL2 awarenessWhat does NOT change
nimCapablestaystrue, no behavior changenimCapable: false, no behavior changeTesting
node --test test/preflight.test.js test/wsl2.test.js # 18 pass, 0 failManual verification on WSL2 Ubuntu 24.04 + RTX 5090:
nemoclaw onboardcompletes, sandbox stays alive,openclaw tuiconnects via cloud inference.