Skip to content

fix: skip --gpu on WSL2 where GPU passthrough to k3s is unsupported#209

Closed
mattezell wants to merge 1 commit intoNVIDIA:mainfrom
mattezell:fix/wsl2-gpu-detection
Closed

fix: skip --gpu on WSL2 where GPU passthrough to k3s is unsupported#209
mattezell wants to merge 1 commit intoNVIDIA:mainfrom
mattezell:fix/wsl2-gpu-detection

Conversation

@mattezell
Copy link
Copy Markdown

@mattezell mattezell commented Mar 17, 2026

Problem

nemoclaw onboard forces --gpu on both openshell gateway start and openshell sandbox create whenever nvidia-smi is detected. On WSL2 with Docker Desktop, nvidia-smi works at the host layer but the GPU cannot be passed through to the k3s cluster inside the OpenShell gateway container. The result is a dead sandbox that immediately reports "not found" on every subsequent command.

There is no --no-gpu flag, environment variable, or config option to override this.

Affected users: Everyone on WSL2 with any NVIDIA GPU.

Confirmed on:

Fixes #208

Solution

Detect WSL2 via /proc/version (which contains "microsoft" or "WSL" on WSL2 kernels) and set nimCapable: false. The existing if (gpu && gpu.nimCapable) guards in onboard.js (lines 116, 187) automatically skip --gpu. Cloud inference works normally through the OpenShell proxy.

Changes

File Change
bin/lib/nim.js Add isWSL2() helper; set nimCapable: false and wsl2: true on WSL2
bin/lib/onboard.js Add WSL2 info message during preflight GPU detection
scripts/setup.sh Add WSL2 check to legacy gateway start path
test/wsl2.test.js Tests for isWSL2() and detectGpu() WSL2 awareness

What does NOT change

  • Native Linux with NVIDIA GPU: nimCapable stays true, no behavior change
  • macOS: already nimCapable: false, no behavior change
  • DGX Spark / DGX Station: not WSL2, no behavior change
  • All 13 existing tests pass unchanged

Testing

node --test test/preflight.test.js test/wsl2.test.js
# 18 pass, 0 fail

Manual verification on WSL2 Ubuntu 24.04 + RTX 5090: nemoclaw onboard completes, sandbox stays alive, openclaw tui connects via cloud inference.

On WSL2, nvidia-smi works at the host layer but the GPU cannot be passed
through to the k3s cluster inside the OpenShell gateway container (Docker
Desktop limitation). This causes nemoclaw onboard to create dead sandboxes
that immediately report 'not found'.

- Add isWSL2() detection via /proc/version in bin/lib/nim.js
- Set nimCapable: false when WSL2 detected (GPU visible but unusable)
- Add WSL2 info message during onboard preflight
- Fix scripts/setup.sh legacy path with same WSL2 check
- Add test/wsl2.test.js

Tested on WSL2 Ubuntu 24.04 + Docker Desktop + RTX 5090 Laptop.
No behavior change on native Linux, macOS, or DGX.

Fixes NVIDIA#208

Signed-off-by: Matt Ezell <ezell.matt@gmail.com>
@mattezell
Copy link
Copy Markdown
Author

Fixes: #208

@mattezell
Copy link
Copy Markdown
Author

Noticed #140 describes the same WSL2 symptoms. PR #229 fixes the error-masking side (awk pipe swallowing exit codes), which is a solid improvement for all platforms. This PR fixes the upstream root cause on WSL2 specifically — preventing --gpu from being passed when the GPU can't actually reach the container runtime. The two fixes are complementary: #229 ensures failures surface clearly, this PR prevents the failure from occurring on WSL2 in the first place.

@wscurran wscurran added the Platform: Windows/WSL Support for Windows Subsystem for Linux label Mar 18, 2026
@cv
Copy link
Copy Markdown
Contributor

cv commented Mar 21, 2026

Nice catch on the WSL2 GPU passthrough issue, @mattezell! That's a real pain point for folks on that platform. Just wanted to flag that the codebase has changed a fair amount since this was opened — we've introduced CI checks and landed several new features. When you get a chance, would you be able to rebase against the latest main? That'll make it much easier for us to review and merge. Thanks!

@mattezell
Copy link
Copy Markdown
Author

@cv thanks. It looks like the issue I filed has been closed, with a later submitted PR having been selected as the fix, so I will go ahead and close this out.

@mattezell mattezell closed this Mar 25, 2026
mafueee pushed a commit to mafueee/NemoClaw that referenced this pull request Mar 28, 2026
…l inference (NVIDIA#209)

* feat(inference): add sandbox-system inference route for platform-level inference

Add a separate system-level inference route that the sandbox supervisor
can use in-process for platform functions (e.g., embedded agent harness
for policy analysis), distinct from the user-facing inference.local
endpoint. The system route is accessed via an in-process API on the
supervisor, ensuring userland code in the sandbox netns cannot reach it.

- Extend proto with route_name fields on Set/Get inference messages
- Add ResolvedRoute.name field to the router for route segregation
- Server resolves both user and sandbox-system routes in bundles
- Sandbox partitions routes into user/system caches on refresh
- Expose InferenceContext::system_inference() in-process API
- CLI --sandbox flag targets the system route on set/get/update
- Integration tests using mock:// routes for the full in-process path

Closes NVIDIA#207

* refactor(cli): rename --sandbox flag to --system for inference commands

The --sandbox flag could be misread as targeting a user-level sandbox
operation. Rename to --system to clearly indicate it configures the
platform-level system inference route.

* style(cli): collapse short if-else per rustfmt 1.94

---------

Co-authored-by: John Myers <johntmyers@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Platform: Windows/WSL Support for Windows Subsystem for Linux

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] nemoclaw onboard forces --gpu on WSL2, sandbox DOA (workaround included)

3 participants