fix: detect cgroup v2 misconfiguration during onboard preflight by brianwtaylor · Pull Request #62 · NVIDIA/NemoClaw

brianwtaylor · 2026-03-16T23:21:59Z

Summary

Adds a preflight check to nemoclaw onboard (step [1/7]) that detects the cgroup v2 / Docker cgroupns misconfiguration before gateway startup, instead of letting it fail late with a cryptic kubelet error.

New module bin/lib/preflight.js with isCgroupV2() and checkCgroupConfig() detection
Wired into the existing preflight step in bin/lib/onboard.js between Docker/OpenShell checks and GPU detection
On failure: prints the exact error users will see, and points them to nemoclaw setup-spark (which does a safe JSON merge of daemon.json, preserving existing settings)
13 new unit tests in test/preflight.test.js covering all branches

Closes #16

Hardware verification

Tested on two NVIDIA DGX Spark units (aarch64, Ubuntu 24.04.4 LTS, Docker 28.x, cgroup v2):

Machine	cgroup	daemon.json	Detection result	Expected
spark0	cgroup2fs	missing	`{ok: false, reason: "...does not exist..."}`	Correct
spark1	cgroup2fs	missing	cgroup2fs confirmed, no daemon.json	Correct
spark0 (simulated fix)	cgroup2fs	`{"default-cgroupns-mode": "host"}`	`{ok: true}`	Correct

Both machines are in the exact broken state described in #16. The check catches it and exits with actionable guidance before the gateway ever starts.

Test plan

Automated Tests

node --test test/preflight.test.js
node --test test/*.test.js

Manual verification

On a cgroup v2 host without daemon.json: nemoclaw onboard should fail at step [1/7] with the cgroup error
After running nemoclaw setup-spark: nemoclaw onboard should pass the cgroup check
On a cgroup v1 host or macOS: check is silently skipped (returns {ok: true})

Add a preflight check that catches the #1 onboarding blocker on Ubuntu 24.04, DGX Spark, and WSL2. When cgroup v2 is active but Docker's daemon.json lacks "default-cgroupns-mode": "host", onboarding now fails fast with a clear error and fix instructions instead of failing late at gateway startup with a cryptic kubelet error. Closes #16

ericksoa

LGTM. Clean implementation — detection logic is well-separated in preflight.js with dependency injection for testability, error message is actionable (points to nemoclaw setup-spark), and 13 unit tests cover all branches.

Verified: code reads one file (/etc/docker/daemon.json) and runs one safe command (stat -fc %T /sys/fs/cgroup). No shell injection surface, no network calls. macOS/cgroup v1 silently passes.

Hardware verification by the author on two DGX Spark units covers the real detection path that can't be exercised in Docker CI.

…tibility On cgroup v2 hosts (Ubuntu 24.04, DGX Spark), k3s-in-Docker requires cgroupns=host in the Docker daemon config. Without this, kubelet fails to initialize, the 'openshell' namespace is never created, and gateway startup times out with: 'timed out waiting for namespace openshell to exist' This was previously caught by a preflight check (NVIDIA#62) that was removed in NVIDIA#248. This commit restores the check in the onboard preflight: - Detects cgroup v2 by reading /proc/mounts for 'cgroup2' - Checks /etc/docker/daemon.json for 'default-cgroupns-mode: host' - On mismatch, exits with a clear error and remediation steps (run 'sudo nemoclaw setup-spark' or manually configure daemon.json) - Shows a success message when cgroup v2 is properly configured Also improves GPU display for DGX Spark: shows 'unified memory' instead of 'VRAM' when the Spark GPU is detected (since GB10 uses unified memory architecture where nvidia-smi cannot report dedicated VRAM). Fixes NVIDIA#280

…IA#62) Add a preflight check that catches the NVIDIA#1 onboarding blocker on Ubuntu 24.04, DGX Spark, and WSL2. When cgroup v2 is active but Docker's daemon.json lacks "default-cgroupns-mode": "host", onboarding now fails fast with a clear error and fix instructions instead of failing late at gateway startup with a cryptic kubelet error. Closes NVIDIA#16

… (!37) > **🔧 security-fix-agent** Closes NVIDIA#62 ## Security Fix ### Summary The CONNECT proxy accepted hostnames from clients and connected to whatever IP they resolved to, with no validation against internal address ranges. While the OPA policy is default-deny, a misconfigured or overly permissive policy could allow SSRF to cloud metadata (169.254.169.254), localhost, or RFC1918 services. This fix adds DNS resolution before connecting and rejects any host that resolves to an internal IP. ### Severity Assessment - **Impact:** Medium — if exploited, could reach cloud metadata (IAM creds), cluster-internal services, or host-local services - **Exploitability:** Very low — requires OPA policy misconfiguration or DNS rebinding attack - **Affected components:** `crates/navigator-sandbox/src/proxy.rs` — `handle_tcp_connection` ### Changes Made - `crates/navigator-sandbox/src/proxy.rs`: Added `is_internal_ip()` helper that checks IPv4 loopback/private/link-local, IPv6 loopback/link-local, and IPv4-mapped IPv6. Added `resolve_and_reject_internal()` that resolves DNS and rejects internal IPs. Inserted check between OPA allow and `TcpStream::connect`, with control plane endpoints exempt. - `architecture/security-policy.md`: Added SSRF Protection section with blocked ranges table and flow diagram - `architecture/sandbox.md`: Updated proxy connection flow diagram and added SSRF protection subsection - `architecture/README.md`: Added internal IP rejection step to proxy description ### Tests Added - **Unit:** 17 tests in `proxy::tests` — covers IPv4 loopback/private/link-local, IPv6 loopback/link-local, IPv4-mapped IPv6, public IPs, DNS resolution of localhost/127.0.0.1/169.254.169.254, and DNS failure handling - **Integration/E2E:** N/A — the proxy runs inside a Linux network namespace; unit tests for IP checking and DNS resolution cover the security boundary ### Documentation Updated - `architecture/security-policy.md`: New SSRF Protection section with blocked IP ranges and Mermaid flowchart - `architecture/sandbox.md`: Updated proxy flow diagram and added SSRF protection subsection - `architecture/README.md`: Added step 4 to proxy description ### Verification All 85 sandbox tests pass including 17 new proxy SSRF tests. Pre-commit (fmt, clippy, full test suite) passes clean with zero warnings.

…VIDIA#62) TLS certificates are always resolved automatically from cluster metadata, making the explicit CLI flags unnecessary. The TlsOptions struct and auto-resolution logic remain intact for programmatic and test use.

ericksoa approved these changes Mar 17, 2026

View reviewed changes

ericksoa merged commit 65dad39 into NVIDIA:main Mar 17, 2026

brianwtaylor mentioned this pull request Mar 17, 2026

test: add GPU detection tests with dependency injection #142

Closed

This was referenced Mar 17, 2026

setup-spark.sh crashes trying to install vLLM (descoped feature) #164

Closed

setup-spark.sh chains into deprecated setup.sh instead of directing to nemoclaw onboard #165

Closed

This was referenced Mar 17, 2026

Gateway failure on cgroup v2: openat2 /sys/fs/cgroup/kubepods/pids.max: no such file or directory #136

Closed

install failed #280

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: detect cgroup v2 misconfiguration during onboard preflight#62

fix: detect cgroup v2 misconfiguration during onboard preflight#62
ericksoa merged 1 commit intoNVIDIA:mainfrom
brianwtaylor:fix/cgroup-v2-preflight-check

brianwtaylor commented Mar 16, 2026

Uh oh!

ericksoa left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

brianwtaylor commented Mar 16, 2026

Summary

Hardware verification

Test plan

Automated Tests

Manual verification

Uh oh!

ericksoa left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants