fix: detect cgroup v2 misconfiguration during onboard preflight#62
Merged
ericksoa merged 1 commit intoNVIDIA:mainfrom Mar 17, 2026
Merged
Conversation
Add a preflight check that catches the #1 onboarding blocker on Ubuntu 24.04, DGX Spark, and WSL2. When cgroup v2 is active but Docker's daemon.json lacks "default-cgroupns-mode": "host", onboarding now fails fast with a clear error and fix instructions instead of failing late at gateway startup with a cryptic kubelet error. Closes #16
ericksoa
approved these changes
Mar 17, 2026
Contributor
ericksoa
left a comment
There was a problem hiding this comment.
LGTM. Clean implementation — detection logic is well-separated in preflight.js with dependency injection for testability, error message is actionable (points to nemoclaw setup-spark), and 13 unit tests cover all branches.
Verified: code reads one file (/etc/docker/daemon.json) and runs one safe command (stat -fc %T /sys/fs/cgroup). No shell injection surface, no network calls. macOS/cgroup v1 silently passes.
Hardware verification by the author on two DGX Spark units covers the real detection path that can't be exercised in Docker CI.
This was referenced Mar 17, 2026
This was referenced Mar 17, 2026
Closed
kagura-agent
pushed a commit
to kagura-agent/NemoClaw
that referenced
this pull request
Mar 18, 2026
…tibility On cgroup v2 hosts (Ubuntu 24.04, DGX Spark), k3s-in-Docker requires cgroupns=host in the Docker daemon config. Without this, kubelet fails to initialize, the 'openshell' namespace is never created, and gateway startup times out with: 'timed out waiting for namespace openshell to exist' This was previously caught by a preflight check (NVIDIA#62) that was removed in NVIDIA#248. This commit restores the check in the onboard preflight: - Detects cgroup v2 by reading /proc/mounts for 'cgroup2' - Checks /etc/docker/daemon.json for 'default-cgroupns-mode: host' - On mismatch, exits with a clear error and remediation steps (run 'sudo nemoclaw setup-spark' or manually configure daemon.json) - Shows a success message when cgroup v2 is properly configured Also improves GPU display for DGX Spark: shows 'unified memory' instead of 'VRAM' when the Spark GPU is detected (since GB10 uses unified memory architecture where nvidia-smi cannot report dedicated VRAM). Fixes NVIDIA#280
jessesanford
pushed a commit
to jessesanford/NemoClaw
that referenced
this pull request
Mar 24, 2026
…IA#62) Add a preflight check that catches the NVIDIA#1 onboarding blocker on Ubuntu 24.04, DGX Spark, and WSL2. When cgroup v2 is active but Docker's daemon.json lacks "default-cgroupns-mode": "host", onboarding now fails fast with a clear error and fix instructions instead of failing late at gateway startup with a cryptic kubelet error. Closes NVIDIA#16
mafueee
pushed a commit
to mafueee/NemoClaw
that referenced
this pull request
Mar 28, 2026
… (!37) > **🔧 security-fix-agent** Closes NVIDIA#62 ## Security Fix ### Summary The CONNECT proxy accepted hostnames from clients and connected to whatever IP they resolved to, with no validation against internal address ranges. While the OPA policy is default-deny, a misconfigured or overly permissive policy could allow SSRF to cloud metadata (169.254.169.254), localhost, or RFC1918 services. This fix adds DNS resolution before connecting and rejects any host that resolves to an internal IP. ### Severity Assessment - **Impact:** Medium — if exploited, could reach cloud metadata (IAM creds), cluster-internal services, or host-local services - **Exploitability:** Very low — requires OPA policy misconfiguration or DNS rebinding attack - **Affected components:** `crates/navigator-sandbox/src/proxy.rs` — `handle_tcp_connection` ### Changes Made - `crates/navigator-sandbox/src/proxy.rs`: Added `is_internal_ip()` helper that checks IPv4 loopback/private/link-local, IPv6 loopback/link-local, and IPv4-mapped IPv6. Added `resolve_and_reject_internal()` that resolves DNS and rejects internal IPs. Inserted check between OPA allow and `TcpStream::connect`, with control plane endpoints exempt. - `architecture/security-policy.md`: Added SSRF Protection section with blocked ranges table and flow diagram - `architecture/sandbox.md`: Updated proxy connection flow diagram and added SSRF protection subsection - `architecture/README.md`: Added internal IP rejection step to proxy description ### Tests Added - **Unit:** 17 tests in `proxy::tests` — covers IPv4 loopback/private/link-local, IPv6 loopback/link-local, IPv4-mapped IPv6, public IPs, DNS resolution of localhost/127.0.0.1/169.254.169.254, and DNS failure handling - **Integration/E2E:** N/A — the proxy runs inside a Linux network namespace; unit tests for IP checking and DNS resolution cover the security boundary ### Documentation Updated - `architecture/security-policy.md`: New SSRF Protection section with blocked IP ranges and Mermaid flowchart - `architecture/sandbox.md`: Updated proxy flow diagram and added SSRF protection subsection - `architecture/README.md`: Added step 4 to proxy description ### Verification All 85 sandbox tests pass including 17 new proxy SSRF tests. Pre-commit (fmt, clippy, full test suite) passes clean with zero warnings.
mafueee
pushed a commit
to mafueee/NemoClaw
that referenced
this pull request
Mar 28, 2026
…VIDIA#62) TLS certificates are always resolved automatically from cluster metadata, making the explicit CLI flags unnecessary. The TlsOptions struct and auto-resolution logic remain intact for programmatic and test use.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a preflight check to
nemoclaw onboard(step [1/7]) that detects the cgroup v2 / Dockercgroupnsmisconfiguration before gateway startup, instead of letting it fail late with a cryptic kubelet error.bin/lib/preflight.jswithisCgroupV2()andcheckCgroupConfig()detectionbin/lib/onboard.jsbetween Docker/OpenShell checks and GPU detectionnemoclaw setup-spark(which does a safe JSON merge ofdaemon.json, preserving existing settings)test/preflight.test.jscovering all branchesCloses #16
Hardware verification
Tested on two NVIDIA DGX Spark units (aarch64, Ubuntu 24.04.4 LTS, Docker 28.x, cgroup v2):
{ok: false, reason: "...does not exist..."}{"default-cgroupns-mode": "host"}{ok: true}Both machines are in the exact broken state described in #16. The check catches it and exits with actionable guidance before the gateway ever starts.
Test plan
Automated Tests
Manual verification
daemon.json:nemoclaw onboardshould fail at step [1/7] with the cgroup errornemoclaw setup-spark:nemoclaw onboardshould pass the cgroup check{ok: true})