Fix/exclude venv from build context#1110
Conversation
- New `nemoclaw sandbox-init <name>` command that sets up workspace identity files (IDENTITY.md, SOUL.md, AGENTS.md, USER.md), applies network policies, configures GitHub credentials, and registers an agent entry in openclaw.json - All steps are safe to re-run; skips duplicate entries - Supports --agent-name, --agent-id, --parent-agent, --soul, --identity, --agents, --user, --policy, --no-github, --non-interactive flags - 15 unit tests covering sandbox validation, agent registration, idempotency, subagent wiring, policy deduplication, and file resolution
- Dockerfile.sandbox-ai: CUDA 12.6 + PyTorch + Node 22 sandbox image - scripts/post-onboard-gpu.sh: swap standard onboard sandbox for GPU image - scripts/add-gpu-agent.sh: create additional GPU agents with time-slicing
- bin/lib/sandbox-resume.js: starts gateway, port forward, returns auth token - bin/nemoclaw.js: wire resume into CLI dispatch - test/sandbox-resume.test.js: 8 tests covering resume flow
- Dockerfiles: GPU-capable base and sandbox-ai images - onboard.js: GPU onboarding flow - sandbox-add-gpu-agent.js: add GPU agent to sandbox - sandbox-init.js: sandbox initialization with GPU support - sandbox-resume.js: resume flow updates - nemoclaw.js: GPU agent CLI commands - start-services.sh: docker-proxy service management - mount-sandbox.sh, nemoclaw-start.sh, post-onboard-gpu.sh: GPU scripts - extensions/docker-proxy: OpenClaw docker-proxy plugin - scripts/docker-proxy.js: docker proxy server - nemoclaw-blueprint: docker-proxy policy preset - .agents/skills/nemoclaw-docker-proxy: agent skill
- add opts.model to addGpuAgent JSDoc typedef - annotate ALLOWED_ROUTES as Array<[string, RegExp]> - cast socket to net.Socket for setNoDelay call
- extensions/claw3d: OpenClaw plugin with tools for office list, office map, send message, and studio settings via the Claw3D REST API - nemoclaw-blueprint/policies/presets/claw3d.yaml: network policy preset allowing agent access to host.openshell.internal:3000 - start-services.sh, sandbox-resume.js: fix gateway relay container lookup to search openshell-cluster-* containers for the sandbox pod rather than assuming openshell-cluster-<sandboxName>
detectGpu() queries nvidia-smi for VRAM, which returns [N/A] on unified-memory architectures. The existing fallback matched GPU names containing "GB10" (DGX Spark) but missed Jetson AGX Thor and Orin, leaving those devices undetected. Broaden the name check to ["GB10", "Thor", "Orin"]. Align nim.js with the runner.runCapture() indirection already used in sandbox-add-gpu-agent.js to enable mocked test coverage of the fallback path. Five new tests exercise each device tag, a desktop GPU negative case, and the standard VRAM-queryable early return. Fixes NVIDIA#300
nemoclaw-blueprint/ is copied into the Docker build context via cp -r during onboarding. If a developer has run uv sync locally, the resulting .venv directory (often hundreds of MB) is included in the staged context and baked into the sandbox image. Add rm -rf of nemoclaw-blueprint/.venv after staging in both onboard.js and setup.sh, matching the existing node_modules cleanup pattern. Add --exclude .venv to the rsync in nemoclaw.js. Also add .venv to .dockerignore and .gitignore as defense-in-depth. Fixes NVIDIA#774
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughThis PR introduces GPU-accelerated sandbox support, Docker Engine API access for sandboxes, Claw3D 3D office integration, and sandbox lifecycle management (mount/backup/resume). It adds comprehensive tooling for multi-agent GPU deployment, post-reboot recovery, local filesystem mounting via SSHFS, and fixes build-context hygiene by excluding Changes
Sequence DiagramssequenceDiagram
participant User as User (CLI)
participant Nemoclaw as nemoclaw add-gpu-agent
participant Registry as Sandbox Registry
participant Gateway as OpenShell Gateway<br/>(k3s Cluster)
participant Containerd as k3s Containerd<br/>(Image Store)
participant Sandbox as New GPU<br/>Sandbox Pod
participant Parent as Parent Sandbox<br/>OpenClaw Config
User->>Nemoclaw: add-gpu-agent agent-name<br/>--parent parent-name
Nemoclaw->>Registry: Resolve parent sandbox
Registry-->>Nemoclaw: Parent metadata
Nemoclaw->>Gateway: Check allocatable<br/>nvidia.com/gpu count
alt GPU count ≤ 1
Nemoclaw->>Gateway: Apply nvidia-device-plugin<br/>ConfigMap with timeSlicing
Nemoclaw->>Gateway: Patch DaemonSet to<br/>use config file
Nemoclaw->>Gateway: Poll until GPU count > 1
end
Nemoclaw->>Containerd: Check if<br/>nemoclaw-sandbox-ai:v3<br/>exists
alt Image not present
Nemoclaw->>Containerd: Import GPU image<br/>from local Docker
end
Nemoclaw->>Gateway: Create sandbox from<br/>GPU image + --gpu
Gateway->>Sandbox: Provision pod<br/>with GPU
Nemoclaw->>Sandbox: Poll for Ready state<br/>(up to 60s)
Sandbox-->>Nemoclaw: Ready
Nemoclaw->>Sandbox: Start openclaw gateway<br/>via SSH proxy
Sandbox->>Sandbox: Gateway listening<br/>on port 18789
Nemoclaw->>Sandbox: Extract auth token
Sandbox-->>Nemoclaw: token from openclaw.json
Nemoclaw->>Parent: Update openclaw.json<br/>to register agent<br/>as subagent
Parent-->>Nemoclaw: Config patched
Nemoclaw->>Registry: Mark gpuEnabled: true,<br/>parentAgent reference
Registry-->>Nemoclaw: Updated
Nemoclaw-->>User: ✓ GPU agent created<br/>Dashboard URL + token
sequenceDiagram
participant User as User (CLI)
participant Resume as sandboxResume
participant Openshell as openshell CLI<br/>(Cluster API)
participant Sandbox as Sandbox Pod<br/>(OpenClaw)
participant GatewayRelay as gateway-relay.py<br/>(Host)
participant Dashboard as Dashboard<br/>Port 18789
User->>Resume: nemoclaw sandbox resume
Resume->>Openshell: sandbox list<br/>find target sandbox
Openshell-->>Resume: Sandbox metadata
Resume->>Sandbox: Check if gateway<br/>listening on 18789<br/>(kubectl exec ss -tlnp)
alt Gateway not listening
Resume->>Sandbox: Start openclaw gateway<br/>nohup with HOME=/sandbox
Sandbox->>Sandbox: Gateway starts<br/>and listens
Resume->>Resume: Poll until listening<br/>(max 30s)
end
Resume->>Openshell: Get sandbox container<br/>and IP
Openshell-->>Resume: Container info
Resume->>Openshell: Kill stale<br/>gateway-relay.py
Resume->>Openshell: Start kubectl port-forward<br/>inside cluster container<br/>18789 → all interfaces
Resume->>GatewayRelay: Start gateway-relay.py<br/>with cluster container IP
GatewayRelay->>Dashboard: Forward port 18789
Resume->>Sandbox: Extract auth token<br/>from openclaw.json<br/>(kubectl exec + python)
Sandbox-->>Resume: token value
Resume-->>User: ✓ Gateway started<br/>token, port forwarded
sequenceDiagram
participant Agent as Agent inside<br/>Sandbox
participant Proxy as docker-proxy.js<br/>(Host HTTP Server)<br/>Port 2376
participant Allowlist as Request Validation<br/>(Allowlist + Body Check)
participant Docker as Docker Engine<br/>Daemon (Socket/TCP)
Agent->>Proxy: POST /v1.47/containers/create<br/>DOCKER_HOST=tcp://host.openshell.internal:2376
Proxy->>Allowlist: Check method/path<br/>against allowlist
Allowlist-->>Proxy: ✓ POST /containers/**<br/>allowed
Proxy->>Allowlist: Buffer and parse<br/>JSON body
Allowlist->>Allowlist: Validate: no Privileged,<br/>no HostNetwork,<br/>no dangerous CapAdd,<br/>no blocked mounts
alt Validation fails
Allowlist-->>Proxy: ✗ Forbidden
Proxy-->>Agent: HTTP 403<br/>{ error: "..." }
else Validation succeeds
Allowlist-->>Proxy: ✓ Body valid
Proxy->>Docker: Forward validated<br/>request + body<br/>to upstream Docker
Docker-->>Proxy: Container created<br/>{ Id, Warnings }
Proxy-->>Agent: HTTP 201<br/>Response forwarded
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~70 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts
Comment |
|
Closing — discovered during rebase that #774 is already resolved on main via copyBuildContextDir() in onboard.js (which filters .venv) and clean-staged-tree.sh in setup.sh. No additional changes needed. |
Summary
Prevent local Python virtual environments from being copied into the sandbox image build context. This aligns
.venvhandling with the existingnode_modulescleanup pattern across onboarding and setup flows, and adds exclusion rules so developer-local artifacts do not leak into staged builds.Related Issue
Fixes #774
Changes
nemoclaw-blueprint/.venvafter blueprint copy in interactive onboardingnemoclaw-blueprint/.venvafter blueprint copy in scripted setup.venvfrom the VM deployrsyncpath.venvignore coverage in.dockerignoreand.gitignoreType of Change
Testing
npx prek run --all-filespasses (or equivalentlymake check).npm testpasses.make docsbuilds without warnings. (for doc-only changes)Additional validation:
npx vitest run test/build-context-clean.test.jspassesnpx vitest run --project clishows no new failures relative to cleanHEAD(
385 passed, 5 failed, matching the same 5 pre-existing failures on unmodifiedmain)Checklist
General
Code Changes
npx prek run --all-filesauto-fixes formatting (ormake formatfor targeted runs).Doc Changes
update-docsagent skill to draft changes while complying with the style guide. For example, prompt your agent with "/update-docscatch up the docs for the new changes I made in this PR."Summary by CodeRabbit
Release Notes
New Features
Documentation
Chores