fix: sandbox DNS resolution via CoreDNS proxy in sandbox pod by jestyr27 · Pull Request #732 · NVIDIA/NemoClaw

jestyr27 · 2026-03-23T20:20:29Z

Summary

Fixes DNS resolution inside the sandbox by running a lightweight Python DNS forwarder in the sandbox pod's namespace. This makes dns.lookup() work for tools like web_search and web_fetch that do pre-flight DNS checks before using the HTTP proxy.

Related Issue

Fixes #626
Related: OpenShell #437, #504

Changes

New: scripts/setup-dns-proxy.sh — Runs after sandbox creation. Adds the CoreDNS service IP (10.43.0.10) as a local address in the pod, then starts a Python DNS forwarder bound to it. Forwards to the actual CoreDNS pod IP (discovered dynamically) so k8s-internal names (*.svc.cluster.local) still resolve for the openshell-sandbox proxy.
scripts/fix-coredns.sh — Extended to all platforms (was Colima-only). The comment already said "broken on all platforms" but the code only gated for Colima.
scripts/lib/runtime.sh — Added 8.8.8.8 fallback to resolve_coredns_upstream() for hosts where all nameservers are loopback (e.g. systemd-resolved).
bin/lib/platform.js — shouldPatchCoredns() returns true for all known container runtimes (was colima only).
bin/lib/onboard.js / scripts/setup.sh — Call setup-dns-proxy.sh after sandbox reaches Ready state.
.pre-commit-config.yaml — Added --with pytest to Pyright pre-push hook (fixes type check failures on test_endpoint_validation.py).

Type of Change

Code change for a new feature, bug fix, or refactor.

Testing

npx prek run --all-files passes.
npm test passes (311/311).
End-to-end: nemoclaw onboard --non-interactive creates sandbox with working DNS. socket.getaddrinfo() resolves external names from inside the sandbox. web_search returns live Google Search results via Gemini grounding. Inference routing to inference.local works (the openshell-sandbox proxy can resolve the gateway via k8s DNS).

Design Notes

Two critical requirements discovered during testing:

Bind to 10.43.0.10 (not 0.0.0.0): glibc's resolver uses connected UDP sockets that discard responses from unexpected source IPs. Binding to 0.0.0.0 causes responses to come from the veth address (10.200.0.1), which glibc ignores.
Forward to the CoreDNS pod IP (not 8.8.8.8): adding 10.43.0.10 as a local address intercepts ALL DNS in the pod namespace — including from the openshell-sandbox binary, which needs to resolve openshell-0.openshell.svc.cluster.local to reach the gateway. Public DNS can't resolve *.svc.cluster.local.

Checklist

General

I have read and followed the contributing guide.

Code Changes

Formatters applied — npx prek run --all-files auto-fixes formatting.
Tests added or updated for new or changed behavior.
No secrets, API keys, or credentials committed.

Summary by CodeRabbit

New Features
- DNS proxy setup now integrated into sandbox creation workflow.
- DNS forwarding support expanded to Docker Desktop and other container runtimes (previously Colima-specific).
Improvements
- CoreDNS patching available across all recognized container runtimes.
- Enhanced DNS resolution fallback to public DNS when upstream lookups fail.
Tests
- Added DNS proxy validation test suite.
- Expanded runtime compatibility test coverage.

The sandbox runs in an isolated network namespace (10.200.0.0/24) where dns.lookup() fails because the CoreDNS service IP (10.43.0.10) is unreachable. This blocks web_search, web_fetch, and any tool that does pre-flight DNS resolution before using the HTTP proxy. Fix: run a Python DNS forwarder in the sandbox pod's root namespace that intercepts queries to 10.43.0.10 and forwards them to the CoreDNS pod. Key details: - setup-dns-proxy.sh: adds 10.43.0.10 as a local address in the pod, starts a Python UDP forwarder bound to 10.43.0.10:53, forwards to the CoreDNS pod IP (discovered via kubectl endpoints). Launched via docker exec -d + nsenter for persistence. - fix-coredns.sh: extended to all platforms (was Colima-only). The comment already said "broken on all platforms" but the code only ran for Colima. - runtime.sh: added 8.8.8.8 fallback to resolve_coredns_upstream() for hosts where all nameservers are loopback (systemd-resolved). - platform.js: shouldPatchCoredns() returns true for all known runtimes. - onboard.js / setup.sh: call setup-dns-proxy.sh after sandbox creation. The forwarder MUST bind to 10.43.0.10 (not 0.0.0.0) because glibc uses connected UDP sockets that discard responses from unexpected source IPs. It MUST forward to the CoreDNS pod IP (not 8.8.8.8) because the openshell-sandbox binary also uses DNS to reach the gateway for inference routing, and public DNS cannot resolve *.svc.cluster.local. Fixes NVIDIA#626 Related: OpenShell NVIDIA#437, NVIDIA#504

coderabbitai · 2026-03-23T20:20:45Z

📝 Walkthrough

Walkthrough

The changes extend DNS patching from Colima-specific containers to all recognized container runtimes, introduce a new DNS proxy setup script for sandboxes, and update onboarding and testing accordingly. The DNS proxy forwards DNS queries from sandbox pods to CoreDNS or public DNS.

Changes

Cohort / File(s)	Summary
DNS Proxy Infrastructure `scripts/setup-dns-proxy.sh`, `test/dns-proxy.test.js`	Added new DNS proxy setup script that deploys a Python UDP forwarder inside sandbox pods, discovering CoreDNS endpoint and configuring loopback routing. Includes tests validating script existence, sourcing, usage output, and hardcoded configuration values.
CoreDNS Patching Generalization `bin/lib/platform.js`, `scripts/fix-coredns.sh`, `scripts/lib/runtime.sh`	Expanded CoreDNS patching from Colima-only (`shouldPatchCoredns`) to all recognized runtimes. Replaced Colima-specific Docker socket detection with generic `DOCKER_HOST` detection. Changed DNS fallback in `resolve_coredns_upstream` from failure to returning `8.8.8.8`.
Onboarding Integration `bin/lib/onboard.js`, `scripts/setup.sh`	Updated onboarding flow to call DNS proxy setup after sandbox creation. Generalized CoreDNS patch messaging from "Colima" to "DNS forwarding". DNS proxy failures logged as warnings but do not halt onboarding.
Test Updates `test/platform.test.js`, `test/runtime-shell.test.js`	Updated assertions for `shouldPatchCoredns` to expect `true` for `docker-desktop` and `docker` runtimes; added `false` assertion for `unknown` runtime. Added test case verifying fallback to `8.8.8.8` when nameserver resolution fails.
Pre-commit Configuration `.pre-commit-config.yaml`	Updated `pyright-check` hook to include both `pytest` and `pyright` in the `uv` environment via `--with pytest pyright`.

Sequence Diagram

sequenceDiagram
    actor User
    participant Onboarding as Onboarding Script
    participant Docker as Docker/Container
    participant Sandbox as Sandbox Pod
    participant Proxy as DNS Proxy (Python)
    participant CoreDNS as CoreDNS Service

    User->>Onboarding: nemoclaw onboard
    Onboarding->>Docker: Create sandbox pod
    Docker->>Sandbox: Pod Ready
    Onboarding->>Onboarding: Detect container runtime
    Onboarding->>Onboarding: Patch CoreDNS if runtime ≠ unknown
    Onboarding->>Docker: setup-dns-proxy.sh <sandbox-name>
    Docker->>Sandbox: nsenter + docker exec
    Sandbox->>Sandbox: Configure lo interface (10.43.0.10/32)
    Sandbox->>Sandbox: Deploy /tmp/dns-proxy.py
    Sandbox->>Proxy: Start Python UDP forwarder
    Proxy->>Sandbox: Bind 10.43.0.10:53
    Proxy->>CoreDNS: Forward DNS queries
    CoreDNS->>Proxy: DNS responses
    Proxy->>Sandbox: Return to applications
    Sandbox->>Onboarding: Proxy ready
    Onboarding->>User: Onboarding complete

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰✨ A DNS proxy hops through the sandbox gate,
Forwarding queries before they're too late,
CoreDNS whispers, eight-eight takes the call,
Docker and friends now work one and all! 🌐

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title clearly and specifically describes the primary change: adding DNS resolution via a CoreDNS proxy in the sandbox pod.
Linked Issues check	✅ Passed	All coding requirements from issue `#626` are met: the PR implements a DNS proxy in the sandbox, enables DNS resolution for web search tools, and preserves Kubernetes internal name resolution.
Out of Scope Changes check	✅ Passed	All changes are directly related to DNS resolution in the sandbox. The pre-commit hook update and test file additions are supporting changes that are within scope.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (3)

scripts/setup-dns-proxy.sh (2)

149-158: Consider logging DNS forwarding errors for debugging.

The except Exception: pass silently discards all errors, making DNS issues difficult to diagnose. At minimum, consider logging failures to the existing log file:

Proposed improvement

 def forward(data, addr):
     try:
         f = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
         f.settimeout(5)
         f.sendto(data, UPSTREAM)
         r, _ = f.recvfrom(4096)
         sock.sendto(r, addr)
         f.close()
-    except Exception:
-        pass
+    except Exception as e:
+        with open('/tmp/dns-proxy.log', 'a') as log:
+            log.write('dns-proxy error: {}\\n'.format(e))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/setup-dns-proxy.sh` around lines 149 - 158, The forward function
swallows all exceptions (except Exception: pass), making DNS forwarding failures
invisible; update forward to catch exceptions and write a descriptive error to
the existing log (include exception details and context like addr and UPSTREAM)
instead of silently passing; reference the forward function, UPSTREAM constant,
and sock variable when adding the logging call so you log "DNS forward failed
for addr=<addr> to UPSTREAM=<UPSTREAM>: <exception>" (or similar) and ensure the
socket f is closed in a finally block or use a context manager to avoid resource
leaks.

186-193: Consider verifying the proxy process is actually running.

The current verification only checks if the log file was written. A more robust check would verify the process is still running:

Proposed enhancement

 LOG="$(kctl exec -n openshell "$POD" -- cat /tmp/dns-proxy.log 2>/dev/null || true)"
-if echo "$LOG" | grep -q "dns-proxy:"; then
+NEW_PID="$(kctl exec -n openshell "$POD" -- cat /tmp/dns-proxy.pid 2>/dev/null || true)"
+RUNNING="$(kctl exec -n openshell "$POD" -- kill -0 "$NEW_PID" 2>/dev/null && echo yes || true)"
+if [ "$RUNNING" = "yes" ] && echo "$LOG" | grep -q "dns-proxy:"; then
   echo "DNS proxy started: $LOG"
 else
   echo "WARNING: DNS proxy may not have started. Log: $LOG"
 fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/setup-dns-proxy.sh` around lines 186 - 193, The verification only
checks for log output (LOG) but not that the dns-proxy process is running;
update the check after reading /tmp/dns-proxy.log to also run a process check
inside the pod (use kctl exec -n openshell "$POD" -- pgrep -f 'dns-proxy' or
kctl exec ... -- ps aux | grep '[d]ns-proxy') and consider checking
/tmp/dns-proxy.pid if the service writes a PID file; if the process check fails,
emit the WARNING with the log and a message that the process is not running,
otherwise print DNS proxy started with PID/info. Ensure you reference the
existing LOG, POD, and /tmp/dns-proxy.log occurrences so the new check is
inserted immediately after the current grep-based verification.

test/dns-proxy.test.js (1)

29-39: Test description could be more precise.

The test name says "no sandbox name provided" but it actually provides nemoclaw as the first argument. Per the script's usage ([gateway-name] <sandbox-name>), this is interpreted as GATEWAY_NAME="nemoclaw" with SANDBOX_NAME="", which correctly triggers the usage error. Consider clarifying the test name:
-  it("exits with usage when no sandbox name provided", () => {
+  it("exits with usage when only gateway name provided (no sandbox name)", () => {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/dns-proxy.test.js` around lines 29 - 39, The test name is misleading: it
supplies a first argument ("nemoclaw") which becomes the gateway name while the
sandbox name is omitted, so update the spec in test/dns-proxy.test.js (the
it(...) description) to clearly state that the sandbox name is omitted while a
gateway name is provided (for example: "exits with usage when sandbox name
omitted (gateway provided)"), and keep the existing invocation that calls bash
with SETUP_DNS_PROXY and "nemoclaw" so the behavior and assertions (expecting
non-zero status and Usage: message) remain unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 617-620: The DNS proxy installation is currently allowed to fail
silently because run(...) is invoked with { ignoreError: true }, so remove the
ignoreError bypass (or detect the run exit status) and make the onboarding flow
exit/throw on failure of run("... setup-dns-proxy.sh ...", ...); additionally,
after the script completes, perform a verification step (e.g., kubectl exec into
the sandbox pod identified by sandboxName and run a DNS lookup like dig/host
against the expected host) and only mark onboarding complete if that check
succeeds—update the logic around run, SCRIPTS, and sandboxName to fail the
onboarding or rollback the sandbox when either the install or the post-install
DNS verification fails.

In `@scripts/fix-coredns.sh`:
- Around line 27-32: The current path blindly accepts any value returned by
detect_docker_host into DOCKER_HOST, which can point at a remote daemon and thus
pollute the remote CoreDNS config; change the logic around the DOCKER_HOST
assignment (the docker_host variable and detect_docker_host call) to validate
that the returned host is local before exporting it — only accept unix socket
paths or TCP hosts that resolve to localhost/127.0.0.1 (or ::1); if docker_host
is remote, do not set DOCKER_HOST and instead fall back to deriving upstream
from the daemon/gateway side or bail with a clear message; apply the same
guard/validation to the analogous block referenced at the later section (lines
49-55).

In `@scripts/lib/runtime.sh`:
- Around line 166-168: The current last-resort branch in runtime.sh that prints
'8.8.8.8' must be changed to first attempt to resolve systemd-resolved's real
upstreams and to treat IPv6 loopback as loopback too; update the logic around
the printf '8.8.8.8' to (1) try to read upstream servers from systemd-resolved
(e.g. via resolvectl status or /run/systemd/resolve/resolv.conf) and extract
non-loopback IPs, (2) filter out loopback addresses including 127.0.0.0/8 and
::1 (and equivalent IPv4-mapped loopback), and (3) only fall back to a public
resolver like 8.8.8.8 if no non-loopback upstreams are found; keep the change
localized to the branch that currently emits '8.8.8.8' so CoreDNS uses the
machine's upstreams when available.

In `@scripts/setup-dns-proxy.sh`:
- Around line 90-92: The pipeline that sets POD uses grep -- "$SANDBOX_NAME"
which treats the sandbox name as a regex; change grep to fixed-string mode
(e.g., use grep -F -- "$SANDBOX_NAME") so special characters in $SANDBOX_NAME
are matched literally. Update the line that assigns POD (the command pipeline
with kctl get pods | grep -- "$SANDBOX_NAME" | head -1 | sed 's|pod/||') to use
grep -F -- "$SANDBOX_NAME" to ensure correct, literal matching.
- Around line 113-117: The fallback that sets POD_PID via "kctl exec ... echo
$$" is invalid for nsenter because that PID is in the pod's PID namespace, not
the gateway/container host namespace; remove the fallback block that assigns
POD_PID with kctl exec and instead fail explicitly when POD_PID is empty (print
a clear error and exit non-zero), so the later nsenter invocation (which
requires a host-visible PID) is not attempted with a wrong value; update the
script's POD_PID handling to use the original hostname method only and error out
if POD_PID remains unset (references: POD_PID, kctl exec, nsenter).

In `@scripts/setup.sh`:
- Around line 233-235: The DNS proxy step currently swallows errors (bash
"$SCRIPT_DIR/setup-dns-proxy.sh" ... || warn ...) so setup.sh can report success
despite unresolved EAI_AGAIN; change this to treat failures as fatal or block
until DNS works: invoke setup-dns-proxy.sh without the short-circuit warn and if
it exits non-zero call error/exit (or retry), then perform an in-sandbox
resolution check (e.g., run a simple node or getent/dig inside the sandbox using
SANDBOX_NAME) and loop/retry with backoff until the lookup succeeds before
printing "Setup complete!" — update the calls around info "Setting up sandbox
DNS proxy...", the bash invocation of setup-dns-proxy.sh, and the subsequent
success path to enforce failure or readiness check.

---

Nitpick comments:
In `@scripts/setup-dns-proxy.sh`:
- Around line 149-158: The forward function swallows all exceptions (except
Exception: pass), making DNS forwarding failures invisible; update forward to
catch exceptions and write a descriptive error to the existing log (include
exception details and context like addr and UPSTREAM) instead of silently
passing; reference the forward function, UPSTREAM constant, and sock variable
when adding the logging call so you log "DNS forward failed for addr=<addr> to
UPSTREAM=<UPSTREAM>: <exception>" (or similar) and ensure the socket f is closed
in a finally block or use a context manager to avoid resource leaks.
- Around line 186-193: The verification only checks for log output (LOG) but not
that the dns-proxy process is running; update the check after reading
/tmp/dns-proxy.log to also run a process check inside the pod (use kctl exec -n
openshell "$POD" -- pgrep -f 'dns-proxy' or kctl exec ... -- ps aux | grep
'[d]ns-proxy') and consider checking /tmp/dns-proxy.pid if the service writes a
PID file; if the process check fails, emit the WARNING with the log and a
message that the process is not running, otherwise print DNS proxy started with
PID/info. Ensure you reference the existing LOG, POD, and /tmp/dns-proxy.log
occurrences so the new check is inserted immediately after the current
grep-based verification.

In `@test/dns-proxy.test.js`:
- Around line 29-39: The test name is misleading: it supplies a first argument
("nemoclaw") which becomes the gateway name while the sandbox name is omitted,
so update the spec in test/dns-proxy.test.js (the it(...) description) to
clearly state that the sandbox name is omitted while a gateway name is provided
(for example: "exits with usage when sandbox name omitted (gateway provided)"),
and keep the existing invocation that calls bash with SETUP_DNS_PROXY and
"nemoclaw" so the behavior and assertions (expecting non-zero status and Usage:
message) remain unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f3de8a15-066b-4be1-a472-15719c66b53a

📥 Commits

Reviewing files that changed from the base of the PR and between 5ac3575 and 51db152.

📒 Files selected for processing (10)

.pre-commit-config.yaml
bin/lib/onboard.js
bin/lib/platform.js
scripts/fix-coredns.sh
scripts/lib/runtime.sh
scripts/setup-dns-proxy.sh
scripts/setup.sh
test/dns-proxy.test.js
test/platform.test.js
test/runtime-shell.test.js

coderabbitai · 2026-03-23T20:27:07Z

+  // DNS proxy — run a forwarder in the sandbox pod so the isolated
+  // sandbox namespace can resolve DNS. Must run after sandbox is Ready.
+  console.log("  Setting up sandbox DNS proxy...");
+  run(`bash "${path.join(SCRIPTS, "setup-dns-proxy.sh")}" nemoclaw "${sandboxName}" 2>&1 || true`, { ignoreError: true });


⚠️ Potential issue | 🟠 Major

Don't complete onboarding before sandbox DNS is verified.

This path keeps the new sandbox even when the DNS proxy install fails, so users can end up with a registered sandbox that still has broken hostname resolution. Please make this step fail the flow, or gate completion on a post-check from inside the sandbox.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 617 - 620, The DNS proxy installation is currently allowed to fail silently because run(...) is invoked with { ignoreError: true }, so remove the ignoreError bypass (or detect the run exit status) and make the onboarding flow exit/throw on failure of run("... setup-dns-proxy.sh ...", ...); additionally, after the script completes, perform a verification step (e.g., kubectl exec into the sandbox pod identified by sandboxName and run a DNS lookup like dig/host against the expected host) and only mark onboarding complete if that check succeeds—update the logic around run, SCRIPTS, and sandboxName to fail the onboarding or rollback the sandbox when either the install or the post-install DNS verification fails.

coderabbitai · 2026-03-23T20:27:07Z

 if [ -z "${DOCKER_HOST:-}" ]; then
-  if [ -n "$COLIMA_SOCKET" ]; then
-    export DOCKER_HOST="unix://$COLIMA_SOCKET"
-  else
-    echo "Skipping CoreDNS patch: Colima socket not found."
-    exit 0
+  if docker_host="$(detect_docker_host)"; then
+    export DOCKER_HOST="$docker_host"
  fi
+  # If still unset, Docker CLI will use the default socket
 fi


⚠️ Potential issue | 🟠 Major

Guard this path to local Docker daemons.

This generalized flow still derives the fallback resolver from the local machine, but now runs against any detected Docker engine. When DOCKER_HOST points at a remote daemon, the remote CoreDNS config gets patched with the workstation's DNS, which may be unreachable from that cluster. Please either reject non-local Docker hosts here or derive the upstream entirely from the daemon/gateway side.

Also applies to: 49-55

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/fix-coredns.sh` around lines 27 - 32, The current path blindly accepts any value returned by detect_docker_host into DOCKER_HOST, which can point at a remote daemon and thus pollute the remote CoreDNS config; change the logic around the DOCKER_HOST assignment (the docker_host variable and detect_docker_host call) to validate that the returned host is local before exporting it — only accept unix socket paths or TCP hosts that resolve to localhost/127.0.0.1 (or ::1); if docker_host is remote, do not set DOCKER_HOST and instead fall back to deriving upstream from the daemon/gateway side or bail with a clear message; apply the same guard/validation to the analogous block referenced at the later section (lines 49-55).

coderabbitai · 2026-03-23T20:27:07Z

+  # Last resort: public DNS. Needed on hosts where all nameservers are
+  # loopback (e.g. systemd-resolved uses 127.0.0.53).
+  printf '8.8.8.8\n'


⚠️ Potential issue | 🟠 Major

Don't fall back to public DNS before exhausting the host resolver.

On the common 127.0.0.53 / ::1 case this branch now pins CoreDNS to 8.8.8.8 instead of the machine's real upstreams, which breaks split-DNS/VPN/internal zones and leaks queries to a public resolver. Please resolve systemd-resolved's actual upstreams first and treat IPv6 loopback as loopback too.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/lib/runtime.sh` around lines 166 - 168, The current last-resort branch in runtime.sh that prints '8.8.8.8' must be changed to first attempt to resolve systemd-resolved's real upstreams and to treat IPv6 loopback as loopback too; update the logic around the printf '8.8.8.8' to (1) try to read upstream servers from systemd-resolved (e.g. via resolvectl status or /run/systemd/resolve/resolv.conf) and extract non-loopback IPs, (2) filter out loopback addresses including 127.0.0.0/8 and ::1 (and equivalent IPv4-mapped loopback), and (3) only fall back to a public resolver like 8.8.8.8 if no non-loopback upstreams are found; keep the change localized to the branch that currently emits '8.8.8.8' so CoreDNS uses the machine's upstreams when available.

coderabbitai · 2026-03-23T20:27:07Z

+POD="$(kctl get pods -n openshell -o name 2>/dev/null \
+  | grep -- "$SANDBOX_NAME" | head -1 | sed 's|pod/||' || true)"
+


⚠️ Potential issue | 🟡 Minor

Use fixed-string matching for sandbox name.

The grep without -F interprets $SANDBOX_NAME as a regex pattern. If sandbox names contain characters like ., *, or [, this could match unintended pods.

Proposed fix

-POD="$(kctl get pods -n openshell -o name 2>/dev/null \ - | grep -- "$SANDBOX_NAME" | head -1 | sed 's|pod/||' || true)" +POD="$(kctl get pods -n openshell -o name 2>/dev/null \ + | grep -F -- "$SANDBOX_NAME" | head -1 | sed 's|pod/||' || true)"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

POD="$(kctl get pods -n openshell -o name 2>/dev/null \

| grep -- "$SANDBOX_NAME" | head -1 | sed 's|pod/||' || true)"

POD="$(kctl get pods -n openshell -o name 2>/dev/null \

| grep -F -- "$SANDBOX_NAME" | head -1 | sed 's|pod/||' || true)"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/setup-dns-proxy.sh` around lines 90 - 92, The pipeline that sets POD uses grep -- "$SANDBOX_NAME" which treats the sandbox name as a regex; change grep to fixed-string mode (e.g., use grep -F -- "$SANDBOX_NAME") so special characters in $SANDBOX_NAME are matched literally. Update the line that assigns POD (the command pipeline with kctl get pods | grep -- "$SANDBOX_NAME" | head -1 | sed 's|pod/||') to use grep -F -- "$SANDBOX_NAME" to ensure correct, literal matching.

coderabbitai · 2026-03-23T20:27:07Z

+if [ -z "$POD_PID" ]; then
+  echo "WARNING: Could not find pod PID via hostname. Trying kubectl..."
+  # Fallback: use kubectl exec to find a PID we can nsenter into
+  POD_PID="$(kctl exec -n openshell "$POD" -- sh -c 'echo $$' 2>/dev/null || true)"
+fi


⚠️ Potential issue | 🟠 Major

Fallback PID may not be valid for nsenter.

The fallback uses kubectl exec ... echo $$ which returns the shell's PID as seen from within the pod's PID namespace. However, nsenter -t "$POD_PID" (line 181) requires the PID as seen from the gateway container's namespace. These are different values.

If the primary hostname-matching method fails and this fallback is used, the subsequent nsenter at line 181 will likely fail or enter the wrong namespace.

Consider either:

Removing the fallback and failing explicitly

Finding the PID through an alternative host-visible method

Option: Remove fallback and fail explicitly

if [ -z "$POD_PID" ]; then - echo "WARNING: Could not find pod PID via hostname. Trying kubectl..." - # Fallback: use kubectl exec to find a PID we can nsenter into - POD_PID="$(kctl exec -n openshell "$POD" -- sh -c 'echo $$' 2>/dev/null || true)" -fi - -if [ -z "$POD_PID" ]; then echo "ERROR: Could not determine pod PID for nsenter." + echo " Hostname matching in /proc failed for pod '$POD'." exit 1 fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/setup-dns-proxy.sh` around lines 113 - 117, The fallback that sets POD_PID via "kctl exec ... echo $$" is invalid for nsenter because that PID is in the pod's PID namespace, not the gateway/container host namespace; remove the fallback block that assigns POD_PID with kctl exec and instead fail explicitly when POD_PID is empty (print a clear error and exit non-zero), so the later nsenter invocation (which requires a host-visible PID) is not attempted with a wrong value; update the script's POD_PID handling to use the original hostname method only and error out if POD_PID remains unset (references: POD_PID, kctl exec, nsenter).

coderabbitai · 2026-03-23T20:27:07Z

+# 5b. DNS proxy for sandbox — run after sandbox is Ready.
+info "Setting up sandbox DNS proxy..."
+bash "$SCRIPT_DIR/setup-dns-proxy.sh" nemoclaw "$SANDBOX_NAME" 2>&1 || warn "DNS proxy setup failed (may not be needed)"


⚠️ Potential issue | 🟠 Major

Make DNS proxy setup part of the success criteria.

setup-dns-proxy.sh is the fix that restores sandbox-side dns.lookup() / getaddrinfo(). Swallowing failures here lets setup.sh finish with “Setup complete!” even though the sandbox still returns EAI_AGAIN. Fail the setup, or at least run an in-sandbox resolution check and only continue when it passes.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/setup.sh` around lines 233 - 235, The DNS proxy step currently swallows errors (bash "$SCRIPT_DIR/setup-dns-proxy.sh" ... || warn ...) so setup.sh can report success despite unresolved EAI_AGAIN; change this to treat failures as fatal or block until DNS works: invoke setup-dns-proxy.sh without the short-circuit warn and if it exits non-zero call error/exit (or retry), then perform an in-sandbox resolution check (e.g., run a simple node or getent/dig inside the sandbox using SANDBOX_NAME) and loop/retry with backoff until the lookup succeeds before printing "Setup complete!" — update the calls around info "Setting up sandbox DNS proxy...", the bash invocation of setup-dns-proxy.sh, and the subsequent success path to enforce failure or readiness check.

wscurran · 2026-03-23T21:12:25Z

Thanks for submitting this PR, it looks like it addresses an issue with DNS resolution in the sandbox environment, and the changes are well-documented and easy to follow

senthilr-nv · 2026-03-27T14:29:16Z

Tested this on DGX Spark (ARM64) and Brev VM (x86_64) — DNS is broken from the sandbox shell on both platforms, so the fix is needed everywhere.

The approach works. After running setup-dns-proxy.sh, 10.43.0.10 gets added to the pod's lo, the Python forwarder is written correctly, and CoreDNS pod IP is discovered (10.42.0.6 on Spark). Once the forwarder is running, getent hosts github.com and node dns.lookup() both resolve from inside the sandbox.

One issue: the docker exec -d + nsenter launch path (around line 181) didn't actually start the forwarder on either platform — the script completes without error but the proxy process never appears. CodeRabbit's note about the fallback PID namespace mismatch looks correct. When I launched it manually via kubectl exec from the pod, it worked immediately.

One approach that worked for me: replacing the nsenter launch with kubectl exec ... -- sh -c 'nohup python3 -u /tmp/dns-proxy.py > /tmp/dns-proxy.log 2>&1 &'.

Happy to retest once the launch path is updated.

…NVIDIA#626) The sandbox runs in an isolated network namespace (10.200.0.0/24) where OpenShell's iptables rules reject all non-proxy traffic including UDP. This causes getaddrinfo EAI_AGAIN for every outbound DNS lookup, breaking web_fetch, web_search, and all Node.js HTTP tools. Fix (three steps in new scripts/setup-dns-proxy.sh): 1. Run a Python UDP DNS forwarder on the pod-side veth gateway (10.200.0.1:53), forwarding to the real CoreDNS pod IP 2. Add an iptables rule in the sandbox namespace allowing UDP to the gateway on port 53 (the only non-proxy firewall exception) 3. Update the sandbox's /etc/resolv.conf to point to 10.200.0.1 Also broadens the CoreDNS patch (fix-coredns.sh / shouldPatchCoredns) to run on all Docker-based runtimes, not just Colima — k3s-inside-Docker has broken DNS forwarding on Linux hosts with systemd-resolved too. Inspired by PR NVIDIA#732's DNS forwarder approach. Key differences: - Uses kubectl exec instead of nsenter (fixes PR NVIDIA#732's launch bug) - Handles the sandbox iptables constraint (all UDP blocked) - Resolves systemd-resolved upstreams via resolvectl before falling back to 8.8.8.8 - Uses grep -F for fixed-string sandbox name matching Tested on DGX Spark (ARM64): fresh destroy + onboard, getent hosts and node dns.lookup both resolve from inside the sandbox.

… (#1062) * fix(sandbox): add DNS forwarder so web_fetch resolves hostnames (fixes #626) The sandbox runs in an isolated network namespace (10.200.0.0/24) where OpenShell's iptables rules reject all non-proxy traffic including UDP. This causes getaddrinfo EAI_AGAIN for every outbound DNS lookup, breaking web_fetch, web_search, and all Node.js HTTP tools. Fix (three steps in new scripts/setup-dns-proxy.sh): 1. Run a Python UDP DNS forwarder on the pod-side veth gateway (10.200.0.1:53), forwarding to the real CoreDNS pod IP 2. Add an iptables rule in the sandbox namespace allowing UDP to the gateway on port 53 (the only non-proxy firewall exception) 3. Update the sandbox's /etc/resolv.conf to point to 10.200.0.1 Also broadens the CoreDNS patch (fix-coredns.sh / shouldPatchCoredns) to run on all Docker-based runtimes, not just Colima — k3s-inside-Docker has broken DNS forwarding on Linux hosts with systemd-resolved too. Inspired by PR #732's DNS forwarder approach. Key differences: - Uses kubectl exec instead of nsenter (fixes PR #732's launch bug) - Handles the sandbox iptables constraint (all UDP blocked) - Resolves systemd-resolved upstreams via resolvectl before falling back to 8.8.8.8 - Uses grep -F for fixed-string sandbox name matching Tested on DGX Spark (ARM64): fresh destroy + onboard, getent hosts and node dns.lookup both resolve from inside the sandbox. * fix(sandbox): add runtime DNS verification after setup setup-dns-proxy.sh now verifies all three DNS bridge layers from inside the sandbox namespace post-deployment: resolv.conf target, iptables UDP rule, and actual getent hosts resolution. Reports [PASS]/[FAIL] per check so failures are immediately visible in onboard logs.

…DIA#626) (NVIDIA#1062) * fix(sandbox): add DNS forwarder so web_fetch resolves hostnames (fixes NVIDIA#626) The sandbox runs in an isolated network namespace (10.200.0.0/24) where OpenShell's iptables rules reject all non-proxy traffic including UDP. This causes getaddrinfo EAI_AGAIN for every outbound DNS lookup, breaking web_fetch, web_search, and all Node.js HTTP tools. Fix (three steps in new scripts/setup-dns-proxy.sh): 1. Run a Python UDP DNS forwarder on the pod-side veth gateway (10.200.0.1:53), forwarding to the real CoreDNS pod IP 2. Add an iptables rule in the sandbox namespace allowing UDP to the gateway on port 53 (the only non-proxy firewall exception) 3. Update the sandbox's /etc/resolv.conf to point to 10.200.0.1 Also broadens the CoreDNS patch (fix-coredns.sh / shouldPatchCoredns) to run on all Docker-based runtimes, not just Colima — k3s-inside-Docker has broken DNS forwarding on Linux hosts with systemd-resolved too. Inspired by PR NVIDIA#732's DNS forwarder approach. Key differences: - Uses kubectl exec instead of nsenter (fixes PR NVIDIA#732's launch bug) - Handles the sandbox iptables constraint (all UDP blocked) - Resolves systemd-resolved upstreams via resolvectl before falling back to 8.8.8.8 - Uses grep -F for fixed-string sandbox name matching Tested on DGX Spark (ARM64): fresh destroy + onboard, getent hosts and node dns.lookup both resolve from inside the sandbox. * fix(sandbox): add runtime DNS verification after setup setup-dns-proxy.sh now verifies all three DNS bridge layers from inside the sandbox namespace post-deployment: resolv.conf target, iptables UDP rule, and actual getent hosts resolution. Reports [PASS]/[FAIL] per check so failures are immediately visible in onboard logs.

wscurran · 2026-04-14T16:32:25Z

Thanks for the CoreDNS proxy approach to sandbox DNS. The codebase has changed significantly since March 23 — including a full TypeScript migration — so this will need a rebase on origin/main before we can review it. Please rebase and resolve any conflicts, and we'll take a look.

…DIA#626) (NVIDIA#1062) * fix(sandbox): add DNS forwarder so web_fetch resolves hostnames (fixes NVIDIA#626) The sandbox runs in an isolated network namespace (10.200.0.0/24) where OpenShell's iptables rules reject all non-proxy traffic including UDP. This causes getaddrinfo EAI_AGAIN for every outbound DNS lookup, breaking web_fetch, web_search, and all Node.js HTTP tools. Fix (three steps in new scripts/setup-dns-proxy.sh): 1. Run a Python UDP DNS forwarder on the pod-side veth gateway (10.200.0.1:53), forwarding to the real CoreDNS pod IP 2. Add an iptables rule in the sandbox namespace allowing UDP to the gateway on port 53 (the only non-proxy firewall exception) 3. Update the sandbox's /etc/resolv.conf to point to 10.200.0.1 Also broadens the CoreDNS patch (fix-coredns.sh / shouldPatchCoredns) to run on all Docker-based runtimes, not just Colima — k3s-inside-Docker has broken DNS forwarding on Linux hosts with systemd-resolved too. Inspired by PR NVIDIA#732's DNS forwarder approach. Key differences: - Uses kubectl exec instead of nsenter (fixes PR NVIDIA#732's launch bug) - Handles the sandbox iptables constraint (all UDP blocked) - Resolves systemd-resolved upstreams via resolvectl before falling back to 8.8.8.8 - Uses grep -F for fixed-string sandbox name matching Tested on DGX Spark (ARM64): fresh destroy + onboard, getent hosts and node dns.lookup both resolve from inside the sandbox. * fix(sandbox): add runtime DNS verification after setup setup-dns-proxy.sh now verifies all three DNS bridge layers from inside the sandbox namespace post-deployment: resolv.conf target, iptables UDP rule, and actual getent hosts resolution. Reports [PASS]/[FAIL] per check so failures are immediately visible in onboard logs.

coderabbitai bot reviewed Mar 23, 2026

View reviewed changes

wscurran added bug Something isn't working OpenShell Support for OpenShell, a safe, private runtime for autonomous AI agents labels Mar 23, 2026

senthilr-nv mentioned this pull request Mar 27, 2026

fix(sandbox): export proxy env vars with full NO_PROXY and persist across reconnects #1025

Merged

senthilr-nv mentioned this pull request Mar 29, 2026

fix(sandbox): restore sandbox DNS resolution for web tools (fixes #626) #1062

Merged

cv mentioned this pull request Apr 1, 2026

refactor(cli): shell consolidation, TypeScript migration & oclif #924

Open

wscurran added the status: rebase PR needs to be rebased against main before review can continue label Apr 14, 2026

		POD="$(kctl get pods -n openshell -o name 2>/dev/null \
		\| grep -- "$SANDBOX_NAME" \| head -1 \| sed 's\|pod/\|\|' \|\| true)"

Conversation

jestyr27 commented Mar 23, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Testing

Design Notes

Checklist

General

Code Changes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

wscurran commented Mar 23, 2026

Uh oh!

senthilr-nv commented Mar 27, 2026

Uh oh!

wscurran commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jestyr27 commented Mar 23, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 23, 2026 •

edited

Loading