Skip to content

fix: sandbox DNS resolution via CoreDNS proxy in sandbox pod#732

Open
jestyr27 wants to merge 1 commit intoNVIDIA:mainfrom
uhstray-io:fix/sandbox-dns-resolution
Open

fix: sandbox DNS resolution via CoreDNS proxy in sandbox pod#732
jestyr27 wants to merge 1 commit intoNVIDIA:mainfrom
uhstray-io:fix/sandbox-dns-resolution

Conversation

@jestyr27
Copy link
Copy Markdown

@jestyr27 jestyr27 commented Mar 23, 2026

Summary

Fixes DNS resolution inside the sandbox by running a lightweight Python DNS forwarder in the sandbox pod's namespace. This makes dns.lookup() work for tools like web_search and web_fetch that do pre-flight DNS checks before using the HTTP proxy.

Related Issue

Fixes #626
Related: OpenShell #437, #504

Changes

  • New: scripts/setup-dns-proxy.sh — Runs after sandbox creation. Adds the CoreDNS service IP (10.43.0.10) as a local address in the pod, then starts a Python DNS forwarder bound to it. Forwards to the actual CoreDNS pod IP (discovered dynamically) so k8s-internal names (*.svc.cluster.local) still resolve for the openshell-sandbox proxy.
  • scripts/fix-coredns.sh — Extended to all platforms (was Colima-only). The comment already said "broken on all platforms" but the code only gated for Colima.
  • scripts/lib/runtime.sh — Added 8.8.8.8 fallback to resolve_coredns_upstream() for hosts where all nameservers are loopback (e.g. systemd-resolved).
  • bin/lib/platform.jsshouldPatchCoredns() returns true for all known container runtimes (was colima only).
  • bin/lib/onboard.js / scripts/setup.sh — Call setup-dns-proxy.sh after sandbox reaches Ready state.
  • .pre-commit-config.yaml — Added --with pytest to Pyright pre-push hook (fixes type check failures on test_endpoint_validation.py).

Type of Change

  • Code change for a new feature, bug fix, or refactor.

Testing

  • npx prek run --all-files passes.
  • npm test passes (311/311).
  • End-to-end: nemoclaw onboard --non-interactive creates sandbox with working DNS. socket.getaddrinfo() resolves external names from inside the sandbox. web_search returns live Google Search results via Gemini grounding. Inference routing to inference.local works (the openshell-sandbox proxy can resolve the gateway via k8s DNS).

Design Notes

Two critical requirements discovered during testing:

  1. Bind to 10.43.0.10 (not 0.0.0.0): glibc's resolver uses connected UDP sockets that discard responses from unexpected source IPs. Binding to 0.0.0.0 causes responses to come from the veth address (10.200.0.1), which glibc ignores.

  2. Forward to the CoreDNS pod IP (not 8.8.8.8): adding 10.43.0.10 as a local address intercepts ALL DNS in the pod namespace — including from the openshell-sandbox binary, which needs to resolve openshell-0.openshell.svc.cluster.local to reach the gateway. Public DNS can't resolve *.svc.cluster.local.

Checklist

General

Code Changes

  • Formatters applied — npx prek run --all-files auto-fixes formatting.
  • Tests added or updated for new or changed behavior.
  • No secrets, API keys, or credentials committed.

Summary by CodeRabbit

  • New Features

    • DNS proxy setup now integrated into sandbox creation workflow.
    • DNS forwarding support expanded to Docker Desktop and other container runtimes (previously Colima-specific).
  • Improvements

    • CoreDNS patching available across all recognized container runtimes.
    • Enhanced DNS resolution fallback to public DNS when upstream lookups fail.
  • Tests

    • Added DNS proxy validation test suite.
    • Expanded runtime compatibility test coverage.

The sandbox runs in an isolated network namespace (10.200.0.0/24) where
dns.lookup() fails because the CoreDNS service IP (10.43.0.10) is
unreachable. This blocks web_search, web_fetch, and any tool that does
pre-flight DNS resolution before using the HTTP proxy.

Fix: run a Python DNS forwarder in the sandbox pod's root namespace that
intercepts queries to 10.43.0.10 and forwards them to the CoreDNS pod.

Key details:
- setup-dns-proxy.sh: adds 10.43.0.10 as a local address in the pod,
  starts a Python UDP forwarder bound to 10.43.0.10:53, forwards to
  the CoreDNS pod IP (discovered via kubectl endpoints). Launched via
  docker exec -d + nsenter for persistence.
- fix-coredns.sh: extended to all platforms (was Colima-only). The
  comment already said "broken on all platforms" but the code only ran
  for Colima.
- runtime.sh: added 8.8.8.8 fallback to resolve_coredns_upstream() for
  hosts where all nameservers are loopback (systemd-resolved).
- platform.js: shouldPatchCoredns() returns true for all known runtimes.
- onboard.js / setup.sh: call setup-dns-proxy.sh after sandbox creation.

The forwarder MUST bind to 10.43.0.10 (not 0.0.0.0) because glibc uses
connected UDP sockets that discard responses from unexpected source IPs.
It MUST forward to the CoreDNS pod IP (not 8.8.8.8) because the
openshell-sandbox binary also uses DNS to reach the gateway for inference
routing, and public DNS cannot resolve *.svc.cluster.local.

Fixes NVIDIA#626
Related: OpenShell NVIDIA#437, NVIDIA#504
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 23, 2026

📝 Walkthrough

Walkthrough

The changes extend DNS patching from Colima-specific containers to all recognized container runtimes, introduce a new DNS proxy setup script for sandboxes, and update onboarding and testing accordingly. The DNS proxy forwards DNS queries from sandbox pods to CoreDNS or public DNS.

Changes

Cohort / File(s) Summary
DNS Proxy Infrastructure
scripts/setup-dns-proxy.sh, test/dns-proxy.test.js
Added new DNS proxy setup script that deploys a Python UDP forwarder inside sandbox pods, discovering CoreDNS endpoint and configuring loopback routing. Includes tests validating script existence, sourcing, usage output, and hardcoded configuration values.
CoreDNS Patching Generalization
bin/lib/platform.js, scripts/fix-coredns.sh, scripts/lib/runtime.sh
Expanded CoreDNS patching from Colima-only (shouldPatchCoredns) to all recognized runtimes. Replaced Colima-specific Docker socket detection with generic DOCKER_HOST detection. Changed DNS fallback in resolve_coredns_upstream from failure to returning 8.8.8.8.
Onboarding Integration
bin/lib/onboard.js, scripts/setup.sh
Updated onboarding flow to call DNS proxy setup after sandbox creation. Generalized CoreDNS patch messaging from "Colima" to "DNS forwarding". DNS proxy failures logged as warnings but do not halt onboarding.
Test Updates
test/platform.test.js, test/runtime-shell.test.js
Updated assertions for shouldPatchCoredns to expect true for docker-desktop and docker runtimes; added false assertion for unknown runtime. Added test case verifying fallback to 8.8.8.8 when nameserver resolution fails.
Pre-commit Configuration
.pre-commit-config.yaml
Updated pyright-check hook to include both pytest and pyright in the uv environment via --with pytest pyright.

Sequence Diagram

sequenceDiagram
    actor User
    participant Onboarding as Onboarding Script
    participant Docker as Docker/Container
    participant Sandbox as Sandbox Pod
    participant Proxy as DNS Proxy (Python)
    participant CoreDNS as CoreDNS Service

    User->>Onboarding: nemoclaw onboard
    Onboarding->>Docker: Create sandbox pod
    Docker->>Sandbox: Pod Ready
    Onboarding->>Onboarding: Detect container runtime
    Onboarding->>Onboarding: Patch CoreDNS if runtime ≠ unknown
    Onboarding->>Docker: setup-dns-proxy.sh <sandbox-name>
    Docker->>Sandbox: nsenter + docker exec
    Sandbox->>Sandbox: Configure lo interface (10.43.0.10/32)
    Sandbox->>Sandbox: Deploy /tmp/dns-proxy.py
    Sandbox->>Proxy: Start Python UDP forwarder
    Proxy->>Sandbox: Bind 10.43.0.10:53
    Proxy->>CoreDNS: Forward DNS queries
    CoreDNS->>Proxy: DNS responses
    Proxy->>Sandbox: Return to applications
    Sandbox->>Onboarding: Proxy ready
    Onboarding->>User: Onboarding complete
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰✨ A DNS proxy hops through the sandbox gate,
Forwarding queries before they're too late,
CoreDNS whispers, eight-eight takes the call,
Docker and friends now work one and all! 🌐

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and specifically describes the primary change: adding DNS resolution via a CoreDNS proxy in the sandbox pod.
Linked Issues check ✅ Passed All coding requirements from issue #626 are met: the PR implements a DNS proxy in the sandbox, enables DNS resolution for web search tools, and preserves Kubernetes internal name resolution.
Out of Scope Changes check ✅ Passed All changes are directly related to DNS resolution in the sandbox. The pre-commit hook update and test file additions are supporting changes that are within scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (3)
scripts/setup-dns-proxy.sh (2)

149-158: Consider logging DNS forwarding errors for debugging.

The except Exception: pass silently discards all errors, making DNS issues difficult to diagnose. At minimum, consider logging failures to the existing log file:

Proposed improvement
 def forward(data, addr):
     try:
         f = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
         f.settimeout(5)
         f.sendto(data, UPSTREAM)
         r, _ = f.recvfrom(4096)
         sock.sendto(r, addr)
         f.close()
-    except Exception:
-        pass
+    except Exception as e:
+        with open('/tmp/dns-proxy.log', 'a') as log:
+            log.write('dns-proxy error: {}\\n'.format(e))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/setup-dns-proxy.sh` around lines 149 - 158, The forward function
swallows all exceptions (except Exception: pass), making DNS forwarding failures
invisible; update forward to catch exceptions and write a descriptive error to
the existing log (include exception details and context like addr and UPSTREAM)
instead of silently passing; reference the forward function, UPSTREAM constant,
and sock variable when adding the logging call so you log "DNS forward failed
for addr=<addr> to UPSTREAM=<UPSTREAM>: <exception>" (or similar) and ensure the
socket f is closed in a finally block or use a context manager to avoid resource
leaks.

186-193: Consider verifying the proxy process is actually running.

The current verification only checks if the log file was written. A more robust check would verify the process is still running:

Proposed enhancement
 LOG="$(kctl exec -n openshell "$POD" -- cat /tmp/dns-proxy.log 2>/dev/null || true)"
-if echo "$LOG" | grep -q "dns-proxy:"; then
+NEW_PID="$(kctl exec -n openshell "$POD" -- cat /tmp/dns-proxy.pid 2>/dev/null || true)"
+RUNNING="$(kctl exec -n openshell "$POD" -- kill -0 "$NEW_PID" 2>/dev/null && echo yes || true)"
+if [ "$RUNNING" = "yes" ] && echo "$LOG" | grep -q "dns-proxy:"; then
   echo "DNS proxy started: $LOG"
 else
   echo "WARNING: DNS proxy may not have started. Log: $LOG"
 fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/setup-dns-proxy.sh` around lines 186 - 193, The verification only
checks for log output (LOG) but not that the dns-proxy process is running;
update the check after reading /tmp/dns-proxy.log to also run a process check
inside the pod (use kctl exec -n openshell "$POD" -- pgrep -f 'dns-proxy' or
kctl exec ... -- ps aux | grep '[d]ns-proxy') and consider checking
/tmp/dns-proxy.pid if the service writes a PID file; if the process check fails,
emit the WARNING with the log and a message that the process is not running,
otherwise print DNS proxy started with PID/info. Ensure you reference the
existing LOG, POD, and /tmp/dns-proxy.log occurrences so the new check is
inserted immediately after the current grep-based verification.
test/dns-proxy.test.js (1)

29-39: Test description could be more precise.

The test name says "no sandbox name provided" but it actually provides nemoclaw as the first argument. Per the script's usage ([gateway-name] <sandbox-name>), this is interpreted as GATEWAY_NAME="nemoclaw" with SANDBOX_NAME="", which correctly triggers the usage error. Consider clarifying the test name:

-  it("exits with usage when no sandbox name provided", () => {
+  it("exits with usage when only gateway name provided (no sandbox name)", () => {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/dns-proxy.test.js` around lines 29 - 39, The test name is misleading: it
supplies a first argument ("nemoclaw") which becomes the gateway name while the
sandbox name is omitted, so update the spec in test/dns-proxy.test.js (the
it(...) description) to clearly state that the sandbox name is omitted while a
gateway name is provided (for example: "exits with usage when sandbox name
omitted (gateway provided)"), and keep the existing invocation that calls bash
with SETUP_DNS_PROXY and "nemoclaw" so the behavior and assertions (expecting
non-zero status and Usage: message) remain unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 617-620: The DNS proxy installation is currently allowed to fail
silently because run(...) is invoked with { ignoreError: true }, so remove the
ignoreError bypass (or detect the run exit status) and make the onboarding flow
exit/throw on failure of run("... setup-dns-proxy.sh ...", ...); additionally,
after the script completes, perform a verification step (e.g., kubectl exec into
the sandbox pod identified by sandboxName and run a DNS lookup like dig/host
against the expected host) and only mark onboarding complete if that check
succeeds—update the logic around run, SCRIPTS, and sandboxName to fail the
onboarding or rollback the sandbox when either the install or the post-install
DNS verification fails.

In `@scripts/fix-coredns.sh`:
- Around line 27-32: The current path blindly accepts any value returned by
detect_docker_host into DOCKER_HOST, which can point at a remote daemon and thus
pollute the remote CoreDNS config; change the logic around the DOCKER_HOST
assignment (the docker_host variable and detect_docker_host call) to validate
that the returned host is local before exporting it — only accept unix socket
paths or TCP hosts that resolve to localhost/127.0.0.1 (or ::1); if docker_host
is remote, do not set DOCKER_HOST and instead fall back to deriving upstream
from the daemon/gateway side or bail with a clear message; apply the same
guard/validation to the analogous block referenced at the later section (lines
49-55).

In `@scripts/lib/runtime.sh`:
- Around line 166-168: The current last-resort branch in runtime.sh that prints
'8.8.8.8' must be changed to first attempt to resolve systemd-resolved's real
upstreams and to treat IPv6 loopback as loopback too; update the logic around
the printf '8.8.8.8' to (1) try to read upstream servers from systemd-resolved
(e.g. via resolvectl status or /run/systemd/resolve/resolv.conf) and extract
non-loopback IPs, (2) filter out loopback addresses including 127.0.0.0/8 and
::1 (and equivalent IPv4-mapped loopback), and (3) only fall back to a public
resolver like 8.8.8.8 if no non-loopback upstreams are found; keep the change
localized to the branch that currently emits '8.8.8.8' so CoreDNS uses the
machine's upstreams when available.

In `@scripts/setup-dns-proxy.sh`:
- Around line 90-92: The pipeline that sets POD uses grep -- "$SANDBOX_NAME"
which treats the sandbox name as a regex; change grep to fixed-string mode
(e.g., use grep -F -- "$SANDBOX_NAME") so special characters in $SANDBOX_NAME
are matched literally. Update the line that assigns POD (the command pipeline
with kctl get pods | grep -- "$SANDBOX_NAME" | head -1 | sed 's|pod/||') to use
grep -F -- "$SANDBOX_NAME" to ensure correct, literal matching.
- Around line 113-117: The fallback that sets POD_PID via "kctl exec ... echo
$$" is invalid for nsenter because that PID is in the pod's PID namespace, not
the gateway/container host namespace; remove the fallback block that assigns
POD_PID with kctl exec and instead fail explicitly when POD_PID is empty (print
a clear error and exit non-zero), so the later nsenter invocation (which
requires a host-visible PID) is not attempted with a wrong value; update the
script's POD_PID handling to use the original hostname method only and error out
if POD_PID remains unset (references: POD_PID, kctl exec, nsenter).

In `@scripts/setup.sh`:
- Around line 233-235: The DNS proxy step currently swallows errors (bash
"$SCRIPT_DIR/setup-dns-proxy.sh" ... || warn ...) so setup.sh can report success
despite unresolved EAI_AGAIN; change this to treat failures as fatal or block
until DNS works: invoke setup-dns-proxy.sh without the short-circuit warn and if
it exits non-zero call error/exit (or retry), then perform an in-sandbox
resolution check (e.g., run a simple node or getent/dig inside the sandbox using
SANDBOX_NAME) and loop/retry with backoff until the lookup succeeds before
printing "Setup complete!" — update the calls around info "Setting up sandbox
DNS proxy...", the bash invocation of setup-dns-proxy.sh, and the subsequent
success path to enforce failure or readiness check.

---

Nitpick comments:
In `@scripts/setup-dns-proxy.sh`:
- Around line 149-158: The forward function swallows all exceptions (except
Exception: pass), making DNS forwarding failures invisible; update forward to
catch exceptions and write a descriptive error to the existing log (include
exception details and context like addr and UPSTREAM) instead of silently
passing; reference the forward function, UPSTREAM constant, and sock variable
when adding the logging call so you log "DNS forward failed for addr=<addr> to
UPSTREAM=<UPSTREAM>: <exception>" (or similar) and ensure the socket f is closed
in a finally block or use a context manager to avoid resource leaks.
- Around line 186-193: The verification only checks for log output (LOG) but not
that the dns-proxy process is running; update the check after reading
/tmp/dns-proxy.log to also run a process check inside the pod (use kctl exec -n
openshell "$POD" -- pgrep -f 'dns-proxy' or kctl exec ... -- ps aux | grep
'[d]ns-proxy') and consider checking /tmp/dns-proxy.pid if the service writes a
PID file; if the process check fails, emit the WARNING with the log and a
message that the process is not running, otherwise print DNS proxy started with
PID/info. Ensure you reference the existing LOG, POD, and /tmp/dns-proxy.log
occurrences so the new check is inserted immediately after the current
grep-based verification.

In `@test/dns-proxy.test.js`:
- Around line 29-39: The test name is misleading: it supplies a first argument
("nemoclaw") which becomes the gateway name while the sandbox name is omitted,
so update the spec in test/dns-proxy.test.js (the it(...) description) to
clearly state that the sandbox name is omitted while a gateway name is provided
(for example: "exits with usage when sandbox name omitted (gateway provided)"),
and keep the existing invocation that calls bash with SETUP_DNS_PROXY and
"nemoclaw" so the behavior and assertions (expecting non-zero status and Usage:
message) remain unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f3de8a15-066b-4be1-a472-15719c66b53a

📥 Commits

Reviewing files that changed from the base of the PR and between 5ac3575 and 51db152.

📒 Files selected for processing (10)
  • .pre-commit-config.yaml
  • bin/lib/onboard.js
  • bin/lib/platform.js
  • scripts/fix-coredns.sh
  • scripts/lib/runtime.sh
  • scripts/setup-dns-proxy.sh
  • scripts/setup.sh
  • test/dns-proxy.test.js
  • test/platform.test.js
  • test/runtime-shell.test.js

Comment thread bin/lib/onboard.js
Comment on lines +617 to +620
// DNS proxy — run a forwarder in the sandbox pod so the isolated
// sandbox namespace can resolve DNS. Must run after sandbox is Ready.
console.log(" Setting up sandbox DNS proxy...");
run(`bash "${path.join(SCRIPTS, "setup-dns-proxy.sh")}" nemoclaw "${sandboxName}" 2>&1 || true`, { ignoreError: true });
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't complete onboarding before sandbox DNS is verified.

This path keeps the new sandbox even when the DNS proxy install fails, so users can end up with a registered sandbox that still has broken hostname resolution. Please make this step fail the flow, or gate completion on a post-check from inside the sandbox.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 617 - 620, The DNS proxy installation is
currently allowed to fail silently because run(...) is invoked with {
ignoreError: true }, so remove the ignoreError bypass (or detect the run exit
status) and make the onboarding flow exit/throw on failure of run("...
setup-dns-proxy.sh ...", ...); additionally, after the script completes, perform
a verification step (e.g., kubectl exec into the sandbox pod identified by
sandboxName and run a DNS lookup like dig/host against the expected host) and
only mark onboarding complete if that check succeeds—update the logic around
run, SCRIPTS, and sandboxName to fail the onboarding or rollback the sandbox
when either the install or the post-install DNS verification fails.

Comment thread scripts/fix-coredns.sh
Comment on lines 27 to 32
if [ -z "${DOCKER_HOST:-}" ]; then
if [ -n "$COLIMA_SOCKET" ]; then
export DOCKER_HOST="unix://$COLIMA_SOCKET"
else
echo "Skipping CoreDNS patch: Colima socket not found."
exit 0
if docker_host="$(detect_docker_host)"; then
export DOCKER_HOST="$docker_host"
fi
# If still unset, Docker CLI will use the default socket
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard this path to local Docker daemons.

This generalized flow still derives the fallback resolver from the local machine, but now runs against any detected Docker engine. When DOCKER_HOST points at a remote daemon, the remote CoreDNS config gets patched with the workstation's DNS, which may be unreachable from that cluster. Please either reject non-local Docker hosts here or derive the upstream entirely from the daemon/gateway side.

Also applies to: 49-55

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/fix-coredns.sh` around lines 27 - 32, The current path blindly
accepts any value returned by detect_docker_host into DOCKER_HOST, which can
point at a remote daemon and thus pollute the remote CoreDNS config; change the
logic around the DOCKER_HOST assignment (the docker_host variable and
detect_docker_host call) to validate that the returned host is local before
exporting it — only accept unix socket paths or TCP hosts that resolve to
localhost/127.0.0.1 (or ::1); if docker_host is remote, do not set DOCKER_HOST
and instead fall back to deriving upstream from the daemon/gateway side or bail
with a clear message; apply the same guard/validation to the analogous block
referenced at the later section (lines 49-55).

Comment thread scripts/lib/runtime.sh
Comment on lines +166 to +168
# Last resort: public DNS. Needed on hosts where all nameservers are
# loopback (e.g. systemd-resolved uses 127.0.0.53).
printf '8.8.8.8\n'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't fall back to public DNS before exhausting the host resolver.

On the common 127.0.0.53 / ::1 case this branch now pins CoreDNS to 8.8.8.8 instead of the machine's real upstreams, which breaks split-DNS/VPN/internal zones and leaks queries to a public resolver. Please resolve systemd-resolved's actual upstreams first and treat IPv6 loopback as loopback too.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/lib/runtime.sh` around lines 166 - 168, The current last-resort
branch in runtime.sh that prints '8.8.8.8' must be changed to first attempt to
resolve systemd-resolved's real upstreams and to treat IPv6 loopback as loopback
too; update the logic around the printf '8.8.8.8' to (1) try to read upstream
servers from systemd-resolved (e.g. via resolvectl status or
/run/systemd/resolve/resolv.conf) and extract non-loopback IPs, (2) filter out
loopback addresses including 127.0.0.0/8 and ::1 (and equivalent IPv4-mapped
loopback), and (3) only fall back to a public resolver like 8.8.8.8 if no
non-loopback upstreams are found; keep the change localized to the branch that
currently emits '8.8.8.8' so CoreDNS uses the machine's upstreams when
available.

Comment on lines +90 to +92
POD="$(kctl get pods -n openshell -o name 2>/dev/null \
| grep -- "$SANDBOX_NAME" | head -1 | sed 's|pod/||' || true)"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use fixed-string matching for sandbox name.

The grep without -F interprets $SANDBOX_NAME as a regex pattern. If sandbox names contain characters like ., *, or [, this could match unintended pods.

Proposed fix
-POD="$(kctl get pods -n openshell -o name 2>/dev/null \
-  | grep -- "$SANDBOX_NAME" | head -1 | sed 's|pod/||' || true)"
+POD="$(kctl get pods -n openshell -o name 2>/dev/null \
+  | grep -F -- "$SANDBOX_NAME" | head -1 | sed 's|pod/||' || true)"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
POD="$(kctl get pods -n openshell -o name 2>/dev/null \
| grep -- "$SANDBOX_NAME" | head -1 | sed 's|pod/||' || true)"
POD="$(kctl get pods -n openshell -o name 2>/dev/null \
| grep -F -- "$SANDBOX_NAME" | head -1 | sed 's|pod/||' || true)"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/setup-dns-proxy.sh` around lines 90 - 92, The pipeline that sets POD
uses grep -- "$SANDBOX_NAME" which treats the sandbox name as a regex; change
grep to fixed-string mode (e.g., use grep -F -- "$SANDBOX_NAME") so special
characters in $SANDBOX_NAME are matched literally. Update the line that assigns
POD (the command pipeline with kctl get pods | grep -- "$SANDBOX_NAME" | head -1
| sed 's|pod/||') to use grep -F -- "$SANDBOX_NAME" to ensure correct, literal
matching.

Comment on lines +113 to +117
if [ -z "$POD_PID" ]; then
echo "WARNING: Could not find pod PID via hostname. Trying kubectl..."
# Fallback: use kubectl exec to find a PID we can nsenter into
POD_PID="$(kctl exec -n openshell "$POD" -- sh -c 'echo $$' 2>/dev/null || true)"
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fallback PID may not be valid for nsenter.

The fallback uses kubectl exec ... echo $$ which returns the shell's PID as seen from within the pod's PID namespace. However, nsenter -t "$POD_PID" (line 181) requires the PID as seen from the gateway container's namespace. These are different values.

If the primary hostname-matching method fails and this fallback is used, the subsequent nsenter at line 181 will likely fail or enter the wrong namespace.

Consider either:

  1. Removing the fallback and failing explicitly
  2. Finding the PID through an alternative host-visible method
Option: Remove fallback and fail explicitly
 if [ -z "$POD_PID" ]; then
-  echo "WARNING: Could not find pod PID via hostname. Trying kubectl..."
-  # Fallback: use kubectl exec to find a PID we can nsenter into
-  POD_PID="$(kctl exec -n openshell "$POD" -- sh -c 'echo $$' 2>/dev/null || true)"
-fi
-
-if [ -z "$POD_PID" ]; then
   echo "ERROR: Could not determine pod PID for nsenter."
+  echo "       Hostname matching in /proc failed for pod '$POD'."
   exit 1
 fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/setup-dns-proxy.sh` around lines 113 - 117, The fallback that sets
POD_PID via "kctl exec ... echo $$" is invalid for nsenter because that PID is
in the pod's PID namespace, not the gateway/container host namespace; remove the
fallback block that assigns POD_PID with kctl exec and instead fail explicitly
when POD_PID is empty (print a clear error and exit non-zero), so the later
nsenter invocation (which requires a host-visible PID) is not attempted with a
wrong value; update the script's POD_PID handling to use the original hostname
method only and error out if POD_PID remains unset (references: POD_PID, kctl
exec, nsenter).

Comment thread scripts/setup.sh
Comment on lines +233 to +235
# 5b. DNS proxy for sandbox — run after sandbox is Ready.
info "Setting up sandbox DNS proxy..."
bash "$SCRIPT_DIR/setup-dns-proxy.sh" nemoclaw "$SANDBOX_NAME" 2>&1 || warn "DNS proxy setup failed (may not be needed)"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make DNS proxy setup part of the success criteria.

setup-dns-proxy.sh is the fix that restores sandbox-side dns.lookup() / getaddrinfo(). Swallowing failures here lets setup.sh finish with “Setup complete!” even though the sandbox still returns EAI_AGAIN. Fail the setup, or at least run an in-sandbox resolution check and only continue when it passes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/setup.sh` around lines 233 - 235, The DNS proxy step currently
swallows errors (bash "$SCRIPT_DIR/setup-dns-proxy.sh" ... || warn ...) so
setup.sh can report success despite unresolved EAI_AGAIN; change this to treat
failures as fatal or block until DNS works: invoke setup-dns-proxy.sh without
the short-circuit warn and if it exits non-zero call error/exit (or retry), then
perform an in-sandbox resolution check (e.g., run a simple node or getent/dig
inside the sandbox using SANDBOX_NAME) and loop/retry with backoff until the
lookup succeeds before printing "Setup complete!" — update the calls around info
"Setting up sandbox DNS proxy...", the bash invocation of setup-dns-proxy.sh,
and the subsequent success path to enforce failure or readiness check.

@wscurran wscurran added bug Something isn't working OpenShell Support for OpenShell, a safe, private runtime for autonomous AI agents labels Mar 23, 2026
@wscurran
Copy link
Copy Markdown
Contributor

Thanks for submitting this PR, it looks like it addresses an issue with DNS resolution in the sandbox environment, and the changes are well-documented and easy to follow

@senthilr-nv
Copy link
Copy Markdown
Contributor

Tested this on DGX Spark (ARM64) and Brev VM (x86_64) — DNS is broken from the sandbox shell on both platforms, so the fix is needed everywhere.

The approach works. After running setup-dns-proxy.sh, 10.43.0.10 gets added to the pod's lo, the Python forwarder is written correctly, and CoreDNS pod IP is discovered (10.42.0.6 on Spark). Once the forwarder is running, getent hosts github.com and node dns.lookup() both resolve from inside the sandbox.

One issue: the docker exec -d + nsenter launch path (around line 181) didn't actually start the forwarder on either platform — the script completes without error but the proxy process never appears. CodeRabbit's note about the fallback PID namespace mismatch looks correct. When I launched it manually via kubectl exec from the pod, it worked immediately.

One approach that worked for me: replacing the nsenter launch with kubectl exec ... -- sh -c 'nohup python3 -u /tmp/dns-proxy.py > /tmp/dns-proxy.log 2>&1 &'.

Happy to retest once the launch path is updated.

senthilr-nv added a commit to senthilr-nv/NemoClaw that referenced this pull request Mar 28, 2026
…NVIDIA#626)

The sandbox runs in an isolated network namespace (10.200.0.0/24) where
OpenShell's iptables rules reject all non-proxy traffic including UDP.
This causes getaddrinfo EAI_AGAIN for every outbound DNS lookup,
breaking web_fetch, web_search, and all Node.js HTTP tools.

Fix (three steps in new scripts/setup-dns-proxy.sh):
1. Run a Python UDP DNS forwarder on the pod-side veth gateway
   (10.200.0.1:53), forwarding to the real CoreDNS pod IP
2. Add an iptables rule in the sandbox namespace allowing UDP to
   the gateway on port 53 (the only non-proxy firewall exception)
3. Update the sandbox's /etc/resolv.conf to point to 10.200.0.1

Also broadens the CoreDNS patch (fix-coredns.sh / shouldPatchCoredns)
to run on all Docker-based runtimes, not just Colima — k3s-inside-Docker
has broken DNS forwarding on Linux hosts with systemd-resolved too.

Inspired by PR NVIDIA#732's DNS forwarder approach. Key differences:
- Uses kubectl exec instead of nsenter (fixes PR NVIDIA#732's launch bug)
- Handles the sandbox iptables constraint (all UDP blocked)
- Resolves systemd-resolved upstreams via resolvectl before falling
  back to 8.8.8.8
- Uses grep -F for fixed-string sandbox name matching

Tested on DGX Spark (ARM64): fresh destroy + onboard, getent hosts
and node dns.lookup both resolve from inside the sandbox.
senthilr-nv added a commit to senthilr-nv/NemoClaw that referenced this pull request Mar 28, 2026
…NVIDIA#626)

The sandbox runs in an isolated network namespace (10.200.0.0/24) where
OpenShell's iptables rules reject all non-proxy traffic including UDP.
This causes getaddrinfo EAI_AGAIN for every outbound DNS lookup,
breaking web_fetch, web_search, and all Node.js HTTP tools.

Fix (three steps in new scripts/setup-dns-proxy.sh):
1. Run a Python UDP DNS forwarder on the pod-side veth gateway
   (10.200.0.1:53), forwarding to the real CoreDNS pod IP
2. Add an iptables rule in the sandbox namespace allowing UDP to
   the gateway on port 53 (the only non-proxy firewall exception)
3. Update the sandbox's /etc/resolv.conf to point to 10.200.0.1

Also broadens the CoreDNS patch (fix-coredns.sh / shouldPatchCoredns)
to run on all Docker-based runtimes, not just Colima — k3s-inside-Docker
has broken DNS forwarding on Linux hosts with systemd-resolved too.

Inspired by PR NVIDIA#732's DNS forwarder approach. Key differences:
- Uses kubectl exec instead of nsenter (fixes PR NVIDIA#732's launch bug)
- Handles the sandbox iptables constraint (all UDP blocked)
- Resolves systemd-resolved upstreams via resolvectl before falling
  back to 8.8.8.8
- Uses grep -F for fixed-string sandbox name matching

Tested on DGX Spark (ARM64): fresh destroy + onboard, getent hosts
and node dns.lookup both resolve from inside the sandbox.
kjw3 pushed a commit that referenced this pull request Mar 29, 2026
… (#1062)

* fix(sandbox): add DNS forwarder so web_fetch resolves hostnames (fixes #626)

The sandbox runs in an isolated network namespace (10.200.0.0/24) where
OpenShell's iptables rules reject all non-proxy traffic including UDP.
This causes getaddrinfo EAI_AGAIN for every outbound DNS lookup,
breaking web_fetch, web_search, and all Node.js HTTP tools.

Fix (three steps in new scripts/setup-dns-proxy.sh):
1. Run a Python UDP DNS forwarder on the pod-side veth gateway
   (10.200.0.1:53), forwarding to the real CoreDNS pod IP
2. Add an iptables rule in the sandbox namespace allowing UDP to
   the gateway on port 53 (the only non-proxy firewall exception)
3. Update the sandbox's /etc/resolv.conf to point to 10.200.0.1

Also broadens the CoreDNS patch (fix-coredns.sh / shouldPatchCoredns)
to run on all Docker-based runtimes, not just Colima — k3s-inside-Docker
has broken DNS forwarding on Linux hosts with systemd-resolved too.

Inspired by PR #732's DNS forwarder approach. Key differences:
- Uses kubectl exec instead of nsenter (fixes PR #732's launch bug)
- Handles the sandbox iptables constraint (all UDP blocked)
- Resolves systemd-resolved upstreams via resolvectl before falling
  back to 8.8.8.8
- Uses grep -F for fixed-string sandbox name matching

Tested on DGX Spark (ARM64): fresh destroy + onboard, getent hosts
and node dns.lookup both resolve from inside the sandbox.

* fix(sandbox): add runtime DNS verification after setup

setup-dns-proxy.sh now verifies all three DNS bridge layers from inside
the sandbox namespace post-deployment: resolv.conf target, iptables UDP
rule, and actual getent hosts resolution. Reports [PASS]/[FAIL] per
check so failures are immediately visible in onboard logs.
laitingsheng pushed a commit that referenced this pull request Apr 2, 2026
… (#1062)

* fix(sandbox): add DNS forwarder so web_fetch resolves hostnames (fixes #626)

The sandbox runs in an isolated network namespace (10.200.0.0/24) where
OpenShell's iptables rules reject all non-proxy traffic including UDP.
This causes getaddrinfo EAI_AGAIN for every outbound DNS lookup,
breaking web_fetch, web_search, and all Node.js HTTP tools.

Fix (three steps in new scripts/setup-dns-proxy.sh):
1. Run a Python UDP DNS forwarder on the pod-side veth gateway
   (10.200.0.1:53), forwarding to the real CoreDNS pod IP
2. Add an iptables rule in the sandbox namespace allowing UDP to
   the gateway on port 53 (the only non-proxy firewall exception)
3. Update the sandbox's /etc/resolv.conf to point to 10.200.0.1

Also broadens the CoreDNS patch (fix-coredns.sh / shouldPatchCoredns)
to run on all Docker-based runtimes, not just Colima — k3s-inside-Docker
has broken DNS forwarding on Linux hosts with systemd-resolved too.

Inspired by PR #732's DNS forwarder approach. Key differences:
- Uses kubectl exec instead of nsenter (fixes PR #732's launch bug)
- Handles the sandbox iptables constraint (all UDP blocked)
- Resolves systemd-resolved upstreams via resolvectl before falling
  back to 8.8.8.8
- Uses grep -F for fixed-string sandbox name matching

Tested on DGX Spark (ARM64): fresh destroy + onboard, getent hosts
and node dns.lookup both resolve from inside the sandbox.

* fix(sandbox): add runtime DNS verification after setup

setup-dns-proxy.sh now verifies all three DNS bridge layers from inside
the sandbox namespace post-deployment: resolv.conf target, iptables UDP
rule, and actual getent hosts resolution. Reports [PASS]/[FAIL] per
check so failures are immediately visible in onboard logs.
lakamsani pushed a commit to lakamsani/NemoClaw that referenced this pull request Apr 4, 2026
…DIA#626) (NVIDIA#1062)

* fix(sandbox): add DNS forwarder so web_fetch resolves hostnames (fixes NVIDIA#626)

The sandbox runs in an isolated network namespace (10.200.0.0/24) where
OpenShell's iptables rules reject all non-proxy traffic including UDP.
This causes getaddrinfo EAI_AGAIN for every outbound DNS lookup,
breaking web_fetch, web_search, and all Node.js HTTP tools.

Fix (three steps in new scripts/setup-dns-proxy.sh):
1. Run a Python UDP DNS forwarder on the pod-side veth gateway
   (10.200.0.1:53), forwarding to the real CoreDNS pod IP
2. Add an iptables rule in the sandbox namespace allowing UDP to
   the gateway on port 53 (the only non-proxy firewall exception)
3. Update the sandbox's /etc/resolv.conf to point to 10.200.0.1

Also broadens the CoreDNS patch (fix-coredns.sh / shouldPatchCoredns)
to run on all Docker-based runtimes, not just Colima — k3s-inside-Docker
has broken DNS forwarding on Linux hosts with systemd-resolved too.

Inspired by PR NVIDIA#732's DNS forwarder approach. Key differences:
- Uses kubectl exec instead of nsenter (fixes PR NVIDIA#732's launch bug)
- Handles the sandbox iptables constraint (all UDP blocked)
- Resolves systemd-resolved upstreams via resolvectl before falling
  back to 8.8.8.8
- Uses grep -F for fixed-string sandbox name matching

Tested on DGX Spark (ARM64): fresh destroy + onboard, getent hosts
and node dns.lookup both resolve from inside the sandbox.

* fix(sandbox): add runtime DNS verification after setup

setup-dns-proxy.sh now verifies all three DNS bridge layers from inside
the sandbox namespace post-deployment: resolv.conf target, iptables UDP
rule, and actual getent hosts resolution. Reports [PASS]/[FAIL] per
check so failures are immediately visible in onboard logs.
@wscurran
Copy link
Copy Markdown
Contributor

Thanks for the CoreDNS proxy approach to sandbox DNS. The codebase has changed significantly since March 23 — including a full TypeScript migration — so this will need a rebase on origin/main before we can review it. Please rebase and resolve any conflicts, and we'll take a look.

@wscurran wscurran added the status: rebase PR needs to be rebased against main before review can continue label Apr 14, 2026
gemini2026 pushed a commit to gemini2026/NemoClaw that referenced this pull request Apr 14, 2026
…DIA#626) (NVIDIA#1062)

* fix(sandbox): add DNS forwarder so web_fetch resolves hostnames (fixes NVIDIA#626)

The sandbox runs in an isolated network namespace (10.200.0.0/24) where
OpenShell's iptables rules reject all non-proxy traffic including UDP.
This causes getaddrinfo EAI_AGAIN for every outbound DNS lookup,
breaking web_fetch, web_search, and all Node.js HTTP tools.

Fix (three steps in new scripts/setup-dns-proxy.sh):
1. Run a Python UDP DNS forwarder on the pod-side veth gateway
   (10.200.0.1:53), forwarding to the real CoreDNS pod IP
2. Add an iptables rule in the sandbox namespace allowing UDP to
   the gateway on port 53 (the only non-proxy firewall exception)
3. Update the sandbox's /etc/resolv.conf to point to 10.200.0.1

Also broadens the CoreDNS patch (fix-coredns.sh / shouldPatchCoredns)
to run on all Docker-based runtimes, not just Colima — k3s-inside-Docker
has broken DNS forwarding on Linux hosts with systemd-resolved too.

Inspired by PR NVIDIA#732's DNS forwarder approach. Key differences:
- Uses kubectl exec instead of nsenter (fixes PR NVIDIA#732's launch bug)
- Handles the sandbox iptables constraint (all UDP blocked)
- Resolves systemd-resolved upstreams via resolvectl before falling
  back to 8.8.8.8
- Uses grep -F for fixed-string sandbox name matching

Tested on DGX Spark (ARM64): fresh destroy + onboard, getent hosts
and node dns.lookup both resolve from inside the sandbox.

* fix(sandbox): add runtime DNS verification after setup

setup-dns-proxy.sh now verifies all three DNS bridge layers from inside
the sandbox namespace post-deployment: resolv.conf target, iptables UDP
rule, and actual getent hosts resolution. Reports [PASS]/[FAIL] per
check so failures are immediately visible in onboard logs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working OpenShell Support for OpenShell, a safe, private runtime for autonomous AI agents status: rebase PR needs to be rebased against main before review can continue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Web search tools fail in sandbox: DNS resolution blocked (getaddrinfo EAI_AGAIN)

3 participants