fix: sandbox DNS resolution via CoreDNS proxy in sandbox pod#732
fix: sandbox DNS resolution via CoreDNS proxy in sandbox pod#732jestyr27 wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
The sandbox runs in an isolated network namespace (10.200.0.0/24) where dns.lookup() fails because the CoreDNS service IP (10.43.0.10) is unreachable. This blocks web_search, web_fetch, and any tool that does pre-flight DNS resolution before using the HTTP proxy. Fix: run a Python DNS forwarder in the sandbox pod's root namespace that intercepts queries to 10.43.0.10 and forwards them to the CoreDNS pod. Key details: - setup-dns-proxy.sh: adds 10.43.0.10 as a local address in the pod, starts a Python UDP forwarder bound to 10.43.0.10:53, forwards to the CoreDNS pod IP (discovered via kubectl endpoints). Launched via docker exec -d + nsenter for persistence. - fix-coredns.sh: extended to all platforms (was Colima-only). The comment already said "broken on all platforms" but the code only ran for Colima. - runtime.sh: added 8.8.8.8 fallback to resolve_coredns_upstream() for hosts where all nameservers are loopback (systemd-resolved). - platform.js: shouldPatchCoredns() returns true for all known runtimes. - onboard.js / setup.sh: call setup-dns-proxy.sh after sandbox creation. The forwarder MUST bind to 10.43.0.10 (not 0.0.0.0) because glibc uses connected UDP sockets that discard responses from unexpected source IPs. It MUST forward to the CoreDNS pod IP (not 8.8.8.8) because the openshell-sandbox binary also uses DNS to reach the gateway for inference routing, and public DNS cannot resolve *.svc.cluster.local. Fixes NVIDIA#626 Related: OpenShell NVIDIA#437, NVIDIA#504
📝 WalkthroughWalkthroughThe changes extend DNS patching from Colima-specific containers to all recognized container runtimes, introduce a new DNS proxy setup script for sandboxes, and update onboarding and testing accordingly. The DNS proxy forwards DNS queries from sandbox pods to CoreDNS or public DNS. Changes
Sequence DiagramsequenceDiagram
actor User
participant Onboarding as Onboarding Script
participant Docker as Docker/Container
participant Sandbox as Sandbox Pod
participant Proxy as DNS Proxy (Python)
participant CoreDNS as CoreDNS Service
User->>Onboarding: nemoclaw onboard
Onboarding->>Docker: Create sandbox pod
Docker->>Sandbox: Pod Ready
Onboarding->>Onboarding: Detect container runtime
Onboarding->>Onboarding: Patch CoreDNS if runtime ≠ unknown
Onboarding->>Docker: setup-dns-proxy.sh <sandbox-name>
Docker->>Sandbox: nsenter + docker exec
Sandbox->>Sandbox: Configure lo interface (10.43.0.10/32)
Sandbox->>Sandbox: Deploy /tmp/dns-proxy.py
Sandbox->>Proxy: Start Python UDP forwarder
Proxy->>Sandbox: Bind 10.43.0.10:53
Proxy->>CoreDNS: Forward DNS queries
CoreDNS->>Proxy: DNS responses
Proxy->>Sandbox: Return to applications
Sandbox->>Onboarding: Proxy ready
Onboarding->>User: Onboarding complete
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (3)
scripts/setup-dns-proxy.sh (2)
149-158: Consider logging DNS forwarding errors for debugging.The
except Exception: passsilently discards all errors, making DNS issues difficult to diagnose. At minimum, consider logging failures to the existing log file:Proposed improvement
def forward(data, addr): try: f = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) f.settimeout(5) f.sendto(data, UPSTREAM) r, _ = f.recvfrom(4096) sock.sendto(r, addr) f.close() - except Exception: - pass + except Exception as e: + with open('/tmp/dns-proxy.log', 'a') as log: + log.write('dns-proxy error: {}\\n'.format(e))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/setup-dns-proxy.sh` around lines 149 - 158, The forward function swallows all exceptions (except Exception: pass), making DNS forwarding failures invisible; update forward to catch exceptions and write a descriptive error to the existing log (include exception details and context like addr and UPSTREAM) instead of silently passing; reference the forward function, UPSTREAM constant, and sock variable when adding the logging call so you log "DNS forward failed for addr=<addr> to UPSTREAM=<UPSTREAM>: <exception>" (or similar) and ensure the socket f is closed in a finally block or use a context manager to avoid resource leaks.
186-193: Consider verifying the proxy process is actually running.The current verification only checks if the log file was written. A more robust check would verify the process is still running:
Proposed enhancement
LOG="$(kctl exec -n openshell "$POD" -- cat /tmp/dns-proxy.log 2>/dev/null || true)" -if echo "$LOG" | grep -q "dns-proxy:"; then +NEW_PID="$(kctl exec -n openshell "$POD" -- cat /tmp/dns-proxy.pid 2>/dev/null || true)" +RUNNING="$(kctl exec -n openshell "$POD" -- kill -0 "$NEW_PID" 2>/dev/null && echo yes || true)" +if [ "$RUNNING" = "yes" ] && echo "$LOG" | grep -q "dns-proxy:"; then echo "DNS proxy started: $LOG" else echo "WARNING: DNS proxy may not have started. Log: $LOG" fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/setup-dns-proxy.sh` around lines 186 - 193, The verification only checks for log output (LOG) but not that the dns-proxy process is running; update the check after reading /tmp/dns-proxy.log to also run a process check inside the pod (use kctl exec -n openshell "$POD" -- pgrep -f 'dns-proxy' or kctl exec ... -- ps aux | grep '[d]ns-proxy') and consider checking /tmp/dns-proxy.pid if the service writes a PID file; if the process check fails, emit the WARNING with the log and a message that the process is not running, otherwise print DNS proxy started with PID/info. Ensure you reference the existing LOG, POD, and /tmp/dns-proxy.log occurrences so the new check is inserted immediately after the current grep-based verification.test/dns-proxy.test.js (1)
29-39: Test description could be more precise.The test name says "no sandbox name provided" but it actually provides
nemoclawas the first argument. Per the script's usage ([gateway-name] <sandbox-name>), this is interpreted asGATEWAY_NAME="nemoclaw"withSANDBOX_NAME="", which correctly triggers the usage error. Consider clarifying the test name:- it("exits with usage when no sandbox name provided", () => { + it("exits with usage when only gateway name provided (no sandbox name)", () => {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/dns-proxy.test.js` around lines 29 - 39, The test name is misleading: it supplies a first argument ("nemoclaw") which becomes the gateway name while the sandbox name is omitted, so update the spec in test/dns-proxy.test.js (the it(...) description) to clearly state that the sandbox name is omitted while a gateway name is provided (for example: "exits with usage when sandbox name omitted (gateway provided)"), and keep the existing invocation that calls bash with SETUP_DNS_PROXY and "nemoclaw" so the behavior and assertions (expecting non-zero status and Usage: message) remain unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/onboard.js`:
- Around line 617-620: The DNS proxy installation is currently allowed to fail
silently because run(...) is invoked with { ignoreError: true }, so remove the
ignoreError bypass (or detect the run exit status) and make the onboarding flow
exit/throw on failure of run("... setup-dns-proxy.sh ...", ...); additionally,
after the script completes, perform a verification step (e.g., kubectl exec into
the sandbox pod identified by sandboxName and run a DNS lookup like dig/host
against the expected host) and only mark onboarding complete if that check
succeeds—update the logic around run, SCRIPTS, and sandboxName to fail the
onboarding or rollback the sandbox when either the install or the post-install
DNS verification fails.
In `@scripts/fix-coredns.sh`:
- Around line 27-32: The current path blindly accepts any value returned by
detect_docker_host into DOCKER_HOST, which can point at a remote daemon and thus
pollute the remote CoreDNS config; change the logic around the DOCKER_HOST
assignment (the docker_host variable and detect_docker_host call) to validate
that the returned host is local before exporting it — only accept unix socket
paths or TCP hosts that resolve to localhost/127.0.0.1 (or ::1); if docker_host
is remote, do not set DOCKER_HOST and instead fall back to deriving upstream
from the daemon/gateway side or bail with a clear message; apply the same
guard/validation to the analogous block referenced at the later section (lines
49-55).
In `@scripts/lib/runtime.sh`:
- Around line 166-168: The current last-resort branch in runtime.sh that prints
'8.8.8.8' must be changed to first attempt to resolve systemd-resolved's real
upstreams and to treat IPv6 loopback as loopback too; update the logic around
the printf '8.8.8.8' to (1) try to read upstream servers from systemd-resolved
(e.g. via resolvectl status or /run/systemd/resolve/resolv.conf) and extract
non-loopback IPs, (2) filter out loopback addresses including 127.0.0.0/8 and
::1 (and equivalent IPv4-mapped loopback), and (3) only fall back to a public
resolver like 8.8.8.8 if no non-loopback upstreams are found; keep the change
localized to the branch that currently emits '8.8.8.8' so CoreDNS uses the
machine's upstreams when available.
In `@scripts/setup-dns-proxy.sh`:
- Around line 90-92: The pipeline that sets POD uses grep -- "$SANDBOX_NAME"
which treats the sandbox name as a regex; change grep to fixed-string mode
(e.g., use grep -F -- "$SANDBOX_NAME") so special characters in $SANDBOX_NAME
are matched literally. Update the line that assigns POD (the command pipeline
with kctl get pods | grep -- "$SANDBOX_NAME" | head -1 | sed 's|pod/||') to use
grep -F -- "$SANDBOX_NAME" to ensure correct, literal matching.
- Around line 113-117: The fallback that sets POD_PID via "kctl exec ... echo
$$" is invalid for nsenter because that PID is in the pod's PID namespace, not
the gateway/container host namespace; remove the fallback block that assigns
POD_PID with kctl exec and instead fail explicitly when POD_PID is empty (print
a clear error and exit non-zero), so the later nsenter invocation (which
requires a host-visible PID) is not attempted with a wrong value; update the
script's POD_PID handling to use the original hostname method only and error out
if POD_PID remains unset (references: POD_PID, kctl exec, nsenter).
In `@scripts/setup.sh`:
- Around line 233-235: The DNS proxy step currently swallows errors (bash
"$SCRIPT_DIR/setup-dns-proxy.sh" ... || warn ...) so setup.sh can report success
despite unresolved EAI_AGAIN; change this to treat failures as fatal or block
until DNS works: invoke setup-dns-proxy.sh without the short-circuit warn and if
it exits non-zero call error/exit (or retry), then perform an in-sandbox
resolution check (e.g., run a simple node or getent/dig inside the sandbox using
SANDBOX_NAME) and loop/retry with backoff until the lookup succeeds before
printing "Setup complete!" — update the calls around info "Setting up sandbox
DNS proxy...", the bash invocation of setup-dns-proxy.sh, and the subsequent
success path to enforce failure or readiness check.
---
Nitpick comments:
In `@scripts/setup-dns-proxy.sh`:
- Around line 149-158: The forward function swallows all exceptions (except
Exception: pass), making DNS forwarding failures invisible; update forward to
catch exceptions and write a descriptive error to the existing log (include
exception details and context like addr and UPSTREAM) instead of silently
passing; reference the forward function, UPSTREAM constant, and sock variable
when adding the logging call so you log "DNS forward failed for addr=<addr> to
UPSTREAM=<UPSTREAM>: <exception>" (or similar) and ensure the socket f is closed
in a finally block or use a context manager to avoid resource leaks.
- Around line 186-193: The verification only checks for log output (LOG) but not
that the dns-proxy process is running; update the check after reading
/tmp/dns-proxy.log to also run a process check inside the pod (use kctl exec -n
openshell "$POD" -- pgrep -f 'dns-proxy' or kctl exec ... -- ps aux | grep
'[d]ns-proxy') and consider checking /tmp/dns-proxy.pid if the service writes a
PID file; if the process check fails, emit the WARNING with the log and a
message that the process is not running, otherwise print DNS proxy started with
PID/info. Ensure you reference the existing LOG, POD, and /tmp/dns-proxy.log
occurrences so the new check is inserted immediately after the current
grep-based verification.
In `@test/dns-proxy.test.js`:
- Around line 29-39: The test name is misleading: it supplies a first argument
("nemoclaw") which becomes the gateway name while the sandbox name is omitted,
so update the spec in test/dns-proxy.test.js (the it(...) description) to
clearly state that the sandbox name is omitted while a gateway name is provided
(for example: "exits with usage when sandbox name omitted (gateway provided)"),
and keep the existing invocation that calls bash with SETUP_DNS_PROXY and
"nemoclaw" so the behavior and assertions (expecting non-zero status and Usage:
message) remain unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f3de8a15-066b-4be1-a472-15719c66b53a
📒 Files selected for processing (10)
.pre-commit-config.yamlbin/lib/onboard.jsbin/lib/platform.jsscripts/fix-coredns.shscripts/lib/runtime.shscripts/setup-dns-proxy.shscripts/setup.shtest/dns-proxy.test.jstest/platform.test.jstest/runtime-shell.test.js
| // DNS proxy — run a forwarder in the sandbox pod so the isolated | ||
| // sandbox namespace can resolve DNS. Must run after sandbox is Ready. | ||
| console.log(" Setting up sandbox DNS proxy..."); | ||
| run(`bash "${path.join(SCRIPTS, "setup-dns-proxy.sh")}" nemoclaw "${sandboxName}" 2>&1 || true`, { ignoreError: true }); |
There was a problem hiding this comment.
Don't complete onboarding before sandbox DNS is verified.
This path keeps the new sandbox even when the DNS proxy install fails, so users can end up with a registered sandbox that still has broken hostname resolution. Please make this step fail the flow, or gate completion on a post-check from inside the sandbox.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bin/lib/onboard.js` around lines 617 - 620, The DNS proxy installation is
currently allowed to fail silently because run(...) is invoked with {
ignoreError: true }, so remove the ignoreError bypass (or detect the run exit
status) and make the onboarding flow exit/throw on failure of run("...
setup-dns-proxy.sh ...", ...); additionally, after the script completes, perform
a verification step (e.g., kubectl exec into the sandbox pod identified by
sandboxName and run a DNS lookup like dig/host against the expected host) and
only mark onboarding complete if that check succeeds—update the logic around
run, SCRIPTS, and sandboxName to fail the onboarding or rollback the sandbox
when either the install or the post-install DNS verification fails.
| if [ -z "${DOCKER_HOST:-}" ]; then | ||
| if [ -n "$COLIMA_SOCKET" ]; then | ||
| export DOCKER_HOST="unix://$COLIMA_SOCKET" | ||
| else | ||
| echo "Skipping CoreDNS patch: Colima socket not found." | ||
| exit 0 | ||
| if docker_host="$(detect_docker_host)"; then | ||
| export DOCKER_HOST="$docker_host" | ||
| fi | ||
| # If still unset, Docker CLI will use the default socket | ||
| fi |
There was a problem hiding this comment.
Guard this path to local Docker daemons.
This generalized flow still derives the fallback resolver from the local machine, but now runs against any detected Docker engine. When DOCKER_HOST points at a remote daemon, the remote CoreDNS config gets patched with the workstation's DNS, which may be unreachable from that cluster. Please either reject non-local Docker hosts here or derive the upstream entirely from the daemon/gateway side.
Also applies to: 49-55
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/fix-coredns.sh` around lines 27 - 32, The current path blindly
accepts any value returned by detect_docker_host into DOCKER_HOST, which can
point at a remote daemon and thus pollute the remote CoreDNS config; change the
logic around the DOCKER_HOST assignment (the docker_host variable and
detect_docker_host call) to validate that the returned host is local before
exporting it — only accept unix socket paths or TCP hosts that resolve to
localhost/127.0.0.1 (or ::1); if docker_host is remote, do not set DOCKER_HOST
and instead fall back to deriving upstream from the daemon/gateway side or bail
with a clear message; apply the same guard/validation to the analogous block
referenced at the later section (lines 49-55).
| # Last resort: public DNS. Needed on hosts where all nameservers are | ||
| # loopback (e.g. systemd-resolved uses 127.0.0.53). | ||
| printf '8.8.8.8\n' |
There was a problem hiding this comment.
Don't fall back to public DNS before exhausting the host resolver.
On the common 127.0.0.53 / ::1 case this branch now pins CoreDNS to 8.8.8.8 instead of the machine's real upstreams, which breaks split-DNS/VPN/internal zones and leaks queries to a public resolver. Please resolve systemd-resolved's actual upstreams first and treat IPv6 loopback as loopback too.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/lib/runtime.sh` around lines 166 - 168, The current last-resort
branch in runtime.sh that prints '8.8.8.8' must be changed to first attempt to
resolve systemd-resolved's real upstreams and to treat IPv6 loopback as loopback
too; update the logic around the printf '8.8.8.8' to (1) try to read upstream
servers from systemd-resolved (e.g. via resolvectl status or
/run/systemd/resolve/resolv.conf) and extract non-loopback IPs, (2) filter out
loopback addresses including 127.0.0.0/8 and ::1 (and equivalent IPv4-mapped
loopback), and (3) only fall back to a public resolver like 8.8.8.8 if no
non-loopback upstreams are found; keep the change localized to the branch that
currently emits '8.8.8.8' so CoreDNS uses the machine's upstreams when
available.
| POD="$(kctl get pods -n openshell -o name 2>/dev/null \ | ||
| | grep -- "$SANDBOX_NAME" | head -1 | sed 's|pod/||' || true)" | ||
|
|
There was a problem hiding this comment.
Use fixed-string matching for sandbox name.
The grep without -F interprets $SANDBOX_NAME as a regex pattern. If sandbox names contain characters like ., *, or [, this could match unintended pods.
Proposed fix
-POD="$(kctl get pods -n openshell -o name 2>/dev/null \
- | grep -- "$SANDBOX_NAME" | head -1 | sed 's|pod/||' || true)"
+POD="$(kctl get pods -n openshell -o name 2>/dev/null \
+ | grep -F -- "$SANDBOX_NAME" | head -1 | sed 's|pod/||' || true)"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| POD="$(kctl get pods -n openshell -o name 2>/dev/null \ | |
| | grep -- "$SANDBOX_NAME" | head -1 | sed 's|pod/||' || true)" | |
| POD="$(kctl get pods -n openshell -o name 2>/dev/null \ | |
| | grep -F -- "$SANDBOX_NAME" | head -1 | sed 's|pod/||' || true)" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/setup-dns-proxy.sh` around lines 90 - 92, The pipeline that sets POD
uses grep -- "$SANDBOX_NAME" which treats the sandbox name as a regex; change
grep to fixed-string mode (e.g., use grep -F -- "$SANDBOX_NAME") so special
characters in $SANDBOX_NAME are matched literally. Update the line that assigns
POD (the command pipeline with kctl get pods | grep -- "$SANDBOX_NAME" | head -1
| sed 's|pod/||') to use grep -F -- "$SANDBOX_NAME" to ensure correct, literal
matching.
| if [ -z "$POD_PID" ]; then | ||
| echo "WARNING: Could not find pod PID via hostname. Trying kubectl..." | ||
| # Fallback: use kubectl exec to find a PID we can nsenter into | ||
| POD_PID="$(kctl exec -n openshell "$POD" -- sh -c 'echo $$' 2>/dev/null || true)" | ||
| fi |
There was a problem hiding this comment.
Fallback PID may not be valid for nsenter.
The fallback uses kubectl exec ... echo $$ which returns the shell's PID as seen from within the pod's PID namespace. However, nsenter -t "$POD_PID" (line 181) requires the PID as seen from the gateway container's namespace. These are different values.
If the primary hostname-matching method fails and this fallback is used, the subsequent nsenter at line 181 will likely fail or enter the wrong namespace.
Consider either:
- Removing the fallback and failing explicitly
- Finding the PID through an alternative host-visible method
Option: Remove fallback and fail explicitly
if [ -z "$POD_PID" ]; then
- echo "WARNING: Could not find pod PID via hostname. Trying kubectl..."
- # Fallback: use kubectl exec to find a PID we can nsenter into
- POD_PID="$(kctl exec -n openshell "$POD" -- sh -c 'echo $$' 2>/dev/null || true)"
-fi
-
-if [ -z "$POD_PID" ]; then
echo "ERROR: Could not determine pod PID for nsenter."
+ echo " Hostname matching in /proc failed for pod '$POD'."
exit 1
fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/setup-dns-proxy.sh` around lines 113 - 117, The fallback that sets
POD_PID via "kctl exec ... echo $$" is invalid for nsenter because that PID is
in the pod's PID namespace, not the gateway/container host namespace; remove the
fallback block that assigns POD_PID with kctl exec and instead fail explicitly
when POD_PID is empty (print a clear error and exit non-zero), so the later
nsenter invocation (which requires a host-visible PID) is not attempted with a
wrong value; update the script's POD_PID handling to use the original hostname
method only and error out if POD_PID remains unset (references: POD_PID, kctl
exec, nsenter).
| # 5b. DNS proxy for sandbox — run after sandbox is Ready. | ||
| info "Setting up sandbox DNS proxy..." | ||
| bash "$SCRIPT_DIR/setup-dns-proxy.sh" nemoclaw "$SANDBOX_NAME" 2>&1 || warn "DNS proxy setup failed (may not be needed)" |
There was a problem hiding this comment.
Make DNS proxy setup part of the success criteria.
setup-dns-proxy.sh is the fix that restores sandbox-side dns.lookup() / getaddrinfo(). Swallowing failures here lets setup.sh finish with “Setup complete!” even though the sandbox still returns EAI_AGAIN. Fail the setup, or at least run an in-sandbox resolution check and only continue when it passes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/setup.sh` around lines 233 - 235, The DNS proxy step currently
swallows errors (bash "$SCRIPT_DIR/setup-dns-proxy.sh" ... || warn ...) so
setup.sh can report success despite unresolved EAI_AGAIN; change this to treat
failures as fatal or block until DNS works: invoke setup-dns-proxy.sh without
the short-circuit warn and if it exits non-zero call error/exit (or retry), then
perform an in-sandbox resolution check (e.g., run a simple node or getent/dig
inside the sandbox using SANDBOX_NAME) and loop/retry with backoff until the
lookup succeeds before printing "Setup complete!" — update the calls around info
"Setting up sandbox DNS proxy...", the bash invocation of setup-dns-proxy.sh,
and the subsequent success path to enforce failure or readiness check.
|
Thanks for submitting this PR, it looks like it addresses an issue with DNS resolution in the sandbox environment, and the changes are well-documented and easy to follow |
|
Tested this on DGX Spark (ARM64) and Brev VM (x86_64) — DNS is broken from the sandbox shell on both platforms, so the fix is needed everywhere. The approach works. After running setup-dns-proxy.sh, 10.43.0.10 gets added to the pod's lo, the Python forwarder is written correctly, and CoreDNS pod IP is discovered (10.42.0.6 on Spark). Once the forwarder is running, getent hosts github.com and node dns.lookup() both resolve from inside the sandbox. One issue: the docker exec -d + nsenter launch path (around line 181) didn't actually start the forwarder on either platform — the script completes without error but the proxy process never appears. CodeRabbit's note about the fallback PID namespace mismatch looks correct. When I launched it manually via kubectl exec from the pod, it worked immediately. One approach that worked for me: replacing the nsenter launch with kubectl exec ... -- sh -c 'nohup python3 -u /tmp/dns-proxy.py > /tmp/dns-proxy.log 2>&1 &'. Happy to retest once the launch path is updated. |
…NVIDIA#626) The sandbox runs in an isolated network namespace (10.200.0.0/24) where OpenShell's iptables rules reject all non-proxy traffic including UDP. This causes getaddrinfo EAI_AGAIN for every outbound DNS lookup, breaking web_fetch, web_search, and all Node.js HTTP tools. Fix (three steps in new scripts/setup-dns-proxy.sh): 1. Run a Python UDP DNS forwarder on the pod-side veth gateway (10.200.0.1:53), forwarding to the real CoreDNS pod IP 2. Add an iptables rule in the sandbox namespace allowing UDP to the gateway on port 53 (the only non-proxy firewall exception) 3. Update the sandbox's /etc/resolv.conf to point to 10.200.0.1 Also broadens the CoreDNS patch (fix-coredns.sh / shouldPatchCoredns) to run on all Docker-based runtimes, not just Colima — k3s-inside-Docker has broken DNS forwarding on Linux hosts with systemd-resolved too. Inspired by PR NVIDIA#732's DNS forwarder approach. Key differences: - Uses kubectl exec instead of nsenter (fixes PR NVIDIA#732's launch bug) - Handles the sandbox iptables constraint (all UDP blocked) - Resolves systemd-resolved upstreams via resolvectl before falling back to 8.8.8.8 - Uses grep -F for fixed-string sandbox name matching Tested on DGX Spark (ARM64): fresh destroy + onboard, getent hosts and node dns.lookup both resolve from inside the sandbox.
…NVIDIA#626) The sandbox runs in an isolated network namespace (10.200.0.0/24) where OpenShell's iptables rules reject all non-proxy traffic including UDP. This causes getaddrinfo EAI_AGAIN for every outbound DNS lookup, breaking web_fetch, web_search, and all Node.js HTTP tools. Fix (three steps in new scripts/setup-dns-proxy.sh): 1. Run a Python UDP DNS forwarder on the pod-side veth gateway (10.200.0.1:53), forwarding to the real CoreDNS pod IP 2. Add an iptables rule in the sandbox namespace allowing UDP to the gateway on port 53 (the only non-proxy firewall exception) 3. Update the sandbox's /etc/resolv.conf to point to 10.200.0.1 Also broadens the CoreDNS patch (fix-coredns.sh / shouldPatchCoredns) to run on all Docker-based runtimes, not just Colima — k3s-inside-Docker has broken DNS forwarding on Linux hosts with systemd-resolved too. Inspired by PR NVIDIA#732's DNS forwarder approach. Key differences: - Uses kubectl exec instead of nsenter (fixes PR NVIDIA#732's launch bug) - Handles the sandbox iptables constraint (all UDP blocked) - Resolves systemd-resolved upstreams via resolvectl before falling back to 8.8.8.8 - Uses grep -F for fixed-string sandbox name matching Tested on DGX Spark (ARM64): fresh destroy + onboard, getent hosts and node dns.lookup both resolve from inside the sandbox.
… (#1062) * fix(sandbox): add DNS forwarder so web_fetch resolves hostnames (fixes #626) The sandbox runs in an isolated network namespace (10.200.0.0/24) where OpenShell's iptables rules reject all non-proxy traffic including UDP. This causes getaddrinfo EAI_AGAIN for every outbound DNS lookup, breaking web_fetch, web_search, and all Node.js HTTP tools. Fix (three steps in new scripts/setup-dns-proxy.sh): 1. Run a Python UDP DNS forwarder on the pod-side veth gateway (10.200.0.1:53), forwarding to the real CoreDNS pod IP 2. Add an iptables rule in the sandbox namespace allowing UDP to the gateway on port 53 (the only non-proxy firewall exception) 3. Update the sandbox's /etc/resolv.conf to point to 10.200.0.1 Also broadens the CoreDNS patch (fix-coredns.sh / shouldPatchCoredns) to run on all Docker-based runtimes, not just Colima — k3s-inside-Docker has broken DNS forwarding on Linux hosts with systemd-resolved too. Inspired by PR #732's DNS forwarder approach. Key differences: - Uses kubectl exec instead of nsenter (fixes PR #732's launch bug) - Handles the sandbox iptables constraint (all UDP blocked) - Resolves systemd-resolved upstreams via resolvectl before falling back to 8.8.8.8 - Uses grep -F for fixed-string sandbox name matching Tested on DGX Spark (ARM64): fresh destroy + onboard, getent hosts and node dns.lookup both resolve from inside the sandbox. * fix(sandbox): add runtime DNS verification after setup setup-dns-proxy.sh now verifies all three DNS bridge layers from inside the sandbox namespace post-deployment: resolv.conf target, iptables UDP rule, and actual getent hosts resolution. Reports [PASS]/[FAIL] per check so failures are immediately visible in onboard logs.
… (#1062) * fix(sandbox): add DNS forwarder so web_fetch resolves hostnames (fixes #626) The sandbox runs in an isolated network namespace (10.200.0.0/24) where OpenShell's iptables rules reject all non-proxy traffic including UDP. This causes getaddrinfo EAI_AGAIN for every outbound DNS lookup, breaking web_fetch, web_search, and all Node.js HTTP tools. Fix (three steps in new scripts/setup-dns-proxy.sh): 1. Run a Python UDP DNS forwarder on the pod-side veth gateway (10.200.0.1:53), forwarding to the real CoreDNS pod IP 2. Add an iptables rule in the sandbox namespace allowing UDP to the gateway on port 53 (the only non-proxy firewall exception) 3. Update the sandbox's /etc/resolv.conf to point to 10.200.0.1 Also broadens the CoreDNS patch (fix-coredns.sh / shouldPatchCoredns) to run on all Docker-based runtimes, not just Colima — k3s-inside-Docker has broken DNS forwarding on Linux hosts with systemd-resolved too. Inspired by PR #732's DNS forwarder approach. Key differences: - Uses kubectl exec instead of nsenter (fixes PR #732's launch bug) - Handles the sandbox iptables constraint (all UDP blocked) - Resolves systemd-resolved upstreams via resolvectl before falling back to 8.8.8.8 - Uses grep -F for fixed-string sandbox name matching Tested on DGX Spark (ARM64): fresh destroy + onboard, getent hosts and node dns.lookup both resolve from inside the sandbox. * fix(sandbox): add runtime DNS verification after setup setup-dns-proxy.sh now verifies all three DNS bridge layers from inside the sandbox namespace post-deployment: resolv.conf target, iptables UDP rule, and actual getent hosts resolution. Reports [PASS]/[FAIL] per check so failures are immediately visible in onboard logs.
…DIA#626) (NVIDIA#1062) * fix(sandbox): add DNS forwarder so web_fetch resolves hostnames (fixes NVIDIA#626) The sandbox runs in an isolated network namespace (10.200.0.0/24) where OpenShell's iptables rules reject all non-proxy traffic including UDP. This causes getaddrinfo EAI_AGAIN for every outbound DNS lookup, breaking web_fetch, web_search, and all Node.js HTTP tools. Fix (three steps in new scripts/setup-dns-proxy.sh): 1. Run a Python UDP DNS forwarder on the pod-side veth gateway (10.200.0.1:53), forwarding to the real CoreDNS pod IP 2. Add an iptables rule in the sandbox namespace allowing UDP to the gateway on port 53 (the only non-proxy firewall exception) 3. Update the sandbox's /etc/resolv.conf to point to 10.200.0.1 Also broadens the CoreDNS patch (fix-coredns.sh / shouldPatchCoredns) to run on all Docker-based runtimes, not just Colima — k3s-inside-Docker has broken DNS forwarding on Linux hosts with systemd-resolved too. Inspired by PR NVIDIA#732's DNS forwarder approach. Key differences: - Uses kubectl exec instead of nsenter (fixes PR NVIDIA#732's launch bug) - Handles the sandbox iptables constraint (all UDP blocked) - Resolves systemd-resolved upstreams via resolvectl before falling back to 8.8.8.8 - Uses grep -F for fixed-string sandbox name matching Tested on DGX Spark (ARM64): fresh destroy + onboard, getent hosts and node dns.lookup both resolve from inside the sandbox. * fix(sandbox): add runtime DNS verification after setup setup-dns-proxy.sh now verifies all three DNS bridge layers from inside the sandbox namespace post-deployment: resolv.conf target, iptables UDP rule, and actual getent hosts resolution. Reports [PASS]/[FAIL] per check so failures are immediately visible in onboard logs.
|
Thanks for the CoreDNS proxy approach to sandbox DNS. The codebase has changed significantly since March 23 — including a full TypeScript migration — so this will need a rebase on |
…DIA#626) (NVIDIA#1062) * fix(sandbox): add DNS forwarder so web_fetch resolves hostnames (fixes NVIDIA#626) The sandbox runs in an isolated network namespace (10.200.0.0/24) where OpenShell's iptables rules reject all non-proxy traffic including UDP. This causes getaddrinfo EAI_AGAIN for every outbound DNS lookup, breaking web_fetch, web_search, and all Node.js HTTP tools. Fix (three steps in new scripts/setup-dns-proxy.sh): 1. Run a Python UDP DNS forwarder on the pod-side veth gateway (10.200.0.1:53), forwarding to the real CoreDNS pod IP 2. Add an iptables rule in the sandbox namespace allowing UDP to the gateway on port 53 (the only non-proxy firewall exception) 3. Update the sandbox's /etc/resolv.conf to point to 10.200.0.1 Also broadens the CoreDNS patch (fix-coredns.sh / shouldPatchCoredns) to run on all Docker-based runtimes, not just Colima — k3s-inside-Docker has broken DNS forwarding on Linux hosts with systemd-resolved too. Inspired by PR NVIDIA#732's DNS forwarder approach. Key differences: - Uses kubectl exec instead of nsenter (fixes PR NVIDIA#732's launch bug) - Handles the sandbox iptables constraint (all UDP blocked) - Resolves systemd-resolved upstreams via resolvectl before falling back to 8.8.8.8 - Uses grep -F for fixed-string sandbox name matching Tested on DGX Spark (ARM64): fresh destroy + onboard, getent hosts and node dns.lookup both resolve from inside the sandbox. * fix(sandbox): add runtime DNS verification after setup setup-dns-proxy.sh now verifies all three DNS bridge layers from inside the sandbox namespace post-deployment: resolv.conf target, iptables UDP rule, and actual getent hosts resolution. Reports [PASS]/[FAIL] per check so failures are immediately visible in onboard logs.
Summary
Fixes DNS resolution inside the sandbox by running a lightweight Python DNS forwarder in the sandbox pod's namespace. This makes
dns.lookup()work for tools likeweb_searchandweb_fetchthat do pre-flight DNS checks before using the HTTP proxy.Related Issue
Fixes #626
Related: OpenShell #437, #504
Changes
scripts/setup-dns-proxy.sh— Runs after sandbox creation. Adds the CoreDNS service IP (10.43.0.10) as a local address in the pod, then starts a Python DNS forwarder bound to it. Forwards to the actual CoreDNS pod IP (discovered dynamically) so k8s-internal names (*.svc.cluster.local) still resolve for theopenshell-sandboxproxy.scripts/fix-coredns.sh— Extended to all platforms (was Colima-only). The comment already said "broken on all platforms" but the code only gated for Colima.scripts/lib/runtime.sh— Added8.8.8.8fallback toresolve_coredns_upstream()for hosts where all nameservers are loopback (e.g.systemd-resolved).bin/lib/platform.js—shouldPatchCoredns()returns true for all known container runtimes (wascolimaonly).bin/lib/onboard.js/scripts/setup.sh— Callsetup-dns-proxy.shafter sandbox reaches Ready state..pre-commit-config.yaml— Added--with pytestto Pyright pre-push hook (fixes type check failures ontest_endpoint_validation.py).Type of Change
Testing
npx prek run --all-filespasses.npm testpasses (311/311).nemoclaw onboard --non-interactivecreates sandbox with working DNS.socket.getaddrinfo()resolves external names from inside the sandbox.web_searchreturns live Google Search results via Gemini grounding. Inference routing toinference.localworks (theopenshell-sandboxproxy can resolve the gateway via k8s DNS).Design Notes
Two critical requirements discovered during testing:
Bind to
10.43.0.10(not0.0.0.0): glibc's resolver uses connected UDP sockets that discard responses from unexpected source IPs. Binding to0.0.0.0causes responses to come from the veth address (10.200.0.1), which glibc ignores.Forward to the CoreDNS pod IP (not
8.8.8.8): adding10.43.0.10as a local address intercepts ALL DNS in the pod namespace — including from theopenshell-sandboxbinary, which needs to resolveopenshell-0.openshell.svc.cluster.localto reach the gateway. Public DNS can't resolve*.svc.cluster.local.Checklist
General
Code Changes
npx prek run --all-filesauto-fixes formatting.Summary by CodeRabbit
New Features
Improvements
Tests