Skip to content

bug(cluster): host.openshell.internal resolves to unreachable IP on Docker Desktop + WSL2 #811

@gburachas

Description

@gburachas

Agent Diagnostic

  • Loaded debug-inference and openshell-cli skills
  • Traced the inference routing path through the codebase: proxy.rs:351 intercepts inference.localroute_inference_request()Router::proxy_with_candidates_streaming()reqwest::Client → upstream at host.openshell.internal:11434
  • Confirmed the SSRF check at proxy.rs:480-601 does NOT apply to managed inference routes (the inference.local interception returns early at line 374)
  • Traced host gateway IP detection in cluster-entrypoint.sh:397-415: resolves host.docker.internal via getent ahostsv4, falls back to container default route
  • On Docker Desktop + WSL2, host.docker.internal resolves to either IPv6 ULA (fdc4:...) or 172.29.0.254 (Docker Desktop gateway) — both unreachable from inside k3s pods
  • On Docker Engine + WSL2, falls back to 172.17.0.1 (docker0 bridge) which has asymmetric routing and times out
  • The bad IP propagates: cluster-entrypoint.sh__HOST_GATEWAY_IP__ in HelmChart → hostGatewayIP Helm value → StatefulSet hostAliases → gateway pod /etc/hosts → router reqwest::Client DNS resolution → connection failure
  • The router's reqwest::Client (openshell-router/src/lib.rs:39-41) has no IP filtering — once the resolved IP is reachable, inference works

Description

When running OpenShell on Docker Desktop + WSL2 (or Docker Engine on WSL2), host.openshell.internal resolves to an unreachable IP address. This breaks any feature that depends on reaching host services from inside the cluster, most notably local inference routing (e.g., Ollama at host.openshell.internal:11434).

The root cause is in deploy/docker/cluster-entrypoint.sh lines 397-415. The detection logic:

  1. Tries getent ahostsv4 host.docker.internal — on Docker Desktop/WSL2 this returns an IPv6 ULA or unreachable gateway IP
  2. Falls back to ip -4 route | awk '/default/ { print $3 }' — on Docker Engine/WSL2 this returns the docker0 bridge IP which has asymmetric routing

Neither produces a usable IPv4 address that pods can reach.

Expected: host.openshell.internal should resolve to an IP where the host's services (e.g., Ollama on port 11434) are actually reachable.

Actual: The resolved IP is either IPv6, unreachable, or has broken routing. The router's upstream connection fails with RouterError::UpstreamUnavailable.

Reproduction Steps

  1. Run on WSL2 with Docker Desktop (Windows 11)
  2. Start Ollama on the host: OLLAMA_HOST=0.0.0.0:11434 ollama serve
  3. Start the gateway: openshell gateway start
  4. Create a provider and set inference:
    openshell provider create --name ollama --type openai \
      --credential OPENAI_API_KEY=empty \
      --config OPENAI_BASE_URL=http://host.openshell.internal:11434/v1
    openshell inference set --no-verify --provider ollama --model qwen2.5-coder:3b
    
  5. Create a sandbox and test inference:
    openshell sandbox create -- bash
    # Inside sandbox:
    curl -sS https://inference.local/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{"model":"qwen2.5-coder:3b","messages":[{"role":"user","content":"say hello"}]}'
    
  6. Result: timeout or {"error":"upstream unavailable"} — the router cannot connect to host.openshell.internal:11434

Workaround: Pass the WSL2 eth0 IP directly as OPENAI_BASE_URL instead of host.openshell.internal:

openshell provider create --name ollama --type openai \
  --credential OPENAI_API_KEY=empty \
  --config OPENAI_BASE_URL=http://$(hostname -I | awk '{print $1}'):11434/v1

Environment

  • OS: Windows 11 + WSL2 (Ubuntu, kernel 6.6.87.2)
  • Docker: Docker Desktop 4.x with WSL2 backend
  • OpenShell: current main
  • Also affects: Docker Engine running inside WSL2

Logs

# Inside the cluster container:
$ getent ahostsv4 host.docker.internal
# Returns empty or IPv6 on Docker Desktop/WSL2

# Gateway pod logs show:
INFO  openshell_router: routing proxy inference request (streaming)
# Followed by upstream connection failure (no NET:OPEN, just timeout)

Proposed Fix

After the existing detection in cluster-entrypoint.sh:415, add:

  1. IPv6 rejection — if detected IP contains :, discard and try fallbacks
  2. WSL2 eth0 fallback — try ip -4 addr show eth0 for the WSL2 distro IP
  3. Environment variable override — accept OPENSHELL_HOST_GATEWAY_IP_OVERRIDE

This is additive and does not affect platforms where the existing detection works (macOS Docker Desktop, native Linux).

Related: #681 (WSL2 proxy issues, different root cause), #642 (sandbox networking on WSL2)

Agent-First Checklist

  • I pointed my agent at the repo and had it investigate this issue
  • I loaded relevant skills (e.g., debug-openshell-cluster, debug-inference, openshell-cli)
  • My agent could not resolve this — the diagnostic above explains why

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions