Skip to content

Traefik routes intermittently timeout in Docker Swarm due to service DNS resolving to stale VIP instead of task IPs #3480

@LastSkywalkerER

Description

@LastSkywalkerER

Traefik + Docker Swarm DNS Resolution Issue

To Reproduce

  1. Install Dokploy on a single-node Docker Swarm setup.

  2. Deploy an application as a Docker Swarm service (e.g. frontend app listening on port 3000).

  3. Expose the application through Traefik using the service name as backend (default Dokploy behavior):

    http://<service-name>:3000
    
  4. Access the application via a domain routed through Traefik.

  5. Observe intermittent timeouts / 404 errors.

  6. Inside the Traefik container, resolve the service name:

    getent ahostsv4 <service-name>
  7. Notice that Docker DNS returns multiple IPs, including a stale VIP.

  8. Traefik randomly selects one of the returned IPs and may hit the non-routable VIP, causing timeouts.


Current vs. Expected Behavior

Expected Behavior

Traefik should consistently route traffic to a healthy backend container when using a Docker Swarm service as an upstream.

Current Behavior

Traefik intermittently times out because Docker DNS resolves the service name to:

  • a valid task IP (working)
  • a stale service VIP (not routable in this setup)

Traefik may select the VIP, resulting in connection timeouts.


Environment Information

Operating System:
  Debian GNU/Linux 13 (trixie)

Kernel:
  6.17.2-2-pve

Architecture:
  x86_64 / amd64

Docker Engine:
  Docker Engine – Community 28.5.0
  API version: 1.51
  Containerd: v2.2.1
  runc: 1.3.4

Docker mode:
  Docker Swarm active
  Single-node cluster
  Managers: 1
  Nodes: 1
  Node address: 10.202.20.128

Dokploy:
  Image: dokploy/dokploy:latest
  Running as Docker Swarm service
  (single replica on the same node)

Traefik:
  Image: traefik:v3.6.1
  Version: 3.6.1
  Codename: ramequin
  OS/Arch: linux/amd64
  Deployed and managed by Dokploy

Deployment type:
  Applications are deployed on the same server where Dokploy is installed

Application type:
  Frontend application built with Nixpacks
  Served by Caddy web server
  Internal listening port: 3000
  Deployed as a Docker Swarm service

Affected Area(s)

  • Traefik
  • Docker

Deployment Location

Applications are deployed on the same server where Dokploy is installed.


Technical Investigation (Commands & Findings)

  1. Application is healthy inside the container:

    docker exec -it <task-container> curl http://127.0.0.1:3000
    # HTTP/1.1 200 OK
  2. Application is reachable via task IP:

    docker exec -it <task-container> curl http://<task-ip>:3000
    # HTTP/1.1 200 OK
  3. Traefik can reach the task IP directly:

    docker exec -it dokploy-traefik wget http://<task-ip>:3000
    # HTTP/1.1 200 OK
  4. Service name resolves to multiple IPs:

    docker exec -it dokploy-traefik getent ahostsv4 <service-name>

    Example output:

    10.0.1.43   # service VIP (stale / not routable)
    10.0.1.62   # active task IP
    
  5. VIP IP does NOT belong to any container:

    docker network inspect dokploy-network | grep 10.0.1.43
    # no output
  6. Enabling DNSRR does not remove VIP from DNS:

    docker service update --endpoint-mode dnsrr <service-name>

    DNS still returns both IPs:

    10.0.1.43
    10.0.1.62
    
  7. Using tasks.<service-name> resolves only real task IPs:

    docker exec -it dokploy-traefik nslookup tasks.<service-name> 127.0.0.11

    Output:

    Address: 10.0.1.62
    
  8. Traefik successfully connects using tasks DNS:

    docker exec -it dokploy-traefik wget http://tasks.<service-name>:3000
    # HTTP/1.1 200 OK

Root Cause

Docker Swarm DNS resolves a service name to both:

  • the service VIP
  • the task IPs

In single-node Swarm setups (especially with published ports), the VIP may be non-functional.
Traefik may randomly select this VIP, leading to intermittent routing failures.


Workaround / Solution

Configure Traefik backends to use task-specific DNS instead of the service name:

http://tasks.<service-name>:<port>

This guarantees that Traefik routes traffic only to active task containers and avoids stale VIPs entirely.


Affected Components

  • Traefik
  • Docker Swarm service discovery
  • Dokploy auto-generated Traefik configuration

Additional Context

This issue is reproducible on a clean single-node Swarm installation and disappears immediately when switching Traefik upstreams from <service-name> to tasks.<service-name>.


Will You Send a PR to Fix It?

No

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions