-
Notifications
You must be signed in to change notification settings - Fork 886
Description
Description of problem:
Very rarely (observed twice after using 1000s of containers) we start a new container
into an overlay network in a docker swarm. Existing containers in the overlay network that
are on different nodes cannot connect to the new container. However containers in the
overlay network on the same node as the new container are able to connect.
The new container receives an IP address in the overlay network subnet, but this does not
seem to work correctly when resolved from a different node.
The second time this happened we fixed the problem by stopping and starting the new
container.
We haven't found a way to reliably reproduce this problem. Is there any other debugging
I can provide that would help diagnose this issue?
The error message is the same as the one reported on #617.
docker version:
Client:
Version: 1.10.0
API version: 1.22
Go version: go1.5.3
Git commit: 590d5108
Built: Thu Feb 4 19:04:33 2016
OS/Arch: linux/amd64
Server:
Version: swarm/1.1.0
API version: 1.22
Go version: go1.5.3
Git commit: a0fd82b
Built: Thu Feb 4 08:55:18 UTC 2016
OS/Arch: linux/amd64
docker info:
Containers: 102
Running: 53
Paused: 0
Stopped: 49
Images: 372
Role: primary
Strategy: spread
Filters: health, port, dependency, affinity, constraint
Nodes: 3
glera.int.corefiling.com: 10.0.0.57:2375
└ Status: Healthy
└ Containers: 32
└ Reserved CPUs: 0 / 4
└ Reserved Memory: 0 B / 32.94 GiB
└ Labels: executiondriver=native-0.2, kernelversion=4.2.6-201.fc22.x86_64, operatingsystem=Fedora 22 (Twenty Two), storagedriver=devicemapper
└ Error: (none)
└ UpdatedAt: 2016-02-22T11:20:16Z
kafue.int.corefiling.com: 10.0.0.17:2375
└ Status: Healthy
└ Containers: 36
└ Reserved CPUs: 0 / 4
└ Reserved Memory: 0 B / 16.4 GiB
└ Labels: executiondriver=native-0.2, kernelversion=4.2.6-201.fc22.x86_64, operatingsystem=Fedora 22 (Twenty Two), storagedriver=devicemapper
└ Error: (none)
└ UpdatedAt: 2016-02-22T11:20:20Z
paar.int.corefiling.com: 10.0.1.1:2375
└ Status: Healthy
└ Containers: 34
└ Reserved CPUs: 0 / 4
└ Reserved Memory: 0 B / 16.44 GiB
└ Labels: executiondriver=native-0.2, kernelversion=4.2.6-201.fc22.x86_64, operatingsystem=Fedora 22 (Twenty Two), storagedriver=devicemapper
└ Error: (none)
└ UpdatedAt: 2016-02-22T11:20:31Z
Plugins:
Volume:
Network:
Kernel Version: 4.2.6-201.fc22.x86_64
Operating System: linux
Architecture: amd64
CPUs: 12
Total Memory: 65.77 GiB
Name: 9dd94ffb6aea
uname -a:
Linux glera.int.corefiling.com 4.2.6-201.fc22.x86_64 #1 SMP Tue Nov 24 18:42:39 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Environment details (AWS, VirtualBox, physical, etc.):
Physical - docker swarm cluster.
How reproducible:
Rare - happened 2 times after creating/starting 1000s of containers.
Steps to Reproduce:
- Create/start a container in an overlay network
- In the same overlay network create/start a container on a different host in the swarm.
A process in the container is listening on port 80 and this port is exposed to the overlay network. - Try to connect to the container of step 2 from within the container of step 1 with http client.
Actual Results:
Get a connection timeout. For example with the golang http client:
http: proxy error: dial tcp 10.158.0.60:80: i/o timeout
10.158.0.60 is the address of the container in step 2 in the overlay network subnet.
The docker logs on the swarm node that launched the container in step 2 contain (from journalctl -u docker):
level=error msg="could not resolve peer \"<nil>\": timed out resolving peer by querying the cluster".
We see a line like this for each failed request between the containers.
When we make the same request from a container in the overlay network on the same swarm node as the
container running the http server the expected connection is established and a response is received.
Expected Results:
The http client receieves a response from the container its trying to connect to.
Additional info:
The second time this occurred we fixed the problem by stopping and starting the container running
the http server.
We are using Consul as the KV store of the overlay network and swarm.
When removing the container that cannot be connected to, docker logs (journalctl -u docker) contain the line:
error msg="Peer delete failed in the driver: could not delete fdb entry into the sandbox: could not delete neighbor entry: no such file or directory\n"
The docker log lines are emitted by https://github.com/docker/libnetwork/blob/master/drivers/overlay/ov_serf.go#L180. I can't find an existing issue tracking this.