Cannot connect to a container in an overlay network from a different swarm node: `could not resolve peer \"<nil>\": timed out resolving peer by querying the cluster`

## Description of problem:

Very rarely (observed twice after using 1000s of containers) we start a new container
into an overlay network in a docker swarm. Existing containers in the overlay network that
are on different nodes cannot connect to the new container. However containers in the
overlay network on the same node as the new container are able to connect.

The new container receives an IP address in the overlay network subnet, but this does not
seem to work correctly when resolved from a different node.

The second time this happened we fixed the problem by stopping and starting the new
container.

We haven't found a way to reliably reproduce this problem. Is there any other debugging
I can provide that would help diagnose this issue?

The error message is the same as the one reported on https://github.com/docker/libnetwork/issues/617.

`docker version`:

```
Client:
 Version:      1.10.0
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   590d5108
 Built:        Thu Feb  4 19:04:33 2016
 OS/Arch:      linux/amd64

Server:
 Version:      swarm/1.1.0
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   a0fd82b
 Built:        Thu Feb  4 08:55:18 UTC 2016
 OS/Arch:      linux/amd64
```

`docker info`:

```
Containers: 102
 Running: 53
 Paused: 0
 Stopped: 49
Images: 372
Role: primary
Strategy: spread
Filters: health, port, dependency, affinity, constraint
Nodes: 3
 glera.int.corefiling.com: 10.0.0.57:2375
  └ Status: Healthy
  └ Containers: 32
  └ Reserved CPUs: 0 / 4
  └ Reserved Memory: 0 B / 32.94 GiB
  └ Labels: executiondriver=native-0.2, kernelversion=4.2.6-201.fc22.x86_64, operatingsystem=Fedora 22 (Twenty Two), storagedriver=devicemapper
  └ Error: (none)
  └ UpdatedAt: 2016-02-22T11:20:16Z
 kafue.int.corefiling.com: 10.0.0.17:2375
  └ Status: Healthy
  └ Containers: 36
  └ Reserved CPUs: 0 / 4
  └ Reserved Memory: 0 B / 16.4 GiB
  └ Labels: executiondriver=native-0.2, kernelversion=4.2.6-201.fc22.x86_64, operatingsystem=Fedora 22 (Twenty Two), storagedriver=devicemapper
  └ Error: (none)
  └ UpdatedAt: 2016-02-22T11:20:20Z
 paar.int.corefiling.com: 10.0.1.1:2375
  └ Status: Healthy
  └ Containers: 34
  └ Reserved CPUs: 0 / 4
  └ Reserved Memory: 0 B / 16.44 GiB
  └ Labels: executiondriver=native-0.2, kernelversion=4.2.6-201.fc22.x86_64, operatingsystem=Fedora 22 (Twenty Two), storagedriver=devicemapper
  └ Error: (none)
  └ UpdatedAt: 2016-02-22T11:20:31Z
Plugins:
 Volume:
 Network:
Kernel Version: 4.2.6-201.fc22.x86_64
Operating System: linux
Architecture: amd64
CPUs: 12
Total Memory: 65.77 GiB
Name: 9dd94ffb6aea
```

`uname -a`:

```
Linux glera.int.corefiling.com 4.2.6-201.fc22.x86_64 #1 SMP Tue Nov 24 18:42:39 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
```
## Environment details (AWS, VirtualBox, physical, etc.):

Physical - docker swarm cluster.
## How reproducible:

Rare - happened 2 times after creating/starting 1000s of containers.
## Steps to Reproduce:
1. Create/start a container in an overlay network
2. In the same overlay network create/start a container on a different host in the swarm.
   A process in the container is listening on port 80 and this port is exposed to the overlay network.
3. Try to connect to the container of step 2 from within the container of step 1 with http client.
## Actual Results:

Get a connection timeout. For example with the golang http client:

```
 http: proxy error: dial tcp 10.158.0.60:80: i/o timeout
```

10.158.0.60 is the address of the container in step 2 in the overlay network subnet.

The docker logs on the swarm node that launched the container in step 2 contain (from `journalctl -u docker`):

```
level=error msg="could not resolve peer \"<nil>\": timed out resolving peer by querying the cluster".
```

We see a line like this for each failed request between the containers.

When we make the same request from a container in the overlay network on the same swarm node as the
container running the http server the expected connection is established and a response is received.
## Expected Results:

The http client receieves a response from the container its trying to connect to.
## Additional info:

The second time this occurred we fixed the problem by stopping and starting the container running
the http server.

We are using Consul as the KV store of the overlay network and swarm.

When removing the container that cannot be connected to, docker logs (`journalctl -u docker`) contain the line:

```
error msg="Peer delete failed in the driver: could not delete fdb entry into the sandbox: could not delete neighbor entry: no such file or directory\n"
```

The docker log lines are emitted by https://github.com/docker/libnetwork/blob/master/drivers/overlay/ov_serf.go#L180. I can't find an existing issue tracking this.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cannot connect to a container in an overlay network from a different swarm node: `could not resolve peer \"<nil>\": timed out resolving peer by querying the cluster` #962

Description of problem:

Environment details (AWS, VirtualBox, physical, etc.):

How reproducible:

Steps to Reproduce:

Actual Results:

Expected Results:

Additional info:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cannot connect to a container in an overlay network from a different swarm node: could not resolve peer \"<nil>\": timed out resolving peer by querying the cluster #962

Description

Description of problem:

Environment details (AWS, VirtualBox, physical, etc.):

How reproducible:

Steps to Reproduce:

Actual Results:

Expected Results:

Additional info:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Cannot connect to a container in an overlay network from a different swarm node: `could not resolve peer \"<nil>\": timed out resolving peer by querying the cluster` #962