Skip to content

Conversation

@fcrisciani
Copy link

The netlink socket that was used to monitor the L2
miss was never being closed. The watchMiss goroutine
spawned was never returning. This was causing goroutine
leak in case of createNetwork/destroyNetwork

Signed-off-by: Flavio Crisciani flavio.crisciani@docker.com

@fcrisciani
Copy link
Author

Stacktrace related in 17.03:

github.com/docker/docker/vendor/github.com/vishvananda/netlink/nl.(*NetlinkSocket).Receive(0xc42570c7b0, 0xc422e72d80, 0x1, 0x1, 0x0, 0x0)
	/root/rpmbuild/BUILD/docker-ee/.gopath/src/github.com/docker/docker/vendor/github.com/vishvananda/netlink/nl/nl_linux.go:596 +0x10b
github.com/docker/docker/vendor/github.com/docker/libnetwork/drivers/overlay.(*network).watchMiss(0xc421d2a500, 0xc42570c7b0)
	/root/rpmbuild/BUILD/docker-ee/.gopath/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/drivers/overlay/ov_network.go:631 +0x61
created by github.com/docker/docker/vendor/github.com/docker/libnetwork/drivers/overlay.(*network).initSandbox
	/root/rpmbuild/BUILD/docker-ee/.gopath/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/drivers/overlay/ov_network.go:620 +0x564

for {
msgs, err := nlSock.Receive()
if err != nil {
if nlSock.GetFd() == -1 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs locking to avoid a data race involving nlSock.fd.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct thanks


// Close the netlink socket, this will also release the watchMiss goroutine that is using it
if n.nlSocket != nil {
n.nlSocket.Close()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs locking to avoid a data race involving nlSock.fd. Close could race with Receive or GetFd.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

network lock is already taken before calling the destroySandbox, this looks like is by design, see the comment on top of the function: https://github.com/docker/libnetwork/blob/master/drivers/overlay/ov_network.go#L267

@fcrisciani
Copy link
Author

@aaronlehmann can you give another round of review? as this LG for you I will update the backport

@aaronlehmann
Copy link

n.nlSocket.Close still races with n.nlSocket.Receive. The fd field inside n.nlSocket is not protected by any kind of lock, but is written to and read in different goroutines.

@fcrisciani
Copy link
Author

goroutine A is blocked on the Receive on the netlink socket
goroutine B will acquire the network lock, call the Close (setting the fd to -1) and wake up the Receive of the goroutine A
goroutine A will receive an error, and will enter in the if, will try to acquire the network lock and will wait on it
goroutine B will finish the operation and release the network lock
goroutine A will acquire the lock and fetch the fd now equal to -1

Do you have in mind some other scenario? In this case the lock is the network one that guarantees the safety of the netlink socket.
The atomicity of the GetFd and Close is ensured by the network lock

@aaronlehmann
Copy link

Do you have in mind some other scenario?

goroutine A calls Receive, which checks if s.fd < 0. At this time, s.fd >= 0.
goroutine B calls Close
goroutine A receives from s.fd, even though it is no longer valid. Or its read of s.fd is corrupted because the underlying value changes during the read (I don't think integer reads are guaranteed to be atomic).

This is really a bug in github.com/vishvananda/netlink/nl, though. I don't think we can work around it without patching that code.

@fcrisciani
Copy link
Author

In this case it is still fine, because you want to avoid the race between the Close and the GetFD (Receive a part from the initial check on the fd will return what the syscall returns so no issue there) and that is done using the network mutex. In general the linux socket are not thread safe, but sure a more clean way would be to protect the fd with the mutex that is inside the netlink library

@aaronlehmann
Copy link

I opened vishvananda/netlink#234 upstream

@fcrisciani
Copy link
Author

will wait for it to be merged and update this commit then

@aaronlehmann aaronlehmann mentioned this pull request Jun 6, 2017
23 tasks
Flavio Crisciani added 2 commits June 6, 2017 09:24
The netlink socket that was used to monitor the L2
miss was never being closed. The watchMiss goroutine
spawned was never returning. This was causing goroutine
leak in case of createNetwork/destroyNetwork

Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>
Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>
@mavenugo
Copy link
Contributor

mavenugo commented Jun 6, 2017

Thanks @fcrisciani and @aaronlehmann

LGTM

@ityangchen
Copy link
Contributor

when repeated operate docker swarm init and docker swarm leave -f , it seems leak of watchMiss goroutine remain the same.

@fcrisciani
Copy link
Author

@ityangchen this is the follow up PR
#1976
There is a kernel bug that does not release the receive on fd close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants