Skip to content

Workaround kernel bugs s related to namespaces#223

Merged
mavenugo merged 1 commit intomoby:docker1.7.0_integfrom
mrjana:docker1.7.0_integ
May 27, 2015
Merged

Workaround kernel bugs s related to namespaces#223
mavenugo merged 1 commit intomoby:docker1.7.0_integfrom
mrjana:docker1.7.0_integ

Conversation

@mrjana
Copy link
Contributor

@mrjana mrjana commented May 27, 2015

This PR attempts to work around bugs present in kernel
version 3.18-4.0.1 relating to namespace creation
and destruction. This fix attempts to avoid certain
system calls to not get in the kernel bug path as well
as lazily garbage collecting the name paths when they are removed.

Signed-off-by: Jana Radhakrishnan mrjana@docker.com

This PR attempts to work around bugs present in kernel
version 3.18-4.0.1 relating to namespace creation
and destruction. This fix attempts to avoid certain
systemmcalls to not get in the kkernel bug path as well
as lazily garbage collecting the name paths when they are removed.

Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>
@mrjana mrjana force-pushed the docker1.7.0_integ branch from 781e45e to 05462c2 Compare May 27, 2015 19:08
@mavenugo
Copy link
Contributor

LGTM. Thanks for the quick fix.

@aboch
Copy link
Contributor

aboch commented May 27, 2015

LGTM

mavenugo added a commit that referenced this pull request May 27, 2015
Workaround kernel bugs s related to namespaces
@mavenugo mavenugo merged commit a4d595b into moby:docker1.7.0_integ May 27, 2015
@rade
Copy link

rade commented Sep 23, 2015

@mrjana would you mind explaining what specific kernel bugs this is addressing?

Leaving the removal of the ns path to the period gc thread, instead of doing so eagerly in networkNamespace.Destroy() has the undesirable effect of delaying removal of the ns (and thus any associated interfaces and their IPs and MACs) for up to 60 seconds.

@mrjana
Copy link
Contributor Author

mrjana commented Sep 23, 2015

@rade It was introduced to fix a kernel race bug in mount shared subtrees (/var/run is one typically mounted by many distros as one) which wasn't fixed until 3.19 IIRC. We have to workaround the issue because we can't control kernel versions.

But the namespace gets removed immediately. Only the namespace file gets garbage collected after 60 seconds. So there shouldn't not be any holding up network resources of any kind.

@mrjana
Copy link
Contributor Author

mrjana commented Sep 23, 2015

@rade The commit is torvalds/linux@8f502d5 so actually it wasn't fixed until 4.1. The issue in particular is the crash happening inside __detach_mounts while holding the mount lock while trying to propogate umounts. This fix removed umount propogation altogether. Note that __detach_mounts will also be called during unlink(i.e removing the namespace file) and that was race for us. We were racing to remove path files in /var/run/docker/netns while we try to create newer namespaces from another thread which made docker hit the kernel race all the time.

@dpw
Copy link

dpw commented Sep 24, 2015

Thanks @mrjana. But:

But the namespace gets removed immediately. Only the namespace file gets garbage collected after 60 seconds. So there shouldn't not be any holding up network resources of any kind.

Containers started with -v /var/run:/var/run can hold up network resources despite the namespace mount being removed immediately. See this example (I'm running this on docker 1.7.1, kernel 3.19.0, but I don't think later versions behave differently):

# docker run -itd --name=test1 ubuntu
016405d6636cdb2f3773079effbd0fdbcc94b7f61edc516a3becd4b3c009852c
# docker run -itd --name=test2 -v /var/run:/var/run ubuntu
f6deddb6b3f580de289125b81d427da9aa1cdae60f49ffac87d59ca482649f39
# docker rm -f test1
test1
# docker exec test2 grep docker/netns /proc/mounts
nsfs /run/docker/netns/016405d6636c nsfs rw 0 0
nsfs /run/docker/netns/f6deddb6b3f5 nsfs rw 0 0
<wait for 60 seconds>
# docker exec test2 grep docker/netns /proc/mounts
nsfs /run/docker/netns/f6deddb6b3f5 nsfs rw 0 0

As you can see, even though test1 was killed, test2 still has the namespace mount for it. Only when libnetwork later deletes the mount point file does that mount go away (due to detach_mounts being called from vfs_unlink). We'd like the namespace to be cleaned up without such a delay, which is why we have been looking into this.

Kernel versions prior to 3.18 behaved differently: The unlink of the namespace mount point file would fail with EBUSY, and the other container could hold onto the netns forever. So kernel 3.18 improves the situation. But it would be better without the delay.

Should I create an issue for this?

@rade
Copy link

rade commented Sep 29, 2015

@mrjana ping

@mrjana
Copy link
Contributor Author

mrjana commented Sep 30, 2015

@dpw @rade I don't think we can remove the delayed garbage collection of namespace paths because of bugs in kernel during __detach_mounts which causes kernel crashes in 3.18 kernel onwards. So controlled cleanup of namespace paths in necessary. Either you can bind mount /var/run as private and unmount all the namespaces inside the container or bind mount the exact files you need to bind mount one by one.

@rade
Copy link

rade commented Oct 1, 2015

@mrjana

Either you can bind mount /var/run as private and unmount all the namespaces inside the container or bind mount the exact files you need to bind mount one by one.

That is what we are going to do for our own containers. Unfortunately there are some quite popular containers out there which mount /var/run, for example cAdvisor and RancherOS

If this really isn't fixable, I suggest alerting developers to the problem with a note in the volume mount docs.

@mrjana
Copy link
Contributor Author

mrjana commented Oct 1, 2015

@rade

That is what we are going to do for our own containers. Unfortunately there are some quite popular containers out there which mount /var/run, for example cAdvisor and RancherOS

If this really isn't fixable, I suggest alerting developers to the problem with a note in the volume mount docs.

For the drivers who provide us with the interfaces to push into namespace using the network driver plugin protocol, we do move those interfaces out the namespace when the container exits and it is upto the driver to delete the interfaces (veths) when they get a Leave call. This is exactly what the docker native bridge driver does.

For weave case I see it is a problem because you are relying on automatic destruction of veths when the namespace is destroyed. But any other container which is using native docker networking this is not a problem because the consequence of this workaround and a container bind mounting /var/run is an empty namespace which lingers around for an extra minute or so. Since this is a weave specific problem and not a problem with docker native networking I don't think it is appropriate to document it in docker docs

And when in 1.9 release when the plugin framework becomes generally available weave can become a network plugin and libnetwork will automatically take care of moving the interfaces out of the namespace and will fix the problem for weave as well.

@rade
Copy link

rade commented Oct 4, 2015

@mrjana Would it be possible to notify the GC goroutine in addToGarbagePaths, in order to avoid the delay?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants