Workaround kernel bugs s related to namespaces by mrjana · Pull Request #223 · moby/libnetwork

mrjana · 2015-05-27T18:47:06Z

This PR attempts to work around bugs present in kernel
version 3.18-4.0.1 relating to namespace creation
and destruction. This fix attempts to avoid certain
system calls to not get in the kernel bug path as well
as lazily garbage collecting the name paths when they are removed.

Signed-off-by: Jana Radhakrishnan mrjana@docker.com

This PR attempts to work around bugs present in kernel version 3.18-4.0.1 relating to namespace creation and destruction. This fix attempts to avoid certain systemmcalls to not get in the kkernel bug path as well as lazily garbage collecting the name paths when they are removed. Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>

mavenugo · 2015-05-27T19:22:38Z

LGTM. Thanks for the quick fix.

aboch · 2015-05-27T19:23:10Z

LGTM

Workaround kernel bugs s related to namespaces

rade · 2015-09-23T14:57:50Z

@mrjana would you mind explaining what specific kernel bugs this is addressing?

Leaving the removal of the ns path to the period gc thread, instead of doing so eagerly in networkNamespace.Destroy() has the undesirable effect of delaying removal of the ns (and thus any associated interfaces and their IPs and MACs) for up to 60 seconds.

mrjana · 2015-09-23T17:11:05Z

@rade It was introduced to fix a kernel race bug in mount shared subtrees (/var/run is one typically mounted by many distros as one) which wasn't fixed until 3.19 IIRC. We have to workaround the issue because we can't control kernel versions.

But the namespace gets removed immediately. Only the namespace file gets garbage collected after 60 seconds. So there shouldn't not be any holding up network resources of any kind.

mrjana · 2015-09-23T17:50:21Z

@rade The commit is torvalds/linux@8f502d5 so actually it wasn't fixed until 4.1. The issue in particular is the crash happening inside __detach_mounts while holding the mount lock while trying to propogate umounts. This fix removed umount propogation altogether. Note that __detach_mounts will also be called during unlink(i.e removing the namespace file) and that was race for us. We were racing to remove path files in /var/run/docker/netns while we try to create newer namespaces from another thread which made docker hit the kernel race all the time.

dpw · 2015-09-24T11:50:51Z

Thanks @mrjana. But:

But the namespace gets removed immediately. Only the namespace file gets garbage collected after 60 seconds. So there shouldn't not be any holding up network resources of any kind.

Containers started with -v /var/run:/var/run can hold up network resources despite the namespace mount being removed immediately. See this example (I'm running this on docker 1.7.1, kernel 3.19.0, but I don't think later versions behave differently):

# docker run -itd --name=test1 ubuntu
016405d6636cdb2f3773079effbd0fdbcc94b7f61edc516a3becd4b3c009852c
# docker run -itd --name=test2 -v /var/run:/var/run ubuntu
f6deddb6b3f580de289125b81d427da9aa1cdae60f49ffac87d59ca482649f39
# docker rm -f test1
test1
# docker exec test2 grep docker/netns /proc/mounts
nsfs /run/docker/netns/016405d6636c nsfs rw 0 0
nsfs /run/docker/netns/f6deddb6b3f5 nsfs rw 0 0
<wait for 60 seconds>
# docker exec test2 grep docker/netns /proc/mounts
nsfs /run/docker/netns/f6deddb6b3f5 nsfs rw 0 0

As you can see, even though test1 was killed, test2 still has the namespace mount for it. Only when libnetwork later deletes the mount point file does that mount go away (due to detach_mounts being called from vfs_unlink). We'd like the namespace to be cleaned up without such a delay, which is why we have been looking into this.

Kernel versions prior to 3.18 behaved differently: The unlink of the namespace mount point file would fail with EBUSY, and the other container could hold onto the netns forever. So kernel 3.18 improves the situation. But it would be better without the delay.

Should I create an issue for this?

rade · 2015-09-29T17:30:07Z

@mrjana ping

mrjana · 2015-09-30T17:47:23Z

@dpw @rade I don't think we can remove the delayed garbage collection of namespace paths because of bugs in kernel during __detach_mounts which causes kernel crashes in 3.18 kernel onwards. So controlled cleanup of namespace paths in necessary. Either you can bind mount /var/run as private and unmount all the namespaces inside the container or bind mount the exact files you need to bind mount one by one.

rade · 2015-10-01T07:59:51Z

@mrjana

Either you can bind mount /var/run as private and unmount all the namespaces inside the container or bind mount the exact files you need to bind mount one by one.

That is what we are going to do for our own containers. Unfortunately there are some quite popular containers out there which mount /var/run, for example cAdvisor and RancherOS

If this really isn't fixable, I suggest alerting developers to the problem with a note in the volume mount docs.

mrjana · 2015-10-01T14:41:05Z

@rade

That is what we are going to do for our own containers. Unfortunately there are some quite popular containers out there which mount /var/run, for example cAdvisor and RancherOS

If this really isn't fixable, I suggest alerting developers to the problem with a note in the volume mount docs.

For the drivers who provide us with the interfaces to push into namespace using the network driver plugin protocol, we do move those interfaces out the namespace when the container exits and it is upto the driver to delete the interfaces (veths) when they get a Leave call. This is exactly what the docker native bridge driver does.

For weave case I see it is a problem because you are relying on automatic destruction of veths when the namespace is destroyed. But any other container which is using native docker networking this is not a problem because the consequence of this workaround and a container bind mounting /var/run is an empty namespace which lingers around for an extra minute or so. Since this is a weave specific problem and not a problem with docker native networking I don't think it is appropriate to document it in docker docs

And when in 1.9 release when the plugin framework becomes generally available weave can become a network plugin and libnetwork will automatically take care of moving the interfaces out of the namespace and will fix the problem for weave as well.

rade · 2015-10-04T16:13:38Z

@mrjana Would it be possible to notify the GC goroutine in addToGarbagePaths, in order to avoid the delay?

mrjana force-pushed the docker1.7.0_integ branch from 781e45e to 05462c2 Compare May 27, 2015 19:08

mavenugo added a commit that referenced this pull request May 27, 2015

Merge pull request #223 from mrjana/docker1.7.0_integ

a4d595b

Workaround kernel bugs s related to namespaces

mavenugo merged commit a4d595b into moby:docker1.7.0_integ May 27, 2015

rade mentioned this pull request Sep 23, 2015

Proxy volume mount of /var/run prevents netns cleanup weaveworks/weave#1455

Closed

bboreham mentioned this pull request Dec 16, 2016

weave attach --rewrite-hosts sporadically fails weaveworks/weave#2559

Closed

SpComb mentioned this pull request Mar 20, 2017

Docker netns mounts are leaked to kontena-cadvisor kontena/kontena#2004

Closed

bboreham mentioned this pull request Jun 5, 2017

Cannot start container: exit status 4 moby/moby#12547

Closed

akerouanton mentioned this pull request Sep 20, 2023

libnetwork/osl: Namespace: assorted cleanups moby/moby#46315

Merged

thaJeztah mentioned this pull request Dec 16, 2024

libnet/osl: drop netns path GC moby/moby#49099

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workaround kernel bugs s related to namespaces#223

Workaround kernel bugs s related to namespaces#223
mavenugo merged 1 commit intomoby:docker1.7.0_integfrom
mrjana:docker1.7.0_integ

mrjana commented May 27, 2015

Uh oh!

mavenugo commented May 27, 2015

Uh oh!

aboch commented May 27, 2015

Uh oh!

rade commented Sep 23, 2015

Uh oh!

mrjana commented Sep 23, 2015

Uh oh!

mrjana commented Sep 23, 2015

Uh oh!

dpw commented Sep 24, 2015

Uh oh!

rade commented Sep 29, 2015

Uh oh!

mrjana commented Sep 30, 2015

Uh oh!

rade commented Oct 1, 2015

Uh oh!

mrjana commented Oct 1, 2015

Uh oh!

rade commented Oct 4, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

mrjana commented May 27, 2015

Uh oh!

mavenugo commented May 27, 2015

Uh oh!

aboch commented May 27, 2015

Uh oh!

rade commented Sep 23, 2015

Uh oh!

mrjana commented Sep 23, 2015

Uh oh!

mrjana commented Sep 23, 2015

Uh oh!

dpw commented Sep 24, 2015

Uh oh!

rade commented Sep 29, 2015

Uh oh!

mrjana commented Sep 30, 2015

Uh oh!

rade commented Oct 1, 2015

Uh oh!

mrjana commented Oct 1, 2015

Uh oh!

rade commented Oct 4, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants