Cgroup2 [WIP] by sargun · Pull Request #1708 · opencontainers/runc

sargun · 2018-01-30T07:20:37Z

This is my first cut of cgroup2. It's very awkward to mix cgroup2, and cgroup 1 with the spaghetti code that currently exists. I'd suggest that we have a mechanism to switch wholesale from cgroupv1 and v2, versus trying to maintain a hybrid mode. If people are okay with that, I can begin work on a parallel cgv2 manager.

In addition to this, I'm unsure of what the point of the systemd integraiton is? Can someone clue me in on that?

In cgroupv2, it looks like it wont be needed, because you have proper namespacing and delegation, but again, I have no idea how this code is actually designed to fit together.

Can people please comment, so I can get a general direction to take this?
CC: @crosbymichael @hqhq

cyphar · 2018-01-31T18:18:59Z

Thanks for giving this a shot. However...

I'd suggest that we have a mechanism to switch wholesale from cgroupv1 and v2, versus trying to maintain a hybrid mode.

cgroupv2 doesn't provide all the controllers we need (devices for security and freezer for container pausing), so in order for runc to work like it does today we would still need some sort of hybrid mode. However, maybe it would be nicer if we had separate packages for each mode? There are quite a few differences between them that it might make less sense to use the same manager struct for both. I'm not sure though.

I believe that only LXC currently has any form of support for this, and from what Christian has told me, it's pretty awful to make all the edge cases work. In particular, the amount of work needed to create a new container from a leaf node that has other (non-container) processes is quite problematic -- not least of all because it will confuse systemd. (Also, to be honest, I haven't managed to boot a cgroupv2-only machine in the past year without things breaking.)

In addition to this, I'm unsure of what the point of the systemd integraiton is? Can someone clue me in on that? In cgroupv2, it looks like it wont be needed, because you have proper namespacing and delegation, but again, I have no idea how this code is actually designed to fit together.

The systemd code is a bit of a sore point. The core reason for it existing is that systemd has a history of messing with the cgroups of containers. For a period of time, telling systemd about your "container" through a TransientUnit would be enough to convince it to not touch your processes. Then they added Delegate which actually codified this. I believe that RedHat relies on this feature (at least, that's where most of the bug reports come from) and they also register containers with machinectl.

However, the systemd code is quite far departed from what it should be doing semantically (which is just alerting systemd to the existence of the container and then manually setting everything anyway). So yeah, it's pretty ugly.

Can people please comment, so I can get a general direction to take this?

I'd recommend writing down what the plan is for handling the edge-cases (especially wrt systemd meddling with cgroupv2 -- as now named hierarchies no longer exist so the policy hierarchy is identical to the service hierarchy). Have you looked at what LXC does?

sargun · 2018-01-31T18:52:32Z

So, a couple things:

cgroupv2 doesn't provide all the controllers we need (devices for security and freezer for container pausing), so in order for runc to work like it does today we would still need some sort of hybrid mode. However, maybe it would be nicer if we had separate packages for each mode? There are quite a few differences between them that it might make less sense to use the same manager struct for both. I'm not sure though.

They added device filtering support in 4.14. It's a bit different than the way the devices cgroup worked before. Instead, you install a BPF filter on the cgroup which checks the device, and rejects / accepts access to it.

Do we need container pausing, or just safe termination support? Safe termination can be done by setting pids.max to 0, and then killing pid 1 of the pid namespace, and walking it down. While we're in migration, we can require that people use pid ns with cgroup2?

It looks like LXC tries to mash together cgroupv1 and cgroupv2. It seems like a better idea to not mash them together, at least in the first iteration of cgroupv2.

wking · 2018-01-31T19:13:23Z

On Wed, Jan 31, 2018 at 06:52:44PM +0000, Sargun Dhillon wrote: They added device filtering support in 4.14. It's a bit different than the way the devices cgroup worked before. Instead, you install a BPF filter on the cgroup which checks the device, and rejects / accepts access to it.

I think you mean 4.15. Looking up torvalds/linux@ebc614f68: linux$ git merge-base --is-ancestor ebc614f6 v4.14 && echo 'in that release' linux$ git merge-base --is-ancestor ebc614f6 v4.15 && echo 'in that release' in that release

cyphar · 2018-01-31T21:15:07Z

@sargun

They added device filtering support in 4.14. It's a bit different than the way the devices cgroup worked before. Instead, you install a BPF filter on the cgroup which checks the device, and rejects / accepts access to it.

Ah, sorry -- you're right.

Do we need container pausing, or just safe termination support?

Both.

It looks like LXC tries to mash together cgroupv1 and cgroupv2. It seems like a better idea to not mash them together, at least in the first iteration of cgroupv2.

I would agree with you if cgroupv2 wasn't missing support for cgroups, and if projects like systemd didn't already have their "unified" mode (that is actually hybrid). I do agree though that the code would be nicer if it was separate, my worry is that making it separate will make it unusable for quite a long time.

Is the plan for this for it to just be so that people can use it "when all the features we need are done in the kernel"? Or do you envision people using it today? Because if you want people to use cgroupv2 today, removing their ability to use stuff that works with cgroupv1 is a bit of an issue.

alban · 2018-02-02T16:23:40Z

+	return -1
+}
+
+func parseMountLine(line string) (MountLine, error) {


Just a side thought: /proc/self/mountinfo is parsed at several places already:

FindCgroupMountpointDir()

FindCgroupMountpointAndRoot()

parseMountTable()

rkt has also yet another implementation. It could be good to factorize at least within runc.

If MountPoint has spaces, new lines or other special characters, mountinfo escapes them but the parser will not unescape them at the moment. If it is factorized, the unescaping could be fixed in one place.

AkihiroSuda · 2018-09-10T04:06:07Z

Any progress on v2 freezer?

dongsupark · 2018-09-11T09:27:32Z

@AkihiroSuda Are you asking whether freezer will be added to cgroup v2 in the Kernel or not?
AFAIK, no. There seems to be no plan for doing that.
See also this.

AkihiroSuda · 2018-09-11T09:35:24Z

There seems to be no plan for doing that.

😢

@cyphar @sargun can we move this forward without support for freezer?

cyphar · 2018-09-11T09:47:29Z

@dongsupark Do you have a source for that? Tejun has definitely mentioned that he wanted to implement freezer in cgroupv2 in that past -- but the main blocker was that using the refrigerator subsystem in Linux can result in userspace processes being frozen in some pretty hairy kernel code (potentially rendering them in an uninterruptible state). He wanted freezer in cgroupv2 to leave processes in a SIGSTOP-like state so that you don't have those types of issues. If there is no plan for freezer in cgroupv2 this is the first time I've heard that.

dongsupark · 2018-09-11T10:08:57Z

@cyphar No I don't.
I tried to find such discussions in mailing lists, but couldn't find one.
If you have heard of it directly from the maintainer, then maybe you're right.
Anyway so far I have assumed that not every controller from cgroup v1 could be supported in v2, and that freezer would not be.

cyphar · 2018-09-11T10:21:50Z

On 2018-09-11, Dongsu Park ***@***.***> wrote: If you have heard of it directly from the maintainer, then maybe you're right.

I'm trying to remember if I asked about this on a mailing-list or in-person, but I do remember Tejun mentioning this in the past.

Anyway so far I have assumed that not every controller from cgroup v1 could be supported in v2, and that freezer would not be.

That is definitely true (for instance `net_cls` and `net_prio` are never going to be supported in cgroupv2 because they cannot be implemented hierarchically). But there are cgroupv1 controllers that can be done in cgroupv2 that have not yet been implemented.

…

-- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/>

crosbymichael · 2018-09-17T14:50:17Z

@AkihiroSuda what is your reason for caring about v2? v1 works, it's fully implemented, etc.

AkihiroSuda · 2018-09-17T14:57:26Z

v1 lacks nsdelegate

sargun · 2018-09-17T15:02:05Z

@crosbymichael My reason for wanting Cgroup V2 is because there are new features (like BPF network filters), which are only on V2.

crosbymichael · 2018-09-17T15:08:27Z

So are we doomed to have a split world where we need v2 and v1 together? I don't see how we can do consistent filesystem snapshots without freezer

sargun · 2018-09-17T15:22:26Z

@crosbymichael

Do those work for your needs?

crosbymichael · 2018-09-17T15:24:02Z

@sargun devmapper is on it's way out as more systems support overlayfs. We have a lot of overlay users

sargun · 2018-09-17T15:31:45Z

@crosbymichael Talking to Tejun, it sounds like Freezing could potentially come back, but it's unlikely in the short term. Are there that many use cases for "live snapshots"?

crosbymichael · 2018-09-17T15:34:54Z

Docker copy, some builds, docker export, checkpoint restore all use pause. Also killing containers that are in the host pid namespace all use pause so that we can deliver the signal to all processes before they fork more things off. Its how we do atomic operations on containers at the filesystem and process level

cyphar · 2018-09-17T15:56:51Z

Atomic operations on a container is the biggest one -- killing is a bit odd because in theory killing pid1 in a pidns kills everything but because you can share pid namespaces (and there are other operations that don't have such nice semantics) we need freezer. LXC has the same problem as us on this one.

So are we doomed to have a split world where we need v2 and v1 together?

Sort of -- LXC currently has "hybrid" support (which was partially necessary because systemd decided to break container runtimes with their "hybrid" setup) but after talking to @brauner I have a feeling that it is absolutely awful to deal with on every possible level. I think I've already linked to his talk earlier in the thread, but the tl;dr is that it's not fun.

As for nsdelegate it should be noted that until all of the controllers we need are in cgroupv2, nsdelegate is not really very useful (because while you could delegate some controllers -- the ones we actually need are not delegated and thus rootless cgroup usage isn't usable).

I think there are also some general problems of how subtree_control works with delegation (since it has to be enabled from the top of the tree down, any one of your ancestors could stop you from being able to use freezer -- and this includes systemd which might decide to not enable freezer for the entire OS because they don't care about it). I think that's a pretty significant issue.

cyphar · 2018-10-02T11:11:04Z

Alright, so there has actually been progress on the "hybrid" mode in systemd (systemd/systemd#10107). It turns out that systemd does not intend on having hybrid as the long-term future and so we should be fine with implementing cgroup2-only.

Sorry for being a blocker on this one @sargun. I'm okay with this now that we know what systemd's plan for cgroupv2 is...

rhatdan · 2018-10-02T11:20:12Z

I met with CGroup V2 developers from Facebook at "All Systems Go" conference this past weekend. I have asked them to participate in this conversation, and help us find a way forward. They indicated that the Freezer Cgroup should be in around Kernel 4.20. They also said work is going on for Hugetlb as well as a rework of the Device Cgroup to use BPF. Hopefully this will become easier by end of year.

Once we have these we need to move forward on getting runc to support V2 and then we can allow the Distributions to begin moving forward. Sadly I don't believe this will all be fleshed out, until the distributions default to V2.

brauner · 2018-10-02T12:04:54Z

So I have implemented full cgroup v2 support in LXC a while ago. It's at the
point where we're just fine-tuning. I've also talked to @poettering (and parts
of the results can be seen in the thread that @cyphar has linked to). Hybrid
is going to die. However it is still a thing in a lot of distribution and if
you're not handling it you're likely going to have trouble.
LXC is not meshing v1, hybrid, and v2 together. The cgroup api is abstracted in
the same way that it is in systemd. I.e. it handles these three modes in a
similar way but in separate codepaths.

What needs to be clear to everyone is that cgroup v2 will require you to talk
to systemd or any init system that makes use of cgroups on their own. There's
no way around it. Period. cgroup v2 is designed around the single writer rule
and the owner of the whole cgroup tree is - like it or not - systemd. Any
processes associated with a logged-in user on the system will be located in a
cgroup. That is you are always on a leaf node which means no new cgroups for
you unless:

you either migrate all the processes into another cgroup
you escape to the root cgroup
you escape one level up form the cgroup the processes are located in
use one of the ways to ask systemd for delegation

Option 1 is racy and only works reliably if you are root. Option 2 is a big
nono as the root cgroup is owned by systemd and is free to do whatever it wants
with your processes. You're also violating the single writer rule. Option 3 is
another big nono for all the reasons 2 is. Another reason is that you're now in
a slice and a slice is an inner node and these are freely moved around by
systemd so say bye-bye to your limits or at least be prepared to do so. In
fact, it is way more likely that systemd will move you around when you're
messing with inner nodes. The last option is to talk to systemd by either using
the dbus api or by using the Delegate option in your unit file. The
remaining task is being smart about how you create your cgroups in your leave
nodes.

AkihiroSuda · 2019-10-17T09:16:02Z

@cyphar @mrunalp is this closable?

AkihiroSuda · 2020-01-20T17:44:42Z

closable?

AkihiroSuda · 2020-03-04T00:39:11Z

Closing. Remaining issues are tracked in #2209.

add first pass of basic cgroup2 support

2f6f762

sargun force-pushed the cgroup2 branch from 91d4c96 to 2f6f762 Compare January 30, 2018 07:21

alban reviewed Feb 2, 2018

View reviewed changes

cyphar self-assigned this Feb 4, 2018

filbranden mentioned this pull request Feb 20, 2019

[RFC] Implement systemd-specific per-cgroup support (+ proof-of-concept "devices" and "memory") #1991

Closed

AkihiroSuda closed this Mar 4, 2020

kolyshkin mentioned this pull request Apr 14, 2020

cgroupv2: fix fs2 driver default path #2305

Merged

Conversation

sargun commented Jan 30, 2018

Uh oh!

cyphar commented Jan 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sargun commented Jan 31, 2018

Uh oh!

wking commented Jan 31, 2018 via email

Uh oh!

cyphar commented Jan 31, 2018

Uh oh!

alban Feb 2, 2018

Choose a reason for hiding this comment

Uh oh!

AkihiroSuda commented Sep 10, 2018

Uh oh!

dongsupark commented Sep 11, 2018

Uh oh!

AkihiroSuda commented Sep 11, 2018

Uh oh!

cyphar commented Sep 11, 2018

Uh oh!

dongsupark commented Sep 11, 2018

Uh oh!

cyphar commented Sep 11, 2018 via email

Uh oh!

crosbymichael commented Sep 17, 2018

Uh oh!

AkihiroSuda commented Sep 17, 2018

Uh oh!

sargun commented Sep 17, 2018

Uh oh!

crosbymichael commented Sep 17, 2018

Uh oh!

sargun commented Sep 17, 2018

Uh oh!

crosbymichael commented Sep 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sargun commented Sep 17, 2018

Uh oh!

crosbymichael commented Sep 17, 2018

Uh oh!

cyphar commented Sep 17, 2018

Uh oh!

cyphar commented Oct 2, 2018

Uh oh!

rhatdan commented Oct 2, 2018

Uh oh!

brauner commented Oct 2, 2018

Uh oh!

AkihiroSuda commented Oct 17, 2019

Uh oh!

AkihiroSuda commented Jan 20, 2020

Uh oh!

AkihiroSuda commented Mar 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

cyphar commented Jan 31, 2018 •

edited

Loading

crosbymichael commented Sep 17, 2018 •

edited

Loading