Skip to content

Cgroup2 [WIP]#1708

Closed
sargun wants to merge 1 commit intoopencontainers:masterfrom
sargun:cgroup2
Closed

Cgroup2 [WIP]#1708
sargun wants to merge 1 commit intoopencontainers:masterfrom
sargun:cgroup2

Conversation

@sargun
Copy link
Copy Markdown

@sargun sargun commented Jan 30, 2018

This is my first cut of cgroup2. It's very awkward to mix cgroup2, and cgroup 1 with the spaghetti code that currently exists. I'd suggest that we have a mechanism to switch wholesale from cgroupv1 and v2, versus trying to maintain a hybrid mode. If people are okay with that, I can begin work on a parallel cgv2 manager.

In addition to this, I'm unsure of what the point of the systemd integraiton is? Can someone clue me in on that?

In cgroupv2, it looks like it wont be needed, because you have proper namespacing and delegation, but again, I have no idea how this code is actually designed to fit together.

Can people please comment, so I can get a general direction to take this?
CC: @crosbymichael @hqhq

@cyphar
Copy link
Copy Markdown
Member

cyphar commented Jan 31, 2018

Thanks for giving this a shot. However...

I'd suggest that we have a mechanism to switch wholesale from cgroupv1 and v2, versus trying to maintain a hybrid mode.

cgroupv2 doesn't provide all the controllers we need (devices for security and freezer for container pausing), so in order for runc to work like it does today we would still need some sort of hybrid mode. However, maybe it would be nicer if we had separate packages for each mode? There are quite a few differences between them that it might make less sense to use the same manager struct for both. I'm not sure though.

I believe that only LXC currently has any form of support for this, and from what Christian has told me, it's pretty awful to make all the edge cases work. In particular, the amount of work needed to create a new container from a leaf node that has other (non-container) processes is quite problematic -- not least of all because it will confuse systemd. (Also, to be honest, I haven't managed to boot a cgroupv2-only machine in the past year without things breaking.)

In addition to this, I'm unsure of what the point of the systemd integraiton is? Can someone clue me in on that? In cgroupv2, it looks like it wont be needed, because you have proper namespacing and delegation, but again, I have no idea how this code is actually designed to fit together.

The systemd code is a bit of a sore point. The core reason for it existing is that systemd has a history of messing with the cgroups of containers. For a period of time, telling systemd about your "container" through a TransientUnit would be enough to convince it to not touch your processes. Then they added Delegate which actually codified this. I believe that RedHat relies on this feature (at least, that's where most of the bug reports come from) and they also register containers with machinectl.

However, the systemd code is quite far departed from what it should be doing semantically (which is just alerting systemd to the existence of the container and then manually setting everything anyway). So yeah, it's pretty ugly.

Can people please comment, so I can get a general direction to take this?

I'd recommend writing down what the plan is for handling the edge-cases (especially wrt systemd meddling with cgroupv2 -- as now named hierarchies no longer exist so the policy hierarchy is identical to the service hierarchy). Have you looked at what LXC does?

@sargun
Copy link
Copy Markdown
Author

sargun commented Jan 31, 2018

So, a couple things:

cgroupv2 doesn't provide all the controllers we need (devices for security and freezer for container pausing), so in order for runc to work like it does today we would still need some sort of hybrid mode. However, maybe it would be nicer if we had separate packages for each mode? There are quite a few differences between them that it might make less sense to use the same manager struct for both. I'm not sure though.

They added device filtering support in 4.14. It's a bit different than the way the devices cgroup worked before. Instead, you install a BPF filter on the cgroup which checks the device, and rejects / accepts access to it.

Do we need container pausing, or just safe termination support? Safe termination can be done by setting pids.max to 0, and then killing pid 1 of the pid namespace, and walking it down. While we're in migration, we can require that people use pid ns with cgroup2?


It looks like LXC tries to mash together cgroupv1 and cgroupv2. It seems like a better idea to not mash them together, at least in the first iteration of cgroupv2.

@wking
Copy link
Copy Markdown
Contributor

wking commented Jan 31, 2018 via email

@cyphar
Copy link
Copy Markdown
Member

cyphar commented Jan 31, 2018

@sargun

They added device filtering support in 4.14. It's a bit different than the way the devices cgroup worked before. Instead, you install a BPF filter on the cgroup which checks the device, and rejects / accepts access to it.

Ah, sorry -- you're right.

Do we need container pausing, or just safe termination support?

Both.

It looks like LXC tries to mash together cgroupv1 and cgroupv2. It seems like a better idea to not mash them together, at least in the first iteration of cgroupv2.

I would agree with you if cgroupv2 wasn't missing support for cgroups, and if projects like systemd didn't already have their "unified" mode (that is actually hybrid). I do agree though that the code would be nicer if it was separate, my worry is that making it separate will make it unusable for quite a long time.

Is the plan for this for it to just be so that people can use it "when all the features we need are done in the kernel"? Or do you envision people using it today? Because if you want people to use cgroupv2 today, removing their ability to use stuff that works with cgroupv1 is a bit of an issue.

return -1
}

func parseMountLine(line string) (MountLine, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a side thought: /proc/self/mountinfo is parsed at several places already:

  • FindCgroupMountpointDir()
  • FindCgroupMountpointAndRoot()
  • parseMountTable()

rkt has also yet another implementation. It could be good to factorize at least within runc.

If MountPoint has spaces, new lines or other special characters, mountinfo escapes them but the parser will not unescape them at the moment. If it is factorized, the unescaping could be fixed in one place.

@cyphar cyphar self-assigned this Feb 4, 2018
@AkihiroSuda
Copy link
Copy Markdown
Member

Any progress on v2 freezer?

@dongsupark
Copy link
Copy Markdown

@AkihiroSuda Are you asking whether freezer will be added to cgroup v2 in the Kernel or not?
AFAIK, no. There seems to be no plan for doing that.
See also this.

@AkihiroSuda
Copy link
Copy Markdown
Member

There seems to be no plan for doing that.

😢

@cyphar @sargun can we move this forward without support for freezer?

@cyphar
Copy link
Copy Markdown
Member

cyphar commented Sep 11, 2018

@dongsupark Do you have a source for that? Tejun has definitely mentioned that he wanted to implement freezer in cgroupv2 in that past -- but the main blocker was that using the refrigerator subsystem in Linux can result in userspace processes being frozen in some pretty hairy kernel code (potentially rendering them in an uninterruptible state). He wanted freezer in cgroupv2 to leave processes in a SIGSTOP-like state so that you don't have those types of issues. If there is no plan for freezer in cgroupv2 this is the first time I've heard that.

@dongsupark
Copy link
Copy Markdown

@cyphar No I don't.
I tried to find such discussions in mailing lists, but couldn't find one.
If you have heard of it directly from the maintainer, then maybe you're right.
Anyway so far I have assumed that not every controller from cgroup v1 could be supported in v2, and that freezer would not be.

@cyphar
Copy link
Copy Markdown
Member

cyphar commented Sep 11, 2018 via email

@crosbymichael
Copy link
Copy Markdown
Member

@AkihiroSuda what is your reason for caring about v2? v1 works, it's fully implemented, etc.

@AkihiroSuda
Copy link
Copy Markdown
Member

v1 lacks nsdelegate

@sargun
Copy link
Copy Markdown
Author

sargun commented Sep 17, 2018

@crosbymichael My reason for wanting Cgroup V2 is because there are new features (like BPF network filters), which are only on V2.

@crosbymichael
Copy link
Copy Markdown
Member

So are we doomed to have a split world where we need v2 and v1 together? I don't see how we can do consistent filesystem snapshots without freezer

@sargun
Copy link
Copy Markdown
Author

sargun commented Sep 17, 2018

@crosbymichael
Copy link
Copy Markdown
Member

crosbymichael commented Sep 17, 2018

@sargun devmapper is on it's way out as more systems support overlayfs. We have a lot of overlay users

@sargun
Copy link
Copy Markdown
Author

sargun commented Sep 17, 2018

@crosbymichael Talking to Tejun, it sounds like Freezing could potentially come back, but it's unlikely in the short term. Are there that many use cases for "live snapshots"?

@crosbymichael
Copy link
Copy Markdown
Member

Docker copy, some builds, docker export, checkpoint restore all use pause. Also killing containers that are in the host pid namespace all use pause so that we can deliver the signal to all processes before they fork more things off. Its how we do atomic operations on containers at the filesystem and process level

@cyphar
Copy link
Copy Markdown
Member

cyphar commented Sep 17, 2018

Atomic operations on a container is the biggest one -- killing is a bit odd because in theory killing pid1 in a pidns kills everything but because you can share pid namespaces (and there are other operations that don't have such nice semantics) we need freezer. LXC has the same problem as us on this one.

So are we doomed to have a split world where we need v2 and v1 together?

Sort of -- LXC currently has "hybrid" support (which was partially necessary because systemd decided to break container runtimes with their "hybrid" setup) but after talking to @brauner I have a feeling that it is absolutely awful to deal with on every possible level. I think I've already linked to his talk earlier in the thread, but the tl;dr is that it's not fun.

As for nsdelegate it should be noted that until all of the controllers we need are in cgroupv2, nsdelegate is not really very useful (because while you could delegate some controllers -- the ones we actually need are not delegated and thus rootless cgroup usage isn't usable).

I think there are also some general problems of how subtree_control works with delegation (since it has to be enabled from the top of the tree down, any one of your ancestors could stop you from being able to use freezer -- and this includes systemd which might decide to not enable freezer for the entire OS because they don't care about it). I think that's a pretty significant issue.

@cyphar
Copy link
Copy Markdown
Member

cyphar commented Oct 2, 2018

Alright, so there has actually been progress on the "hybrid" mode in systemd (systemd/systemd#10107). It turns out that systemd does not intend on having hybrid as the long-term future and so we should be fine with implementing cgroup2-only.

Sorry for being a blocker on this one @sargun. I'm okay with this now that we know what systemd's plan for cgroupv2 is...

@rhatdan
Copy link
Copy Markdown
Contributor

rhatdan commented Oct 2, 2018

I met with CGroup V2 developers from Facebook at "All Systems Go" conference this past weekend. I have asked them to participate in this conversation, and help us find a way forward. They indicated that the Freezer Cgroup should be in around Kernel 4.20. They also said work is going on for Hugetlb as well as a rework of the Device Cgroup to use BPF. Hopefully this will become easier by end of year.

Once we have these we need to move forward on getting runc to support V2 and then we can allow the Distributions to begin moving forward. Sadly I don't believe this will all be fleshed out, until the distributions default to V2.

@brauner
Copy link
Copy Markdown

brauner commented Oct 2, 2018

So I have implemented full cgroup v2 support in LXC a while ago. It's at the
point where we're just fine-tuning. I've also talked to @poettering (and parts
of the results can be seen in the thread that @cyphar has linked to). Hybrid
is going to die. However it is still a thing in a lot of distribution and if
you're not handling it you're likely going to have trouble.
LXC is not meshing v1, hybrid, and v2 together. The cgroup api is abstracted in
the same way that it is in systemd. I.e. it handles these three modes in a
similar way but in separate codepaths.

What needs to be clear to everyone is that cgroup v2 will require you to talk
to systemd or any init system that makes use of cgroups on their own. There's
no way around it. Period. cgroup v2 is designed around the single writer rule
and the owner of the whole cgroup tree is - like it or not - systemd. Any
processes associated with a logged-in user on the system will be located in a
cgroup. That is you are always on a leaf node which means no new cgroups for
you unless:

  1. you either migrate all the processes into another cgroup
  2. you escape to the root cgroup
  3. you escape one level up form the cgroup the processes are located in
  4. use one of the ways to ask systemd for delegation

Option 1 is racy and only works reliably if you are root. Option 2 is a big
nono as the root cgroup is owned by systemd and is free to do whatever it wants
with your processes. You're also violating the single writer rule. Option 3 is
another big nono for all the reasons 2 is. Another reason is that you're now in
a slice and a slice is an inner node and these are freely moved around by
systemd so say bye-bye to your limits or at least be prepared to do so. In
fact, it is way more likely that systemd will move you around when you're
messing with inner nodes. The last option is to talk to systemd by either using
the dbus api or by using the Delegate option in your unit file. The
remaining task is being smart about how you create your cgroups in your leave
nodes.

@AkihiroSuda
Copy link
Copy Markdown
Member

@cyphar @mrunalp is this closable?

@AkihiroSuda
Copy link
Copy Markdown
Member

closable?

@AkihiroSuda
Copy link
Copy Markdown
Member

Closing. Remaining issues are tracked in #2209.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants