Cgroup2 [WIP]#1708
Conversation
|
Thanks for giving this a shot. However...
cgroupv2 doesn't provide all the controllers we need ( I believe that only LXC currently has any form of support for this, and from what Christian has told me, it's pretty awful to make all the edge cases work. In particular, the amount of work needed to create a new container from a leaf node that has other (non-container) processes is quite problematic -- not least of all because it will confuse
The However, the
I'd recommend writing down what the plan is for handling the edge-cases (especially wrt |
|
So, a couple things:
They added device filtering support in 4.14. It's a bit different than the way the devices cgroup worked before. Instead, you install a BPF filter on the cgroup which checks the device, and rejects / accepts access to it. Do we need container pausing, or just safe termination support? Safe termination can be done by setting pids.max to 0, and then killing pid 1 of the pid namespace, and walking it down. While we're in migration, we can require that people use pid ns with cgroup2? It looks like LXC tries to mash together cgroupv1 and cgroupv2. It seems like a better idea to not mash them together, at least in the first iteration of cgroupv2. |
|
On Wed, Jan 31, 2018 at 06:52:44PM +0000, Sargun Dhillon wrote:
They added device filtering support in 4.14. It's a bit different
than the way the devices cgroup worked before. Instead, you install
a BPF filter on the cgroup which checks the device, and rejects /
accepts access to it.
I think you mean 4.15. Looking up torvalds/linux@ebc614f68:
linux$ git merge-base --is-ancestor ebc614f6 v4.14 && echo 'in that release'
linux$ git merge-base --is-ancestor ebc614f6 v4.15 && echo 'in that release'
in that release
|
Ah, sorry -- you're right.
Both.
I would agree with you if Is the plan for this for it to just be so that people can use it "when all the features we need are done in the kernel"? Or do you envision people using it today? Because if you want people to use |
| return -1 | ||
| } | ||
|
|
||
| func parseMountLine(line string) (MountLine, error) { |
There was a problem hiding this comment.
Just a side thought: /proc/self/mountinfo is parsed at several places already:
FindCgroupMountpointDir()FindCgroupMountpointAndRoot()parseMountTable()
rkt has also yet another implementation. It could be good to factorize at least within runc.
If MountPoint has spaces, new lines or other special characters, mountinfo escapes them but the parser will not unescape them at the moment. If it is factorized, the unescaping could be fixed in one place.
|
Any progress on v2 freezer? |
|
@AkihiroSuda Are you asking whether freezer will be added to cgroup v2 in the Kernel or not? |
|
@dongsupark Do you have a source for that? Tejun has definitely mentioned that he wanted to implement freezer in cgroupv2 in that past -- but the main blocker was that using the |
|
@cyphar No I don't. |
|
On 2018-09-11, Dongsu Park ***@***.***> wrote:
If you have heard of it directly from the maintainer, then maybe
you're right.
I'm trying to remember if I asked about this on a mailing-list or
in-person, but I do remember Tejun mentioning this in the past.
Anyway so far I have assumed that not every controller from cgroup v1
could be supported in v2, and that freezer would not be.
That is definitely true (for instance `net_cls` and `net_prio` are never
going to be supported in cgroupv2 because they cannot be implemented
hierarchically). But there are cgroupv1 controllers that can be done in
cgroupv2 that have not yet been implemented.
…--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
|
|
@AkihiroSuda what is your reason for caring about v2? v1 works, it's fully implemented, etc. |
|
v1 lacks |
|
@crosbymichael My reason for wanting Cgroup V2 is because there are new features (like BPF network filters), which are only on V2. |
|
So are we doomed to have a split world where we need v2 and v1 together? I don't see how we can do consistent filesystem snapshots without freezer |
Do those work for your needs? |
|
@sargun devmapper is on it's way out as more systems support overlayfs. We have a lot of overlay users |
|
@crosbymichael Talking to Tejun, it sounds like Freezing could potentially come back, but it's unlikely in the short term. Are there that many use cases for "live snapshots"? |
|
Docker copy, some builds, docker export, checkpoint restore all use pause. Also killing containers that are in the host pid namespace all use pause so that we can deliver the signal to all processes before they fork more things off. Its how we do atomic operations on containers at the filesystem and process level |
|
Atomic operations on a container is the biggest one -- killing is a bit odd because in theory killing pid1 in a pidns kills everything but because you can share pid namespaces (and there are other operations that don't have such nice semantics) we need
Sort of -- LXC currently has "hybrid" support (which was partially necessary because systemd decided to break container runtimes with their "hybrid" setup) but after talking to @brauner I have a feeling that it is absolutely awful to deal with on every possible level. I think I've already linked to his talk earlier in the thread, but the tl;dr is that it's not fun. As for I think there are also some general problems of how |
|
Alright, so there has actually been progress on the "hybrid" mode in systemd (systemd/systemd#10107). It turns out that systemd does not intend on having hybrid as the long-term future and so we should be fine with implementing cgroup2-only. Sorry for being a blocker on this one @sargun. I'm okay with this now that we know what systemd's plan for cgroupv2 is... |
|
I met with CGroup V2 developers from Facebook at "All Systems Go" conference this past weekend. I have asked them to participate in this conversation, and help us find a way forward. They indicated that the Freezer Cgroup should be in around Kernel 4.20. They also said work is going on for Hugetlb as well as a rework of the Device Cgroup to use BPF. Hopefully this will become easier by end of year. Once we have these we need to move forward on getting runc to support V2 and then we can allow the Distributions to begin moving forward. Sadly I don't believe this will all be fleshed out, until the distributions default to V2. |
|
So I have implemented full cgroup v2 support in LXC a while ago. It's at the What needs to be clear to everyone is that cgroup v2 will require you to talk
Option 1 is racy and only works reliably if you are root. Option 2 is a big |
|
closable? |
|
Closing. Remaining issues are tracked in #2209. |
This is my first cut of cgroup2. It's very awkward to mix cgroup2, and cgroup 1 with the spaghetti code that currently exists. I'd suggest that we have a mechanism to switch wholesale from cgroupv1 and v2, versus trying to maintain a hybrid mode. If people are okay with that, I can begin work on a parallel cgv2 manager.
In addition to this, I'm unsure of what the point of the systemd integraiton is? Can someone clue me in on that?
In cgroupv2, it looks like it wont be needed, because you have proper namespacing and delegation, but again, I have no idea how this code is actually designed to fit together.
Can people please comment, so I can get a general direction to take this?
CC: @crosbymichael @hqhq