resource management: document current status, suggested fixes by egernst · Pull Request #428 · kata-containers/documentation

egernst · 2019-04-12T23:12:13Z

Let's use this document to drive the conversation around design changes to better manage Kata Containers' resources.

Signed-off-by: Eric Ernst <eric.ernst@intel.com>

bergwolf

I agree with all the bits especially the point that we just need a sandbox cgroup instead of per-container cgroups. Thanks for putting it together!

bergwolf · 2019-04-14T15:09:20Z

+
+QEMU itself and its vhost threads are very problematic.  Depending on the workload, these can consume a
+non-negligible amount of resources. Note, in the Kata implementation, these components are purposefuly
+not added to the constrained cgroup, the pause container. As a result, they fall under the caller's


you mean "not added to the pod cgroup"?

bergwolf · 2019-04-14T15:20:11Z

+### Summary
+ * Pause cgroup cpu shaes should be setup correctly.
+ * Do not create container cgroups on the host. Instead, create a pod sandbox group that is entirely managed by Kata
+ * Move the QEMU threads into the sandbox cgroup


Plus shimv2, kata-shim/proxy, and vhost threads? I think they all belong to the sandbox cgroup.

bergwolf · 2019-04-14T15:21:27Z

+The overheads associated with running a sandbox should be accounted for explicitly, and at the pod level.
+Once the Pod Overhead KEP is available, this should become a part of RuntimeClass, applied to pods which
+utilize the applicable RuntimeClass.  See [Pod Overhead KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/20190226-pod-overhead.md)
+for more details.


Yes! We really need pod overhead to let kubernetes be aware of kata.

bergwolf · 2019-04-14T15:26:14Z

+
+### Only constrain vCPUs, leaving remaining threads for system reserved
+
+This will, in some cases, provide improved performance.


The problem with system reserved is that it does not scale. So I don't think we should put scaling factors (qemu main thread, vhost etc.) there and it can result in unexpected system failures due to enforced constraints in system reserved.

bergwolf · 2019-04-15T15:52:43Z

@egernst The other part of the cgroups handling is how we handle them in the guest. Do you think we should include it in this document?

mcastelino · 2019-04-15T17:50:30Z

+and CRI-O.
+
+#### In Docker
+


We need to fill this as we follow a different cgroup setup here. Also Docker exposes a much richer resource API which some people may use. So we need to get docker right too.

mcastelino · 2019-04-15T18:06:14Z

+For each container in the pod, a cgroup is created within the pod cgroup (ie, under `/sys/fs/cgroup/*/kubepod/pod.*/`
+for a guaranteed pod). This is not necessary; only a single cgroup which constrains the hypervisor
+appropriately is required.
+


There is a cgroup for the pod itself. Within the cgroup of the pod child cgroups are setup. There is one for the pause container and one each for each container in the pod. Only the actual container processes are placed in these cgroups. The container framework processes like shim are placed under containerd systemd slice cgroup and as such are accounted for. This can be a potential issue is the framework ends up consuming resources. example logs, stdio.

mcastelino · 2019-04-15T18:07:13Z

+
+##### Where are all the vCPUS
+
+The vCPUs are placed under the pause container:


Kata tries to setup the pause cgroup to represent the VM itself. It scales its resources to match the total resources assigned to the pod.

mcastelino · 2019-04-15T18:08:12Z

+One drawback of this is that it assumes the existence of a pause container. This is an assumption
+based on the implementation of containerd/cri-o.  On the plus side, the cgroup is placed directly under
+the pod cgroup, which is created and managed by Kubelet.  Overall, this isn't terrible.
+


However CPU resources are proportional. Hence if any of the sibling cgroups consume resources the summation logic does not work.

mcastelino · 2019-04-15T18:09:51Z

+
+CRI-O is very similar the containerd, except for where the non-constrained processes end up. Instead
+of being called by CRIO directly, kata-runtime is called from a process `conmon`, which is located
+in a cgroup under the pod-cgroup. As expected based on prior exapmles, cgroups are ceated for each


typos here.

mcastelino · 2019-04-15T18:10:58Z

+The QEMU, vhost, proxy and shim threads, however, are placed under the caller's cgroup, which in this case
+is `conmon`, which is a peer of the container cgroups we created. So, the good news is that QEMU, vhost,
+etc, are constrained within the pod's cgroup. The bad news is these will be constrained based on the values
+associated with conmon:


This is also wrong as conmon should be constrained with the absolute minmum resources it can get away with. Typically this is set to 2 (i.e. so that it does not impact the container scheduling).

mcastelino · 2019-04-15T18:17:15Z

+If static CPU policies are introduced, the end user will assign CPUs to a specific container within the pod.  Running
+IO threads on this CPU may not necessarily be desirable, compared to the users expectations.
+
+Long term (ie, with RuntimeClass augmented to handle pod overheads), we should create a seperate `cpuset` cgroup,


The cpuset cgroup is separate from CPU cgroup. Also upstream kubernetes is planning to eliminate the cfs quotas for containers with cpusets. So we really cannot place QEMU outside the cpuset.
kubernetes/kubernetes#70585

jcvenegas · 2019-04-15T16:29:15Z

+IO threads on this CPU may not necessarily be desirable, compared to the users expectations.
+
+Long term (ie, with RuntimeClass augmented to handle pod overheads), we should create a seperate `cpuset` cgroup,
+`kata-sandbox-vcpus`, alongside the standard sandbox cgroup, `kata-sandbox`. These would be siblings underneath the


kata-sandbox-<sandbox-id>: In the case of docker containers there is not a pod level cgroup so if we use the same id they will endup all in the same cgroup.

Where should be placed this new set of cgroups?

kata-maneged: /sys/fs/cgroup/{subsystem}/kata?
parent based: sandboxContainer.CgroupPath.GetParent()

@jcvenegas, these would need to be under parent (appropraite hierarchical location; where the caller expects it to be). In case of kubernetes, under the pod-*.

I'm not sure we need sandbox-id in the naming; determining which sandbox it is associate with is determined by its hierarchical location.

@jcvenegas do we need the sandbox id? Can we not call it just sandbox in the case of crio and containerd with kubernetes.

@egernst In case of docker who creates the cgroup /docker/3cca57349adcfb17f917d2312b568faa85515e6d5a8327a30ea8e5ecd47b0e1f

Can we be under that.

jcvenegas · 2019-04-15T20:41:40Z

+
+### Summary
+ * Pause cgroup cpu shaes should be setup correctly.
+ * Do not create container cgroups on the host. Instead, create a pod sandbox group that is entirely managed by Kata


Would be good to describe why: This is related with the cadvisor issue right ?

No. This is because it is not for the pause process. It is actually for the VM. Hence we cannot just add up and apply to pause. If we create the container cgroup and something runs in them as cgroups in the case of cpu is proportional we will get wrong behavior. So it is equally important not to create the container cgroups.
The conmon cgroup will be really tiny and hence should not effect the sandbox. (like how pause is setup by runc)

jcvenegas · 2019-04-16T13:43:20Z

PR to move kata assets to its sandbox level cgroup based in sandbox cgroup container. See: kata-containers/runtime#1522

Signed-off-by: Eric Ernst <eric.ernst@intel.com>

chavafg · 2019-05-07T20:59:07Z

+# cgroup updates in Kata
+
+  * [Background](#background)
+  * [Existing Behavior (Kata 1.6):](#existing-behavior--kata-16--)


link seems broken

Once the commit message is updated and re-pushed, the CI should detect all these broken links (and tell you what they should be ;)

chavafg · 2019-05-07T20:59:44Z

+  * [Existing Behavior (Kata 1.6):](#existing-behavior--kata-16--)
+    + [Behavior Observed using various upper layer tools](#behavior-observed-using-various-upper-layer-tools)
+      - [In Docker](#in-docker)
+      - [In Kubernetes + Containerd](#in-kubernetes---containerd)


this link also seems broken

chavafg · 2019-05-07T21:00:36Z

+      - [In Docker](#in-docker)
+      - [In Kubernetes + Containerd](#in-kubernetes---containerd)
+        * [Where are all the vCPUS](#where-are-all-the-vcpus)
+        * [Where are the v2-shim, QEMU, and Vhost processes](#where-are-the-v2-shim--qemu--and-vhost-processes)


broken link?

chavafg · 2019-05-07T21:00:59Z

+      - [In Kubernetes + Containerd](#in-kubernetes---containerd)
+        * [Where are all the vCPUS](#where-are-all-the-vcpus)
+        * [Where are the v2-shim, QEMU, and Vhost processes](#where-are-the-v2-shim--qemu--and-vhost-processes)
+      - [In Kubernetes  + CRI-O (v1 shim)](#in-kubernetes----cri-o--v1-shim-)


broken link?

chavafg · 2019-05-07T21:02:11Z

+      - [Accurate usage accounting](#accurate-usage-accounting)
+      - [Node stability](#node-stability)
+      - [Consistent guaranteed pod behavior](#consistent-guaranteed-pod-behavior)
+      - [OOM, unbound CPU utilization](#oom--unbound-cpu-utilization)


broken link?

chavafg · 2019-05-07T21:02:34Z

+    + [Details](#details)
+      - [Pod Sandbox Cgroup](#pod-sandbox-cgroup)
+  * [Alternatives Considered](#alternatives-considered)
+    + [Only constrain vCPUs, leaving remaining threads for system reserved](#only-constrain-vcpus--leaving-remaining-threads-for-system-reserved)


broken link?

chavafg · 2019-05-07T21:03:20Z

+  * [Alternatives Considered](#alternatives-considered)
+    + [Only constrain vCPUs, leaving remaining threads for system reserved](#only-constrain-vcpus--leaving-remaining-threads-for-system-reserved)
+  * [Opens](#opens)
+    + [static CPU configurations](#static-cpu-configurations)


start with Capital letter? [Static....]

chavafg · 2019-05-07T21:04:16Z

+
+Before diving into the gaps and behavior exhibited in Kata Containers, it is important to have
+a thorough understanding of how cgroups are leveraged by Kubernetes.  It is both straight forward
+and confusing.  An in-depth guide is available for background in [mcastelino's gist](https://gist.github.com/mcastelino/b8ce9a70b00ee56036dadd70ded53e9f).


remove extra-space?

jodh-intel · 2019-05-08T09:55:38Z

@egernst - you could try running kata-containers/tests#1542 over this PR as a useful test as it should detect the issues (and suggest fixes where possible).

raravena80 · 2019-05-24T16:28:48Z

@egernst any updates? Thx!

grahamwhaley · 2019-06-03T16:58:39Z

+
+Before diving into the gaps and behavior exhibited in Kata Containers, it is important to have
+a thorough understanding of how cgroups are leveraged by Kubernetes.  It is both straight forward
+and confusing.  An in-depth guide is available for background in [mcastelino's gist](https://gist.github.com/mcastelino/b8ce9a70b00ee56036dadd70ded53e9f).


I'd love us to not have to link to a gist of @mcastelino - they just feel too ephemeral to me - any chance we can get that info into the kata repos/docs??

grahamwhaley · 2019-06-03T17:00:04Z

I skimmed over this - this looks like a good very technical doc. for us to have in Kata @egernst - do you plan to clean/land it?

caoruidong · 2019-06-17T10:32:31Z

@egernst

jodh-intel · 2019-06-24T09:35:23Z

Commit needs a tweak:

ERROR: Commit 82e4800b2796bd3760851f0892c10ad60618ac7b: Failed to find subsystem in subject: "resource management: document current status, suggested fixes"
ERROR: checkcommits failed. See the document below for help on formatting
commits for the project.
https://github.com/kata-containers/community/blob/master/CONTRIBUTING.md#patch-format

I suggest changing it to

resource-management: document current status, suggested fixes.

jodh-intel · 2019-06-24T09:36:19Z

+# cgroup updates in Kata
+
+  * [Background](#background)
+  * [Existing Behavior (Kata 1.6):](#existing-behavior--kata-16--)


Once the commit message is updated and re-pushed, the CI should detect all these broken links (and tell you what they should be ;)

jodh-intel · 2019-06-24T09:37:07Z

+## Background
+
+With 1.6 release of Kata Containers there are some issues with resource management resulting
+in inconsistent behavior. This document descibes the state of 1.6, and a suggested implementation


We'll be releasing 1.8 soon so will this document still be relevant (since 1.6 won't be updated any further)?

jodh-intel · 2019-07-29T10:37:03Z

@egernst - Do you think you'll get cycles to update this doc this week? Would be great to see this land.

jodh-intel · 2019-09-02T10:43:10Z

Re-ping @egernst.

egernst requested a review from a team as a code owner April 12, 2019 23:12

egernst added this to the cgroup-sprint milestone Apr 12, 2019

egernst mentioned this pull request Apr 12, 2019

RFC: update resource management in Kata kata-containers/runtime#1535

Closed

4 tasks

resource management: document current status, suggested fixes

82e4800

Signed-off-by: Eric Ernst <eric.ernst@intel.com>

egernst force-pushed the cgroup-fixing branch from 47656ca to f7bff95 Compare April 12, 2019 23:26

cleanup initial drop, add TOC

f7bff95

Signed-off-by: Eric Ernst <eric.ernst@intel.com>

egernst force-pushed the cgroup-fixing branch 3 times, most recently from e2153ef to 367d5c6 Compare April 12, 2019 23:51

add containerd details, and outline of proposal

4e0f909

Signed-off-by: Eric Ernst <eric.ernst@intel.com>

egernst force-pushed the cgroup-fixing branch from 367d5c6 to 4e0f909 Compare April 13, 2019 22:37

bergwolf reviewed Apr 14, 2019

View reviewed changes

mcastelino reviewed Apr 15, 2019

View reviewed changes

jcvenegas reviewed Apr 15, 2019

View reviewed changes

add docker details

1c4290b

Signed-off-by: Eric Ernst <eric.ernst@intel.com>

egernst force-pushed the cgroup-fixing branch from 4ba46bf to 1c4290b Compare April 16, 2019 14:17

chavafg reviewed May 7, 2019

View reviewed changes

grahamwhaley reviewed Jun 3, 2019

View reviewed changes

egernst mentioned this pull request Jun 8, 2019

need to provide guidance on sandbox overhead #490

Closed

jodh-intel reviewed Jun 24, 2019

View reviewed changes

egernst closed this Sep 11, 2019


		### Only constrain vCPUs, leaving remaining threads for system reserved

		This will, in some cases, provide improved performance.


		##### Where are all the vCPUS

		The vCPUs are placed under the pause container:

Conversation

egernst commented Apr 12, 2019

Uh oh!

bergwolf left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bergwolf commented Apr 15, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcvenegas commented Apr 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jodh-intel commented May 8, 2019

Uh oh!

raravena80 commented May 24, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grahamwhaley commented Jun 3, 2019

Uh oh!

caoruidong commented Jun 17, 2019

Uh oh!

jodh-intel commented Jun 24, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jodh-intel commented Jul 29, 2019

jcvenegas commented Apr 16, 2019 •

edited

Loading