resource management: document current status, suggested fixes#428
resource management: document current status, suggested fixes#428egernst wants to merge 4 commits into
Conversation
Signed-off-by: Eric Ernst <eric.ernst@intel.com>
Signed-off-by: Eric Ernst <eric.ernst@intel.com>
e2153ef to
367d5c6
Compare
Signed-off-by: Eric Ernst <eric.ernst@intel.com>
bergwolf
left a comment
There was a problem hiding this comment.
I agree with all the bits especially the point that we just need a sandbox cgroup instead of per-container cgroups. Thanks for putting it together!
|
|
||
| QEMU itself and its vhost threads are very problematic. Depending on the workload, these can consume a | ||
| non-negligible amount of resources. Note, in the Kata implementation, these components are purposefuly | ||
| not added to the constrained cgroup, the pause container. As a result, they fall under the caller's |
There was a problem hiding this comment.
you mean "not added to the pod cgroup"?
| ### Summary | ||
| * Pause cgroup cpu shaes should be setup correctly. | ||
| * Do not create container cgroups on the host. Instead, create a pod sandbox group that is entirely managed by Kata | ||
| * Move the QEMU threads into the sandbox cgroup |
There was a problem hiding this comment.
Plus shimv2, kata-shim/proxy, and vhost threads? I think they all belong to the sandbox cgroup.
| The overheads associated with running a sandbox should be accounted for explicitly, and at the pod level. | ||
| Once the Pod Overhead KEP is available, this should become a part of RuntimeClass, applied to pods which | ||
| utilize the applicable RuntimeClass. See [Pod Overhead KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/20190226-pod-overhead.md) | ||
| for more details. |
There was a problem hiding this comment.
Yes! We really need pod overhead to let kubernetes be aware of kata.
|
|
||
| ### Only constrain vCPUs, leaving remaining threads for system reserved | ||
|
|
||
| This will, in some cases, provide improved performance. |
There was a problem hiding this comment.
The problem with system reserved is that it does not scale. So I don't think we should put scaling factors (qemu main thread, vhost etc.) there and it can result in unexpected system failures due to enforced constraints in system reserved.
|
@egernst The other part of the cgroups handling is how we handle them in the guest. Do you think we should include it in this document? |
| and CRI-O. | ||
|
|
||
| #### In Docker | ||
|
|
There was a problem hiding this comment.
We need to fill this as we follow a different cgroup setup here. Also Docker exposes a much richer resource API which some people may use. So we need to get docker right too.
| For each container in the pod, a cgroup is created within the pod cgroup (ie, under `/sys/fs/cgroup/*/kubepod/pod.*/` | ||
| for a guaranteed pod). This is not necessary; only a single cgroup which constrains the hypervisor | ||
| appropriately is required. | ||
|
|
There was a problem hiding this comment.
There is a cgroup for the pod itself. Within the cgroup of the pod child cgroups are setup. There is one for the pause container and one each for each container in the pod. Only the actual container processes are placed in these cgroups. The container framework processes like shim are placed under containerd systemd slice cgroup and as such are accounted for. This can be a potential issue is the framework ends up consuming resources. example logs, stdio.
|
|
||
| ##### Where are all the vCPUS | ||
|
|
||
| The vCPUs are placed under the pause container: |
There was a problem hiding this comment.
Kata tries to setup the pause cgroup to represent the VM itself. It scales its resources to match the total resources assigned to the pod.
| One drawback of this is that it assumes the existence of a pause container. This is an assumption | ||
| based on the implementation of containerd/cri-o. On the plus side, the cgroup is placed directly under | ||
| the pod cgroup, which is created and managed by Kubelet. Overall, this isn't terrible. | ||
|
|
There was a problem hiding this comment.
However CPU resources are proportional. Hence if any of the sibling cgroups consume resources the summation logic does not work.
|
|
||
| CRI-O is very similar the containerd, except for where the non-constrained processes end up. Instead | ||
| of being called by CRIO directly, kata-runtime is called from a process `conmon`, which is located | ||
| in a cgroup under the pod-cgroup. As expected based on prior exapmles, cgroups are ceated for each |
| The QEMU, vhost, proxy and shim threads, however, are placed under the caller's cgroup, which in this case | ||
| is `conmon`, which is a peer of the container cgroups we created. So, the good news is that QEMU, vhost, | ||
| etc, are constrained within the pod's cgroup. The bad news is these will be constrained based on the values | ||
| associated with conmon: |
There was a problem hiding this comment.
This is also wrong as conmon should be constrained with the absolute minmum resources it can get away with. Typically this is set to 2 (i.e. so that it does not impact the container scheduling).
| If static CPU policies are introduced, the end user will assign CPUs to a specific container within the pod. Running | ||
| IO threads on this CPU may not necessarily be desirable, compared to the users expectations. | ||
|
|
||
| Long term (ie, with RuntimeClass augmented to handle pod overheads), we should create a seperate `cpuset` cgroup, |
There was a problem hiding this comment.
The cpuset cgroup is separate from CPU cgroup. Also upstream kubernetes is planning to eliminate the cfs quotas for containers with cpusets. So we really cannot place QEMU outside the cpuset.
kubernetes/kubernetes#70585
| IO threads on this CPU may not necessarily be desirable, compared to the users expectations. | ||
|
|
||
| Long term (ie, with RuntimeClass augmented to handle pod overheads), we should create a seperate `cpuset` cgroup, | ||
| `kata-sandbox-vcpus`, alongside the standard sandbox cgroup, `kata-sandbox`. These would be siblings underneath the |
There was a problem hiding this comment.
kata-sandbox-<sandbox-id>: In the case of docker containers there is not a pod level cgroup so if we use the same id they will endup all in the same cgroup.
There was a problem hiding this comment.
Where should be placed this new set of cgroups?
kata-maneged: /sys/fs/cgroup/{subsystem}/kata?
parent based: sandboxContainer.CgroupPath.GetParent()
There was a problem hiding this comment.
@jcvenegas, these would need to be under parent (appropraite hierarchical location; where the caller expects it to be). In case of kubernetes, under the pod-*.
I'm not sure we need sandbox-id in the naming; determining which sandbox it is associate with is determined by its hierarchical location.
There was a problem hiding this comment.
@jcvenegas do we need the sandbox id? Can we not call it just sandbox in the case of crio and containerd with kubernetes.
@egernst In case of docker who creates the cgroup /docker/3cca57349adcfb17f917d2312b568faa85515e6d5a8327a30ea8e5ecd47b0e1f
Can we be under that.
|
|
||
| ### Summary | ||
| * Pause cgroup cpu shaes should be setup correctly. | ||
| * Do not create container cgroups on the host. Instead, create a pod sandbox group that is entirely managed by Kata |
There was a problem hiding this comment.
Would be good to describe why: This is related with the cadvisor issue right ?
There was a problem hiding this comment.
No. This is because it is not for the pause process. It is actually for the VM. Hence we cannot just add up and apply to pause. If we create the container cgroup and something runs in them as cgroups in the case of cpu is proportional we will get wrong behavior. So it is equally important not to create the container cgroups.
The conmon cgroup will be really tiny and hence should not effect the sandbox. (like how pause is setup by runc)
|
PR to move kata assets to its sandbox level cgroup based in sandbox cgroup container. See: kata-containers/runtime#1522 |
Signed-off-by: Eric Ernst <eric.ernst@intel.com>
| # cgroup updates in Kata | ||
|
|
||
| * [Background](#background) | ||
| * [Existing Behavior (Kata 1.6):](#existing-behavior--kata-16--) |
There was a problem hiding this comment.
Once the commit message is updated and re-pushed, the CI should detect all these broken links (and tell you what they should be ;)
| * [Existing Behavior (Kata 1.6):](#existing-behavior--kata-16--) | ||
| + [Behavior Observed using various upper layer tools](#behavior-observed-using-various-upper-layer-tools) | ||
| - [In Docker](#in-docker) | ||
| - [In Kubernetes + Containerd](#in-kubernetes---containerd) |
There was a problem hiding this comment.
this link also seems broken
| - [In Docker](#in-docker) | ||
| - [In Kubernetes + Containerd](#in-kubernetes---containerd) | ||
| * [Where are all the vCPUS](#where-are-all-the-vcpus) | ||
| * [Where are the v2-shim, QEMU, and Vhost processes](#where-are-the-v2-shim--qemu--and-vhost-processes) |
| - [In Kubernetes + Containerd](#in-kubernetes---containerd) | ||
| * [Where are all the vCPUS](#where-are-all-the-vcpus) | ||
| * [Where are the v2-shim, QEMU, and Vhost processes](#where-are-the-v2-shim--qemu--and-vhost-processes) | ||
| - [In Kubernetes + CRI-O (v1 shim)](#in-kubernetes----cri-o--v1-shim-) |
| - [Accurate usage accounting](#accurate-usage-accounting) | ||
| - [Node stability](#node-stability) | ||
| - [Consistent guaranteed pod behavior](#consistent-guaranteed-pod-behavior) | ||
| - [OOM, unbound CPU utilization](#oom--unbound-cpu-utilization) |
| + [Details](#details) | ||
| - [Pod Sandbox Cgroup](#pod-sandbox-cgroup) | ||
| * [Alternatives Considered](#alternatives-considered) | ||
| + [Only constrain vCPUs, leaving remaining threads for system reserved](#only-constrain-vcpus--leaving-remaining-threads-for-system-reserved) |
| * [Alternatives Considered](#alternatives-considered) | ||
| + [Only constrain vCPUs, leaving remaining threads for system reserved](#only-constrain-vcpus--leaving-remaining-threads-for-system-reserved) | ||
| * [Opens](#opens) | ||
| + [static CPU configurations](#static-cpu-configurations) |
There was a problem hiding this comment.
start with Capital letter? [Static....]
|
|
||
| Before diving into the gaps and behavior exhibited in Kata Containers, it is important to have | ||
| a thorough understanding of how cgroups are leveraged by Kubernetes. It is both straight forward | ||
| and confusing. An in-depth guide is available for background in [mcastelino's gist](https://gist.github.com/mcastelino/b8ce9a70b00ee56036dadd70ded53e9f). |
|
@egernst - you could try running kata-containers/tests#1542 over this PR as a useful test as it should detect the issues (and suggest fixes where possible). |
|
@egernst any updates? Thx! |
|
|
||
| Before diving into the gaps and behavior exhibited in Kata Containers, it is important to have | ||
| a thorough understanding of how cgroups are leveraged by Kubernetes. It is both straight forward | ||
| and confusing. An in-depth guide is available for background in [mcastelino's gist](https://gist.github.com/mcastelino/b8ce9a70b00ee56036dadd70ded53e9f). |
There was a problem hiding this comment.
I'd love us to not have to link to a gist of @mcastelino - they just feel too ephemeral to me - any chance we can get that info into the kata repos/docs??
|
I skimmed over this - this looks like a good very technical doc. for us to have in Kata @egernst - do you plan to clean/land it? |
|
Commit needs a tweak: I suggest changing it to
|
| # cgroup updates in Kata | ||
|
|
||
| * [Background](#background) | ||
| * [Existing Behavior (Kata 1.6):](#existing-behavior--kata-16--) |
There was a problem hiding this comment.
Once the commit message is updated and re-pushed, the CI should detect all these broken links (and tell you what they should be ;)
| ## Background | ||
|
|
||
| With 1.6 release of Kata Containers there are some issues with resource management resulting | ||
| in inconsistent behavior. This document descibes the state of 1.6, and a suggested implementation |
There was a problem hiding this comment.
We'll be releasing 1.8 soon so will this document still be relevant (since 1.6 won't be updated any further)?
|
@egernst - Do you think you'll get cycles to update this doc this week? Would be great to see this land. |
|
Re-ping @egernst. |
Let's use this document to drive the conversation around design changes to better manage Kata Containers' resources.