Skip to content
This repository was archived by the owner on May 12, 2021. It is now read-only.
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
393 changes: 393 additions & 0 deletions design/resource-management.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,393 @@
# cgroup updates in Kata

* [Background](#background)
* [Existing Behavior (Kata 1.6):](#existing-behavior--kata-16--)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link seems broken

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the commit message is updated and re-pushed, the CI should detect all these broken links (and tell you what they should be ;)

+ [Behavior Observed using various upper layer tools](#behavior-observed-using-various-upper-layer-tools)
- [In Docker](#in-docker)
- [In Kubernetes + Containerd](#in-kubernetes---containerd)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this link also seems broken

* [Where are all the vCPUS](#where-are-all-the-vcpus)
* [Where are the v2-shim, QEMU, and Vhost processes](#where-are-the-v2-shim--qemu--and-vhost-processes)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

broken link?

- [In Kubernetes + CRI-O (v1 shim)](#in-kubernetes----cri-o--v1-shim-)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

broken link?

+ [Issues with current implementation](#issues-with-current-implementation)
- [Accurate usage accounting](#accurate-usage-accounting)
- [Node stability](#node-stability)
- [Consistent guaranteed pod behavior](#consistent-guaranteed-pod-behavior)
- [OOM, unbound CPU utilization](#oom--unbound-cpu-utilization)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

broken link?

* [Proposed Changes](#proposed-changes)
+ [Summary](#summary)
+ [Details](#details)
- [Pod Sandbox Cgroup](#pod-sandbox-cgroup)
* [Alternatives Considered](#alternatives-considered)
+ [Only constrain vCPUs, leaving remaining threads for system reserved](#only-constrain-vcpus--leaving-remaining-threads-for-system-reserved)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

broken link?

* [Opens](#opens)
+ [static CPU configurations](#static-cpu-configurations)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start with Capital letter? [Static....]


## Background

With 1.6 release of Kata Containers there are some issues with resource management resulting
in inconsistent behavior. This document descibes the state of 1.6, and a suggested implementation
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll be releasing 1.8 soon so will this document still be relevant (since 1.6 won't be updated any further)?

for 1.7 version of Kata.

Before diving into the gaps and behavior exhibited in Kata Containers, it is important to have
a thorough understanding of how cgroups are leveraged by Kubernetes. It is both straight forward
and confusing. An in-depth guide is available for background in [mcastelino's gist](https://gist.github.com/mcastelino/b8ce9a70b00ee56036dadd70ded53e9f).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra-space?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love us to not have to link to a gist of @mcastelino - they just feel too ephemeral to me - any chance we can get that info into the kata repos/docs??

This may be good content to include in our repository eventually, or as part of the Kata blog series,
but let's leave it out of the scope of this initial documentation.

This document is part of the `cgroup-sprint` GitHub milestone, and can be observed in the milestone's
[GitHub project](https://github.com/orgs/kata-containers/projects/17).


## Existing Behavior (Kata 1.6):

### Behavior Observed using various upper layer tools

To exhibit current behavior, we utilize a simple guaranteed pod description:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: guar-runc
spec:
containers:
- name: cont-2cpu-400m
image: busybox
resources:
limits:sadf
cpu: 2
memory: "400Mi"
command: ["md5sum"]
args: ["/dev/urandom"]
- name: cont-3cpu-200m
image: busybox
resources:
limits:
cpu: 3
memory: "200Mi"
command: ["md5sum"]
args: ["/dev/urandom"]
```
We'll show the behavior starting with the simplest scenario, Docker, followed by containerd
and CRI-O.

#### In Docker

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to fill this as we follow a different cgroup setup here. Also Docker exposes a much richer resource API which some people may use. So we need to get docker right too.

We use a smple container: ```sudo docker run --cpus=2 --runtime=kata-qemu -it alpine sh```

For the example below, the containerID is `3cca57349adcfb17f917d2312b568faa85515e6d5a8327a30ea8e5ecd47b0e1f`.

The vCPU threads associated with the container are constrained in a cgroup created by Kata within the
Docker directory, with a name that matches the containerID. We see three vCPU threads running within this cgroup,
while the remaining sit within the `docker.service` `system.slice`:

```
$ for c in `ps -aeT | grep qem | cut -c 9-14 `; do grep -ir $c . | grep task ; done
./system.slice/docker.service/tasks:82005
./system.slice/docker.service/tasks:82006
./docker/3cca57349adcfb17f917d2312b568faa85515e6d5a8327a30ea8e5ecd47b0e1f/tasks:82009
./docker/3cca57349adcfb17f917d2312b568faa85515e6d5a8327a30ea8e5ecd47b0e1f/tasks:82076
./docker/3cca57349adcfb17f917d2312b568faa85515e6d5a8327a30ea8e5ecd47b0e1f/tasks:82077
```


Similarly, the shim and vhost threads sit within `docker.service` `system.slice`.

```bash
$ grep -r `ps -ae|grep kata-proxy | cut -c -6` . | grep task
./system.slice/docker.service/tasks:82011
$ grep -r `ps -ae|grep vhost | cut -c -6` . | grep task
./system.slice/docker.service/tasks:82008
```

```bash
$ grep -r `ps -ae|grep kata-shim | cut -c -6` . | grep task
./docker/3cca57349adcfb17f917d2312b568faa85515e6d5a8327a30ea8e5ecd47b0e1f/tasks:82085
```

The docker system.slice is unconstrained. The created cgroup is constraind based on the workload
definition (2 CPUs).

```bash
/sys/fs/cgroup/cpu,cpuacct/docker$ cat cpu.shares cpu.cfs_quota_us
1024
-1
```

```bash
/sys/fs/cgroup/cpu,cpuacct/docker$ cat */cpu.shares */cpu.cfs_quota_us
1024
200000
```

This is interesting, given that there are three vCPUS (one default, plus two requested
CPUs). Since today *only* the vCPU and the kata-shim are within the container's cgroup,
this is adecquate. The remaining threads (vhost, kata-proxy, QEMU itself) run unconstrained
on the host.

All processes are left unconstrained for memory.
```
/sys/fs/cgroup/memory/docker$ cat */memory.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/docker$ cat memory.limit_in_bytes
9223372036854771712
```

#####

#### In Kubernetes + Containerd

For each container in the pod, a cgroup is created within the pod cgroup (ie, under `/sys/fs/cgroup/*/kubepod/pod.*/`
for a guaranteed pod). This is not necessary; only a single cgroup which constrains the hypervisor
appropriately is required.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a cgroup for the pod itself. Within the cgroup of the pod child cgroups are setup. There is one for the pause container and one each for each container in the pod. Only the actual container processes are placed in these cgroups. The container framework processes like shim are placed under containerd systemd slice cgroup and as such are accounted for. This can be a potential issue is the framework ends up consuming resources. example logs, stdio.

##### Where are all the vCPUS

The vCPUs are placed under the pause container:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kata tries to setup the pause cgroup to represent the VM itself. It scales its resources to match the total resources assigned to the pod.


```bash
# for i in `ls pod*/**/tasks`; do echo $i && for j in `cat $i`; do pstree -pt $j;done; done;
podf277e232-5ca6-11e9-b514-000d3a6d0876/2692eaedb55f8cfd1b9aadcbc5e3f0ac527cb39ff26d31877f1be5a495b966c1/tasks
podf277e232-5ca6-11e9-b514-000d3a6d0876/6689f72eef2161b85d5d57cb9f4670ae4e08f551d9aeb4b28efb67eb306034d8/tasks
podf277e232-5ca6-11e9-b514-000d3a6d0876/9d17d1c5d075ca42dfefb58ef7b82c8b1b234cc128ebf9332b902b866c0ebed7/tasks
{qemu-system-x86}(24011)
{qemu-system-x86}(24093)
{qemu-system-x86}(24097)
{qemu-system-x86}(24156)
{qemu-system-x86}(24157)
{qemu-system-x86}(24158)
```

In this case, `9d17d1` is the pause, as you can see based on the summation for `cpu.cfs_quota_us`:
```bash
# cat */cpu.cfs_quota_us
300000
200000
500000
```
One drawback of this is that it assumes the existence of a pause container. This is an assumption
based on the implementation of containerd/cri-o. On the plus side, the cgroup is placed directly under
the pod cgroup, which is created and managed by Kubelet. Overall, this isn't terrible.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However CPU resources are proportional. Hence if any of the sibling cgroups consume resources the summation logic does not work.

##### Where are the v2-shim, QEMU, and Vhost processes

As shown below, all of these are placed under the containerd.service system.slice. For `containerd-shim-kata-v2`
this isn't a major concern, as it is not expected to take much resource, and it is pretty closely
coupled to containerd.

QEMU itself and its vhost threads are very problematic. Depending on the workload, these can consume a
non-negligible amount of resources. Note, in the Kata implementation, these components are purposefuly
not added to the constrained cgroup, the pause container. As a result, they fall under the caller's
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean "not added to the pod cgroup"?

cgroup, which in this case is the containerd service.

The process location is determined as follows:

v2-shim:
```bash
root@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct# ps -aef | grep containerd-shim-kata | grep -v grep

root 23992 1 0 22:13 ? 00:00:00 /opt/kata/bin/containerd-shim-kata-v2 -namespace k8s.io -address /run/containerd/containerd.sock -publish-binary /usr/local/bin/containerd -id 9d17d1c5d075ca42dfefb58ef7b82c8b1b234cc128ebf9332b902b866c0ebed7 -debug
root@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct# grep -ir 23992
system.slice/containerd.service/cgroup.procs:23992
system.slice/containerd.service/tasks:23992
```

qemu:
```bash
root@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct# ps -ae | grep qemu
24007 ? 00:37:26 qemu-system-x86

root@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct# grep -ir 24007
kubepods/podf277e232-5ca6-11e9-b514-000d3a6d0876/9d17d1c5d075ca42dfefb58ef7b82c8b1b234cc128ebf9332b902b866c0ebed7/cgroup.procs:24007
system.slice/containerd.service/cgroup.procs:24007
system.slice/containerd.service/tasks:24007
```

vhost:
```bash
root@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct# ps -aef | grep vhost | grep -v qemu | grep -v grep
root 24010 2 0 22:13 ? 00:00:00 [vhost-24007]

root@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct# grep -ir 24010
system.slice/containerd.service/cgroup.procs:24010
system.slice/containerd.service/tasks:24010
```

#### In Kubernetes + CRI-O (v1 shim)

CRI-O is very similar the containerd, except for where the non-constrained processes end up. Instead
of being called by CRIO directly, kata-runtime is called from a process `conmon`, which is located
in a cgroup under the pod-cgroup. As expected based on prior examples, cgroups are created for each
container, and the QEMU vCPU threads are placed within the pause container's cgroup.

```bash
pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-1b05886a39901ef3a7555796d38dcfaafd8fda929aef223ea576324a4949f9ef/tasks
{qemu-system-x86}(18061)
kata-shim(18207)─┬─{kata-shim}(18213)
├─{kata-shim}(18215)
├─{kata-shim}(18216)
├─{kata-shim}(18217)
├─{kata-shim}(18218)
├─{kata-shim}(18219)
├─{kata-shim}(18220)
├─{kata-shim}(18221)
├─{kata-shim}(18223)
├─{kata-shim}(18224)
└─{kata-shim}(18226)
{kata-shim}(18213)
{kata-shim}(18215)
{kata-shim}(18216)
{kata-shim}(18217)
{kata-shim}(18218)
{kata-shim}(18219)
{kata-shim}(18220)
{kata-shim}(18221)
{kata-shim}(18223)
{kata-shim}(18224)
{kata-shim}(18226)
{qemu-system-x86}(18280)
{qemu-system-x86}(18281)
{qemu-system-x86}(18368)
{qemu-system-x86}(18369)
{qemu-system-x86}(18370)
```

The QEMU, vhost, proxy and shim threads, however, are placed under the caller's cgroup, which in this case
is `conmon`, which is a peer of the container cgroups we created. So, the good news is that QEMU, vhost,
etc, are constrained within the pod's cgroup. The bad news is these will be constrained based on the values
associated with conmon:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also wrong as conmon should be constrained with the absolute minmum resources it can get away with. Typically this is set to 2 (i.e. so that it does not impact the container scheduling).


```bash
pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-conmon-1b05886a39901ef3a7555796d38dcfaafd8fda929aef223ea576324a4949f9ef/tasks
conmon(18040)─┬─kata-proxy(18063)─┬─{kata-proxy}(18070)
│ ├─{kata-proxy}(18071)
│ ├─{kata-proxy}(18072)
│ ├─{kata-proxy}(18073)
│ ├─{kata-proxy}(18074)
│ ├─{kata-proxy}(18075)
│ ├─{kata-proxy}(18076)
│ ├─{kata-proxy}(18077)
│ ├─{kata-proxy}(18286)
│ ├─{kata-proxy}(20199)
│ ├─{kata-proxy}(20200)
│ ├─{kata-proxy}(20201)
│ ├─{kata-proxy}(20202)
│ ├─{kata-proxy}(20203)
│ └─{kata-proxy}(28490)
├─kata-shim(18207)─┬─{kata-shim}(18213)
│ ├─{kata-shim}(18215)
│ ├─{kata-shim}(18216)
│ ├─{kata-shim}(18217)
│ ├─{kata-shim}(18218)
│ ├─{kata-shim}(18219)
│ ├─{kata-shim}(18220)
│ ├─{kata-shim}(18221)
│ ├─{kata-shim}(18223)
│ ├─{kata-shim}(18224)
│ └─{kata-shim}(18226)
├─qemu-system-x86(18058)─┬─{qemu-system-x86}(18059)
│ ├─{qemu-system-x86}(18061)
│ ├─{qemu-system-x86}(18280)
│ ├─{qemu-system-x86}(18281)
│ ├─{qemu-system-x86}(18368)
│ ├─{qemu-system-x86}(18369)
│ └─{qemu-system-x86}(18370)
└─{gmain}(18042)
{gmain}(18042)
qemu-system-x86(18058)─┬─{qemu-system-x86}(18059)
├─{qemu-system-x86}(18061)
├─{qemu-system-x86}(18280)
├─{qemu-system-x86}(18281)
├─{qemu-system-x86}(18368)
├─{qemu-system-x86}(18369)
└─{qemu-system-x86}(18370)
{qemu-system-x86}(18059)
vhost-18058(18060)
```

Two things should happen here:
* work with CRI-O to determine a more appropriate CPU shares setting for conmon, to avoid impacting the container
cgroups (in case of runc) or the hypervisor's cgroup (in case of Kata). See <ADD ISSUE LINK HERE>
* do not place our IO threads, shim, proxy and QEMU process in conmon

### Issues with current implementation

There are a few major issues that result from the current implementation, and are motivation for design
changes.

#### Accurate usage accounting
The IO pocessing should be charged to the pod performing the IO, not against the system. Without
utilization of a same hierarchical cgroup, this will not be feasible.

#### Node stability

QEMU and its IO threads consume a non-negligible amount of resources. If the memory and CPU utilized is not
constrained, measured and not accounted for, the node will run into CPU and memory pressure unexpectedly.

#### Consistent guaranteed pod behavior

Predictable performance is important for end users. By pushing IO threads into a shared pool, the
achievable performance will be inconsistent. Even if a user utilizes a `guaranteed` QoS pod, the
performance profile will differ depending on the amount of contention on the system. Raw unconstrained
performance is important for Kata, but not as important as consistent and predictable behavior.

#### OOM, unbound CPU utilization

Memory limits are enforced, not requests. Until [Pod Overhead KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/20190226-pod-overhead.md)
is added, users or admission controllers need to provide higher limits.

In the case of CRI-O, the memory is being charged to conmon which is bounded by pod limits. As a result,
the workload can still be OOMed. This is okay from a node stability point of view, but not from a pod stability
point of view. This behavior assumes we are called by conmon and that they are constrained appropriately. Luckily,
this is reasonably correct from a memory point of view. I/O bound workloads will exhibit sub-optimal performance
due to the CPU constraints applied to conmon (where the io threads run).

For containerd, the memory is being charged to containerd, which is basically unbounded. This is bad for
node stability, as the `pod` is essentially unbounded.

## Proposed Changes

### Summary
* Pause cgroup cpu shaes should be setup correctly.
* Do not create container cgroups on the host. Instead, create a pod sandbox group that is entirely managed by Kata
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to describe why: This is related with the cadvisor issue right ?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This is because it is not for the pause process. It is actually for the VM. Hence we cannot just add up and apply to pause. If we create the container cgroup and something runs in them as cgroups in the case of cpu is proportional we will get wrong behavior. So it is equally important not to create the container cgroups.
The conmon cgroup will be really tiny and hence should not effect the sandbox. (like how pause is setup by runc)

* Move all of the Kata threads (vCPU, shimv2, kata-shim, kata-proxy, vhost, etc), not just vCPU threads, into the sandbox cgroup

With these changes, performance and constraints for a pod is consistent. This constraining change will be
more restrictive relative to existing design.

The overheads associated with running a sandbox should be accounted for explicitly, and at the pod level.
Once the Pod Overhead KEP is available, this should become a part of RuntimeClass, applied to pods which
utilize the applicable RuntimeClass. See [Pod Overhead KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/20190226-pod-overhead.md)
for more details.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! We really need pod overhead to let kubernetes be aware of kata.


### Details

#### Pod Sandbox Cgroup
* The pod sandbox cgroup should always be the summation of all container group resources.
* In the case of cri-o where it creates other cgroups for conmon, they will be siblings.
* The conmon cgroups today are rather large, so if conmon goes wild there is a possibility
the workload will get fewer resources. But it will not introduce any other side effects.

## Alternatives Considered

### Only constrain vCPUs, leaving remaining threads for system reserved

This will, in some cases, provide improved performance. Utilizing system reserved does not scale, though.
If the QEMU main thread and IO threads are placed here, unexpected failures could occur on a loaded system
with enforaced constraints.

## Opens

### static CPU configurations

If static CPU policies are introduced, the end user will assign CPUs to a specific container within the pod. Running
IO threads on this CPU may not necessarily be desirable, compared to the users expectations.

Long term (ie, with RuntimeClass augmented to handle pod overheads), we should create a seperate `cpuset` cgroup,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cpuset cgroup is separate from CPU cgroup. Also upstream kubernetes is planning to eliminate the cfs quotas for containers with cpusets. So we really cannot place QEMU outside the cpuset.
kubernetes/kubernetes#70585

`kata-sandbox-vcpus`, alongside the standard sandbox cgroup, `kata-sandbox`. These would be siblings underneath the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kata-sandbox-<sandbox-id>: In the case of docker containers there is not a pod level cgroup so if we use the same id they will endup all in the same cgroup.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where should be placed this new set of cgroups?

kata-maneged: /sys/fs/cgroup/{subsystem}/kata?
parent based: sandboxContainer.CgroupPath.GetParent()

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcvenegas, these would need to be under parent (appropraite hierarchical location; where the caller expects it to be). In case of kubernetes, under the pod-*.

I'm not sure we need sandbox-id in the naming; determining which sandbox it is associate with is determined by its hierarchical location.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcvenegas do we need the sandbox id? Can we not call it just sandbox in the case of crio and containerd with kubernetes.

@egernst In case of docker who creates the cgroup /docker/3cca57349adcfb17f917d2312b568faa85515e6d5a8327a30ea8e5ecd47b0e1f

Can we be under that.

pod cgroup, in the kubernetes case. vCPU threads will be placed under `kata-sandbox-vcpus`, which will be updated
to use the CPUset suggested for the workload. The remaining threads will be placed under `kata-sandbox`, utilizing
the remaining non-claimed CPUs (problem: is this even possible to determine?). The CPU cgroups will be managed
as normal. The non-vCPU threads will be limited to the CPU utilization provided by the pod overhead, in this case.

In the short term, non-vCPU threads will need to share the cpuset, and the end-user will need to add additional
CPUs for overhead, if desired.