Skip to content
This repository was archived by the owner on May 12, 2021. It is now read-only.

resource management: document current status, suggested fixes#428

Closed
egernst wants to merge 4 commits into
kata-containers:masterfrom
egernst:cgroup-fixing
Closed

resource management: document current status, suggested fixes#428
egernst wants to merge 4 commits into
kata-containers:masterfrom
egernst:cgroup-fixing

Conversation

@egernst
Copy link
Copy Markdown
Member

@egernst egernst commented Apr 12, 2019

Let's use this document to drive the conversation around design changes to better manage Kata Containers' resources.

@egernst egernst requested a review from a team as a code owner April 12, 2019 23:12
@egernst egernst added this to the cgroup-sprint milestone Apr 12, 2019
Signed-off-by: Eric Ernst <eric.ernst@intel.com>
Signed-off-by: Eric Ernst <eric.ernst@intel.com>
@egernst egernst force-pushed the cgroup-fixing branch 3 times, most recently from e2153ef to 367d5c6 Compare April 12, 2019 23:51
Signed-off-by: Eric Ernst <eric.ernst@intel.com>
Copy link
Copy Markdown
Member

@bergwolf bergwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with all the bits especially the point that we just need a sandbox cgroup instead of per-container cgroups. Thanks for putting it together!


QEMU itself and its vhost threads are very problematic. Depending on the workload, these can consume a
non-negligible amount of resources. Note, in the Kata implementation, these components are purposefuly
not added to the constrained cgroup, the pause container. As a result, they fall under the caller's
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean "not added to the pod cgroup"?

Comment thread design/resource-management.md Outdated
### Summary
* Pause cgroup cpu shaes should be setup correctly.
* Do not create container cgroups on the host. Instead, create a pod sandbox group that is entirely managed by Kata
* Move the QEMU threads into the sandbox cgroup
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus shimv2, kata-shim/proxy, and vhost threads? I think they all belong to the sandbox cgroup.

The overheads associated with running a sandbox should be accounted for explicitly, and at the pod level.
Once the Pod Overhead KEP is available, this should become a part of RuntimeClass, applied to pods which
utilize the applicable RuntimeClass. See [Pod Overhead KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/20190226-pod-overhead.md)
for more details.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! We really need pod overhead to let kubernetes be aware of kata.

Comment thread design/resource-management.md Outdated

### Only constrain vCPUs, leaving remaining threads for system reserved

This will, in some cases, provide improved performance.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with system reserved is that it does not scale. So I don't think we should put scaling factors (qemu main thread, vhost etc.) there and it can result in unexpected system failures due to enforced constraints in system reserved.

@bergwolf
Copy link
Copy Markdown
Member

@egernst The other part of the cgroups handling is how we handle them in the guest. Do you think we should include it in this document?

and CRI-O.

#### In Docker

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to fill this as we follow a different cgroup setup here. Also Docker exposes a much richer resource API which some people may use. So we need to get docker right too.

For each container in the pod, a cgroup is created within the pod cgroup (ie, under `/sys/fs/cgroup/*/kubepod/pod.*/`
for a guaranteed pod). This is not necessary; only a single cgroup which constrains the hypervisor
appropriately is required.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a cgroup for the pod itself. Within the cgroup of the pod child cgroups are setup. There is one for the pause container and one each for each container in the pod. Only the actual container processes are placed in these cgroups. The container framework processes like shim are placed under containerd systemd slice cgroup and as such are accounted for. This can be a potential issue is the framework ends up consuming resources. example logs, stdio.


##### Where are all the vCPUS

The vCPUs are placed under the pause container:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kata tries to setup the pause cgroup to represent the VM itself. It scales its resources to match the total resources assigned to the pod.

One drawback of this is that it assumes the existence of a pause container. This is an assumption
based on the implementation of containerd/cri-o. On the plus side, the cgroup is placed directly under
the pod cgroup, which is created and managed by Kubelet. Overall, this isn't terrible.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However CPU resources are proportional. Hence if any of the sibling cgroups consume resources the summation logic does not work.

Comment thread design/resource-management.md Outdated

CRI-O is very similar the containerd, except for where the non-constrained processes end up. Instead
of being called by CRIO directly, kata-runtime is called from a process `conmon`, which is located
in a cgroup under the pod-cgroup. As expected based on prior exapmles, cgroups are ceated for each
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typos here.

The QEMU, vhost, proxy and shim threads, however, are placed under the caller's cgroup, which in this case
is `conmon`, which is a peer of the container cgroups we created. So, the good news is that QEMU, vhost,
etc, are constrained within the pod's cgroup. The bad news is these will be constrained based on the values
associated with conmon:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also wrong as conmon should be constrained with the absolute minmum resources it can get away with. Typically this is set to 2 (i.e. so that it does not impact the container scheduling).

If static CPU policies are introduced, the end user will assign CPUs to a specific container within the pod. Running
IO threads on this CPU may not necessarily be desirable, compared to the users expectations.

Long term (ie, with RuntimeClass augmented to handle pod overheads), we should create a seperate `cpuset` cgroup,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cpuset cgroup is separate from CPU cgroup. Also upstream kubernetes is planning to eliminate the cfs quotas for containers with cpusets. So we really cannot place QEMU outside the cpuset.
kubernetes/kubernetes#70585

IO threads on this CPU may not necessarily be desirable, compared to the users expectations.

Long term (ie, with RuntimeClass augmented to handle pod overheads), we should create a seperate `cpuset` cgroup,
`kata-sandbox-vcpus`, alongside the standard sandbox cgroup, `kata-sandbox`. These would be siblings underneath the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kata-sandbox-<sandbox-id>: In the case of docker containers there is not a pod level cgroup so if we use the same id they will endup all in the same cgroup.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where should be placed this new set of cgroups?

kata-maneged: /sys/fs/cgroup/{subsystem}/kata?
parent based: sandboxContainer.CgroupPath.GetParent()

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcvenegas, these would need to be under parent (appropraite hierarchical location; where the caller expects it to be). In case of kubernetes, under the pod-*.

I'm not sure we need sandbox-id in the naming; determining which sandbox it is associate with is determined by its hierarchical location.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcvenegas do we need the sandbox id? Can we not call it just sandbox in the case of crio and containerd with kubernetes.

@egernst In case of docker who creates the cgroup /docker/3cca57349adcfb17f917d2312b568faa85515e6d5a8327a30ea8e5ecd47b0e1f

Can we be under that.


### Summary
* Pause cgroup cpu shaes should be setup correctly.
* Do not create container cgroups on the host. Instead, create a pod sandbox group that is entirely managed by Kata
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to describe why: This is related with the cadvisor issue right ?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This is because it is not for the pause process. It is actually for the VM. Hence we cannot just add up and apply to pause. If we create the container cgroup and something runs in them as cgroups in the case of cpu is proportional we will get wrong behavior. So it is equally important not to create the container cgroups.
The conmon cgroup will be really tiny and hence should not effect the sandbox. (like how pause is setup by runc)

@jcvenegas
Copy link
Copy Markdown
Member

jcvenegas commented Apr 16, 2019

PR to move kata assets to its sandbox level cgroup based in sandbox cgroup container. See: kata-containers/runtime#1522

Signed-off-by: Eric Ernst <eric.ernst@intel.com>
# cgroup updates in Kata

* [Background](#background)
* [Existing Behavior (Kata 1.6):](#existing-behavior--kata-16--)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link seems broken

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the commit message is updated and re-pushed, the CI should detect all these broken links (and tell you what they should be ;)

* [Existing Behavior (Kata 1.6):](#existing-behavior--kata-16--)
+ [Behavior Observed using various upper layer tools](#behavior-observed-using-various-upper-layer-tools)
- [In Docker](#in-docker)
- [In Kubernetes + Containerd](#in-kubernetes---containerd)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this link also seems broken

- [In Docker](#in-docker)
- [In Kubernetes + Containerd](#in-kubernetes---containerd)
* [Where are all the vCPUS](#where-are-all-the-vcpus)
* [Where are the v2-shim, QEMU, and Vhost processes](#where-are-the-v2-shim--qemu--and-vhost-processes)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

broken link?

- [In Kubernetes + Containerd](#in-kubernetes---containerd)
* [Where are all the vCPUS](#where-are-all-the-vcpus)
* [Where are the v2-shim, QEMU, and Vhost processes](#where-are-the-v2-shim--qemu--and-vhost-processes)
- [In Kubernetes + CRI-O (v1 shim)](#in-kubernetes----cri-o--v1-shim-)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

broken link?

- [Accurate usage accounting](#accurate-usage-accounting)
- [Node stability](#node-stability)
- [Consistent guaranteed pod behavior](#consistent-guaranteed-pod-behavior)
- [OOM, unbound CPU utilization](#oom--unbound-cpu-utilization)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

broken link?

+ [Details](#details)
- [Pod Sandbox Cgroup](#pod-sandbox-cgroup)
* [Alternatives Considered](#alternatives-considered)
+ [Only constrain vCPUs, leaving remaining threads for system reserved](#only-constrain-vcpus--leaving-remaining-threads-for-system-reserved)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

broken link?

* [Alternatives Considered](#alternatives-considered)
+ [Only constrain vCPUs, leaving remaining threads for system reserved](#only-constrain-vcpus--leaving-remaining-threads-for-system-reserved)
* [Opens](#opens)
+ [static CPU configurations](#static-cpu-configurations)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start with Capital letter? [Static....]


Before diving into the gaps and behavior exhibited in Kata Containers, it is important to have
a thorough understanding of how cgroups are leveraged by Kubernetes. It is both straight forward
and confusing. An in-depth guide is available for background in [mcastelino's gist](https://gist.github.com/mcastelino/b8ce9a70b00ee56036dadd70ded53e9f).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra-space?

@jodh-intel
Copy link
Copy Markdown

@egernst - you could try running kata-containers/tests#1542 over this PR as a useful test as it should detect the issues (and suggest fixes where possible).

@raravena80
Copy link
Copy Markdown
Member

@egernst any updates? Thx!


Before diving into the gaps and behavior exhibited in Kata Containers, it is important to have
a thorough understanding of how cgroups are leveraged by Kubernetes. It is both straight forward
and confusing. An in-depth guide is available for background in [mcastelino's gist](https://gist.github.com/mcastelino/b8ce9a70b00ee56036dadd70ded53e9f).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love us to not have to link to a gist of @mcastelino - they just feel too ephemeral to me - any chance we can get that info into the kata repos/docs??

@grahamwhaley
Copy link
Copy Markdown
Contributor

I skimmed over this - this looks like a good very technical doc. for us to have in Kata @egernst - do you plan to clean/land it?

@caoruidong
Copy link
Copy Markdown
Member

@egernst

@jodh-intel
Copy link
Copy Markdown

Commit needs a tweak:

ERROR: Commit 82e4800b2796bd3760851f0892c10ad60618ac7b: Failed to find subsystem in subject: "resource management: document current status, suggested fixes"
ERROR: checkcommits failed. See the document below for help on formatting
commits for the project.
https://github.com/kata-containers/community/blob/master/CONTRIBUTING.md#patch-format

I suggest changing it to

resource-management: document current status, suggested fixes.

# cgroup updates in Kata

* [Background](#background)
* [Existing Behavior (Kata 1.6):](#existing-behavior--kata-16--)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the commit message is updated and re-pushed, the CI should detect all these broken links (and tell you what they should be ;)

## Background

With 1.6 release of Kata Containers there are some issues with resource management resulting
in inconsistent behavior. This document descibes the state of 1.6, and a suggested implementation
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll be releasing 1.8 soon so will this document still be relevant (since 1.6 won't be updated any further)?

@jodh-intel
Copy link
Copy Markdown

@egernst - Do you think you'll get cycles to update this doc this week? Would be great to see this land.

@jodh-intel
Copy link
Copy Markdown

Re-ping @egernst.

@egernst egernst closed this Sep 11, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants