From 82e4800b2796bd3760851f0892c10ad60618ac7b Mon Sep 17 00:00:00 2001 From: Eric Ernst Date: Fri, 12 Apr 2019 16:17:54 -0700 Subject: [PATCH 1/4] resource management: document current status, suggested fixes Signed-off-by: Eric Ernst --- design/resource-management.md | 663 ++++++++++++++++++++++++++++++++++ 1 file changed, 663 insertions(+) create mode 100644 design/resource-management.md diff --git a/design/resource-management.md b/design/resource-management.md new file mode 100644 index 00000000..d5669967 --- /dev/null +++ b/design/resource-management.md @@ -0,0 +1,663 @@ +# cgroup updates in Kata + +## Background + +With 1.6 release of Kata Containers there are some issues with resource management resulting in inconsistent +behavior. This document descibes the state of 1.6, and a suggested implementation for 1.7 version of Kata. + +## Existing Behavior (Kata 1.6): + +### In Docker +### In CRI-O +### In Containerd + + +## Proposed Changes + +### Alternatives Considered + + +## Opens + + +Cgroup Updates in Kata + + + +# Backup / to be handled: + + +## Containerd Handling Today + +The hierarchy and cgroup handling seems pragmatic in the case of containerd. The +container cgroups are currently placed under the podcgroup. + +Output from containerd guaranteed pod, with two containers: + +``` +kata@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct/kubepods$ tree pode3bb46f0-5c99-11e9-903b-000d3a6d0876/ +pode3bb46f0-5c99-11e9-903b-000d3a6d0876/ +├── 47c6122142c76db9f2929c23e8693c641559cb6588fbf5e2f5c21623f5af2fd1 +│   ├── cgroup.clone_children +│   ├── cgroup.procs +│   ├── cpu.cfs_period_us +│   ├── cpu.cfs_quota_us +│   ├── cpu.shares +│   ├── cpu.stat +│   ├── cpuacct.stat +│   ├── cpuacct.usage +│   ├── cpuacct.usage_all +│   ├── cpuacct.usage_percpu +│   ├── cpuacct.usage_percpu_sys +│   ├── cpuacct.usage_percpu_user +│   ├── cpuacct.usage_sys +│   ├── cpuacct.usage_user +│   ├── notify_on_release +│   └── tasks +├── a18373fc91cf91d9b362a536afaecd0712a858bb992777d956948911a6d3b248 +│   ├── cgroup.clone_children +│   ├── cgroup.procs +│   ├── cpu.cfs_period_us +│   ├── cpu.cfs_quota_us +│   ├── cpu.shares +│   ├── cpu.stat +│   ├── cpuacct.stat +│   ├── cpuacct.usage +│   ├── cpuacct.usage_all +│   ├── cpuacct.usage_percpu +│   ├── cpuacct.usage_percpu_sys +│   ├── cpuacct.usage_percpu_user +│   ├── cpuacct.usage_sys +│   ├── cpuacct.usage_user +│   ├── notify_on_release +│   └── tasks +├── cgroup.clone_children +├── cgroup.procs +├── cpu.cfs_period_us +├── cpu.cfs_quota_us +├── cpu.shares +├── cpu.stat +├── cpuacct.stat +├── cpuacct.usage +├── cpuacct.usage_all +├── cpuacct.usage_percpu +├── cpuacct.usage_percpu_sys +├── cpuacct.usage_percpu_user +├── cpuacct.usage_sys +├── cpuacct.usage_user +├── f90eb972d44634f85726552a818578c4f28085f9a2f755fb8b18c402fe9cef6d +│   ├── cgroup.clone_children +│   ├── cgroup.procs +│   ├── cpu.cfs_period_us +│   ├── cpu.cfs_quota_us +│   ├── cpu.shares +│   ├── cpu.stat +│   ├── cpuacct.stat +│   ├── cpuacct.usage +│   ├── cpuacct.usage_all +│   ├── cpuacct.usage_percpu +│   ├── cpuacct.usage_percpu_sys +│   ├── cpuacct.usage_percpu_user +│   ├── cpuacct.usage_sys +│   ├── cpuacct.usage_user +│   ├── notify_on_release +│   └── tasks +├── notify_on_release +└── tasks + +``` + +I suggest that we *do not* create cgroups with names identical to the containers, and instead create a single cgroup, sized appropriately, to constrain our hypervisor. + + +## CRI-O + +`insert example here? +` +## Docker + +`$ docker run --cpus=3 -it busybox sh ` + +containerID: `9621fa5988bdd7ba5128f9530a618aa270f9425137e6ffb4d207e5329be9b3f4` + +``` +eernst@eernstworkstation:/sys/fs/cgroup/cpu,cpuacct/docker$ tree +. +├── 9621fa5988bdd7ba5128f9530a618aa270f9425137e6ffb4d207e5329be9b3f4 +│   ├── cgroup.clone_children +│   ├── cgroup.procs +│   ├── cpuacct.stat +│   ├── cpuacct.usage +│   ├── cpuacct.usage_all +│   ├── cpuacct.usage_percpu +│   ├── cpuacct.usage_percpu_sys +│   ├── cpuacct.usage_percpu_user +│   ├── cpuacct.usage_sys +│   ├── cpuacct.usage_user +│   ├── cpu.cfs_period_us +│   ├── cpu.cfs_quota_us +│   ├── cpu.shares +│   ├── cpu.stat +│   ├── notify_on_release +│   └── tasks +├── cgroup.clone_children +├── cgroup.procs +├── cpuacct.stat +├── cpuacct.usage +├── cpuacct.usage_all +├── cpuacct.usage_percpu +├── cpuacct.usage_percpu_sys +├── cpuacct.usage_percpu_user +├── cpuacct.usage_sys +├── cpuacct.usage_user +├── cpu.cfs_period_us +├── cpu.cfs_quota_us +├── cpu.shares +├── cpu.stat +├── notify_on_release +└── tasks + +``` + + + +## Kata implementation +1. Move qemu and proxy from conmon cgroup to a new cgroup (kata-sandbox) + 1.1 - The kata-sandbox cgroup must be a child of the parent cgroup, that is specified in the OCI spec + 1.2 - The constraint for the kata-sandbox cgroup should be equal to -1 (no constraints), that way it inherits the constraint of its parent +2. Only next cgroups will be honored: + * cpu + * cpuset: cpuset initially is a join of all cpusets? +3. Qemu vcpu threads shouldn't be moved into the sandbox cgroup, since the whole Qemu pid is moved into the kata-sandbox. +4. Kata-shim ??? + +## Appendix + +### Guaranteed YAML: +``` +apiVersion: v1 +kind: Pod +metadata: + name: guar-runc +spec: + containers: + - name: cont-2cpu-400m + image: busybox + resources: + limits: + cpu: 2 + memory: "400Mi" + command: ["md5sum"] + args: ["/dev/urandom"] + - name: cont-3cpu-200m + image: busybox + resources: + limits: + cpu: 3 + memory: "200Mi" + command: ["md5sum"] + args: ["/dev/urandom"] +``` + +``` +``` + +## Desired End State + + +``` +s# for i in `ls pod*/**/tasks`; do echo $i && for j in `cat $i`; do pstree -pt $j;done; done; +podf277e232-5ca6-11e9-b514-000d3a6d0876/kata-sandbox +{qemu-system-x86}(24011) +{qemu-system-x86}(24093) +{qemu-system-x86}(24097) +{qemu-system-x86}(24156) +{qemu-system-x86}(24157) +{qemu-system-x86}(24158) +``` + +``` +root@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct/kubepods/podf277e232-5ca6-11e9-b514-000d3a6d0876# cat */cpu.shares +3072 +2048 +3072 +``` + +``` +# cat */cpu.cfs_quota_us +300000 +200000 +500000 + +``` + + +## RESULTS + +### containerd+kata + +**Where is the shimv2 process. It is somewhere else?** + +It is under system.slice/containerd.service/ + + +**Where is the QEMU parent process and the iothreads?** + +Again under system.slice/containerd.service/ + +This is not desired behavior as the system.slice is bounded to be 1024 CPU shares typically which is typically set to match kube+system reserved. So we will get bad CPU performance for iothreads under load. + +The memory for this slice is unbounded. Which makes it worst of both worlds. So unbounded memory and highly bounded CPU. + +**Why is this bad?** + +This means we are using up kube and system reserved resources for kubepods. + +Hence we have to stay within kubepods. + +** What should we do?** + +It is ok for the shim which is launched directly by containerd to stay where it is. + +1. The pause cgroup cpu shares should be setup correctly. This needs fixing. + +- Julio can you fix this. + +2. Change the naming so that cri-o stats work at k8s level + +- https://github.com/kata-containers/runtime/pull/1518 + + +3. *We should move the qemu threads into the sandbox.* + +- https://github.com/kata-containers/runtime/pull/1431 + +Why? +a. Node stability +b. The io's should be charged to the pod performing the io, specially in multi tenant enviornments. +c. OOM is for memory. Memory limits are enforced not requests. So users need to have higher limits till POD overhead support is added. + In the case of CRIO the memory is being charged to conmon which is bounded by pod limits. So can still OOMed. But is still ok from a node stability POV. But not ok from POD stability POV. So resonably correct from a memory point of view. And bad pod performance from a CPU point of view if the workload is iobound. + In case of containerd the memory is being charged to the caller (containerd) which is basically unbounded. So bad for node stability, also the POD is essentially unbounded. Also CPU here is bounded to ~kube+system (~1024), which is also bad. + +4. *We should not create any containers cgroups on the host. We should create a pod sandbox cgroup that is entirely managed by us* + +- The pod sandbox cgroup should always be the summation of all container group resources. + +- In the case of cri-o where it creates other cgroups for conmon, they will be siblings. + +- The conmon cgroups today are rather large, so if conmon goes wild there is a possibility we will get fewer resources. But it will not introduce any other side effects. + + +``` +s# for i in `ls pod*/**/tasks`; do echo $i && for j in `cat $i`; do pstree -pt $j;done; done; +podf277e232-5ca6-11e9-b514-000d3a6d0876/2692eaedb55f8cfd1b9aadcbc5e3f0ac527cb39ff26d31877f1be5a495b966c1/tasks +podf277e232-5ca6-11e9-b514-000d3a6d0876/6689f72eef2161b85d5d57cb9f4670ae4e08f551d9aeb4b28efb67eb306034d8/tasks +podf277e232-5ca6-11e9-b514-000d3a6d0876/9d17d1c5d075ca42dfefb58ef7b82c8b1b234cc128ebf9332b902b866c0ebed7/tasks +{qemu-system-x86}(24011) +{qemu-system-x86}(24093) +{qemu-system-x86}(24097) +{qemu-system-x86}(24156) +{qemu-system-x86}(24157) +{qemu-system-x86}(24158) +``` + +``` +root@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct/kubepods/podf277e232-5ca6-11e9-b514-000d3a6d0876# cat */cpu.shares +3072 +2048 +3072 +``` + +``` +# cat */cpu.cfs_quota_us +300000 +200000 +500000 + +``` +#### shim +``` +root@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct# ps -aef | grep containerd-shim-kata | grep -v grep + +root 23992 1 0 22:13 ? 00:00:00 /opt/kata/bin/containerd-shim-kata-v2 -namespace k8s.io -address /run/containerd/containerd.sock -publish-binary /usr/local/bin/containerd -id 9d17d1c5d075ca42dfefb58ef7b82c8b1b234cc128ebf9332b902b866c0ebed7 -debug +root@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct# grep -ir 23992 +system.slice/containerd.service/cgroup.procs:23992 +system.slice/containerd.service/tasks:23992 +``` + +#### qemu: +``` +root@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct# ps -ae | grep qemu +24007 ? 00:37:26 qemu-system-x86 + +root@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct# grep -ir 24007 +kubepods/podf277e232-5ca6-11e9-b514-000d3a6d0876/9d17d1c5d075ca42dfefb58ef7b82c8b1b234cc128ebf9332b902b866c0ebed7/cgroup.procs:24007 +system.slice/containerd.service/cgroup.procs:24007 +system.slice/containerd.service/tasks:24007 +``` + +#### Vhost: + +``` +root@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct# ps -aef | grep vhost | grep -v qemu | grep -v grep +root 24010 2 0 22:13 ? 00:00:00 [vhost-24007] + +root@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct# grep -ir 24010 +system.slice/containerd.service/cgroup.procs:24010 +system.slice/containerd.service/tasks:24010 +``` + +### cri-o + + +``` +kata-runtime : 1.6.1 + commit : 8efc5718813224722f87ad119edcf9753fd6147d + OCI specs: 1.0.1-dev +``` + + +#### cgroup hierarchy +``` +root@kata /sys/fs/cgroup/cpu/kubepods # tree /sys/fs/cgroup/cpu/kubepods/po* +/sys/fs/cgroup/cpu/kubepods/pod1cc61d33-5ca1-11e9-90bc-525400cfa589 +├── crio-1b05886a39901ef3a7555796d38dcfaafd8fda929aef223ea576324a4949f9ef +│   ├── cpu.shares +├── crio-a8f744f610e72a4a20a3eb75582a6355ebae812e9d0ddcb3eb4d3c4ceb214485 +│   ├── cpu.shares +├── crio-conmon-1b05886a39901ef3a7555796d38dcfaafd8fda929aef223ea576324a4949f9ef +│   ├── cpu.shares +├── crio-conmon-a8f744f610e72a4a20a3eb75582a6355ebae812e9d0ddcb3eb4d3c4ceb214485 +│   ├── cpu.shares +├── crio-conmon-d3b54328d888c763c20a36958f3a25cece4f71f8a2c5900f0e54a80aef68fb1b +│   ├── cpu.shares +├── crio-d3b54328d888c763c20a36958f3a25cece4f71f8a2c5900f0e54a80aef68fb1b +│   ├── cpu.shares +└── tasks + +``` + +##### cpusets + +The first container we see is the pause container + +``` +root@kata /sys/fs/cgroup/cpuset/kubepods # for i in `ls pod*/**/cpuset.cpus`; do echo $i && cat $i;done; +pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-1b05886a39901ef3a7555796d38dcfaafd8fda929aef223ea576324a4949f9ef/cpuset.cpus +1-5 +pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-a8f744f610e72a4a20a3eb75582a6355ebae812e9d0ddcb3eb4d3c4ceb214485/cpuset.cpus +1-2 +pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-conmon-1b05886a39901ef3a7555796d38dcfaafd8fda929aef223ea576324a4949f9ef/cpuset.cpus +0-7 +pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-conmon-a8f744f610e72a4a20a3eb75582a6355ebae812e9d0ddcb3eb4d3c4ceb214485/cpuset.cpus +0-7 +pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-conmon-d3b54328d888c763c20a36958f3a25cece4f71f8a2c5900f0e54a80aef68fb1b/cpuset.cpus +0-7 +pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-d3b54328d888c763c20a36958f3a25cece4f71f8a2c5900f0e54a80aef68fb1b/cpuset.cpus +3-5 +``` + +##### shares + +``` +root@kata /sys/fs/cgroup/cpu/kubepods # for i in `ls pod*/**/cpu.shares`; do echo $i && cat $i;done; +pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-1b05886a39901ef3a7555796d38dcfaafd8fda929aef223ea576324a4949f9ef/cpu.shares +3072 +pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-a8f744f610e72a4a20a3eb75582a6355ebae812e9d0ddcb3eb4d3c4ceb214485/cpu.shares +2048 +pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-conmon-1b05886a39901ef3a7555796d38dcfaafd8fda929aef223ea576324a4949f9ef/cpu.shares +1024 +pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-conmon-a8f744f610e72a4a20a3eb75582a6355ebae812e9d0ddcb3eb4d3c4ceb214485/cpu.shares +1024 +pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-conmon-d3b54328d888c763c20a36958f3a25cece4f71f8a2c5900f0e54a80aef68fb1b/cpu.shares +1024 +pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-d3b54328d888c763c20a36958f3a25cece4f71f8a2c5900f0e54a80aef68fb1b/cpu.shares +3072 +``` + +##### tasks (runc) + +``` +root@runc /sys/fs/cgroup/cpu/kubepods # for i in `ls pod*/**/tasks`; do echo $i && for j in `cat $i`; do pstree -pt $j;done; done; +pod53ff10c2-5ca0-11e9-8a48-525400eac274/crio-340f2c953412c3c0c5d5a7ee68a850563a93f2ec3e4b292776e5ecee0279506d/tasks +md5sum(19646) +pod53ff10c2-5ca0-11e9-8a48-525400eac274/crio-96f7e692f20ae48664bef2a6cd5bee7782f4945593c6dda884c5a1454ea9121b/tasks +pause(19312) +pod53ff10c2-5ca0-11e9-8a48-525400eac274/crio-conmon-340f2c953412c3c0c5d5a7ee68a850563a93f2ec3e4b292776e5ecee0279506d/tasks +conmon(19634)─┬─md5sum(19646) + └─{gmain}(19636) +{gmain}(19636) +pod53ff10c2-5ca0-11e9-8a48-525400eac274/crio-conmon-96f7e692f20ae48664bef2a6cd5bee7782f4945593c6dda884c5a1454ea9121b/tasks +conmon(19300)─┬─pause(19312) + └─{gmain}(19302) +{gmain}(19302) +pod53ff10c2-5ca0-11e9-8a48-525400eac274/crio-conmon-fb40fb917f431ee82ffb73d8794c1634968557a6989d7481057973bbdfaa8fab/tasks +conmon(19422)─┬─md5sum(19434) + └─{gmain}(19424) +{gmain}(19424) +pod53ff10c2-5ca0-11e9-8a48-525400eac274/crio-fb40fb917f431ee82ffb73d8794c1634968557a6989d7481057973bbdfaa8fab/tasks +md5sum(19434) +``` +##### tasks (kata) + +There are a whole bunch of threads under the pause cgroup. + +What are they? + +Are they the vCPU threads or iothreads? + +**They must be vCPU threads as we have 5 vCPUs and we see 5 qemu threads.** + +The QEMU and vhost threads are under the conmon cgroup. + +pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-conmon-1b05886a39901ef3a7555796d38dcfaafd8fda929aef223ea576324a4949f9ef/tasks + +This incorrect? + +So where should they go? + +**Ideally they should also go into the pause cgroup as there is no other cgroup that we can sit under.** + + + + +``` +root@kata /sys/fs/cgroup/cpu/kubepods # for i in `ls pod*/**/tasks`; do echo $i && for j in `cat $i`; do pstree -pt $j;done; done; +pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-1b05886a39901ef3a7555796d38dcfaafd8fda929aef223ea576324a4949f9ef/tasks +{qemu-system-x86}(18061) +kata-shim(18207)─┬─{kata-shim}(18213) + ├─{kata-shim}(18215) + ├─{kata-shim}(18216) + ├─{kata-shim}(18217) + ├─{kata-shim}(18218) + ├─{kata-shim}(18219) + ├─{kata-shim}(18220) + ├─{kata-shim}(18221) + ├─{kata-shim}(18223) + ├─{kata-shim}(18224) + └─{kata-shim}(18226) +{kata-shim}(18213) +{kata-shim}(18215) +{kata-shim}(18216) +{kata-shim}(18217) +{kata-shim}(18218) +{kata-shim}(18219) +{kata-shim}(18220) +{kata-shim}(18221) +{kata-shim}(18223) +{kata-shim}(18224) +{kata-shim}(18226) +{qemu-system-x86}(18280) +{qemu-system-x86}(18281) +{qemu-system-x86}(18368) +{qemu-system-x86}(18369) +{qemu-system-x86}(18370) +pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-a8f744f610e72a4a20a3eb75582a6355ebae812e9d0ddcb3eb4d3c4ceb214485/tasks +kata-shim(18288)─┬─{kata-shim}(18289) + ├─{kata-shim}(18290) + ├─{kata-shim}(18291) + ├─{kata-shim}(18292) + ├─{kata-shim}(18293) + ├─{kata-shim}(18296) + ├─{kata-shim}(18297) + ├─{kata-shim}(18298) + └─{kata-shim}(18299) +{kata-shim}(18289) +{kata-shim}(18290) +{kata-shim}(18291) +{kata-shim}(18292) +{kata-shim}(18293) +{kata-shim}(18296) +{kata-shim}(18297) +{kata-shim}(18298) +{kata-shim}(18299) +pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-conmon-1b05886a39901ef3a7555796d38dcfaafd8fda929aef223ea576324a4949f9ef/tasks +conmon(18040)─┬─kata-proxy(18063)─┬─{kata-proxy}(18070) + │ ├─{kata-proxy}(18071) + │ ├─{kata-proxy}(18072) + │ ├─{kata-proxy}(18073) + │ ├─{kata-proxy}(18074) + │ ├─{kata-proxy}(18075) + │ ├─{kata-proxy}(18076) + │ ├─{kata-proxy}(18077) + │ ├─{kata-proxy}(18286) + │ ├─{kata-proxy}(20199) + │ ├─{kata-proxy}(20200) + │ ├─{kata-proxy}(20201) + │ ├─{kata-proxy}(20202) + │ ├─{kata-proxy}(20203) + │ └─{kata-proxy}(28490) + ├─kata-shim(18207)─┬─{kata-shim}(18213) + │ ├─{kata-shim}(18215) + │ ├─{kata-shim}(18216) + │ ├─{kata-shim}(18217) + │ ├─{kata-shim}(18218) + │ ├─{kata-shim}(18219) + │ ├─{kata-shim}(18220) + │ ├─{kata-shim}(18221) + │ ├─{kata-shim}(18223) + │ ├─{kata-shim}(18224) + │ └─{kata-shim}(18226) + ├─qemu-system-x86(18058)─┬─{qemu-system-x86}(18059) + │ ├─{qemu-system-x86}(18061) + │ ├─{qemu-system-x86}(18280) + │ ├─{qemu-system-x86}(18281) + │ ├─{qemu-system-x86}(18368) + │ ├─{qemu-system-x86}(18369) + │ └─{qemu-system-x86}(18370) + └─{gmain}(18042) +{gmain}(18042) +qemu-system-x86(18058)─┬─{qemu-system-x86}(18059) + ├─{qemu-system-x86}(18061) + ├─{qemu-system-x86}(18280) + ├─{qemu-system-x86}(18281) + ├─{qemu-system-x86}(18368) + ├─{qemu-system-x86}(18369) + └─{qemu-system-x86}(18370) +{qemu-system-x86}(18059) +vhost-18058(18060) +kata-proxy(18063)─┬─{kata-proxy}(18070) + ├─{kata-proxy}(18071) + ├─{kata-proxy}(18072) + ├─{kata-proxy}(18073) + ├─{kata-proxy}(18074) + ├─{kata-proxy}(18075) + ├─{kata-proxy}(18076) + ├─{kata-proxy}(18077) + ├─{kata-proxy}(18286) + ├─{kata-proxy}(20199) + ├─{kata-proxy}(20200) + ├─{kata-proxy}(20201) + ├─{kata-proxy}(20202) + ├─{kata-proxy}(20203) + └─{kata-proxy}(28490) +{kata-proxy}(18070) +{kata-proxy}(18071) +{kata-proxy}(18072) +{kata-proxy}(18073) +{kata-proxy}(18074) +{kata-proxy}(18075) +{kata-proxy}(18076) +{kata-proxy}(18077) +{kata-proxy}(18286) +{kata-proxy}(20199) +{kata-proxy}(20200) +{kata-proxy}(20201) +{kata-proxy}(20202) +{kata-proxy}(20203) +{kata-proxy}(28490) +pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-conmon-a8f744f610e72a4a20a3eb75582a6355ebae812e9d0ddcb3eb4d3c4ceb214485/tasks +conmon(18265)─┬─kata-shim(18288)─┬─{kata-shim}(18289) + │ ├─{kata-shim}(18290) + │ ├─{kata-shim}(18291) + │ ├─{kata-shim}(18292) + │ ├─{kata-shim}(18293) + │ ├─{kata-shim}(18296) + │ ├─{kata-shim}(18297) + │ ├─{kata-shim}(18298) + │ └─{kata-shim}(18299) + └─{gmain}(18267) +{gmain}(18267) +pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-conmon-d3b54328d888c763c20a36958f3a25cece4f71f8a2c5900f0e54a80aef68fb1b/tasks +conmon(18353)─┬─kata-shim(18377)─┬─{kata-shim}(18378) + │ ├─{kata-shim}(18379) + │ ├─{kata-shim}(18380) + │ ├─{kata-shim}(18381) + │ ├─{kata-shim}(18382) + │ ├─{kata-shim}(18383) + │ ├─{kata-shim}(18384) + │ ├─{kata-shim}(18385) + │ ├─{kata-shim}(18386) + │ └─{kata-shim}(18387) + └─{gmain}(18355) +{gmain}(18355) +pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-d3b54328d888c763c20a36958f3a25cece4f71f8a2c5900f0e54a80aef68fb1b/tasks +kata-shim(18377)─┬─{kata-shim}(18378) + ├─{kata-shim}(18379) + ├─{kata-shim}(18380) + ├─{kata-shim}(18381) + ├─{kata-shim}(18382) + ├─{kata-shim}(18383) + ├─{kata-shim}(18384) + ├─{kata-shim}(18385) + ├─{kata-shim}(18386) + └─{kata-shim}(18387) +{kata-shim}(18378) +{kata-shim}(18379) +{kata-shim}(18380) +{kata-shim}(18381) +{kata-shim}(18382) +{kata-shim}(18383) +{kata-shim}(18384) +{kata-shim}(18385) +{kata-shim}(18386) +{kata-shim}(18387) +``` + + +### Kata in Docker: + +``` +$ for j in `cat docker/2026c3747499019a8b33589e6fdc89194117879c0d47b4796c32c587b47bdf92/tasks`; do pstree -pt $j; done +{qemu-system-x86}(9690) +{qemu-system-x86}(9744) +{qemu-system-x86}(9745) +{qemu-system-x86}(9746) +kata-shim(9750)─┬─{kata-shim}(9751) + ├─{kata-shim}(9752) + ├─{kata-shim}(9754) + ├─{kata-shim}(9755) + ├─{kata-shim}(9762) + ├─{kata-shim}(9763) + ├─{kata-shim}(9764) + └─{kata-shim}(9800) +{kata-shim}(9751) +{kata-shim}(9752) +{kata-shim}(9754) +{kata-shim}(9755) +{kata-shim}(9762) +{kata-shim}(9763) +{kata-shim}(9764) +{kata-shim}(9800) + +``` From f7bff95acedac83b12f9b4fa682a186fc846d93f Mon Sep 17 00:00:00 2001 From: Eric Ernst Date: Fri, 12 Apr 2019 16:32:35 -0700 Subject: [PATCH 2/4] cleanup initial drop, add TOC Signed-off-by: Eric Ernst --- design/resource-management.md | 665 ++-------------------------------- 1 file changed, 21 insertions(+), 644 deletions(-) diff --git a/design/resource-management.md b/design/resource-management.md index d5669967..e1bc68b1 100644 --- a/design/resource-management.md +++ b/design/resource-management.md @@ -1,10 +1,29 @@ # cgroup updates in Kata +* [Background](#background) +* [Existing Behavior (Kata 1.6):](#existing-behavior--kata-16--) + + [In Docker](#in-docker) + + [In CRI-O](#in-cri-o) + + [In Containerd](#in-containerd) +* [Proposed Changes](#proposed-changes) + + [Alternatives Considered](#alternatives-considered) +* [Opens](#opens) + + ## Background With 1.6 release of Kata Containers there are some issues with resource management resulting in inconsistent behavior. This document descibes the state of 1.6, and a suggested implementation for 1.7 version of Kata. +Before diving into the gaps and behavior exhibited in Kata Containes, it is important to have a thorough understanding +of how cgroups are leveraged by Kubernetes. It is both straight forward and confusing. An in-depth guide is available +for background [in mcastelino's gist](https://gist.github.com/mcastelino/b8ce9a70b00ee56036dadd70ded53e9f). This may be +good content to include in our repository eventually, or as part of the Kata blog series, but let's leave it out of the +scope of this initial documentation. + +This document is part of the `cgroup-sprint` GitHub milestone, and can be observed in the [Milestone's GitHub project](https://github.com/orgs/kata-containers/projects/17). + + ## Existing Behavior (Kata 1.6): ### In Docker @@ -17,647 +36,5 @@ behavior. This document descibes the state of 1.6, and a suggested implementati ### Alternatives Considered -## Opens - - -Cgroup Updates in Kata - - - -# Backup / to be handled: - - -## Containerd Handling Today - -The hierarchy and cgroup handling seems pragmatic in the case of containerd. The -container cgroups are currently placed under the podcgroup. - -Output from containerd guaranteed pod, with two containers: - -``` -kata@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct/kubepods$ tree pode3bb46f0-5c99-11e9-903b-000d3a6d0876/ -pode3bb46f0-5c99-11e9-903b-000d3a6d0876/ -├── 47c6122142c76db9f2929c23e8693c641559cb6588fbf5e2f5c21623f5af2fd1 -│   ├── cgroup.clone_children -│   ├── cgroup.procs -│   ├── cpu.cfs_period_us -│   ├── cpu.cfs_quota_us -│   ├── cpu.shares -│   ├── cpu.stat -│   ├── cpuacct.stat -│   ├── cpuacct.usage -│   ├── cpuacct.usage_all -│   ├── cpuacct.usage_percpu -│   ├── cpuacct.usage_percpu_sys -│   ├── cpuacct.usage_percpu_user -│   ├── cpuacct.usage_sys -│   ├── cpuacct.usage_user -│   ├── notify_on_release -│   └── tasks -├── a18373fc91cf91d9b362a536afaecd0712a858bb992777d956948911a6d3b248 -│   ├── cgroup.clone_children -│   ├── cgroup.procs -│   ├── cpu.cfs_period_us -│   ├── cpu.cfs_quota_us -│   ├── cpu.shares -│   ├── cpu.stat -│   ├── cpuacct.stat -│   ├── cpuacct.usage -│   ├── cpuacct.usage_all -│   ├── cpuacct.usage_percpu -│   ├── cpuacct.usage_percpu_sys -│   ├── cpuacct.usage_percpu_user -│   ├── cpuacct.usage_sys -│   ├── cpuacct.usage_user -│   ├── notify_on_release -│   └── tasks -├── cgroup.clone_children -├── cgroup.procs -├── cpu.cfs_period_us -├── cpu.cfs_quota_us -├── cpu.shares -├── cpu.stat -├── cpuacct.stat -├── cpuacct.usage -├── cpuacct.usage_all -├── cpuacct.usage_percpu -├── cpuacct.usage_percpu_sys -├── cpuacct.usage_percpu_user -├── cpuacct.usage_sys -├── cpuacct.usage_user -├── f90eb972d44634f85726552a818578c4f28085f9a2f755fb8b18c402fe9cef6d -│   ├── cgroup.clone_children -│   ├── cgroup.procs -│   ├── cpu.cfs_period_us -│   ├── cpu.cfs_quota_us -│   ├── cpu.shares -│   ├── cpu.stat -│   ├── cpuacct.stat -│   ├── cpuacct.usage -│   ├── cpuacct.usage_all -│   ├── cpuacct.usage_percpu -│   ├── cpuacct.usage_percpu_sys -│   ├── cpuacct.usage_percpu_user -│   ├── cpuacct.usage_sys -│   ├── cpuacct.usage_user -│   ├── notify_on_release -│   └── tasks -├── notify_on_release -└── tasks - -``` - -I suggest that we *do not* create cgroups with names identical to the containers, and instead create a single cgroup, sized appropriately, to constrain our hypervisor. - - -## CRI-O - -`insert example here? -` -## Docker - -`$ docker run --cpus=3 -it busybox sh ` - -containerID: `9621fa5988bdd7ba5128f9530a618aa270f9425137e6ffb4d207e5329be9b3f4` - -``` -eernst@eernstworkstation:/sys/fs/cgroup/cpu,cpuacct/docker$ tree -. -├── 9621fa5988bdd7ba5128f9530a618aa270f9425137e6ffb4d207e5329be9b3f4 -│   ├── cgroup.clone_children -│   ├── cgroup.procs -│   ├── cpuacct.stat -│   ├── cpuacct.usage -│   ├── cpuacct.usage_all -│   ├── cpuacct.usage_percpu -│   ├── cpuacct.usage_percpu_sys -│   ├── cpuacct.usage_percpu_user -│   ├── cpuacct.usage_sys -│   ├── cpuacct.usage_user -│   ├── cpu.cfs_period_us -│   ├── cpu.cfs_quota_us -│   ├── cpu.shares -│   ├── cpu.stat -│   ├── notify_on_release -│   └── tasks -├── cgroup.clone_children -├── cgroup.procs -├── cpuacct.stat -├── cpuacct.usage -├── cpuacct.usage_all -├── cpuacct.usage_percpu -├── cpuacct.usage_percpu_sys -├── cpuacct.usage_percpu_user -├── cpuacct.usage_sys -├── cpuacct.usage_user -├── cpu.cfs_period_us -├── cpu.cfs_quota_us -├── cpu.shares -├── cpu.stat -├── notify_on_release -└── tasks - -``` - - - -## Kata implementation -1. Move qemu and proxy from conmon cgroup to a new cgroup (kata-sandbox) - 1.1 - The kata-sandbox cgroup must be a child of the parent cgroup, that is specified in the OCI spec - 1.2 - The constraint for the kata-sandbox cgroup should be equal to -1 (no constraints), that way it inherits the constraint of its parent -2. Only next cgroups will be honored: - * cpu - * cpuset: cpuset initially is a join of all cpusets? -3. Qemu vcpu threads shouldn't be moved into the sandbox cgroup, since the whole Qemu pid is moved into the kata-sandbox. -4. Kata-shim ??? - -## Appendix - -### Guaranteed YAML: -``` -apiVersion: v1 -kind: Pod -metadata: - name: guar-runc -spec: - containers: - - name: cont-2cpu-400m - image: busybox - resources: - limits: - cpu: 2 - memory: "400Mi" - command: ["md5sum"] - args: ["/dev/urandom"] - - name: cont-3cpu-200m - image: busybox - resources: - limits: - cpu: 3 - memory: "200Mi" - command: ["md5sum"] - args: ["/dev/urandom"] -``` - -``` -``` - -## Desired End State - - -``` -s# for i in `ls pod*/**/tasks`; do echo $i && for j in `cat $i`; do pstree -pt $j;done; done; -podf277e232-5ca6-11e9-b514-000d3a6d0876/kata-sandbox -{qemu-system-x86}(24011) -{qemu-system-x86}(24093) -{qemu-system-x86}(24097) -{qemu-system-x86}(24156) -{qemu-system-x86}(24157) -{qemu-system-x86}(24158) -``` - -``` -root@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct/kubepods/podf277e232-5ca6-11e9-b514-000d3a6d0876# cat */cpu.shares -3072 -2048 -3072 -``` - -``` -# cat */cpu.cfs_quota_us -300000 -200000 -500000 - -``` - - -## RESULTS - -### containerd+kata - -**Where is the shimv2 process. It is somewhere else?** - -It is under system.slice/containerd.service/ - - -**Where is the QEMU parent process and the iothreads?** - -Again under system.slice/containerd.service/ - -This is not desired behavior as the system.slice is bounded to be 1024 CPU shares typically which is typically set to match kube+system reserved. So we will get bad CPU performance for iothreads under load. - -The memory for this slice is unbounded. Which makes it worst of both worlds. So unbounded memory and highly bounded CPU. - -**Why is this bad?** - -This means we are using up kube and system reserved resources for kubepods. - -Hence we have to stay within kubepods. - -** What should we do?** - -It is ok for the shim which is launched directly by containerd to stay where it is. - -1. The pause cgroup cpu shares should be setup correctly. This needs fixing. - -- Julio can you fix this. - -2. Change the naming so that cri-o stats work at k8s level - -- https://github.com/kata-containers/runtime/pull/1518 - - -3. *We should move the qemu threads into the sandbox.* - -- https://github.com/kata-containers/runtime/pull/1431 - -Why? -a. Node stability -b. The io's should be charged to the pod performing the io, specially in multi tenant enviornments. -c. OOM is for memory. Memory limits are enforced not requests. So users need to have higher limits till POD overhead support is added. - In the case of CRIO the memory is being charged to conmon which is bounded by pod limits. So can still OOMed. But is still ok from a node stability POV. But not ok from POD stability POV. So resonably correct from a memory point of view. And bad pod performance from a CPU point of view if the workload is iobound. - In case of containerd the memory is being charged to the caller (containerd) which is basically unbounded. So bad for node stability, also the POD is essentially unbounded. Also CPU here is bounded to ~kube+system (~1024), which is also bad. - -4. *We should not create any containers cgroups on the host. We should create a pod sandbox cgroup that is entirely managed by us* - -- The pod sandbox cgroup should always be the summation of all container group resources. - -- In the case of cri-o where it creates other cgroups for conmon, they will be siblings. - -- The conmon cgroups today are rather large, so if conmon goes wild there is a possibility we will get fewer resources. But it will not introduce any other side effects. - - -``` -s# for i in `ls pod*/**/tasks`; do echo $i && for j in `cat $i`; do pstree -pt $j;done; done; -podf277e232-5ca6-11e9-b514-000d3a6d0876/2692eaedb55f8cfd1b9aadcbc5e3f0ac527cb39ff26d31877f1be5a495b966c1/tasks -podf277e232-5ca6-11e9-b514-000d3a6d0876/6689f72eef2161b85d5d57cb9f4670ae4e08f551d9aeb4b28efb67eb306034d8/tasks -podf277e232-5ca6-11e9-b514-000d3a6d0876/9d17d1c5d075ca42dfefb58ef7b82c8b1b234cc128ebf9332b902b866c0ebed7/tasks -{qemu-system-x86}(24011) -{qemu-system-x86}(24093) -{qemu-system-x86}(24097) -{qemu-system-x86}(24156) -{qemu-system-x86}(24157) -{qemu-system-x86}(24158) -``` - -``` -root@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct/kubepods/podf277e232-5ca6-11e9-b514-000d3a6d0876# cat */cpu.shares -3072 -2048 -3072 -``` - -``` -# cat */cpu.cfs_quota_us -300000 -200000 -500000 - -``` -#### shim -``` -root@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct# ps -aef | grep containerd-shim-kata | grep -v grep - -root 23992 1 0 22:13 ? 00:00:00 /opt/kata/bin/containerd-shim-kata-v2 -namespace k8s.io -address /run/containerd/containerd.sock -publish-binary /usr/local/bin/containerd -id 9d17d1c5d075ca42dfefb58ef7b82c8b1b234cc128ebf9332b902b866c0ebed7 -debug -root@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct# grep -ir 23992 -system.slice/containerd.service/cgroup.procs:23992 -system.slice/containerd.service/tasks:23992 -``` - -#### qemu: -``` -root@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct# ps -ae | grep qemu -24007 ? 00:37:26 qemu-system-x86 - -root@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct# grep -ir 24007 -kubepods/podf277e232-5ca6-11e9-b514-000d3a6d0876/9d17d1c5d075ca42dfefb58ef7b82c8b1b234cc128ebf9332b902b866c0ebed7/cgroup.procs:24007 -system.slice/containerd.service/cgroup.procs:24007 -system.slice/containerd.service/tasks:24007 -``` - -#### Vhost: - -``` -root@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct# ps -aef | grep vhost | grep -v qemu | grep -v grep -root 24010 2 0 22:13 ? 00:00:00 [vhost-24007] - -root@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct# grep -ir 24010 -system.slice/containerd.service/cgroup.procs:24010 -system.slice/containerd.service/tasks:24010 -``` - -### cri-o - - -``` -kata-runtime : 1.6.1 - commit : 8efc5718813224722f87ad119edcf9753fd6147d - OCI specs: 1.0.1-dev -``` - - -#### cgroup hierarchy -``` -root@kata /sys/fs/cgroup/cpu/kubepods # tree /sys/fs/cgroup/cpu/kubepods/po* -/sys/fs/cgroup/cpu/kubepods/pod1cc61d33-5ca1-11e9-90bc-525400cfa589 -├── crio-1b05886a39901ef3a7555796d38dcfaafd8fda929aef223ea576324a4949f9ef -│   ├── cpu.shares -├── crio-a8f744f610e72a4a20a3eb75582a6355ebae812e9d0ddcb3eb4d3c4ceb214485 -│   ├── cpu.shares -├── crio-conmon-1b05886a39901ef3a7555796d38dcfaafd8fda929aef223ea576324a4949f9ef -│   ├── cpu.shares -├── crio-conmon-a8f744f610e72a4a20a3eb75582a6355ebae812e9d0ddcb3eb4d3c4ceb214485 -│   ├── cpu.shares -├── crio-conmon-d3b54328d888c763c20a36958f3a25cece4f71f8a2c5900f0e54a80aef68fb1b -│   ├── cpu.shares -├── crio-d3b54328d888c763c20a36958f3a25cece4f71f8a2c5900f0e54a80aef68fb1b -│   ├── cpu.shares -└── tasks - -``` - -##### cpusets - -The first container we see is the pause container - -``` -root@kata /sys/fs/cgroup/cpuset/kubepods # for i in `ls pod*/**/cpuset.cpus`; do echo $i && cat $i;done; -pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-1b05886a39901ef3a7555796d38dcfaafd8fda929aef223ea576324a4949f9ef/cpuset.cpus -1-5 -pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-a8f744f610e72a4a20a3eb75582a6355ebae812e9d0ddcb3eb4d3c4ceb214485/cpuset.cpus -1-2 -pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-conmon-1b05886a39901ef3a7555796d38dcfaafd8fda929aef223ea576324a4949f9ef/cpuset.cpus -0-7 -pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-conmon-a8f744f610e72a4a20a3eb75582a6355ebae812e9d0ddcb3eb4d3c4ceb214485/cpuset.cpus -0-7 -pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-conmon-d3b54328d888c763c20a36958f3a25cece4f71f8a2c5900f0e54a80aef68fb1b/cpuset.cpus -0-7 -pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-d3b54328d888c763c20a36958f3a25cece4f71f8a2c5900f0e54a80aef68fb1b/cpuset.cpus -3-5 -``` - -##### shares - -``` -root@kata /sys/fs/cgroup/cpu/kubepods # for i in `ls pod*/**/cpu.shares`; do echo $i && cat $i;done; -pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-1b05886a39901ef3a7555796d38dcfaafd8fda929aef223ea576324a4949f9ef/cpu.shares -3072 -pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-a8f744f610e72a4a20a3eb75582a6355ebae812e9d0ddcb3eb4d3c4ceb214485/cpu.shares -2048 -pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-conmon-1b05886a39901ef3a7555796d38dcfaafd8fda929aef223ea576324a4949f9ef/cpu.shares -1024 -pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-conmon-a8f744f610e72a4a20a3eb75582a6355ebae812e9d0ddcb3eb4d3c4ceb214485/cpu.shares -1024 -pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-conmon-d3b54328d888c763c20a36958f3a25cece4f71f8a2c5900f0e54a80aef68fb1b/cpu.shares -1024 -pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-d3b54328d888c763c20a36958f3a25cece4f71f8a2c5900f0e54a80aef68fb1b/cpu.shares -3072 -``` - -##### tasks (runc) - -``` -root@runc /sys/fs/cgroup/cpu/kubepods # for i in `ls pod*/**/tasks`; do echo $i && for j in `cat $i`; do pstree -pt $j;done; done; -pod53ff10c2-5ca0-11e9-8a48-525400eac274/crio-340f2c953412c3c0c5d5a7ee68a850563a93f2ec3e4b292776e5ecee0279506d/tasks -md5sum(19646) -pod53ff10c2-5ca0-11e9-8a48-525400eac274/crio-96f7e692f20ae48664bef2a6cd5bee7782f4945593c6dda884c5a1454ea9121b/tasks -pause(19312) -pod53ff10c2-5ca0-11e9-8a48-525400eac274/crio-conmon-340f2c953412c3c0c5d5a7ee68a850563a93f2ec3e4b292776e5ecee0279506d/tasks -conmon(19634)─┬─md5sum(19646) - └─{gmain}(19636) -{gmain}(19636) -pod53ff10c2-5ca0-11e9-8a48-525400eac274/crio-conmon-96f7e692f20ae48664bef2a6cd5bee7782f4945593c6dda884c5a1454ea9121b/tasks -conmon(19300)─┬─pause(19312) - └─{gmain}(19302) -{gmain}(19302) -pod53ff10c2-5ca0-11e9-8a48-525400eac274/crio-conmon-fb40fb917f431ee82ffb73d8794c1634968557a6989d7481057973bbdfaa8fab/tasks -conmon(19422)─┬─md5sum(19434) - └─{gmain}(19424) -{gmain}(19424) -pod53ff10c2-5ca0-11e9-8a48-525400eac274/crio-fb40fb917f431ee82ffb73d8794c1634968557a6989d7481057973bbdfaa8fab/tasks -md5sum(19434) -``` -##### tasks (kata) - -There are a whole bunch of threads under the pause cgroup. - -What are they? - -Are they the vCPU threads or iothreads? - -**They must be vCPU threads as we have 5 vCPUs and we see 5 qemu threads.** - -The QEMU and vhost threads are under the conmon cgroup. - -pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-conmon-1b05886a39901ef3a7555796d38dcfaafd8fda929aef223ea576324a4949f9ef/tasks - -This incorrect? - -So where should they go? - -**Ideally they should also go into the pause cgroup as there is no other cgroup that we can sit under.** - - - - -``` -root@kata /sys/fs/cgroup/cpu/kubepods # for i in `ls pod*/**/tasks`; do echo $i && for j in `cat $i`; do pstree -pt $j;done; done; -pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-1b05886a39901ef3a7555796d38dcfaafd8fda929aef223ea576324a4949f9ef/tasks -{qemu-system-x86}(18061) -kata-shim(18207)─┬─{kata-shim}(18213) - ├─{kata-shim}(18215) - ├─{kata-shim}(18216) - ├─{kata-shim}(18217) - ├─{kata-shim}(18218) - ├─{kata-shim}(18219) - ├─{kata-shim}(18220) - ├─{kata-shim}(18221) - ├─{kata-shim}(18223) - ├─{kata-shim}(18224) - └─{kata-shim}(18226) -{kata-shim}(18213) -{kata-shim}(18215) -{kata-shim}(18216) -{kata-shim}(18217) -{kata-shim}(18218) -{kata-shim}(18219) -{kata-shim}(18220) -{kata-shim}(18221) -{kata-shim}(18223) -{kata-shim}(18224) -{kata-shim}(18226) -{qemu-system-x86}(18280) -{qemu-system-x86}(18281) -{qemu-system-x86}(18368) -{qemu-system-x86}(18369) -{qemu-system-x86}(18370) -pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-a8f744f610e72a4a20a3eb75582a6355ebae812e9d0ddcb3eb4d3c4ceb214485/tasks -kata-shim(18288)─┬─{kata-shim}(18289) - ├─{kata-shim}(18290) - ├─{kata-shim}(18291) - ├─{kata-shim}(18292) - ├─{kata-shim}(18293) - ├─{kata-shim}(18296) - ├─{kata-shim}(18297) - ├─{kata-shim}(18298) - └─{kata-shim}(18299) -{kata-shim}(18289) -{kata-shim}(18290) -{kata-shim}(18291) -{kata-shim}(18292) -{kata-shim}(18293) -{kata-shim}(18296) -{kata-shim}(18297) -{kata-shim}(18298) -{kata-shim}(18299) -pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-conmon-1b05886a39901ef3a7555796d38dcfaafd8fda929aef223ea576324a4949f9ef/tasks -conmon(18040)─┬─kata-proxy(18063)─┬─{kata-proxy}(18070) - │ ├─{kata-proxy}(18071) - │ ├─{kata-proxy}(18072) - │ ├─{kata-proxy}(18073) - │ ├─{kata-proxy}(18074) - │ ├─{kata-proxy}(18075) - │ ├─{kata-proxy}(18076) - │ ├─{kata-proxy}(18077) - │ ├─{kata-proxy}(18286) - │ ├─{kata-proxy}(20199) - │ ├─{kata-proxy}(20200) - │ ├─{kata-proxy}(20201) - │ ├─{kata-proxy}(20202) - │ ├─{kata-proxy}(20203) - │ └─{kata-proxy}(28490) - ├─kata-shim(18207)─┬─{kata-shim}(18213) - │ ├─{kata-shim}(18215) - │ ├─{kata-shim}(18216) - │ ├─{kata-shim}(18217) - │ ├─{kata-shim}(18218) - │ ├─{kata-shim}(18219) - │ ├─{kata-shim}(18220) - │ ├─{kata-shim}(18221) - │ ├─{kata-shim}(18223) - │ ├─{kata-shim}(18224) - │ └─{kata-shim}(18226) - ├─qemu-system-x86(18058)─┬─{qemu-system-x86}(18059) - │ ├─{qemu-system-x86}(18061) - │ ├─{qemu-system-x86}(18280) - │ ├─{qemu-system-x86}(18281) - │ ├─{qemu-system-x86}(18368) - │ ├─{qemu-system-x86}(18369) - │ └─{qemu-system-x86}(18370) - └─{gmain}(18042) -{gmain}(18042) -qemu-system-x86(18058)─┬─{qemu-system-x86}(18059) - ├─{qemu-system-x86}(18061) - ├─{qemu-system-x86}(18280) - ├─{qemu-system-x86}(18281) - ├─{qemu-system-x86}(18368) - ├─{qemu-system-x86}(18369) - └─{qemu-system-x86}(18370) -{qemu-system-x86}(18059) -vhost-18058(18060) -kata-proxy(18063)─┬─{kata-proxy}(18070) - ├─{kata-proxy}(18071) - ├─{kata-proxy}(18072) - ├─{kata-proxy}(18073) - ├─{kata-proxy}(18074) - ├─{kata-proxy}(18075) - ├─{kata-proxy}(18076) - ├─{kata-proxy}(18077) - ├─{kata-proxy}(18286) - ├─{kata-proxy}(20199) - ├─{kata-proxy}(20200) - ├─{kata-proxy}(20201) - ├─{kata-proxy}(20202) - ├─{kata-proxy}(20203) - └─{kata-proxy}(28490) -{kata-proxy}(18070) -{kata-proxy}(18071) -{kata-proxy}(18072) -{kata-proxy}(18073) -{kata-proxy}(18074) -{kata-proxy}(18075) -{kata-proxy}(18076) -{kata-proxy}(18077) -{kata-proxy}(18286) -{kata-proxy}(20199) -{kata-proxy}(20200) -{kata-proxy}(20201) -{kata-proxy}(20202) -{kata-proxy}(20203) -{kata-proxy}(28490) -pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-conmon-a8f744f610e72a4a20a3eb75582a6355ebae812e9d0ddcb3eb4d3c4ceb214485/tasks -conmon(18265)─┬─kata-shim(18288)─┬─{kata-shim}(18289) - │ ├─{kata-shim}(18290) - │ ├─{kata-shim}(18291) - │ ├─{kata-shim}(18292) - │ ├─{kata-shim}(18293) - │ ├─{kata-shim}(18296) - │ ├─{kata-shim}(18297) - │ ├─{kata-shim}(18298) - │ └─{kata-shim}(18299) - └─{gmain}(18267) -{gmain}(18267) -pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-conmon-d3b54328d888c763c20a36958f3a25cece4f71f8a2c5900f0e54a80aef68fb1b/tasks -conmon(18353)─┬─kata-shim(18377)─┬─{kata-shim}(18378) - │ ├─{kata-shim}(18379) - │ ├─{kata-shim}(18380) - │ ├─{kata-shim}(18381) - │ ├─{kata-shim}(18382) - │ ├─{kata-shim}(18383) - │ ├─{kata-shim}(18384) - │ ├─{kata-shim}(18385) - │ ├─{kata-shim}(18386) - │ └─{kata-shim}(18387) - └─{gmain}(18355) -{gmain}(18355) -pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-d3b54328d888c763c20a36958f3a25cece4f71f8a2c5900f0e54a80aef68fb1b/tasks -kata-shim(18377)─┬─{kata-shim}(18378) - ├─{kata-shim}(18379) - ├─{kata-shim}(18380) - ├─{kata-shim}(18381) - ├─{kata-shim}(18382) - ├─{kata-shim}(18383) - ├─{kata-shim}(18384) - ├─{kata-shim}(18385) - ├─{kata-shim}(18386) - └─{kata-shim}(18387) -{kata-shim}(18378) -{kata-shim}(18379) -{kata-shim}(18380) -{kata-shim}(18381) -{kata-shim}(18382) -{kata-shim}(18383) -{kata-shim}(18384) -{kata-shim}(18385) -{kata-shim}(18386) -{kata-shim}(18387) -``` - - -### Kata in Docker: - -``` -$ for j in `cat docker/2026c3747499019a8b33589e6fdc89194117879c0d47b4796c32c587b47bdf92/tasks`; do pstree -pt $j; done -{qemu-system-x86}(9690) -{qemu-system-x86}(9744) -{qemu-system-x86}(9745) -{qemu-system-x86}(9746) -kata-shim(9750)─┬─{kata-shim}(9751) - ├─{kata-shim}(9752) - ├─{kata-shim}(9754) - ├─{kata-shim}(9755) - ├─{kata-shim}(9762) - ├─{kata-shim}(9763) - ├─{kata-shim}(9764) - └─{kata-shim}(9800) -{kata-shim}(9751) -{kata-shim}(9752) -{kata-shim}(9754) -{kata-shim}(9755) -{kata-shim}(9762) -{kata-shim}(9763) -{kata-shim}(9764) -{kata-shim}(9800) - -``` +## Opens +t From 4e0f9096ceebc86bdac6a53ec23550b512d3b571 Mon Sep 17 00:00:00 2001 From: Eric Ernst Date: Fri, 12 Apr 2019 16:50:03 -0700 Subject: [PATCH 3/4] add containerd details, and outline of proposal Signed-off-by: Eric Ernst --- design/resource-management.md | 333 +++++++++++++++++++++++++++++++--- 1 file changed, 311 insertions(+), 22 deletions(-) diff --git a/design/resource-management.md b/design/resource-management.md index e1bc68b1..1d966db4 100644 --- a/design/resource-management.md +++ b/design/resource-management.md @@ -1,40 +1,329 @@ # cgroup updates in Kata -* [Background](#background) -* [Existing Behavior (Kata 1.6):](#existing-behavior--kata-16--) - + [In Docker](#in-docker) - + [In CRI-O](#in-cri-o) - + [In Containerd](#in-containerd) -* [Proposed Changes](#proposed-changes) - + [Alternatives Considered](#alternatives-considered) -* [Opens](#opens) - + * [Background](#background) + * [Existing Behavior (Kata 1.6):](#existing-behavior--kata-16--) + + [Behavior Observed using various upper layer tools](#behavior-observed-using-various-upper-layer-tools) + - [In Docker](#in-docker) + - [In Kubernetes + Containerd](#in-kubernetes---containerd) + * [Where are all the vCPUS](#where-are-all-the-vcpus) + * [Where are the v2-shim, QEMU, and Vhost processes](#where-are-the-v2-shim--qemu--and-vhost-processes) + - [In Kubernetes + CRI-O (v1 shim)](#in-kubernetes----cri-o--v1-shim-) + + [Issues with current implementation](#issues-with-current-implementation) + - [Accurate usage accounting](#accurate-usage-accounting) + - [Node stability](#node-stability) + - [Consistent guaranteed pod behavior](#consistent-guaranteed-pod-behavior) + - [OOM, unbound CPU utilization](#oom--unbound-cpu-utilization) + * [Proposed Changes](#proposed-changes) + + [Summary](#summary) + + [Details](#details) + - [Pod Sandbox Cgroup](#pod-sandbox-cgroup) + * [Alternatives Considered](#alternatives-considered) + + [Only constrain vCPUs, leaving remaining threads for system reserved](#only-constrain-vcpus--leaving-remaining-threads-for-system-reserved) + * [Opens](#opens) + + [static CPU configurations](#static-cpu-configurations) ## Background -With 1.6 release of Kata Containers there are some issues with resource management resulting in inconsistent -behavior. This document descibes the state of 1.6, and a suggested implementation for 1.7 version of Kata. +With 1.6 release of Kata Containers there are some issues with resource management resulting +in inconsistent behavior. This document descibes the state of 1.6, and a suggested implementation +for 1.7 version of Kata. -Before diving into the gaps and behavior exhibited in Kata Containes, it is important to have a thorough understanding -of how cgroups are leveraged by Kubernetes. It is both straight forward and confusing. An in-depth guide is available -for background [in mcastelino's gist](https://gist.github.com/mcastelino/b8ce9a70b00ee56036dadd70ded53e9f). This may be -good content to include in our repository eventually, or as part of the Kata blog series, but let's leave it out of the -scope of this initial documentation. +Before diving into the gaps and behavior exhibited in Kata Containers, it is important to have +a thorough understanding of how cgroups are leveraged by Kubernetes. It is both straight forward +and confusing. An in-depth guide is available for background in [mcastelino's gist](https://gist.github.com/mcastelino/b8ce9a70b00ee56036dadd70ded53e9f). +This may be good content to include in our repository eventually, or as part of the Kata blog series, +but let's leave it out of the scope of this initial documentation. -This document is part of the `cgroup-sprint` GitHub milestone, and can be observed in the [Milestone's GitHub project](https://github.com/orgs/kata-containers/projects/17). +This document is part of the `cgroup-sprint` GitHub milestone, and can be observed in the milestone's +[GitHub project](https://github.com/orgs/kata-containers/projects/17). ## Existing Behavior (Kata 1.6): -### In Docker -### In CRI-O -### In Containerd +### Behavior Observed using various upper layer tools + +To exhibit current behavior, we utilize a simple guaranteed pod description: +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: guar-runc +spec: + containers: + - name: cont-2cpu-400m + image: busybox + resources: + limits:sadf + cpu: 2 + memory: "400Mi" + command: ["md5sum"] + args: ["/dev/urandom"] + - name: cont-3cpu-200m + image: busybox + resources: + limits: + cpu: 3 + memory: "200Mi" + command: ["md5sum"] + args: ["/dev/urandom"] +``` +We'll show the behavior starting with the simplest scenario, Docker, followed by containerd +and CRI-O. + +#### In Docker + +#### In Kubernetes + Containerd + +For each container in the pod, a cgroup is created within the pod cgroup (ie, under `/sys/fs/cgroup/*/kubepod/pod.*/` +for a guaranteed pod). This is not necessary; only a single cgroup which constrains the hypervisor +appropriately is required. + +##### Where are all the vCPUS + +The vCPUs are placed under the pause container: + +```bash +# for i in `ls pod*/**/tasks`; do echo $i && for j in `cat $i`; do pstree -pt $j;done; done; +podf277e232-5ca6-11e9-b514-000d3a6d0876/2692eaedb55f8cfd1b9aadcbc5e3f0ac527cb39ff26d31877f1be5a495b966c1/tasks +podf277e232-5ca6-11e9-b514-000d3a6d0876/6689f72eef2161b85d5d57cb9f4670ae4e08f551d9aeb4b28efb67eb306034d8/tasks +podf277e232-5ca6-11e9-b514-000d3a6d0876/9d17d1c5d075ca42dfefb58ef7b82c8b1b234cc128ebf9332b902b866c0ebed7/tasks +{qemu-system-x86}(24011) +{qemu-system-x86}(24093) +{qemu-system-x86}(24097) +{qemu-system-x86}(24156) +{qemu-system-x86}(24157) +{qemu-system-x86}(24158) +``` + +In this case, `9d17d1` is the pause, as you can see based on the summation for `cpu.cfs_quota_us`: +```bash +# cat */cpu.cfs_quota_us +300000 +200000 +500000 +``` +One drawback of this is that it assumes the existence of a pause container. This is an assumption +based on the implementation of containerd/cri-o. On the plus side, the cgroup is placed directly under +the pod cgroup, which is created and managed by Kubelet. Overall, this isn't terrible. + +##### Where are the v2-shim, QEMU, and Vhost processes + +As shown below, all of these are placed under the containerd.service system.slice. For `containerd-shim-kata-v2` +this isn't a major concern, as it is not expected to take much resource, and it is pretty closely +coupled to containerd. + +QEMU itself and its vhost threads are very problematic. Depending on the workload, these can consume a +non-negligible amount of resources. Note, in the Kata implementation, these components are purposefuly +not added to the constrained cgroup, the pause container. As a result, they fall under the caller's +cgroup, which in this case is the containerd service. + +The process location is determined as follows: + +v2-shim: +```bash +root@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct# ps -aef | grep containerd-shim-kata | grep -v grep + +root 23992 1 0 22:13 ? 00:00:00 /opt/kata/bin/containerd-shim-kata-v2 -namespace k8s.io -address /run/containerd/containerd.sock -publish-binary /usr/local/bin/containerd -id 9d17d1c5d075ca42dfefb58ef7b82c8b1b234cc128ebf9332b902b866c0ebed7 -debug +root@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct# grep -ir 23992 +system.slice/containerd.service/cgroup.procs:23992 +system.slice/containerd.service/tasks:23992 +``` + +qemu: +```bash +root@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct# ps -ae | grep qemu +24007 ? 00:37:26 qemu-system-x86 + +root@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct# grep -ir 24007 +kubepods/podf277e232-5ca6-11e9-b514-000d3a6d0876/9d17d1c5d075ca42dfefb58ef7b82c8b1b234cc128ebf9332b902b866c0ebed7/cgroup.procs:24007 +system.slice/containerd.service/cgroup.procs:24007 +system.slice/containerd.service/tasks:24007 +``` + +vhost: +```bash +root@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct# ps -aef | grep vhost | grep -v qemu | grep -v grep +root 24010 2 0 22:13 ? 00:00:00 [vhost-24007] + +root@kata-k8s-containerd:/sys/fs/cgroup/cpu,cpuacct# grep -ir 24010 +system.slice/containerd.service/cgroup.procs:24010 +system.slice/containerd.service/tasks:24010 +``` + +#### In Kubernetes + CRI-O (v1 shim) + +CRI-O is very similar the containerd, except for where the non-constrained processes end up. Instead +of being called by CRIO directly, kata-runtime is called from a process `conmon`, which is located +in a cgroup under the pod-cgroup. As expected based on prior exapmles, cgroups are ceated for each +container, and the QEMU vCPU threads are placed within the pause containers cgroup. + +```bash +pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-1b05886a39901ef3a7555796d38dcfaafd8fda929aef223ea576324a4949f9ef/tasks +{qemu-system-x86}(18061) +kata-shim(18207)─┬─{kata-shim}(18213) + ├─{kata-shim}(18215) + ├─{kata-shim}(18216) + ├─{kata-shim}(18217) + ├─{kata-shim}(18218) + ├─{kata-shim}(18219) + ├─{kata-shim}(18220) + ├─{kata-shim}(18221) + ├─{kata-shim}(18223) + ├─{kata-shim}(18224) + └─{kata-shim}(18226) +{kata-shim}(18213) +{kata-shim}(18215) +{kata-shim}(18216) +{kata-shim}(18217) +{kata-shim}(18218) +{kata-shim}(18219) +{kata-shim}(18220) +{kata-shim}(18221) +{kata-shim}(18223) +{kata-shim}(18224) +{kata-shim}(18226) +{qemu-system-x86}(18280) +{qemu-system-x86}(18281) +{qemu-system-x86}(18368) +{qemu-system-x86}(18369) +{qemu-system-x86}(18370) +``` + +The QEMU, vhost, proxy and shim threads, however, are placed under the caller's cgroup, which in this case +is `conmon`, which is a peer of the container cgroups we created. So, the good news is that QEMU, vhost, +etc, are constrained within the pod's cgroup. The bad news is these will be constrained based on the values +associated with conmon: +```bash +pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-conmon-1b05886a39901ef3a7555796d38dcfaafd8fda929aef223ea576324a4949f9ef/tasks +conmon(18040)─┬─kata-proxy(18063)─┬─{kata-proxy}(18070) + │ ├─{kata-proxy}(18071) + │ ├─{kata-proxy}(18072) + │ ├─{kata-proxy}(18073) + │ ├─{kata-proxy}(18074) + │ ├─{kata-proxy}(18075) + │ ├─{kata-proxy}(18076) + │ ├─{kata-proxy}(18077) + │ ├─{kata-proxy}(18286) + │ ├─{kata-proxy}(20199) + │ ├─{kata-proxy}(20200) + │ ├─{kata-proxy}(20201) + │ ├─{kata-proxy}(20202) + │ ├─{kata-proxy}(20203) + │ └─{kata-proxy}(28490) + ├─kata-shim(18207)─┬─{kata-shim}(18213) + │ ├─{kata-shim}(18215) + │ ├─{kata-shim}(18216) + │ ├─{kata-shim}(18217) + │ ├─{kata-shim}(18218) + │ ├─{kata-shim}(18219) + │ ├─{kata-shim}(18220) + │ ├─{kata-shim}(18221) + │ ├─{kata-shim}(18223) + │ ├─{kata-shim}(18224) + │ └─{kata-shim}(18226) + ├─qemu-system-x86(18058)─┬─{qemu-system-x86}(18059) + │ ├─{qemu-system-x86}(18061) + │ ├─{qemu-system-x86}(18280) + │ ├─{qemu-system-x86}(18281) + │ ├─{qemu-system-x86}(18368) + │ ├─{qemu-system-x86}(18369) + │ └─{qemu-system-x86}(18370) + └─{gmain}(18042) +{gmain}(18042) +qemu-system-x86(18058)─┬─{qemu-system-x86}(18059) + ├─{qemu-system-x86}(18061) + ├─{qemu-system-x86}(18280) + ├─{qemu-system-x86}(18281) + ├─{qemu-system-x86}(18368) + ├─{qemu-system-x86}(18369) + └─{qemu-system-x86}(18370) +{qemu-system-x86}(18059) +vhost-18058(18060) +``` + +Two things should happen here: + * work with CRI-O to determine a more appropriate CPU shares setting for conmon, to avoid impacting the container + cgroups (in case of runc) or the hypervisor's cgroup (in case of Kata). See + * do not place our IO threads, shim, proxy and QEMU process in conmon + +### Issues with current implementation + +There are a few major issues that result from the current implementation, and are motivation for design +changes. + +#### Accurate usage accounting +The IO pocessing should be charged to the pod performing the IO, not against the system. Without +utilization of a same hierarchical cgroup, this will not be feasible. + +#### Node stability + +QEMU and its IO threads consume a non-negligible amount of resources. If the memory and CPU utilized is not +constrained, measured and not accounted for, the node will run into CPU and memory pressure unexpectedly. + +#### Consistent guaranteed pod behavior + +Predictable performance is important for end users. By pushing IO threads into a shared pool, the +achievable performance will be inconsistent. Even if a user utilizes a `guaranteed` QoS pod, the +performance profile will differ depending on the amount of contention on the system. Raw unconstrained +performance is important for Kata, but not as important as consistent and predictable behavior. + +#### OOM, unbound CPU utilization + +Memory limits are enforced, not requests. Until [Pod Overhead KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/20190226-pod-overhead.md) +is added, users or admission controllers need to provide higher limits. + +In the case of CRI-O, the memory is being charged to conmon which is bounded by pod limits. As a result, +the workload can still be OOMed. This is okay from a node stability point of view, but not from a pod stability +point of view. This behavior assumes we are called by conmon and that they are constrained appropriately. Luckily, +this is reasonably correct from a memory point of view. I/O bound workloads will exhibit sub-optimal performance +due to the CPU constraints applied to conmon (where the io threads run). + +For containerd, the memory is being charged to containerd, which is basically unbounded. This is bad for +node stability, as the `pod` is essentially unbounded. ## Proposed Changes -### Alternatives Considered +### Summary + * Pause cgroup cpu shaes should be setup correctly. + * Do not create container cgroups on the host. Instead, create a pod sandbox group that is entirely managed by Kata + * Move the QEMU threads into the sandbox cgroup + +With these changes, performance and constraints for a pod is consistent. This constraining change will be +more restrictive relative to existing design. + +The overheads associated with running a sandbox should be accounted for explicitly, and at the pod level. +Once the Pod Overhead KEP is available, this should become a part of RuntimeClass, applied to pods which +utilize the applicable RuntimeClass. See [Pod Overhead KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/20190226-pod-overhead.md) +for more details. + +### Details + +#### Pod Sandbox Cgroup + * The pod sandbox cgroup should always be the summation of all container group resources. + * In the case of cri-o where it creates other cgroups for conmon, they will be siblings. + * The conmon cgroups today are rather large, so if conmon goes wild there is a possibility + the workload will get fewer resources. But it will not introduce any other side effects. + +## Alternatives Considered +### Only constrain vCPUs, leaving remaining threads for system reserved + +This will, in some cases, provide improved performance. ## Opens -t + +### static CPU configurations + +If static CPU policies are introduced, the end user will assign CPUs to a specific container within the pod. Running +IO threads on this CPU may not necessarily be desirable, compared to the users expectations. + +Long term (ie, with RuntimeClass augmented to handle pod overheads), we should create a seperate `cpuset` cgroup, +`kata-sandbox-vcpus`, alongside the standard sandbox cgroup, `kata-sandbox`. These would be siblings underneath the +pod cgroup, in the kubernetes case. vCPU threads will be placed under `kata-sandbox-vcpus`, which will be updated +to use the CPUset suggested for the workload. The remaining threads will be placed under `kata-sandbox`, utilizing +the remaining non-claimed CPUs (problem: is this even possible to determine?). The CPU cgroups will be managed +as normal. The non-vCPU threads will be limited to the CPU utilization provided by the pod overhead, in this case. + +In the short term, non-vCPU threads will need to share the cpuset, and the end-user will need to add additional +CPUs for overhead, if desired. From 1c4290bfaec63b9c9415dc21a9ff510f132216bd Mon Sep 17 00:00:00 2001 From: Eric Ernst Date: Mon, 15 Apr 2019 15:57:02 -0700 Subject: [PATCH 4/4] add docker details Signed-off-by: Eric Ernst --- design/resource-management.md | 72 +++++++++++++++++++++++++++++++++-- 1 file changed, 68 insertions(+), 4 deletions(-) diff --git a/design/resource-management.md b/design/resource-management.md index 1d966db4..83e84e07 100644 --- a/design/resource-management.md +++ b/design/resource-management.md @@ -72,6 +72,68 @@ and CRI-O. #### In Docker +We use a smple container: ```sudo docker run --cpus=2 --runtime=kata-qemu -it alpine sh``` + +For the example below, the containerID is `3cca57349adcfb17f917d2312b568faa85515e6d5a8327a30ea8e5ecd47b0e1f`. + +The vCPU threads associated with the container are constrained in a cgroup created by Kata within the +Docker directory, with a name that matches the containerID. We see three vCPU threads running within this cgroup, +while the remaining sit within the `docker.service` `system.slice`: + +``` +$ for c in `ps -aeT | grep qem | cut -c 9-14 `; do grep -ir $c . | grep task ; done +./system.slice/docker.service/tasks:82005 +./system.slice/docker.service/tasks:82006 +./docker/3cca57349adcfb17f917d2312b568faa85515e6d5a8327a30ea8e5ecd47b0e1f/tasks:82009 +./docker/3cca57349adcfb17f917d2312b568faa85515e6d5a8327a30ea8e5ecd47b0e1f/tasks:82076 +./docker/3cca57349adcfb17f917d2312b568faa85515e6d5a8327a30ea8e5ecd47b0e1f/tasks:82077 +``` + + +Similarly, the shim and vhost threads sit within `docker.service` `system.slice`. + +```bash +$ grep -r `ps -ae|grep kata-proxy | cut -c -6` . | grep task +./system.slice/docker.service/tasks:82011 +$ grep -r `ps -ae|grep vhost | cut -c -6` . | grep task +./system.slice/docker.service/tasks:82008 +``` + +```bash +$ grep -r `ps -ae|grep kata-shim | cut -c -6` . | grep task +./docker/3cca57349adcfb17f917d2312b568faa85515e6d5a8327a30ea8e5ecd47b0e1f/tasks:82085 +``` + +The docker system.slice is unconstrained. The created cgroup is constraind based on the workload +definition (2 CPUs). + +```bash +/sys/fs/cgroup/cpu,cpuacct/docker$ cat cpu.shares cpu.cfs_quota_us +1024 +-1 +``` + +```bash +/sys/fs/cgroup/cpu,cpuacct/docker$ cat */cpu.shares */cpu.cfs_quota_us +1024 +200000 +``` + +This is interesting, given that there are three vCPUS (one default, plus two requested +CPUs). Since today *only* the vCPU and the kata-shim are within the container's cgroup, +this is adecquate. The remaining threads (vhost, kata-proxy, QEMU itself) run unconstrained +on the host. + +All processes are left unconstrained for memory. +``` +/sys/fs/cgroup/memory/docker$ cat */memory.limit_in_bytes +9223372036854771712 +/sys/fs/cgroup/memory/docker$ cat memory.limit_in_bytes +9223372036854771712 +``` + +##### + #### In Kubernetes + Containerd For each container in the pod, a cgroup is created within the pod cgroup (ie, under `/sys/fs/cgroup/*/kubepod/pod.*/` @@ -154,8 +216,8 @@ system.slice/containerd.service/tasks:24010 CRI-O is very similar the containerd, except for where the non-constrained processes end up. Instead of being called by CRIO directly, kata-runtime is called from a process `conmon`, which is located -in a cgroup under the pod-cgroup. As expected based on prior exapmles, cgroups are ceated for each -container, and the QEMU vCPU threads are placed within the pause containers cgroup. +in a cgroup under the pod-cgroup. As expected based on prior examples, cgroups are created for each +container, and the QEMU vCPU threads are placed within the pause container's cgroup. ```bash pod1cc61d33-5ca1-11e9-90bc-525400cfa589/crio-1b05886a39901ef3a7555796d38dcfaafd8fda929aef223ea576324a4949f9ef/tasks @@ -287,7 +349,7 @@ node stability, as the `pod` is essentially unbounded. ### Summary * Pause cgroup cpu shaes should be setup correctly. * Do not create container cgroups on the host. Instead, create a pod sandbox group that is entirely managed by Kata - * Move the QEMU threads into the sandbox cgroup + * Move all of the Kata threads (vCPU, shimv2, kata-shim, kata-proxy, vhost, etc), not just vCPU threads, into the sandbox cgroup With these changes, performance and constraints for a pod is consistent. This constraining change will be more restrictive relative to existing design. @@ -309,7 +371,9 @@ for more details. ### Only constrain vCPUs, leaving remaining threads for system reserved -This will, in some cases, provide improved performance. +This will, in some cases, provide improved performance. Utilizing system reserved does not scale, though. +If the QEMU main thread and IO threads are placed here, unexpected failures could occur on a loaded system +with enforaced constraints. ## Opens