[RFC]cgroup: Move all qemu to sandbox cgroup#1431
Conversation
Instead of move only vcpus, move qemu process. This will reduce netowork and IO perfomance for limited cpus pods but will provide better perfomance isolation to not be noisy neighbor. Fixes: kata-containers#1430 Signed-off-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
0c4caed to
e631f2b
Compare
|
@Weichen81 in your first implementation you added this optimization, I'd like to know the your comments about this. @kata-containers/architecture-committee please take a look. |
|
/test |
mcastelino
left a comment
There was a problem hiding this comment.
This makes sense.
To clarify as long as all other containers are running without requiring Guaranteed QoS class and static CPU manager policy this will have no impact?
|
Adding only vCPUs is first imported by me I guess: #1189 (comment) There's a gap between Container world and Virtualization world, namespace based containers normally have only tiny overhead which can be ignored, virtualization based container(AKA kata) will have considerable overhead from hypervisor. Before this commit, resource limit is only applied to vCPU thread. After this commit, qemu IO thread will be applied too. This will bring two questions:
@egernst mentioned he is working on a new proposal on K8s that can add POD overhead to Pod spec, I hope it can help remove the gap of the overhead. This commit can help solve the noisy neighbor problem, but I doubt if it's worth the performance downgrading. Let's see what others think about it. @kata-containers/runtime @kata-containers/architecture-committee |
bergwolf
left a comment
There was a problem hiding this comment.
IMO, the change will actually break user expectation when they require dedicated CPUs for their application through k8s static cpu manager. I think we need to look at a bigger picture and see how to integrate with k8s cpu manager policies as being discussed in #878
| }); err != nil { | ||
| return err | ||
| } | ||
| if err := cgroup.Add(cgroups.Process{Pid: pid}); err != nil { |
There was a problem hiding this comment.
The pid is also added above to a no constraints cgoup. There is no need to do it anymore.
There was a problem hiding this comment.
no constrains cgroup only adds memory subsystem, to actually limit qemu in the rest of subsystems we need to add to this cgroup.
But makes me wonder: if it is better to just add qemu to only to the cpuset subystem and not all the subsystems that has cgroup.? This way when a container is limited by cpu qouta and but not cpuset will not have performance limitations.
There was a problem hiding this comment.
Yes, we should not share container CPU quota with the qemu process. Please take a look at #878 (comment).
There was a problem hiding this comment.
That is an interesting idea. This require push make k8s a bit more aware of kata containers. Sounds like a more longer final solution. But always that the ecosystem consider this limitations we have in Kata is better. I can bring this topic to next architecture meeting.
There was a problem hiding this comment.
Hi @jcvenegas - did this happen? If so, can you summarise the discussion here?
|
Errors: Nemu CI error, failed to stop containerMetrics memory degradationFailed factory cleanupSuse failed due to network issue restating fedora & vsock docker state timeout, failed in 2 fedora jobs race condition?http://jenkins.katacontainers.io/job/kata-containers-runtime-fedora-vsocks-PR/464/console ARM linter issues, not related |
Agree with this it actually comes from this gap. If I could rework how Kuberentes schedule and request pods to CRI I add overhead sandbox limits and if a container creation request has the podsandbox annotation. The container from that request will limit the sandbox aditional bits qemu proxy and more aditional things.
This is a tricky question if a usr pays for 2 cpus I'd give him one aditional more for free because I know there are overhead xD.
This kind of overhead or scheduling is a good item to add to the proposal +1, If kubernetes could assign a best effort cgroup for iotrheads would be awesome :).
I understand being worried requested sandbox resources the same happens with memory we do not consider the extra kernel or agent memory usage to provide memory to the containers again @egernst proposal sounds like the key for it. This topic is like a double-edged sword, on one side the container has to pay for the sandbox overhead. In the other we limit to avoid this overhead do not be noisy and keep the promise of use only the resources what was assigned to. Something I need to confirm is its k8s restricts on its cgroup hierarchy at pod level the cpusets I dont think so but that would simplify the issue a bit.
|
|
This will lead a problem which we have met in production. |
|
@jcvenegas nudge. |
|
@jcvenegas any updates? Thx! |
|
ping @jcvenegas |
|
Ping @jcvenegas |
|
Closing the PR -- this is handled in #1880 |
Instead of move only vcpus, move qemu process.
This will reduce netowork and IO perfomance for limited
cpus pods but will provide better perfomance isolation to
not be noisy neighbor.
Fixes: #1430
Signed-off-by: Jose Carlos Venegas Munoz jose.carlos.venegas.munoz@intel.com