Skip to content
This repository was archived by the owner on May 12, 2021. It is now read-only.

[RFC]cgroup: Move all qemu to sandbox cgroup#1431

Closed
jcvenegas wants to merge 1 commit into
kata-containers:masterfrom
jcvenegas:apply-cgroups-to-all-qemu
Closed

[RFC]cgroup: Move all qemu to sandbox cgroup#1431
jcvenegas wants to merge 1 commit into
kata-containers:masterfrom
jcvenegas:apply-cgroups-to-all-qemu

Conversation

@jcvenegas
Copy link
Copy Markdown
Member

Instead of move only vcpus, move qemu process.

This will reduce netowork and IO perfomance for limited
cpus pods but will provide better perfomance isolation to
not be noisy neighbor.

Fixes: #1430

Signed-off-by: Jose Carlos Venegas Munoz jose.carlos.venegas.munoz@intel.com

Instead of move only vcpus, move qemu process.

This will reduce netowork and IO perfomance for limited
cpus pods but will provide better perfomance isolation to
not be noisy neighbor.

Fixes: kata-containers#1430

Signed-off-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
@jcvenegas jcvenegas force-pushed the apply-cgroups-to-all-qemu branch from 0c4caed to e631f2b Compare March 27, 2019 22:09
@jcvenegas jcvenegas changed the title [RFC]cgrups: Move all qemu to sandbox cgroup [RFC]cgroup: Move all qemu to sandbox cgroup Mar 27, 2019
@jcvenegas
Copy link
Copy Markdown
Member Author

@Weichen81 in your first implementation you added this optimization, I'd like to know the your comments about this. @kata-containers/architecture-committee please take a look.

@jcvenegas
Copy link
Copy Markdown
Member Author

/test

Copy link
Copy Markdown
Contributor

@mcastelino mcastelino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense.

To clarify as long as all other containers are running without requiring Guaranteed QoS class and static CPU manager policy this will have no impact?

@WeiZhang555
Copy link
Copy Markdown
Member

WeiZhang555 commented Mar 28, 2019

Adding only vCPUs is first imported by me I guess: #1189 (comment)

There's a gap between Container world and Virtualization world, namespace based containers normally have only tiny overhead which can be ignored, virtualization based container(AKA kata) will have considerable overhead from hypervisor.

Before this commit, resource limit is only applied to vCPU thread. After this commit, qemu IO thread will be applied too.

This will bring two questions:

  1. if user pay money for 2 cpus, should we give two cpus to GuestOS or GuestOS+Qemu considering qemu has considerable performance overhead?
  2. if IO thread and vCPUs shares 2 cpus, the race will decrease IO performance obviously especially for small chunks IO r/w

@egernst mentioned he is working on a new proposal on K8s that can add POD overhead to Pod spec, I hope it can help remove the gap of the overhead.

This commit can help solve the noisy neighbor problem, but I doubt if it's worth the performance downgrading.

Let's see what others think about it.

@kata-containers/runtime @kata-containers/architecture-committee

Copy link
Copy Markdown
Member

@bergwolf bergwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, the change will actually break user expectation when they require dedicated CPUs for their application through k8s static cpu manager. I think we need to look at a bigger picture and see how to integrate with k8s cpu manager policies as being discussed in #878

Comment thread virtcontainers/cgroups.go
}); err != nil {
return err
}
if err := cgroup.Add(cgroups.Process{Pid: pid}); err != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pid is also added above to a no constraints cgoup. There is no need to do it anymore.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no constrains cgroup only adds memory subsystem, to actually limit qemu in the rest of subsystems we need to add to this cgroup.

But makes me wonder: if it is better to just add qemu to only to the cpuset subystem and not all the subsystems that has cgroup.? This way when a container is limited by cpu qouta and but not cpuset will not have performance limitations.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should not share container CPU quota with the qemu process. Please take a look at #878 (comment).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is an interesting idea. This require push make k8s a bit more aware of kata containers. Sounds like a more longer final solution. But always that the ecosystem consider this limitations we have in Kata is better. I can bring this topic to next architecture meeting.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jcvenegas - did this happen? If so, can you summarise the discussion here?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping @jcvenegas.

@jcvenegas
Copy link
Copy Markdown
Member Author

jcvenegas commented Mar 28, 2019

Errors:

Nemu CI error, failed to stop container
18:18:36     assertions.go:237: 
                        
	Error Trace:	container_stats_test.go:274
18:18:36         
			container_stats_test.go:323
18:18:36         
	Error:		Received unexpected error rpc error: code = Unknown desc = failed to stop container "b093d1c00d45e69d49b4772c563198e535faedf0f1ad813a93c30b513dab9e1f": container not running: unknown
18:18:36         
18:18:36     assertions.go:237: 
                        
	Error Trace:	container_stats_test.go:270
18:18:36         
			container_stats_test.go:323
18:18:36         
	Error:		Received unexpected error rpc error: code = Unknown desc = failed to set removing state for container "b093d1c00d45e69d49b4772c563198e535faedf0f1ad813a93c30b513dab9e1f": container is still running, to stop first
Metrics memory degradation
16:54:29 Report Summary:
16:54:29 +-----+----------------------+-------+--------+--------+-------+--------+--------+-------+------+-----+
16:54:29 | P/F |         NAME         |  FLR  |  MEAN  |  CEIL  |  GAP  |  MIN   |  MAX   |  RNG  | COV  | ITS |
16:54:29 +-----+----------------------+-------+--------+--------+-------+--------+--------+-------+------+-----+
16:54:29 | P   | boot-times           | 95.0% | 101.1% | 105.0% | 10.0% | 93.9%  | 110.9% | 18.2% | 3.9% |  20 |
16:54:29 | P   | memory-footprint     | 95.0% | 102.1% | 105.0% | 10.0% | 102.1% | 102.1% | 0.0%  | 0.0% |   1 |
16:54:29 | *F* | memory-footprint-ksm | 95.0% | 121.4% | 105.0% | 10.0% | 121.4% | 121.4% | 0.0%  | 0.0% |   1 |
16:54:29 +-----+----------------------+-------+--------+--------+-------+--------+--------+-------+------+-----+
16:54:29 Fails: 1, Passes 2
Failed factory cleanup
+ [[ 1553731386 -le 1553731386 ]]
+ sudo docker rm -f kata-l2bphF
kata-l2bphF
+ destroy_vm_template
+ echo 'destroy vm template'
destroy vm template
+ sudo -E PATH=/tmp/jenkins/workspace/kata-containers-runtime-ubuntu-18-04-PR-initrd/go/bin:/usr/local/go/bin:/usr/sbin:/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games /usr/local/bin/kata-runtime factory destroy
vm factory destroyed
++ mount
++ grep template
++ wc -l
+ res=1
+ '[' 1 -eq 0 ']'
+ die 'template factory is not cleaned up'
+ msg='template factory is not cleaned up'
+ echo 'ERROR: template factory is not cleaned up'
ERROR: template factory is not cleaned up
+ exit 1
Makefile:123: recipe for target 'vm-factory' failed

Suse failed due to network issue restating

fedora & vsock docker state timeout, failed in 2 fedora jobs race condition?

http://jenkins.katacontainers.io/job/kata-containers-runtime-fedora-vsocks-PR/464/console
http://jenkins.katacontainers.io/job/kata-containers-runtime-fedora-PR/1754/console

• Failure [16.040 seconds]
state
/tmp/jenkins/workspace/kata-containers-runtime-fedora-vsocks-PR/go/src/github.com/kata-containers/tests/functional/state_test.go:26
  container
  /tmp/jenkins/workspace/kata-containers-runtime-fedora-vsocks-PR/go/src/github.com/kata-containers/tests/vendor/github.com/onsi/ginkgo/extensions/table/table.go:92
    with workload [true], timeWait 5 [It]
    /tmp/jenkins/workspace/kata-containers-runtime-fedora-vsocks-PR/go/src/github.com/kata-containers/tests/vendor/github.com/onsi/ginkgo/extensions/table/table_entry.go:46

    Expected
        <int>: 1
    to equal
        <int>: 0

    /tmp/jenkins/workspace/kata-containers-runtime-fedora-vsocks-PR/go/src/github.com/kata-containers/tests/functional/state_test.go:45
------------------------------
SSSSSSS

Summarizing 1 Failure:

[Fail] state container [It] with workload [true], timeWait 5 
ARM linter issues, not related
18:00:56 cli/kata-check.go:57:2: `genericCPUModelField` is unused (deadcode)
18:00:56 	genericCPUModelField  = "model name" // nolint: varcheck, unused
18:00:56 	^
18:01:08 Build step 'Execute shell' marked build as failure
18:01:08 Performing Post build task...

@jcvenegas
Copy link
Copy Markdown
Member Author

There's a gap between Container world and Virtualization world, namespace based containers normally have only tiny overhead which can be ignored, virtualization based container(AKA kata) will have considerable overhead from hypervisor.

Agree with this it actually comes from this gap. If I could rework how Kuberentes schedule and request pods to CRI I add overhead sandbox limits and if a container creation request has the podsandbox annotation. The container from that request will limit the sandbox aditional bits qemu proxy and more aditional things.

  1. if user pay money for 2 cpus, should we give two cpus to GuestOS or GuestOS+Qemu considering qemu has considerable performance overhead?

This is a tricky question if a usr pays for 2 cpus I'd give him one aditional more for free because I know there are overhead xD.

  1. if IO thread and vCPUs shares 2 cpus, the race will decrease IO performance obviously especially for small chunks IO r/w

@egernst mentioned he is working on a new proposal on K8s that can add POD overhead to Pod spec, I hope it can help remove the gap of the overhead.

This kind of overhead or scheduling is a good item to add to the proposal +1, If kubernetes could assign a best effort cgroup for iotrheads would be awesome :).

This commit can help solve the noisy neighbor problem, but I doubt if it's worth the performance downgrading.

I understand being worried requested sandbox resources the same happens with memory we do not consider the extra kernel or agent memory usage to provide memory to the containers again @egernst proposal sounds like the key for it. This topic is like a double-edged sword, on one side the container has to pay for the sandbox overhead. In the other we limit to avoid this overhead do not be noisy and keep the promise of use only the resources what was assigned to.

Something I need to confirm is its k8s restricts on its cgroup hierarchy at pod level the cpusets I dont think so but that would simplify the issue a bit.

Let's see what others think about it.

@kata-containers/runtime @kata-containers/architecture-committee

@jshachm
Copy link
Copy Markdown
Member

jshachm commented Apr 3, 2019

This will lead a problem which we have met in production.
If we support cgroup-parent in the future. The kubelet resource manager will set a limit and require for pod or called sandbox here which will lead a oom very easily.

@raravena80
Copy link
Copy Markdown
Member

@jcvenegas nudge.

@raravena80
Copy link
Copy Markdown
Member

@jcvenegas any updates? Thx!

@caoruidong
Copy link
Copy Markdown
Member

ping @jcvenegas

@raravena80
Copy link
Copy Markdown
Member

Ping @jcvenegas

@egernst
Copy link
Copy Markdown
Member

egernst commented Aug 28, 2019

Closing the PR -- this is handled in #1880

@egernst egernst closed this Aug 28, 2019
@jcvenegas jcvenegas deleted the apply-cgroups-to-all-qemu branch January 23, 2020 19:46
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Not restricticted iothreads and other qemu threads

9 participants