virtcontainers: process the case that kata-agent doesn't start in VM by flyflypeng · Pull Request #351 · kata-containers/runtime

flyflypeng · 2018-05-30T07:35:04Z

virtcontainers: process the case that kata-agent doesn't start in VM

If kata-agent process is not launched to run in VM, kata-runtime will exit because of timeout,
however kata-proxy and qemu process will still resident in host

Call Check grpc function to check whether grpc server is running in kata-agent process in VM,
if not and then do some rollback operation to stop kata-proxy and qemu processes

Fixes: #297

Signed-off-by: flyflypeng jiangpengfei9@huawei.com

katabuilder · 2018-05-30T08:12:24Z

PSS Measurement:
Qemu: 144708 KB
Proxy: 6742 KB
Shim: 8796 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 2013332 KB

katabuilder · 2018-05-30T10:01:04Z

PSS Measurement:
Qemu: 144050 KB
Proxy: 6795 KB
Shim: 10930 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 2011712 KB

codecov · 2018-05-30T10:37:09Z

Codecov Report

Merging #351 into master will decrease coverage by 0.07%.
The diff coverage is 28%.

@@            Coverage Diff            @@
##           master    #351      +/-   ##
=========================================
- Coverage   66.88%   66.8%   -0.08%     
=========================================
  Files          93      93              
  Lines        9475    9495      +20     
=========================================
+ Hits         6337    6343       +6     
- Misses       2466    2474       +8     
- Partials      672     678       +6

Impacted Files	Coverage Δ
virtcontainers/sandbox.go	`66.18% <0%> (-0.22%)`	⬇️
virtcontainers/kata_agent.go	`60.86% <28.57%> (-0.23%)`	⬇️
virtcontainers/api.go	`63.51% <31.25%> (-0.99%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 99954d5...2993cb3. Read the comment docs.

katabuilder · 2018-05-30T12:00:59Z

PSS Measurement:
Qemu: 142416 KB
Proxy: 6825 KB
Shim: 10819 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 2011464 KB

jodh-intel

Thanks for raising. I think we need @sboeuf's input on this one. If I'm understanding this, thescenario this PR covers is highly unlikely to occur in normal operation, unless maybe that agent crashed on startup?

There is also the question of performance - what sort of hit are we having to take by connecting and then disconnecting every time on the very small likelihood the agent isn't running I wonder?

jodh-intel · 2018-05-30T13:17:47Z

virtcontainers/kata_agent.go

+	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+	defer cancel()
+	if _, err := k.client.Check(ctx, &grpc.CheckRequest{}); err != nil {
+		k.Logger().Error("grpc server check failed")


How about:

k.Logger().WithError(err).Error("grpc server check failed")

This log output is used to locate the problem from kata-runtime logfile when kata-agent is crashed in VM

Sure, but your code is not logging the actual error, so I'm suggesting you keep your error message and also add the original error (which might include more detail).

Thanks! This is a great suggestion about the error log.Back to your first comment, I agree with your opinion that kata-agent isn't running is very small likelihood . But I have another question, do we have some measures to make sure that kata-agent will not fail in VM?

devimc · 2018-05-30T13:28:48Z

virtcontainers/sandbox.go

 	// to start the sandbox inside the VM.
-	return s.agent.startSandbox(s)
+	err := s.agent.startSandbox(s)
+	defer func() {


is defer needed? why not just

if err != nil { s.hypervisor.stopSandbox() }

🤔

or

if err != nil { return s.hypervisor.stopSandbox() }

in case of stopSandbox fails

Thanks! I agree with your last proposed code in case of stopSandbox fails

This is not semantically correct. We should not call stopSandbox() after startSandbox() returned some error. We would be running stopSandbox(), which assumes the sandbox has been properly started, when this is not true.

The valid case for such a thing could be to run stopSandbox() if further in the code, an error would occur, and we would want to rollback from there.

And if you want to do such thing, I think it should be part of its own commit for more clarity ;)

Maybe we have some misunderstanding about s.agent.stopSandbox() and s.hypervisor.stopSandbox().

s.agent.stopSandbox() is used to stop the Sandbox and Containers which are running in VM, in other words, both sandbox and containers are started successfully in VM.

s.hypervisor.stopSandbox() is used to stop qemu process and release related resources, in other words, sandbox is not created successfully in VM.

In this place, if s.agent.startSandbox() return error , it indicates that sandbox is not created successfully in VM, so I want to call s.hypervisor.stopSandbox() to stop VM process.

devimc · 2018-05-30T13:33:11Z

virtcontainers/kata_agent.go

+		return err
+	}
+
+	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)


why 5? should that number be a constant?

Good question, maybe I need to consider which value is best for this scenario, in this place I think 5 seconds is an appropriate choose

yep, just define a constant and add a comment, for example

// timeout to check grpc server is serving const grpcCheckTimeout = 5*time.Second .... func (k *kataAgent) startSandbox(sandbox *Sandbox) error { .... ctx, cancel := context.WithTimeout(context.Background(), grpcCheckTimeout)

sboeuf

There are good ideas in this PR, but this needs to be refactored a bit.

sboeuf · 2018-05-30T20:28:13Z

virtcontainers/sandbox.go

 	// to start the sandbox inside the VM.
-	return s.agent.startSandbox(s)
+	err := s.agent.startSandbox(s)
+	defer func() {


This is not semantically correct. We should not call stopSandbox() after startSandbox() returned some error. We would be running stopSandbox(), which assumes the sandbox has been properly started, when this is not true.

The valid case for such a thing could be to run stopSandbox() if further in the code, an error would occur, and we would want to rollback from there.

And if you want to do such thing, I think it should be part of its own commit for more clarity ;)

sboeuf · 2018-05-30T20:39:38Z

virtcontainers/kata_agent.go

+		k.Logger().Error("grpc server check failed")
+
+		//kill kata-proxy to release resources
+		k.proxy.stop(sandbox, k.state.ProxyPid)


The rollback of the proxy being started sounds like a good idea. But if we want this to be properly implemented, we should create a defer right after the proxy has been correctly started. You need to declare a global err error for the function, and if when the defer is executed, the err != nil, then we should stop the proxy. This will apply for any error case that could happen in this function after the proxy has been started.
You also want to handle the case where the agent fails to start, which is great, but a separate thing IMO, that's why it should be split into 2 different commits. First you introduce proper rollback for the proxy, and then you introduce the check for the agent availability.

katabuilder · 2018-05-31T07:39:42Z

PSS Measurement:
Qemu: 163900 KB
Proxy: 6792 KB
Shim: 10786 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 1996624 KB

devimc · 2018-05-31T13:18:27Z

@woshijpf do me a favour, run next command with this patch

docker run --runtime=kata-runtime -ti busybox bash

if you don't see a qemu process running then this patch looks good to me

jodh-intel · 2018-05-31T13:39:09Z

You have to look closely at the previous command to notice that it won't work as bash isn't included with busybox. Equivalent:

$ docker run --runtime=kata-runtime -ti does-not-exist

devimc · 2018-05-31T13:51:20Z

@jodh-intel I think you mean

docker run --runtime=kata-runtime -ti busybox does-not-exist

jodh-intel · 2018-05-31T13:55:03Z

@devimc - I do! Thanks! 😄

flyflypeng · 2018-05-31T15:17:28Z

@devimc I test this patch in my ubuntu 16.04 x86_64 laptop is good with next two cases:

kata-agent is running in VM
This case works well, container can run in VM

# need to run as root 
# docker run --runtime=kata-runtime -ti busybox sh
/ # ls
bin   dev   etc   home  proc  root  sys   tmp   usr   var

no kata-agent in VM
First, we need to create a rootfs image without kata-agent, you can follow next steps:

$ cd <kata-containers>/osbuilder/rootfs-builder

# use euleros as example
$ ./rootfs.sh euleros

# remove kata-agent.service, so systemd can't launch the kata-agent process
$ cd ./rootfs-EulerOS/usr/lib/systemd/system/
$ rm kata-agent.service
$ rm rm kata-containers.target

# make rootfs-EulerOS into rootfs.img
$ cd <kata-containers>/osbuilder/image-builder

# we need to modify the rootfs make shell scipt
$ cp image_builder.sh image_builder-no-agent.sh
$ vim image_builder-no-agent.sh
 # comment follow lines
129 # [ "${AGENT_INIT}" == "yes" ] || [ -x "${ROOTFS}/usr/bin/${AGENT_BIN}" ] || \
130 #        die "/usr/bin/${AGENT_BIN} is not installed in ${ROOTFS}
131  #       use AGENT_BIN env variable to change the expected agent binary name"
132 # OK "Agent installed"

# next, we can run  image_builder-no-agent.sh to make rootfs image
$ ./image_builder-no-agent.sh ../rootfs-builder/rootfs-EulerOS/

# finally, we get the Euleros rootfs.img without kata-agent
$ ls 
 ls
Dockerfile  image_builder-no-agent.sh  image_builder.sh  kata-containers.img   README.md

# we can copy kata-containers.img to anywhere, but you need to modify the "image = "  field 
# in the /usr/share/defaults/kata-containers/configuration.toml file

Now, we can use the same command as previous case to test this patch

# need to run as root
# docker run --runtime=kata-runtime -ti busybox sh
docker: Error response from daemon: OCI runtime create failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded: unknown.

# we get the error output as expected, and let's see if any kata-proxy and qemu processed is resident in host system
$ ps -auxww --sort=start_time

......
root     10857  0.0  0.0      0     0 ?        S    23:11   0:00 [kworker/0:1]
root     10858  0.0  0.0      0     0 ?        S    23:11   0:00 [kworker/1:3]
root     11118  0.0  0.0  44744  3752 pts/17   R+   23:14   0:00 ps -auxww --sort=start_time

we get the result as expected, so this patch works well.

devimc · 2018-05-31T15:46:50Z

@woshijpf nice and thanks, and what about docker run --runtime=kata-runtime -ti busybox does-not-exist " ? is qemu running?

flyflypeng · 2018-05-31T16:11:44Z

@devimc Yep，I just run the command docker run --runtime=kata-runtime -ti busybox does-not-exist, the result surprises me that qemu and kata-proxy processes both still reside in the host system, it is a bug need to be fixed.
Maybe we need to do some rollback if container doesn't run successfully.

sboeuf · 2018-05-31T15:46:57Z

virtcontainers/kata_agent.go

+		if err != nil && pid > 0{
+			k.proxy.stop(sandbox, pid)
+		}
+	}()


Could you move this defer block after the check for err:

pid, uri, err := k.proxy.start(sandbox, proxyParams) if err != nil { return err } // If error occurs after kata-proxy start, rollback to kill kata-proxy process defer func() { if err != nil && pid > 0{ k.proxy.stop(sandbox, pid) } }()

@sboeuf Will the defer function be called at all if it is moved after the check for err?

Yes, the defer is called all the time (when the function returns), and if you look further at the code in this function, there are numerous cases where we can return a non-nil error, meaning the proxy will get stopped.

@sboeuf Since I am not sure when the defer function will be called, I write a simple code to test it, finally I get the this conclusion that defer function isn't called all the time after function end, for example, if a return code is called before defer in the function body, then this defer function will not be called.
But in the place, I will move defer block after the check for err as you suggested, because it has no effect .
If k.proxy.start() return error that indicates the kata-proxy process doesn't start, so we don't need to do kata-proxy rollback operation.

Yes exactly, if the defer statement is reached, then it will be called when the function returns.
And yes, I have asked for the defer to be called after the error check of k.proxy.start() because we don't need any rollback if err != nil in this case.

sboeuf · 2018-05-31T15:49:30Z

virtcontainers/kata_agent.go

 	// If error occurs after kata-proxy start, rollback to kill kata-proxy process
 	defer func() {
-		if err != nil && pid > 0{
+		if err != nil && pid > 0 {


This added space (go formatting) should be part of the first commit.

sboeuf · 2018-05-31T16:12:40Z

virtcontainers/kata_agent.go

+
+	ctx, cancel := context.WithTimeout(context.Background(), grpcCheckTimeout)
+	defer cancel()
+	if _, err = k.client.Check(ctx, &grpc.CheckRequest{}); err != nil {


This looks good, but I was thinking that in order to avoid duplication of code, the check with timeout should be available through the agent interface itself. Thus, you could enhance the agent interface regarding check() function, and then rely on this new version to call it from here.

You can simply use k.check() here, and it handles grpc connect/disconnect properly.

@bergwolf k.check doesn‘t have timeout. Need to enhance it.

@jshachm Take a look at installReqFunc(). Check request has builtin a 5 second timeout, same as you defined in this PR.

@bergwolf sry for miss it~~ Yep, with the builtin timeout, k.check will make a difference.

@bergwolf Thanks, it's a great suggestion, it makes code more clean and simple

sboeuf · 2018-05-31T16:15:10Z

virtcontainers/kata_agent.go

+		k.Logger().Error("grpc server check failed, get the error: ", err)
+		return err
+	}
+	k.disconnect()


You should check for err here.

sboeuf · 2018-05-31T16:29:51Z

virtcontainers/sandbox.go

+	if err != nil {
+		//VM is started  but sandbox is not created successfully in VM
+		//so we need to rollback to stop VM qemu process
+		s.hypervisor.stopSandbox()


Sorry I didn't realize we were talking about hypervisor.stopSandbox()....
Could you isolate this into its own commit as this addresses the rollback of the VM.
Also, you should follow the same model you followed for the proxy rollback, meaning you defer the hypervisor.stopSandbox() after you properly started the VM. This way you will both cover the error case from waitSandbox() and agent.startSandbox() that could happen.

sboeuf · 2018-05-31T16:36:47Z

@woshijpf there is a lot of work related to rollback, and I'd be more than happy if you want to look into it. We need to be careful about not having rollbacks that would collide with each others.

flyflypeng · 2018-06-01T01:08:28Z

@sboeuf Thanks for your careful review.I'm glad to deep into it to solve rollback related problems.

flyflypeng · 2018-06-04T13:52:11Z

@sboeuf @jodh-intel @devimc Hi buddies~I am fixing the problem of qemu and kata-proxy processes reside in system while createSandboxFromConfig() is failed, but I meet a problem and I want to get help from your nice guys.
The problem is that how can I get to know whether qemu process is running or not in system when err := s.startVM() return error in createSandboxFromConfig() function, since the error return by s.startVM() may caused by s.hypervisor.startSandbox() or s.hypervisor.waitSandbox or s.agent.startSandbox() in startVM() function.Thanks!

sboeuf · 2018-06-04T14:12:18Z

@woshijpf you could move the s.agent.startSandbox() out of the startVM() function since it is not really part of starting the VM. The function startVM() being only used inside virtcontainers/api.go, this should be okay. And then, you need to call into s.hypervisor.stopSandbox() for the rollback.

katabuilder · 2018-06-05T09:59:52Z

PSS Measurement:
Qemu: 164140 KB
Proxy: 6803 KB
Shim: 10818 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 1996136 KB

flyflypeng · 2018-06-19T15:06:42Z

@sboeuf Could you help me to review the code again, I modify the code as you suggested.
Finally thanks for your patient and detailed reviews, I have learned a lot from you.

sboeuf · 2018-06-19T15:09:33Z

@woshijpf I have just added one last comment. Thanks for your quick fixes !

katacontainersbot · 2018-06-19T15:37:32Z

PSS Measurement:
Qemu: 144523 KB
Proxy: 4612 KB
Shim: 8819 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 2007308 KB

sboeuf

Looks good ! Thanks

jshachm · 2018-06-21T01:26:28Z

Since review have been done. I will rebuild the CI and wait for green then merge it

sboeuf · 2018-07-02T14:34:47Z

@woshijpf we just fixed the CI, could you rebase this PR on latest master, and it's good to be merged once the CI will pass.

sboeuf · 2018-07-09T17:39:27Z

@woshijpf any updates on this ?

jshachm · 2018-07-11T01:10:03Z

@sboeuf he is out of office for about a mouth. And will be back next week ,so updates will be promoted next week~

sboeuf · 2018-07-11T01:22:28Z

@jshachm thx for the info !

egernst · 2018-07-23T14:44:55Z

@jshachm @woshijpf -- back yet? I want to get this contribution merged.

flyflypeng · 2018-07-24T02:22:38Z

@egernst I'm sorry for leaving so long time, and I am going to rebase the code to pull a new request.

katacontainersbot · 2018-07-24T03:45:35Z

PSS Measurement:
Qemu: 163197 KB
Proxy: 5130 KB
Shim: 8826 KB

Memory inside container:
Total Memory: 2045968 KB
Free Memory: 2007556 KB

egernst · 2018-07-24T03:59:06Z

no worries @woshijpf - hope it was a good break. Hope the CI goes through smoothly.

opendev-zuul · 2018-07-24T04:14:28Z

Build succeeded (third-party-check pipeline).

kata-runsh : SUCCESS in 48m 44s

opendev-zuul · 2018-07-24T07:44:53Z

Build succeeded (third-party-check pipeline).

kata-runsh : SUCCESS in 48m 44s

katacontainersbot · 2018-07-24T11:34:35Z

PSS Measurement:
Qemu: 164185 KB
Proxy: 5131 KB
Shim: 8944 KB

Memory inside container:
Total Memory: 2045968 KB
Free Memory: 2007212 KB

If error occurs after sandbox network created successfully, we need to rollback to remove the created sandbox network Fixes: kata-containers#297 Signed-off-by: flyflypeng <jiangpengfei9@huawei.com>

If some errors occur after kata-proxy start, we need to rollback to kill kata-proxy process Fixes: kata-containers#297 Signed-off-by: flyflypeng <jiangpengfei9@huawei.com>

If some errors occur after qemu process start, then we need to rollback to kill qemu process Fixes: kata-containers#297 Signed-off-by: flyflypeng <jiangpengfei9@huawei.com>

If kata-agent doesn't start in VM, we need to do some rollback operations to release related resources. add grpc check() to check kata-agent is running or not Fixes: kata-containers#297 Signed-off-by: flyflypeng <jiangpengfei9@huawei.com>

Host device's major-minor numbers are mapped to guest major-minor numbers, for example in the host the major-minor number for /dev/loop0p1 is 259:1, when it's attached to the VM now the major-minor number is 8:0, this conversion must be reflected in devices and resources lists, the first list is used to mount the device in the container and the second one is to update the devices cgroup. fixes kata-containers#351 Signed-off-by: Julio Montes <julio.montes@intel.com>

flyflypeng force-pushed the fix-no-kata-agent branch from f4152d6 to f4b92c2 Compare May 30, 2018 09:24

flyflypeng force-pushed the fix-no-kata-agent branch from f4b92c2 to f3028fe Compare May 30, 2018 11:25

jodh-intel reviewed May 30, 2018

View reviewed changes

devimc reviewed May 30, 2018

View reviewed changes

sboeuf suggested changes May 30, 2018

View reviewed changes

flyflypeng force-pushed the fix-no-kata-agent branch 2 times, most recently from 9cb0a88 to f232158 Compare May 31, 2018 07:00

devimc mentioned this pull request May 31, 2018

add test to ensure qemu is not running if container start fails kata-containers/tests#360

Closed

sboeuf suggested changes May 31, 2018

View reviewed changes

jshachm mentioned this pull request Jun 1, 2018

virtcontainers : fix shared dir resource remaining #292

Merged

flyflypeng force-pushed the fix-no-kata-agent branch 2 times, most recently from 031b9c5 to 004b83b Compare June 5, 2018 09:31

flyflypeng force-pushed the fix-no-kata-agent branch from 2df2345 to 6670b5d Compare June 19, 2018 15:23

sboeuf approved these changes Jun 19, 2018

View reviewed changes

egernst approved these changes Jul 23, 2018

View reviewed changes

egernst mentioned this pull request Jul 23, 2018

Add cleanup task for failed creates #404

Closed

flyflypeng force-pushed the fix-no-kata-agent branch from 6670b5d to bd80266 Compare July 24, 2018 03:25

flyflypeng force-pushed the fix-no-kata-agent branch from bd80266 to 2993cb3 Compare July 24, 2018 06:42

jshachm merged commit 8939fd8 into kata-containers:master Jul 24, 2018

flyflypeng added 4 commits July 24, 2018 21:34

virtcontainers: add rollback to remove sandbox network

daebbd1

If error occurs after sandbox network created successfully, we need to rollback to remove the created sandbox network Fixes: kata-containers#297 Signed-off-by: flyflypeng <jiangpengfei9@huawei.com>

virtcontainers: add kata-proxy rollback

c2651a8

If some errors occur after kata-proxy start, we need to rollback to kill kata-proxy process Fixes: kata-containers#297 Signed-off-by: flyflypeng <jiangpengfei9@huawei.com>

virtcontainers: add qemu process rollback

7103c4f

If some errors occur after qemu process start, then we need to rollback to kill qemu process Fixes: kata-containers#297 Signed-off-by: flyflypeng <jiangpengfei9@huawei.com>

virtcontainers: fix kata-agent fail to start

2993cb3

If kata-agent doesn't start in VM, we need to do some rollback operations to release related resources. add grpc check() to check kata-agent is running or not Fixes: kata-containers#297 Signed-off-by: flyflypeng <jiangpengfei9@huawei.com>

grahamwhaley mentioned this pull request Aug 22, 2018

Stable 1.1.1 candidate #621

Merged

sboeuf added bug Incorrect behaviour stable-candidate labels Sep 12, 2018

Conversation

flyflypeng commented May 30, 2018

Uh oh!

katabuilder commented May 30, 2018

Uh oh!

katabuilder commented May 30, 2018

Uh oh!

codecov bot commented May 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

katabuilder commented May 30, 2018

Uh oh!

jodh-intel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sboeuf left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

katabuilder commented May 31, 2018

Uh oh!

devimc commented May 31, 2018

Uh oh!

jodh-intel commented May 31, 2018

Uh oh!

devimc commented May 31, 2018

Uh oh!

jodh-intel commented May 31, 2018

Uh oh!

flyflypeng commented May 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devimc commented May 31, 2018

Uh oh!

flyflypeng commented May 31, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flyflypeng Jun 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented May 30, 2018 •

edited

Loading

flyflypeng commented May 31, 2018 •

edited

Loading

flyflypeng Jun 4, 2018 •

edited

Loading