virtcontainers: process the case that kata-agent doesn't start in VM#351
Conversation
|
PSS Measurement: Memory inside container: |
f4152d6 to
f4b92c2
Compare
|
PSS Measurement: Memory inside container: |
Codecov Report
@@ Coverage Diff @@
## master #351 +/- ##
=========================================
- Coverage 66.88% 66.8% -0.08%
=========================================
Files 93 93
Lines 9475 9495 +20
=========================================
+ Hits 6337 6343 +6
- Misses 2466 2474 +8
- Partials 672 678 +6
Continue to review full report at Codecov.
|
f4b92c2 to
f3028fe
Compare
|
PSS Measurement: Memory inside container: |
jodh-intel
left a comment
There was a problem hiding this comment.
Thanks for raising. I think we need @sboeuf's input on this one. If I'm understanding this, thescenario this PR covers is highly unlikely to occur in normal operation, unless maybe that agent crashed on startup?
There is also the question of performance - what sort of hit are we having to take by connecting and then disconnecting every time on the very small likelihood the agent isn't running I wonder?
virtcontainers/kata_agent.go
Outdated
| ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) | ||
| defer cancel() | ||
| if _, err := k.client.Check(ctx, &grpc.CheckRequest{}); err != nil { | ||
| k.Logger().Error("grpc server check failed") |
There was a problem hiding this comment.
How about:
k.Logger().WithError(err).Error("grpc server check failed")
There was a problem hiding this comment.
This log output is used to locate the problem from kata-runtime logfile when kata-agent is crashed in VM
There was a problem hiding this comment.
Sure, but your code is not logging the actual error, so I'm suggesting you keep your error message and also add the original error (which might include more detail).
There was a problem hiding this comment.
Thanks! This is a great suggestion about the error log.Back to your first comment, I agree with your opinion that kata-agent isn't running is very small likelihood . But I have another question, do we have some measures to make sure that kata-agent will not fail in VM?
virtcontainers/sandbox.go
Outdated
| // to start the sandbox inside the VM. | ||
| return s.agent.startSandbox(s) | ||
| err := s.agent.startSandbox(s) | ||
| defer func() { |
There was a problem hiding this comment.
is defer needed? why not just
if err != nil {
s.hypervisor.stopSandbox()
}🤔
There was a problem hiding this comment.
or
if err != nil {
return s.hypervisor.stopSandbox()
}in case of stopSandbox fails
There was a problem hiding this comment.
Thanks! I agree with your last proposed code in case of stopSandbox fails
There was a problem hiding this comment.
This is not semantically correct. We should not call stopSandbox() after startSandbox() returned some error. We would be running stopSandbox(), which assumes the sandbox has been properly started, when this is not true.
The valid case for such a thing could be to run stopSandbox() if further in the code, an error would occur, and we would want to rollback from there.
And if you want to do such thing, I think it should be part of its own commit for more clarity ;)
There was a problem hiding this comment.
Maybe we have some misunderstanding about s.agent.stopSandbox() and s.hypervisor.stopSandbox().
s.agent.stopSandbox() is used to stop the Sandbox and Containers which are running in VM, in other words, both sandbox and containers are started successfully in VM.
s.hypervisor.stopSandbox() is used to stop qemu process and release related resources, in other words, sandbox is not created successfully in VM.
In this place, if s.agent.startSandbox() return error , it indicates that sandbox is not created successfully in VM, so I want to call s.hypervisor.stopSandbox() to stop VM process.
virtcontainers/kata_agent.go
Outdated
| return err | ||
| } | ||
|
|
||
| ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) |
There was a problem hiding this comment.
why 5? should that number be a constant?
There was a problem hiding this comment.
Good question, maybe I need to consider which value is best for this scenario, in this place I think 5 seconds is an appropriate choose
There was a problem hiding this comment.
yep, just define a constant and add a comment, for example
// timeout to check grpc server is serving
const grpcCheckTimeout = 5*time.Second
....
func (k *kataAgent) startSandbox(sandbox *Sandbox) error {
....
ctx, cancel := context.WithTimeout(context.Background(), grpcCheckTimeout)
sboeuf
left a comment
There was a problem hiding this comment.
There are good ideas in this PR, but this needs to be refactored a bit.
virtcontainers/sandbox.go
Outdated
| // to start the sandbox inside the VM. | ||
| return s.agent.startSandbox(s) | ||
| err := s.agent.startSandbox(s) | ||
| defer func() { |
There was a problem hiding this comment.
This is not semantically correct. We should not call stopSandbox() after startSandbox() returned some error. We would be running stopSandbox(), which assumes the sandbox has been properly started, when this is not true.
The valid case for such a thing could be to run stopSandbox() if further in the code, an error would occur, and we would want to rollback from there.
And if you want to do such thing, I think it should be part of its own commit for more clarity ;)
virtcontainers/kata_agent.go
Outdated
| k.Logger().Error("grpc server check failed") | ||
|
|
||
| //kill kata-proxy to release resources | ||
| k.proxy.stop(sandbox, k.state.ProxyPid) |
There was a problem hiding this comment.
The rollback of the proxy being started sounds like a good idea. But if we want this to be properly implemented, we should create a defer right after the proxy has been correctly started. You need to declare a global err error for the function, and if when the defer is executed, the err != nil, then we should stop the proxy. This will apply for any error case that could happen in this function after the proxy has been started.
You also want to handle the case where the agent fails to start, which is great, but a separate thing IMO, that's why it should be split into 2 different commits. First you introduce proper rollback for the proxy, and then you introduce the check for the agent availability.
9cb0a88 to
f232158
Compare
|
PSS Measurement: Memory inside container: |
|
@woshijpf do me a favour, run next command with this patch if you don't see a qemu process running then this patch looks good to me |
|
You have to look closely at the previous command to notice that it won't work as |
|
@jodh-intel I think you mean |
|
@devimc - I do! Thanks! 😄 |
|
@devimc I test this patch in my ubuntu 16.04 x86_64 laptop is good with next two cases:
# need to run as root
# docker run --runtime=kata-runtime -ti busybox sh
/ # ls
bin dev etc home proc root sys tmp usr var
$ cd <kata-containers>/osbuilder/rootfs-builder
# use euleros as example
$ ./rootfs.sh euleros
# remove kata-agent.service, so systemd can't launch the kata-agent process
$ cd ./rootfs-EulerOS/usr/lib/systemd/system/
$ rm kata-agent.service
$ rm rm kata-containers.target
# make rootfs-EulerOS into rootfs.img
$ cd <kata-containers>/osbuilder/image-builder
# we need to modify the rootfs make shell scipt
$ cp image_builder.sh image_builder-no-agent.sh
$ vim image_builder-no-agent.sh
# comment follow lines
129 # [ "${AGENT_INIT}" == "yes" ] || [ -x "${ROOTFS}/usr/bin/${AGENT_BIN}" ] || \
130 # die "/usr/bin/${AGENT_BIN} is not installed in ${ROOTFS}
131 # use AGENT_BIN env variable to change the expected agent binary name"
132 # OK "Agent installed"
# next, we can run image_builder-no-agent.sh to make rootfs image
$ ./image_builder-no-agent.sh ../rootfs-builder/rootfs-EulerOS/
# finally, we get the Euleros rootfs.img without kata-agent
$ ls
ls
Dockerfile image_builder-no-agent.sh image_builder.sh kata-containers.img README.md
# we can copy kata-containers.img to anywhere, but you need to modify the "image = " field
# in the /usr/share/defaults/kata-containers/configuration.toml fileNow, we can use the same command as previous case to test this patch # need to run as root
# docker run --runtime=kata-runtime -ti busybox sh
docker: Error response from daemon: OCI runtime create failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded: unknown.
# we get the error output as expected, and let's see if any kata-proxy and qemu processed is resident in host system
$ ps -auxww --sort=start_time
......
root 10857 0.0 0.0 0 0 ? S 23:11 0:00 [kworker/0:1]
root 10858 0.0 0.0 0 0 ? S 23:11 0:00 [kworker/1:3]
root 11118 0.0 0.0 44744 3752 pts/17 R+ 23:14 0:00 ps -auxww --sort=start_timewe get the result as expected, so this patch works well. |
|
@woshijpf nice and thanks, and what about |
|
@devimc Yep,I just run the command |
| if err != nil && pid > 0{ | ||
| k.proxy.stop(sandbox, pid) | ||
| } | ||
| }() |
There was a problem hiding this comment.
Could you move this defer block after the check for err:
pid, uri, err := k.proxy.start(sandbox, proxyParams)
if err != nil {
return err
}
// If error occurs after kata-proxy start, rollback to kill kata-proxy process
defer func() {
if err != nil && pid > 0{
k.proxy.stop(sandbox, pid)
}
}()There was a problem hiding this comment.
@sboeuf Will the defer function be called at all if it is moved after the check for err?
There was a problem hiding this comment.
Yes, the defer is called all the time (when the function returns), and if you look further at the code in this function, there are numerous cases where we can return a non-nil error, meaning the proxy will get stopped.
There was a problem hiding this comment.
@sboeuf Since I am not sure when the defer function will be called, I write a simple code to test it, finally I get the this conclusion that defer function isn't called all the time after function end, for example, if a return code is called before defer in the function body, then this defer function will not be called.
But in the place, I will move defer block after the check for err as you suggested, because it has no effect .
If k.proxy.start() return error that indicates the kata-proxy process doesn't start, so we don't need to do kata-proxy rollback operation.
There was a problem hiding this comment.
Yes exactly, if the defer statement is reached, then it will be called when the function returns.
And yes, I have asked for the defer to be called after the error check of k.proxy.start() because we don't need any rollback if err != nil in this case.
| // If error occurs after kata-proxy start, rollback to kill kata-proxy process | ||
| defer func() { | ||
| if err != nil && pid > 0{ | ||
| if err != nil && pid > 0 { |
There was a problem hiding this comment.
This added space (go formatting) should be part of the first commit.
virtcontainers/kata_agent.go
Outdated
|
|
||
| ctx, cancel := context.WithTimeout(context.Background(), grpcCheckTimeout) | ||
| defer cancel() | ||
| if _, err = k.client.Check(ctx, &grpc.CheckRequest{}); err != nil { |
There was a problem hiding this comment.
This looks good, but I was thinking that in order to avoid duplication of code, the check with timeout should be available through the agent interface itself. Thus, you could enhance the agent interface regarding check() function, and then rely on this new version to call it from here.
There was a problem hiding this comment.
You can simply use k.check() here, and it handles grpc connect/disconnect properly.
There was a problem hiding this comment.
@bergwolf k.check doesn‘t have timeout. Need to enhance it.
There was a problem hiding this comment.
@jshachm Take a look at installReqFunc(). Check request has builtin a 5 second timeout, same as you defined in this PR.
There was a problem hiding this comment.
@bergwolf sry for miss it~~ Yep, with the builtin timeout, k.check will make a difference.
There was a problem hiding this comment.
@bergwolf Thanks, it's a great suggestion, it makes code more clean and simple
virtcontainers/kata_agent.go
Outdated
| k.Logger().Error("grpc server check failed, get the error: ", err) | ||
| return err | ||
| } | ||
| k.disconnect() |
virtcontainers/sandbox.go
Outdated
| if err != nil { | ||
| //VM is started but sandbox is not created successfully in VM | ||
| //so we need to rollback to stop VM qemu process | ||
| s.hypervisor.stopSandbox() |
There was a problem hiding this comment.
Sorry I didn't realize we were talking about hypervisor.stopSandbox()....
Could you isolate this into its own commit as this addresses the rollback of the VM.
Also, you should follow the same model you followed for the proxy rollback, meaning you defer the hypervisor.stopSandbox() after you properly started the VM. This way you will both cover the error case from waitSandbox() and agent.startSandbox() that could happen.
|
@woshijpf there is a lot of work related to rollback, and I'd be more than happy if you want to look into it. We need to be careful about not having rollbacks that would collide with each others. |
|
@sboeuf Thanks for your careful review.I'm glad to deep into it to solve rollback related problems. |
|
@sboeuf @jodh-intel @devimc Hi buddies~I am fixing the problem of qemu and kata-proxy processes reside in system while |
|
@woshijpf you could move the |
031b9c5 to
004b83b
Compare
|
PSS Measurement: Memory inside container: |
|
@sboeuf Could you help me to review the code again, I modify the code as you suggested. |
|
@woshijpf I have just added one last comment. Thanks for your quick fixes ! |
2df2345 to
6670b5d
Compare
|
PSS Measurement: Memory inside container: |
|
Since review have been done. I will rebuild the CI and wait for green then merge it |
|
@woshijpf we just fixed the CI, could you rebase this PR on latest master, and it's good to be merged once the CI will pass. |
|
@woshijpf any updates on this ? |
|
@sboeuf he is out of office for about a mouth. And will be back next week ,so updates will be promoted next week~ |
|
@jshachm thx for the info ! |
|
@jshachm @woshijpf -- back yet? I want to get this contribution merged. |
|
@egernst I'm sorry for leaving so long time, and I am going to rebase the code to pull a new request. |
6670b5d to
bd80266
Compare
|
PSS Measurement: Memory inside container: |
|
no worries @woshijpf - hope it was a good break. Hope the CI goes through smoothly. |
|
Build succeeded (third-party-check pipeline).
|
bd80266 to
2993cb3
Compare
|
Build succeeded (third-party-check pipeline).
|
|
PSS Measurement: Memory inside container: |
If error occurs after sandbox network created successfully, we need to rollback to remove the created sandbox network Fixes: kata-containers#297 Signed-off-by: flyflypeng <jiangpengfei9@huawei.com>
If some errors occur after kata-proxy start, we need to rollback to kill kata-proxy process Fixes: kata-containers#297 Signed-off-by: flyflypeng <jiangpengfei9@huawei.com>
If some errors occur after qemu process start, then we need to rollback to kill qemu process Fixes: kata-containers#297 Signed-off-by: flyflypeng <jiangpengfei9@huawei.com>
If kata-agent doesn't start in VM, we need to do some rollback operations to release related resources. add grpc check() to check kata-agent is running or not Fixes: kata-containers#297 Signed-off-by: flyflypeng <jiangpengfei9@huawei.com>
Host device's major-minor numbers are mapped to guest major-minor numbers, for example in the host the major-minor number for /dev/loop0p1 is 259:1, when it's attached to the VM now the major-minor number is 8:0, this conversion must be reflected in devices and resources lists, the first list is used to mount the device in the container and the second one is to update the devices cgroup. fixes kata-containers#351 Signed-off-by: Julio Montes <julio.montes@intel.com>
virtcontainers: process the case that kata-agent doesn't start in VM
If kata-agent process is not launched to run in VM, kata-runtime will exit because of timeout,
however kata-proxy and qemu process will still resident in host
Call Check grpc function to check whether grpc server is running in kata-agent process in VM,
if not and then do some rollback operation to stop kata-proxy and qemu processes
Fixes: #297
Signed-off-by: flyflypeng jiangpengfei9@huawei.com