sandbox: Delete store artifacts if stopSandbox fails by nitkon · Pull Request #1267 · kata-containers/runtime

nitkon · 2019-02-21T14:26:56Z

If stopSandbox fails due to qmp/qemu issues,
atleast clean up store artifacts before returning
errors.

Signed-off-by: Nitesh Konkar niteshkonkar@in.ibm.com

If stopSandbox fails due to qmp/qemu issues, atleast clean up store artifacts before returning errors. Fixes: kata-containers#1266 Signed-off-by: Nitesh Konkar niteshkonkar@in.ibm.com

nitkon · 2019-02-21T14:34:00Z

/test

sboeuf · 2019-02-21T15:30:09Z

 	s.Logger().Info("Stopping VM")
-	return s.hypervisor.stopSandbox()
+	if err := s.hypervisor.stopSandbox(); err != nil {
+		s.store.Delete()


This does not seem right to cleanup the store when the stopSandbox() fails. I mean doing this from such low level function is not appropriate IMO.

nitkon · 2019-02-22T09:09:27Z

@sboeuf : Okay. Then in the case of stopSandbox failure....

Should we cleanup the artifacts so that we leave the host clean?
If yes, when/where should we cleanup the artifacts?
Can it be a part of cleanupVM ? We can cleanup store.ConfigStoragePath as well

sboeuf · 2019-02-22T17:50:47Z

@nitkon

Should we cleanup the artifacts so that we leave the host clean?

If yes, when/where should we cleanup the artifacts?

I'm not sure we want to clean up the artifacts if a stop fails, but if we do, this should happens from func (s *Sandbox) Stop() error instead, using a defer function.
@sameo do you think the cleanup of the store should happen if the stop fails? I would assume that even if it failed, a delete would be called later to make sure things get cleaned up. And if we do the store delete before, unfortunately, the delete from the caller would also fail, which sounds bad to me.

Can it be a part of cleanupVM ? We can cleanup store.ConfigStoragePath as well

No we cannot do that because the cleanupVM function is specific to the QEMU implementation, while the store is generic to the whole sandbox.

nitkon · 2019-02-22T18:42:32Z

@sboeuf : Assuming I am looking at the right place,it would not error out if the artifacts store.ConfigStoragePath is already deleted.

sboeuf · 2019-02-22T18:51:05Z

@nitkon
You mean that it would not error out if the same function was called twice? Looking at your link, you're right, but you have to keep in mind this is dependent of the implementation. Here you're pointing to the filesystem implementation that works that way, but can we assume this is the way the whole Store interface should behave?

If that's the case, it should be said explicitly that Store.Delete() is idempotent and will not fail if it's been called before.

nitkon · 2019-02-22T19:29:11Z

Two things ...

I would assume that even if it failed, a delete would be called later to make sure things get cleaned up:-
I guess if stopSandbox fails, things are not getting cleaned up, hence the PR. If you are saying something else could you please elaborate ...
If Store.Delete() should be idempotent or not, well I am not sure. As you rightly said that's how it behaves as per current implementation.

sboeuf · 2019-02-22T19:43:44Z

@nitkon

1. I guess if stopSandbox fails, things are not getting cleaned up, hence the PR. If you are saying something else could you please elaborate ...

Well the wrong behavior (delete not being called, or not performed) might be the result of some other code paths. I'm just saying that if you think from a caller perspective, you will do a create, then a start, and if your stop fails, do you expect all information related to the container to be removed?? I don't think so, instead I still expect to be able to delete the thing that actually failed to be stopped. That's why I don't think we should delete the entire store related to a sandbox when the stop fails. But I could use more opinions here for sure: @sameo @egernst @amshinde @jodh-intel @grahamwhaley @bergwolf

2. If Store.Delete() should be idempotent or not, well I am not sure. As you rightly said that's how it behaves as per current implementation.

No I didn't say that it was behaving this way. I said that the filesystem implementation of the store behaves this way. But I suggest that if this is the behavior that we want for the whole store, then we should specify it at the interface level.

nitkon · 2019-02-23T11:10:05Z

@sboeuf not sure if this helps but just FYI the usecase I am trying to solve is where qmp executes qmp_capabilities or quit, but the guest panics GUEST_PANICKED for some reason and hence shuts down event:SHUTDOWN resulting in "Fail to execute qmp QUIT" This results in non-cleanup of sandbox artifacts which are then left on the host system, resulting in CI status reported as failed. ERROR: 2 pods left and found at /var/lib/vc/sbs

Feb 21 11:15:22 kata1 kata-runtime[80116]: time="2019-02-21T11:15:22.69012585+05:30" level=info msg="Stopping Sandbox" arch=ppc64le command=kill container=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 name=kata-runtime pid=80116 sandbox=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 source=virtcontainers subsystem=qemu
Feb 21 11:15:22 kata1 kata-proxy[80043]: time="2019-02-21T11:15:22.690145432+05:30" level=fatal msg="channel error" error="accept unix /run/vc/sbs/eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43/proxy.sock: use of closed network connection" name=kata-proxy pid=80043 sandbox=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 source=proxy
Feb 21 11:15:22 kata1 kata-runtime[80116]: time="2019-02-21T11:15:22.690742711+05:30" level=info msg="{"QMP": {"version": {"qemu": {"micro": 2, "minor": 11, "major": 2}, "package": " (v2.11.2-dirty)"}, "capabilities": []}}" arch=ppc64le command=kill container=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 name=kata-runtime pid=80116 sandbox=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 source=virtcontainers subsystem=qmp
Feb 21 11:15:22 kata1 kata-runtime[80116]: time="2019-02-21T11:15:22.690915681+05:30" level=info msg="{"execute":"qmp_capabilities"}" arch=ppc64le command=kill container=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 name=kata-runtime pid=80116 sandbox=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 source=virtcontainers subsystem=qmp
Feb 21 11:15:22 kata1 kata-runtime[80116]: time="2019-02-21T11:15:22.695333298+05:30" level=info msg="{"return": {}}" arch=ppc64le command=kill container=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 name=kata-runtime pid=80116 sandbox=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 source=virtcontainers subsystem=qmp
Feb 21 11:15:22 kata1 kata-runtime[80116]: time="2019-02-21T11:15:22.695421522+05:30" level=info msg="{"timestamp": {"seconds": 1550727922, "microseconds": 695225}, "event": "GUEST_PANICKED", "data": {"action": "pause"}}" arch=ppc64le command=kill container=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 name=kata-runtime pid=80116 sandbox=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 source=virtcontainers subsystem=qmp
Feb 21 11:15:22 kata1 kata-runtime[80116]: time="2019-02-21T11:15:22.695523356+05:30" level=info msg="{"execute":"quit"}" arch=ppc64le command=kill container=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 name=kata-runtime pid=80116 sandbox=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 source=virtcontainers subsystem=qmp
Feb 21 11:15:22 kata1 kata-runtime[80116]: time="2019-02-21T11:15:22.695603993+05:30" level=info msg="{"timestamp": {"seconds": 1550727922, "microseconds": 695281}, "event": "GUEST_PANICKED", "data": {"action": "poweroff"}}" arch=ppc64le command=kill container=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 name=kata-runtime pid=80116 sandbox=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 source=virtcontainers subsystem=qmp
Feb 21 11:15:22 kata1 kata-runtime[80116]: time="2019-02-21T11:15:22.695667025+05:30" level=info msg="{"timestamp": {"seconds": 1550727922, "microseconds": 695315}, "event": "SHUTDOWN", "data": {"guest": true}}" arch=ppc64le command=kill container=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 name=kata-runtime pid=80116 sandbox=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 source=virtcontainers subsystem=qmp
Feb 21 11:15:22 kata1 kata-runtime[80116]: time="2019-02-21T11:15:22.732232788+05:30" level=error msg="Fail to execute qmp QUIT" arch=ppc64le command=kill container=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 error="exitting QMP loop, command cancelled" name=kata-runtime pid=80116 sandbox=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 source=virtcontainers subsystem=qemu
Feb 21 11:15:22 kata1 kata-runtime[80116]: time="2019-02-21T11:15:22.732309691+05:30" level=info msg="cleanup vm path" arch=ppc64le command=kill

sboeuf · 2019-02-25T15:37:44Z

@nitkon thanks for pointers.
First, the reason why we get GUEST_PANICKED needs to be investigated, as this is not clear to me why we end up with this error sporadically.

About the reason why the QMP_QUIT failure, I think in this case the agent errors out and exit, causing the VM to exit too, because of the systemd service. Take a look a this, and I think you should not run into this QMP_QUIT issue again.

jodh-intel · 2019-02-26T11:19:02Z

@nitkon - Please can you run with full debug enabled and attach the proxy log (or agent log if you're using vsock).

nitkon · 2019-02-28T15:24:11Z

@jodh-intel @sboeuf : This does not occur when I run stand alone. Happens only in the CI. I have attached the journalctl logs when the VM panicked. (GUEST_PANICKED)
Resulting in this at the end of the CI ...

jenkins@kata1:~/workspace$ sudo ls -lrt  /var/lib/vc/sbs
total 4
drwxr-x--- 4 root root 4096 Feb 28 20:04 4d767603ba58677c992489543e9b9bd346b2a5487c4732a7280f6d7e8359b2ee

Journalctl logs: LeakyPods.log
Note: These were from the CI we ran on P8 for agent repo

nitkon · 2019-03-04T12:10:45Z

@jodh-intel: Are the attached logs good enough Or should I provide anything more.. ?

jodh-intel · 2019-03-04T12:17:11Z

@nitkon - thanks for attaching. I don't see any kernel message relatings to the panic reported by qemu. As @sboeuf says, we need to understand why the guest kernel is failing so if you can find a way to increase / capture guest kernel debug for ppc64le that would be helpful.

jodh-intel · 2019-03-11T11:56:40Z

Ping @nitkon - any update?

jodh-intel · 2019-03-13T18:15:27Z

/retest

raravena80 · 2019-03-25T22:10:50Z

@nitkon any updates? Thx!

raravena80 · 2019-04-18T00:53:41Z

@nitkon nudge

jodh-intel · 2019-05-02T08:43:40Z

Ping @nitkon.

nitkon · 2019-05-03T05:54:53Z

I can confirm if the issue still persists once I successfully get CI running on one of the proxy test PRs.

raravena80 · 2019-05-24T16:09:29Z

@nitkon any updates? Thx

egernst · 2019-06-11T20:41:11Z

@nitkon - sorry if this PR is frustrating -- I know its gone on for a while. I think we are mixing up

root-cause of a particular failure, versus
how we cleanup the host system when a failure occurs.

(1) is ongoing it seems per #1267 (comment), but (2) is the point of the PR. I agree that it may make sense to do this cleanup higher in the call stack. While stop is failing, do the CRI level implementations still call delete? What about in Docker case?

egernst · 2019-06-11T20:42:21Z

added "needs-help" -- let's gather some more info first -- @nitkon are you able to drive this?

nitkon · 2019-06-12T04:28:22Z

Hi @egernst, Yes I can drive this. However, I have been unable to reproduce this issue recently and none of the Power CI's has failed due to this issue in recent times. cc @grahamwhaley

grahamwhaley · 2019-06-12T09:00:32Z

If we've not seen this recently on any of the CIs (/cc @chavafg @GabyCT for any confirmations), then we can probably close this, and re-open if it starts to re-occur.
I've not seen any tests fail due to 'leakiness' recently.

raravena80 · 2019-06-27T03:48:40Z

@grahamwhaley close this?

grahamwhaley · 2019-06-27T09:22:46Z

@raravena80 - I think as long as @nitkon is OK closing this ... @nitkon , you want to close this?

jodh-intel · 2019-07-29T10:49:58Z

Ping @nitkon.

nitkon · 2019-07-29T12:10:21Z

Will re-open if it starts to re-occur. Closing it for now.

sandbox: Delete store artifacts if stopSandbox fails

23ed561

If stopSandbox fails due to qmp/qemu issues, atleast clean up store artifacts before returning errors. Fixes: kata-containers#1266 Signed-off-by: Nitesh Konkar niteshkonkar@in.ibm.com

sboeuf reviewed Feb 21, 2019

View reviewed changes

nitkon mentioned this pull request Feb 23, 2019

CI: Fire a test PR for jenkins CI (Do not merge) kata-containers/proxy#141

Closed

nitkon mentioned this pull request Feb 28, 2019

README: Test PR for CI [WIP] kata-containers/agent#468

Closed

egernst added the needs-help Request for extra help (technical, resource, etc) label Jun 11, 2019

nitkon closed this Jul 29, 2019

Conversation

nitkon commented Feb 21, 2019

Uh oh!

nitkon commented Feb 21, 2019

Uh oh!

sboeuf Feb 21, 2019

Choose a reason for hiding this comment

Uh oh!

nitkon commented Feb 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sboeuf commented Feb 22, 2019

Uh oh!

nitkon commented Feb 22, 2019

Uh oh!

sboeuf commented Feb 22, 2019

Uh oh!

nitkon commented Feb 22, 2019

Uh oh!

sboeuf commented Feb 22, 2019

Uh oh!

nitkon commented Feb 23, 2019

Uh oh!

sboeuf commented Feb 25, 2019

Uh oh!

jodh-intel commented Feb 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nitkon commented Feb 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nitkon commented Mar 4, 2019

Uh oh!

jodh-intel commented Mar 4, 2019

Uh oh!

jodh-intel commented Mar 11, 2019

Uh oh!

jodh-intel commented Mar 13, 2019

Uh oh!

raravena80 commented Mar 25, 2019

Uh oh!

raravena80 commented Apr 18, 2019

Uh oh!

jodh-intel commented May 2, 2019

Uh oh!

nitkon commented May 3, 2019

Uh oh!

raravena80 commented May 24, 2019

Uh oh!

egernst commented Jun 11, 2019

Uh oh!

egernst commented Jun 11, 2019

Uh oh!

nitkon commented Jun 12, 2019

Uh oh!

grahamwhaley commented Jun 12, 2019

Uh oh!

raravena80 commented Jun 27, 2019

Uh oh!

grahamwhaley commented Jun 27, 2019

Uh oh!

jodh-intel commented Jul 29, 2019

Uh oh!

nitkon commented Jul 29, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

nitkon commented Feb 22, 2019 •

edited

Loading

jodh-intel commented Feb 26, 2019 •

edited

Loading

nitkon commented Feb 28, 2019 •

edited

Loading