Skip to content
This repository was archived by the owner on May 12, 2021. It is now read-only.

sandbox: Delete store artifacts if stopSandbox fails#1267

Closed
nitkon wants to merge 1 commit into
kata-containers:masterfrom
nitkon:leakyPods
Closed

sandbox: Delete store artifacts if stopSandbox fails#1267
nitkon wants to merge 1 commit into
kata-containers:masterfrom
nitkon:leakyPods

Conversation

@nitkon
Copy link
Copy Markdown
Contributor

@nitkon nitkon commented Feb 21, 2019

If stopSandbox fails due to qmp/qemu issues,
atleast clean up store artifacts before returning
errors.

Fixes: #1266

Signed-off-by: Nitesh Konkar niteshkonkar@in.ibm.com

If stopSandbox fails due to qmp/qemu issues,
atleast clean up store artifacts before returning
errors.

Fixes: kata-containers#1266

Signed-off-by: Nitesh Konkar niteshkonkar@in.ibm.com
@nitkon
Copy link
Copy Markdown
Contributor Author

nitkon commented Feb 21, 2019

/test

Comment thread virtcontainers/sandbox.go
s.Logger().Info("Stopping VM")
return s.hypervisor.stopSandbox()
if err := s.hypervisor.stopSandbox(); err != nil {
s.store.Delete()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not seem right to cleanup the store when the stopSandbox() fails. I mean doing this from such low level function is not appropriate IMO.

@nitkon
Copy link
Copy Markdown
Contributor Author

nitkon commented Feb 22, 2019

@sboeuf : Okay. Then in the case of stopSandbox failure....

  1. Should we cleanup the artifacts so that we leave the host clean?
  2. If yes, when/where should we cleanup the artifacts?
  3. Can it be a part of cleanupVM ? We can cleanup store.ConfigStoragePath as well

@sboeuf
Copy link
Copy Markdown

sboeuf commented Feb 22, 2019

@nitkon

  1. Should we cleanup the artifacts so that we leave the host clean?
  2. If yes, when/where should we cleanup the artifacts?

I'm not sure we want to clean up the artifacts if a stop fails, but if we do, this should happens from func (s *Sandbox) Stop() error instead, using a defer function.
@sameo do you think the cleanup of the store should happen if the stop fails? I would assume that even if it failed, a delete would be called later to make sure things get cleaned up. And if we do the store delete before, unfortunately, the delete from the caller would also fail, which sounds bad to me.

  1. Can it be a part of cleanupVM ? We can cleanup store.ConfigStoragePath as well

No we cannot do that because the cleanupVM function is specific to the QEMU implementation, while the store is generic to the whole sandbox.

@nitkon
Copy link
Copy Markdown
Contributor Author

nitkon commented Feb 22, 2019

@sboeuf : Assuming I am looking at the right place,it would not error out if the artifacts store.ConfigStoragePath is already deleted.

@sboeuf
Copy link
Copy Markdown

sboeuf commented Feb 22, 2019

@nitkon
You mean that it would not error out if the same function was called twice? Looking at your link, you're right, but you have to keep in mind this is dependent of the implementation. Here you're pointing to the filesystem implementation that works that way, but can we assume this is the way the whole Store interface should behave?

If that's the case, it should be said explicitly that Store.Delete() is idempotent and will not fail if it's been called before.

@nitkon
Copy link
Copy Markdown
Contributor Author

nitkon commented Feb 22, 2019

Two things ...

  1. I would assume that even if it failed, a delete would be called later to make sure things get cleaned up:-
    I guess if stopSandbox fails, things are not getting cleaned up, hence the PR. If you are saying something else could you please elaborate ...

  2. If Store.Delete() should be idempotent or not, well I am not sure. As you rightly said that's how it behaves as per current implementation.

@sboeuf
Copy link
Copy Markdown

sboeuf commented Feb 22, 2019

@nitkon

1. I guess if stopSandbox fails, things are not getting cleaned up, hence the PR. If you are saying something else could you please elaborate ...

Well the wrong behavior (delete not being called, or not performed) might be the result of some other code paths. I'm just saying that if you think from a caller perspective, you will do a create, then a start, and if your stop fails, do you expect all information related to the container to be removed?? I don't think so, instead I still expect to be able to delete the thing that actually failed to be stopped. That's why I don't think we should delete the entire store related to a sandbox when the stop fails. But I could use more opinions here for sure: @sameo @egernst @amshinde @jodh-intel @grahamwhaley @bergwolf

2. If Store.Delete() should be idempotent or not, well I am not sure. As you rightly said that's how it behaves as per current implementation.

No I didn't say that it was behaving this way. I said that the filesystem implementation of the store behaves this way. But I suggest that if this is the behavior that we want for the whole store, then we should specify it at the interface level.

@nitkon
Copy link
Copy Markdown
Contributor Author

nitkon commented Feb 23, 2019

@sboeuf not sure if this helps but just FYI the usecase I am trying to solve is where qmp executes qmp_capabilities or quit, but the guest panics GUEST_PANICKED for some reason and hence shuts down event:SHUTDOWN resulting in "Fail to execute qmp QUIT" This results in non-cleanup of sandbox artifacts which are then left on the host system, resulting in CI status reported as failed. ERROR: 2 pods left and found at /var/lib/vc/sbs

Feb 21 11:15:22 kata1 kata-runtime[80116]: time="2019-02-21T11:15:22.69012585+05:30" level=info msg="Stopping Sandbox" arch=ppc64le command=kill container=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 name=kata-runtime pid=80116 sandbox=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 source=virtcontainers subsystem=qemu
Feb 21 11:15:22 kata1 kata-proxy[80043]: time="2019-02-21T11:15:22.690145432+05:30" level=fatal msg="channel error" error="accept unix /run/vc/sbs/eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43/proxy.sock: use of closed network connection" name=kata-proxy pid=80043 sandbox=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 source=proxy
Feb 21 11:15:22 kata1 kata-runtime[80116]: time="2019-02-21T11:15:22.690742711+05:30" level=info msg="{"QMP": {"version": {"qemu": {"micro": 2, "minor": 11, "major": 2}, "package": " (v2.11.2-dirty)"}, "capabilities": []}}" arch=ppc64le command=kill container=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 name=kata-runtime pid=80116 sandbox=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 source=virtcontainers subsystem=qmp
Feb 21 11:15:22 kata1 kata-runtime[80116]: time="2019-02-21T11:15:22.690915681+05:30" level=info msg="{"execute":"qmp_capabilities"}" arch=ppc64le command=kill container=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 name=kata-runtime pid=80116 sandbox=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 source=virtcontainers subsystem=qmp
Feb 21 11:15:22 kata1 kata-runtime[80116]: time="2019-02-21T11:15:22.695333298+05:30" level=info msg="{"return": {}}" arch=ppc64le command=kill container=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 name=kata-runtime pid=80116 sandbox=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 source=virtcontainers subsystem=qmp
Feb 21 11:15:22 kata1 kata-runtime[80116]: time="2019-02-21T11:15:22.695421522+05:30" level=info msg="{"timestamp": {"seconds": 1550727922, "microseconds": 695225}, "event": "GUEST_PANICKED", "data": {"action": "pause"}}" arch=ppc64le command=kill container=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 name=kata-runtime pid=80116 sandbox=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 source=virtcontainers subsystem=qmp
Feb 21 11:15:22 kata1 kata-runtime[80116]: time="2019-02-21T11:15:22.695523356+05:30" level=info msg="{"execute":"quit"}" arch=ppc64le command=kill container=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 name=kata-runtime pid=80116 sandbox=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 source=virtcontainers subsystem=qmp
Feb 21 11:15:22 kata1 kata-runtime[80116]: time="2019-02-21T11:15:22.695603993+05:30" level=info msg="{"timestamp": {"seconds": 1550727922, "microseconds": 695281}, "event": "GUEST_PANICKED", "data": {"action": "poweroff"}}" arch=ppc64le command=kill container=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 name=kata-runtime pid=80116 sandbox=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 source=virtcontainers subsystem=qmp
Feb 21 11:15:22 kata1 kata-runtime[80116]: time="2019-02-21T11:15:22.695667025+05:30" level=info msg="{"timestamp": {"seconds": 1550727922, "microseconds": 695315}, "event": "SHUTDOWN", "data": {"guest": true}}" arch=ppc64le command=kill container=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 name=kata-runtime pid=80116 sandbox=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 source=virtcontainers subsystem=qmp
Feb 21 11:15:22 kata1 kata-runtime[80116]: time="2019-02-21T11:15:22.732232788+05:30" level=error msg="Fail to execute qmp QUIT" arch=ppc64le command=kill container=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 error="exitting QMP loop, command cancelled" name=kata-runtime pid=80116 sandbox=eabfef03b97c301db7cc83f59429998045674e324ccd9caf5b9ab2c67f2cee43 source=virtcontainers subsystem=qemu
Feb 21 11:15:22 kata1 kata-runtime[80116]: time="2019-02-21T11:15:22.732309691+05:30" level=info msg="cleanup vm path" arch=ppc64le command=kill

@sboeuf
Copy link
Copy Markdown

sboeuf commented Feb 25, 2019

@nitkon thanks for pointers.
First, the reason why we get GUEST_PANICKED needs to be investigated, as this is not clear to me why we end up with this error sporadically.

About the reason why the QMP_QUIT failure, I think in this case the agent errors out and exit, causing the VM to exit too, because of the systemd service. Take a look a this, and I think you should not run into this QMP_QUIT issue again.

@jodh-intel
Copy link
Copy Markdown

jodh-intel commented Feb 26, 2019

@nitkon - Please can you run with full debug enabled and attach the proxy log (or agent log if you're using vsock).

@nitkon
Copy link
Copy Markdown
Contributor Author

nitkon commented Feb 28, 2019

@jodh-intel @sboeuf : This does not occur when I run stand alone. Happens only in the CI. I have attached the journalctl logs when the VM panicked. (GUEST_PANICKED)
Resulting in this at the end of the CI ...

jenkins@kata1:~/workspace$ sudo ls -lrt  /var/lib/vc/sbs
total 4
drwxr-x--- 4 root root 4096 Feb 28 20:04 4d767603ba58677c992489543e9b9bd346b2a5487c4732a7280f6d7e8359b2ee

Journalctl logs: LeakyPods.log
Note: These were from the CI we ran on P8 for agent repo

@nitkon
Copy link
Copy Markdown
Contributor Author

nitkon commented Mar 4, 2019

@jodh-intel: Are the attached logs good enough Or should I provide anything more.. ?

@jodh-intel
Copy link
Copy Markdown

@nitkon - thanks for attaching. I don't see any kernel message relatings to the panic reported by qemu. As @sboeuf says, we need to understand why the guest kernel is failing so if you can find a way to increase / capture guest kernel debug for ppc64le that would be helpful.

@jodh-intel
Copy link
Copy Markdown

Ping @nitkon - any update?

@jodh-intel
Copy link
Copy Markdown

/retest

@raravena80
Copy link
Copy Markdown
Member

@nitkon any updates? Thx!

@raravena80
Copy link
Copy Markdown
Member

@nitkon nudge

@jodh-intel
Copy link
Copy Markdown

Ping @nitkon.

@nitkon
Copy link
Copy Markdown
Contributor Author

nitkon commented May 3, 2019

I can confirm if the issue still persists once I successfully get CI running on one of the proxy test PRs.

@raravena80
Copy link
Copy Markdown
Member

@nitkon any updates? Thx

@egernst
Copy link
Copy Markdown
Member

egernst commented Jun 11, 2019

@nitkon - sorry if this PR is frustrating -- I know its gone on for a while. I think we are mixing up

  1. root-cause of a particular failure, versus
  2. how we cleanup the host system when a failure occurs.

(1) is ongoing it seems per #1267 (comment), but (2) is the point of the PR. I agree that it may make sense to do this cleanup higher in the call stack. While stop is failing, do the CRI level implementations still call delete? What about in Docker case?

@egernst egernst added the needs-help Request for extra help (technical, resource, etc) label Jun 11, 2019
@egernst
Copy link
Copy Markdown
Member

egernst commented Jun 11, 2019

added "needs-help" -- let's gather some more info first -- @nitkon are you able to drive this?

@nitkon
Copy link
Copy Markdown
Contributor Author

nitkon commented Jun 12, 2019

Hi @egernst, Yes I can drive this. However, I have been unable to reproduce this issue recently and none of the Power CI's has failed due to this issue in recent times. cc @grahamwhaley

@grahamwhaley
Copy link
Copy Markdown
Contributor

If we've not seen this recently on any of the CIs (/cc @chavafg @GabyCT for any confirmations), then we can probably close this, and re-open if it starts to re-occur.
I've not seen any tests fail due to 'leakiness' recently.

@raravena80
Copy link
Copy Markdown
Member

@grahamwhaley close this?

@grahamwhaley
Copy link
Copy Markdown
Contributor

@raravena80 - I think as long as @nitkon is OK closing this ... @nitkon , you want to close this?

@jodh-intel
Copy link
Copy Markdown

Ping @nitkon.

@nitkon
Copy link
Copy Markdown
Contributor Author

nitkon commented Jul 29, 2019

Will re-open if it starts to re-occur. Closing it for now.

@nitkon nitkon closed this Jul 29, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

needs-help Request for extra help (technical, resource, etc)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sandbox: Delete store artifacts if stopSandbox fails

6 participants