virtcontainers : fix shared dir resource remaining#292
virtcontainers : fix shared dir resource remaining#292sboeuf merged 1 commit intokata-containers:masterfrom jshachm:remove-host-shared
Conversation
virtcontainers/container.go
Outdated
| } else if err := os.RemoveAll(m.HostPath); err != nil { | ||
| // since the mounts related to the shared dir is umounted | ||
| // we need to remove the host path to avoid resource remaining | ||
| c.Logger().WithFields(logrus.Fields{ |
There was a problem hiding this comment.
This log call essentially simulates a call to WithError() and although it is simpler than having to call WithError(), I think it's safer to explicitly call WithError() to protect against it changing its behaviour at some future point.
Since this is based on the log call code above, I'd do something like:
logger := c.Logger().WithField("host-path", m.HostPath)
if err := syscall.Unmount(m.HostPath, 0); err != nil {
logger.WithError(err).Warn("could not unmount")
} else if err := os.RemoveAll(m.HostPath); err != nil {
logger.WithError(err).Warn("could not remove path")
}
| "error": err, | ||
| }).Warn("Could not umount") | ||
| return err | ||
| } else if err := os.RemoveAll(m.HostPath); err != nil { |
There was a problem hiding this comment.
I think you should be able to create a unit test for this in container_test.go where you set c.mounts[0] to contain a double bind-mounted path. That way, the Unmount() will succeed but the RemoveAll() will fail (as there is still one more Unmount() call to make).
Codecov Report
@@ Coverage Diff @@
## master #292 +/- ##
==========================================
- Coverage 63.79% 63.77% -0.02%
==========================================
Files 87 87
Lines 8811 8823 +12
==========================================
+ Hits 5621 5627 +6
- Misses 2587 2592 +5
- Partials 603 604 +1
Continue to review full report at Codecov.
|
virtcontainers/sandbox.go
Outdated
| } | ||
|
|
||
| // vm is stopped remove the sandbox shared dir | ||
| sandboxSharedDir := filepath.Join(kataHostSharedDir, s.id) |
There was a problem hiding this comment.
This cannot be done here. kataHostSharedDir is specific to the kata_agent.go implementation, and should never be used by the generic sandbox implementation.
There was a problem hiding this comment.
Sry for missing this. This should be done more generic.
There was a problem hiding this comment.
@jshachm You can introduce an agent interface to cleanup agent specific artifacts after the VM has stopped.
There was a problem hiding this comment.
But wait, any cleanup specific to an agent implementation should be part of either stopContainer() if it's related to a container, or stopSandbox() if it's related to the sandbox itself.
I don't see the value (and actually this would imply duplication) of introducing a new method to the agent interface.
There was a problem hiding this comment.
@sboeuf The kata host directory can only be deleted after the VM has stopped since it is shared with the VM through 9p. So this cannot be part of stopContainer or stopSandbox.
There was a problem hiding this comment.
Oh yes good point, then in this case, a agent.cleanup() function as you suggested sounds like a good idea.
|
Need to find a better way to remove sandbox work dirs and solve the case that shared dirs are mounted to different destinations. So do-not-merge by now |
|
PSS Measurement: Memory inside container: |
|
@sboeuf @amshinde get Maybe need more unit test as @jodh-intel suggested. Will push those later. |
|
thanks @jshachm! |
sboeuf
left a comment
There was a problem hiding this comment.
Looks good, just a few things to update.
virtcontainers/kata_agent.go
Outdated
|
|
||
| func (k *kataAgent) cleanupSandbox(sandbox *Sandbox) error { | ||
| sandboxSharedDir := filepath.Join(kataHostSharedDir, sandbox.id) | ||
| if err := os.RemoveAll(sandboxSharedDir); err != nil { |
There was a problem hiding this comment.
Please simplify the function this way:
func (k *kataAgent) cleanupSandbox(sandbox *Sandbox) error {
return os.RemoveAll(filepath.Join(kataHostSharedDir, sandbox.id))
}| if err := s.agent.cleanupSandbox(s); err != nil { | ||
| // cleanup resource failed shouldn't block destroy sandbox | ||
| // just raise a warning | ||
| s.Logger().WithError(err).Warnf("cleanup sandbox failed") |
| // cleanup resource failed shouldn't block destroy sandbox | ||
| // just raise a warning | ||
| s.Logger().WithError(err).Warnf("cleanup sandbox failed") | ||
| } |
There was a problem hiding this comment.
Please add a blank line to let the code breathe :)
| var testHyperstartTtySocket = "" | ||
|
|
||
| // cleanUp Removes any stale sandbox/container state that can affect | ||
| // cleanup Removes any stale sandbox/container state that can affect |
|
@sboeuf all changes done~ |
|
Thanks I'll take a look ! |
|
PSS Measurement: Memory inside container: |
|
@jshachm LGTM |
|
@sboeuf Can't find the button in jenkins ... I can only add a |
|
PSS Measurement: Memory inside container: |
|
CI failures. This is from the 16.04 log: I'm seeing the same in the f27 run as well, so I suspect this is not just spurious.
Otherwise, yes, forcing a change of a commit SHA is they other way to force a build - modding the commit message or doing a rebase if something has been merged in the master branch are the most popular ways ;-) ) |
grahamwhaley
left a comment
There was a problem hiding this comment.
I'd really like to see a description in the commit message that details:
- what was wrong, and why it failed
- how and why we fixed it
The commit message right now doesn't really tell me if the dir was always left behind, or just under certain circumstances etc.
Having said that, generally for cleaning up :-)
lgtm
| "error": err, | ||
| }).Warn("Could not umount") | ||
| logger.WithError(err).Warn("Could not umount") | ||
| return err |
There was a problem hiding this comment.
Not a change by your patch, but directly related - it seems if we fail one unmount then we quit the loop. Would it maybe be better for us to continue to try and unmount the remaining mounts in c.mounts, and clean up as much as we can (even though we know we will ultimately fail this function)?
Same for the following RemoveAll as well.
There was a problem hiding this comment.
@grahamwhaley great catch and suggestions~
When I coded these part, it also took me a lot of time to think whether continue to next is better than break from here.After thinking and discussion with teammates, and some "sad" experience from our production environment, I think umount failure is a very bad situation maybe caused by mount point leak and other bad thing... So we need to break right here and raise a warning to upper platform like k8s.
In other word, If a remaining has already happened, maybe it better to alarm the SRE department as quick as possible to get involved in this problem.
What do you think about this, glad to hear your suggestion in this situation~~~ ^_^
There was a problem hiding this comment.
Hi @jshachm Yeah, it is a tricky call if we quit or carry on. I can see your point that 'fail fast' and try not to make things worse is perfectly valid. Maybe that is the safer way which will:
- possibly cause 'less damage' if things are really broken
- give the admins more of a chance to diagnose the problem
So, I'm happy to leave as is. If we do, can we add a big comment saying why we quit on the first failure ;-)
Probably best if we get some other input from maybe @egernst and @gnawux maybe, and others if anybody has a strong opinion on this?
|
The test failures I think are kata-containers/tests#427 (see #385 (comment)). |
|
@grahamwhaley thx for the remind of @jodh-intel As you mentioned , CI will not be happy until we change the scripts of ping CNI? |
|
@jshachm if you rebase your PR on top of master (make sure you fetch), you should get the CI test passing since the CNI issue got fixed. |
|
@jshachm Yes, CI has been fixed now. Please rebase on top of master. |
|
PSS Measurement: Memory inside container: |
Before this patch shared dir will reamin when sandox has already removed, espacilly for kata-agent mod. Do clean up shared dirs after all mounts are umounted. Fixes: #291 Signed-off-by: Haomin <caihaomin@huawei.com>
|
PSS Measurement: Memory inside container: |
logging: Add sandbox field
Before this patch shared dir will reamin when sandox
has already removed.
Do clean up shared dirs after all mounts are umounted
Fix: #291
Signed-off-by: Haomin caihaomin@huawei.com