Fix hostd resource exhaustion in high load CI#6998
Merged
hickeng merged 3 commits intovmware:masterfrom Dec 21, 2017
Merged
Conversation
The delete tests for vic-machine and vic-machine-service leaks VCHs. For the service it's because the tests deploy VCHs directly that are not cleaned up. For vic-machine base it's because we render the VCH invalid by moving the endpointVM in such a manner that the deletion fails without explicit cleanup after.
This reverts commit 7796336 because it's apparent that increasing the page size to this extent can cause hostd to both hit its resource limits and to drastically fragment its heap.
lcastellano
approved these changes
Dec 21, 2017
We were checking for existence of containers before they were created as an artifact of moving the check block prior to the create for the volume existence check. We had a test installing with a named volume store that was not configured during the re-install and therefore not deleted at the end.
cgtexmex
approved these changes
Dec 21, 2017
hickeng
added a commit
to hickeng/vic
that referenced
this pull request
Dec 21, 2017
* Fix leak of VCHs after test runs The delete tests for vic-machine and vic-machine-service leaks VCHs. For the service it's because the tests deploy VCHs directly that are not cleaned up. For vic-machine base it's because we render the VCH invalid by moving the endpointVM in such a manner that the deletion fails without explicit cleanup after. * Revert "Increase event page size to 1000 (vmware#6937)" This reverts commit 7796336 because it's apparent that increasing the page size to this extent can cause hostd to both hit its resource limits and drastically fragment its heap.
hickeng
added a commit
that referenced
this pull request
Dec 21, 2017
* Fix leak of VCHs after test runs The delete tests for vic-machine and vic-machine-service leaks VCHs. For the service it's because the tests deploy VCHs directly that are not cleaned up. For vic-machine base it's because we render the VCH invalid by moving the endpointVM in such a manner that the deletion fails without explicit cleanup after. * Revert "Increase event page size to 1000 (#6937)" This reverts commit 7796336 because it's apparent that increasing the page size to this extent can cause hostd to both hit its resource limits and drastically fragment its heap.
This was referenced Dec 22, 2017
hickeng
pushed a commit
to hickeng/vic
that referenced
this pull request
Dec 29, 2017
An alternative to increasing the collector page size. It will reduce the throughput to the event collector and hence reduce event misses. See issues vmware#6937 and vmware#6998
hickeng
pushed a commit
to hickeng/vic
that referenced
this pull request
Dec 29, 2017
An alternative to increasing the collector page size. It will reduce the throughput to the event collector and hence reduce event misses. See issues vmware#6937 and vmware#6998
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
It's still being validated by other builds, but this is the minimal set of changes that should fix the issue of CI crashing hostd due to memory exhaustion. If I'm correct we have two ways of triggering this:
a. leaked VCHs from Group23 and Group6 delete tests
b. event page size update
This is essentially a revert of the pagesize change and modifies the tests to cleanup correctly.
This DOES NOT address how we handle event storms which the page size change was targeted at; that will need additional work, likely using a page cursor or similar for demand scolling of event history view.
Fixes #6886