Skip to content

Fix hostd resource exhaustion in high load CI#6998

Merged
hickeng merged 3 commits intovmware:masterfrom
hickeng:6886
Dec 21, 2017
Merged

Fix hostd resource exhaustion in high load CI#6998
hickeng merged 3 commits intovmware:masterfrom
hickeng:6886

Conversation

@hickeng
Copy link
Contributor

@hickeng hickeng commented Dec 21, 2017

It's still being validated by other builds, but this is the minimal set of changes that should fix the issue of CI crashing hostd due to memory exhaustion. If I'm correct we have two ways of triggering this:
a. leaked VCHs from Group23 and Group6 delete tests
b. event page size update

This is essentially a revert of the pagesize change and modifies the tests to cleanup correctly.
This DOES NOT address how we handle event storms which the page size change was targeted at; that will need additional work, likely using a page cursor or similar for demand scolling of event history view.

Fixes #6886

The delete tests for vic-machine and vic-machine-service leaks VCHs.
For the service it's because the tests deploy VCHs directly that are not
cleaned up.
For vic-machine base it's because we render the VCH invalid by moving the
endpointVM in such a manner that the deletion fails without explicit
cleanup after.
This reverts commit 7796336 because it's
apparent that increasing the page size to this extent can cause hostd to
both hit its resource limits and to drastically fragment its heap.
We were checking for existence of containers before they were created as
an artifact of moving the check block prior to the create for the volume
existence check.

We had a test installing with a named volume store that was not configured
during the re-install and therefore not deleted at the end.
@hickeng hickeng changed the title Fix hostd resource exhaustion in high load CI [full ci] Fix hostd resource exhaustion in high load CI [specific ci=Group23-VIC-Machine-Service] Dec 21, 2017
@hickeng hickeng requested review from cgtexmex and zjs December 21, 2017 18:54
Copy link
Contributor

@mhagen-vmware mhagen-vmware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@hickeng hickeng changed the title Fix hostd resource exhaustion in high load CI [specific ci=Group23-VIC-Machine-Service] Fix hostd resource exhaustion in high load CI Dec 21, 2017
@hickeng hickeng merged commit 09bfe47 into vmware:master Dec 21, 2017
hickeng added a commit to hickeng/vic that referenced this pull request Dec 21, 2017
* Fix leak of VCHs after test runs

The delete tests for vic-machine and vic-machine-service leaks VCHs.
For the service it's because the tests deploy VCHs directly that are not
cleaned up.
For vic-machine base it's because we render the VCH invalid by moving the
endpointVM in such a manner that the deletion fails without explicit
cleanup after.

* Revert "Increase event page size to 1000 (vmware#6937)"

This reverts commit 7796336 because it's
apparent that increasing the page size to this extent can cause hostd to
both hit its resource limits and drastically fragment its heap.
dougm added a commit that referenced this pull request Dec 21, 2017
An alternative to increasing the collector page size.  It will reduce the throughput to the event collector and hence reduce event misses.

See issues #6937 and #6998
hickeng added a commit that referenced this pull request Dec 21, 2017
* Fix leak of VCHs after test runs

The delete tests for vic-machine and vic-machine-service leaks VCHs.
For the service it's because the tests deploy VCHs directly that are not
cleaned up.
For vic-machine base it's because we render the VCH invalid by moving the
endpointVM in such a manner that the deletion fails without explicit
cleanup after.

* Revert "Increase event page size to 1000 (#6937)"

This reverts commit 7796336 because it's
apparent that increasing the page size to this extent can cause hostd to
both hit its resource limits and drastically fragment its heap.
dougm added a commit that referenced this pull request Dec 21, 2017
An alternative to increasing the collector page size.  It will reduce the throughput to the event collector and hence reduce event misses.

See issues #6937 and #6998
hickeng pushed a commit to hickeng/vic that referenced this pull request Dec 29, 2017
An alternative to increasing the collector page size.  It will reduce the throughput to the event collector and hence reduce event misses.

See issues vmware#6937 and vmware#6998
hickeng pushed a commit to hickeng/vic that referenced this pull request Dec 29, 2017
An alternative to increasing the collector page size.  It will reduce the throughput to the event collector and hence reduce event misses.

See issues vmware#6937 and vmware#6998
hickeng pushed a commit that referenced this pull request Dec 29, 2017
An alternative to increasing the collector page size.  It will reduce the throughput to the event collector and hence reduce event misses.

See issues #6937 and #6998
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

vSphere SDK endpoint becomes unresponsive/eventually login fails completely

5 participants