-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Use UserVmDao for listVirtualMachines API to increase performance #8012
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use UserVmDao for listVirtualMachines API to increase performance #8012
Conversation
|
Great initiative @mlsorensen |
Codecov Report
@@ Coverage Diff @@
## main #8012 +/- ##
============================================
+ Coverage 28.15% 29.19% +1.03%
- Complexity 29181 30638 +1457
============================================
Files 5111 5111
Lines 360669 360740 +71
Branches 52700 52719 +19
============================================
+ Hits 101562 105322 +3760
+ Misses 245113 240950 -4163
- Partials 13994 14468 +474
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 275 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
|
@blueorangutan package |
|
@harikrishna-patnala a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7138 |
|
@blueorangutan test matrix |
|
@rohityadavcloud a [SF] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests |
DaanHoogland
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code looks good, only functional considderation is are there any joined fields that searches are on? I suppose you would have encountered those during your refactor, though.
| return new Pair<>(vms, count); | ||
| } | ||
|
|
||
| private Pair<List<Long>, Integer> searchForUserVMIdsAndCount(ListVMsCmd cmd) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a 400 line method and I would like to reduce its complexity a bit, but given the gain from the change, I'd say 'not now'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is ugly, it was beforehand as well.
I struggle with how to restructure it, we could call out to separate methods to handle each param perhaps but most of these are just simple one liner IF statements. There are just a lot of params to set up.
I'd love to have the SearchBuilder and matching SearchCriteria adjacent but I think we have to set up the SB completely first and then add the criteria.
yadvr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - didn't test it though. Changes are quite tangled to be thoroughly code review them, instead I would look at smoketests and some manual QA if required. Great to see this PR !
|
[SF] Trillian test result (tid-7747)
|
harikrishna-patnala
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code LGTM and that is pretty detailed report. Thanks @mlsorensen
|
@blueorangutan package |
|
@rohityadavcloud a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7167 |
|
Good reporting @mlsorensen ; my un-nuanced conclusions from your results is that there is a small penalty when searching for some explicit fields (list by host/network/state), and a great gain listing all or when searching by keyword. Put like that it seems a no-brainer. |
|
Should I be concerned about the failures in autoscaling and kubernetes, or is this environment? Will try to find time to review these if nobody knows of a current issue. |
I agree. I am hoping that targeting main, if some edge case is found later there is time to address it in the improved version later via small changes. I tried to be thorough but there are just so many ways to use this call. |
The kubernetes cluster errors are known. The autoscaling are intermitted environmental. I like to see them pass... |
|
@blueorangutan test matrix |
|
@DaanHoogland a [SF] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-7814)
|
|
[SF] Trillian test result (tid-7813)
|
|
@blueorangutan LLtest matrix |
|
[SF] Trillian test result (tid-8027)
|
shwstppr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code LGTM
thanks @mlsorensen for catching the VM listing in the two clusters case.
|
This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch. |
|
@blueorangutan package |
|
@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✖️ el7 ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 7474 |
|
@blueorangutan package |
|
@mlsorensen a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
@blueorangutan package |
|
@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7477 |
|
@blueorangutan test |
|
@shwstppr a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-8060)
|
|
Rekicking smoketests by closing/reopening PR |
|
@blueorangutan test alma8 vmware-70u3 |
|
@rohityadavcloud a [SL] Trillian-Jenkins test job (alma8 mgmt + vmware-70u3) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-8077)
|
|
UI build failure is due to codecov upload. No further tests are needed IMNSHO. |
Description
This PR addresses an issue found while scale testing CloudStack's
listVirtualMachinesAPI. It was observed that as the VM record count grows, the list APIs take progressively more time, even using paging to a smaller result set. With as little as 50k records the API starts to become noticeably slow, and by 100k records it becomes almost impossible to page through the VMs in the system in any reasonable amount of time. 100k records is not much in terms of MySQL, so I began trying to find the cause.It seems to mostly affect large result sets, but in some cases also affects queries with small result sets that require searches (like keyword search).
The UserVm view is joined to many other tables, and left joined so we get multiplication of rows. In order to page, the full potential result set needs to be generated, and then the relevant page in the result set is selected. The current code fetches the relevant Vm IDs for the page from the view, then calls the view again to generate the API response for these VM IDs.
I was unable to find any meaningful way to increase performance of the UserVM view, maybe I missed something. However, I found that the UserVM table itself is much faster. As an alternative, I decided to modify the listVirtualMachines query to avoid the view and just join tables as necessary based on the input params. This works because the search to fetch the page only needs the IDs, so no need to search the view.
The result is anywhere from a 2x to 8x improvement in list call times.
One aspect of this PR is a change to how joins are processed. I added unit tests to ensure the existing behavior was unmodified, but I needed to add the ability to join the same table twice and generate an SQL table alias to do so. This was relevant for the performance of SSH key search, as with older VMs the only way to perform this search is via the value of user vm details, there is no direct map from VM to SSH key. To optimize this search a bit I added an additional join to narrow the scope by account, but to do so I needed to join the same table twice to other joined tables.
Ok, this code is ugly. I don't think it's much worse than the existing code was, but it's a big method with a lot of string binding per the way
SearchBuilder/SearchCriteriawork. If there's a much better way to refactor this, I'm all ears, but without cloudstack having some sort of benchmark suite it's a significant effort to test and re-test this. I'd be happy to have some help.Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
This is probably the most used API, so I'm very cautious about trying to make a change here where I'm not super familiar with the code. There are a number of tests that cover listing virtual machines, but I needed an explicit comparison to old vs new.
In testing I focused on regressions, assuming the existing state is correct. If there is a bug in what listVirtualMachines returns, it is replicated.
I created a scale environment with two management servers, and a separate database server. One mgmt server was patched with this PR, and the other is unpatched. I then ran through varieties of listVirtualMachines calls and compared the result output to see if they were a perfect match between the patched and unpatched, as well as collected timing.
I tested some combo queries (two or more parameters) to see if multiple joins were behaving properly.

I also tested with basic user API keys, mostly focusing on exercising that they don't see anything they shouldn't, or nothing has changed. The result sets are smaller, but there is still a noticeable performance boost.