Use UserVmDao for listVirtualMachines API to increase performance #8012

mlsorensen · 2023-09-27T19:12:42Z

Description

This PR addresses an issue found while scale testing CloudStack's listVirtualMachines API. It was observed that as the VM record count grows, the list APIs take progressively more time, even using paging to a smaller result set. With as little as 50k records the API starts to become noticeably slow, and by 100k records it becomes almost impossible to page through the VMs in the system in any reasonable amount of time. 100k records is not much in terms of MySQL, so I began trying to find the cause.

It seems to mostly affect large result sets, but in some cases also affects queries with small result sets that require searches (like keyword search).

The UserVm view is joined to many other tables, and left joined so we get multiplication of rows. In order to page, the full potential result set needs to be generated, and then the relevant page in the result set is selected. The current code fetches the relevant Vm IDs for the page from the view, then calls the view again to generate the API response for these VM IDs.

I was unable to find any meaningful way to increase performance of the UserVM view, maybe I missed something. However, I found that the UserVM table itself is much faster. As an alternative, I decided to modify the listVirtualMachines query to avoid the view and just join tables as necessary based on the input params. This works because the search to fetch the page only needs the IDs, so no need to search the view.

The result is anywhere from a 2x to 8x improvement in list call times.

One aspect of this PR is a change to how joins are processed. I added unit tests to ensure the existing behavior was unmodified, but I needed to add the ability to join the same table twice and generate an SQL table alias to do so. This was relevant for the performance of SSH key search, as with older VMs the only way to perform this search is via the value of user vm details, there is no direct map from VM to SSH key. To optimize this search a bit I added an additional join to narrow the scope by account, but to do so I needed to join the same table twice to other joined tables.

Ok, this code is ugly. I don't think it's much worse than the existing code was, but it's a big method with a lot of string binding per the way SearchBuilder/SearchCriteria work. If there's a much better way to refactor this, I'm all ears, but without cloudstack having some sort of benchmark suite it's a significant effort to test and re-test this. I'd be happy to have some help.

Types of changes

Breaking change (fix or feature that would cause existing functionality to change)
New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Enhancement (improves an existing feature and functionality)
Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Major
Minor

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

This is probably the most used API, so I'm very cautious about trying to make a change here where I'm not super familiar with the code. There are a number of tests that cover listing virtual machines, but I needed an explicit comparison to old vs new.

In testing I focused on regressions, assuming the existing state is correct. If there is a bug in what listVirtualMachines returns, it is replicated.

I created a scale environment with two management servers, and a separate database server. One mgmt server was patched with this PR, and the other is unpatched. I then ran through varieties of listVirtualMachines calls and compared the result output to see if they were a perfect match between the patched and unpatched, as well as collected timing.

I tested some combo queries (two or more parameters) to see if multiple joins were behaving properly.

I also tested with basic user API keys, mostly focusing on exercising that they don't see anything they shouldn't, or nothing has changed. The result sets are smaller, but there is still a noticeable performance boost.

GutoVeronezi · 2023-09-27T19:22:23Z

Great initiative @mlsorensen

codecov · 2023-09-27T19:26:52Z

Codecov Report

Merging #8012 (78d18cf) into main (543c54c) will increase coverage by 1.03%.
Report is 3 commits behind head on main.
The diff coverage is 42.02%.

@@             Coverage Diff              @@
##               main    #8012      +/-   ##
============================================
+ Coverage     28.15%   29.19%   +1.03%     
- Complexity    29181    30638    +1457     
============================================
  Files          5111     5111              
  Lines        360669   360740      +71     
  Branches      52700    52719      +19     
============================================
+ Hits         101562   105322    +3760     
+ Misses       245113   240950    -4163     
- Partials      13994    14468     +474

Flag	Coverage Δ
simulator-marvin-tests	`25.16% <41.71%> (+1.30%)`	⬆️
uitests	`4.79% <ø> (ø)`
unit-tests	`14.51% <1.63%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
...c/main/java/com/cloud/utils/db/GenericDaoBase.java	`56.12% <57.14%> (-1.33%)`	⬇️
...ain/java/com/cloud/api/query/QueryManagerImpl.java	`44.58% <40.98%> (+0.32%)`	⬆️

... and 275 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

harikrishna-patnala · 2023-09-28T04:51:42Z

@blueorangutan package

blueorangutan · 2023-09-28T04:52:03Z

@harikrishna-patnala a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2023-09-28T05:55:02Z

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7138

yadvr · 2023-09-28T06:39:53Z

@blueorangutan test matrix

blueorangutan · 2023-09-28T06:42:06Z

@rohityadavcloud a [SF] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

DaanHoogland

code looks good, only functional considderation is are there any joined fields that searches are on? I suppose you would have encountered those during your refactor, though.

DaanHoogland · 2023-09-28T07:49:20Z

server/src/main/java/com/cloud/api/query/QueryManagerImpl.java

+        return new Pair<>(vms, count);
+    }
+
+    private Pair<List<Long>, Integer> searchForUserVMIdsAndCount(ListVMsCmd cmd) {


this is a 400 line method and I would like to reduce its complexity a bit, but given the gain from the change, I'd say 'not now'.

Yes it is ugly, it was beforehand as well.

I struggle with how to restructure it, we could call out to separate methods to handle each param perhaps but most of these are just simple one liner IF statements. There are just a lot of params to set up.

I'd love to have the SearchBuilder and matching SearchCriteria adjacent but I think we have to set up the SB completely first and then add the criteria.

yadvr

LGTM - didn't test it though. Changes are quite tangled to be thoroughly code review them, instead I would look at smoketests and some manual QA if required. Great to see this PR !

blueorangutan · 2023-09-28T22:12:26Z

[SF] Trillian test result (tid-7747)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 54385 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8012-t7747-xenserver-71.zip
Smoke tests completed. 111 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_01_invalid_upgrade_kubernetes_cluster	`Failure`	3606.39	test_kubernetes_clusters.py
test_02_upgrade_kubernetes_cluster	`Failure`	3608.55	test_kubernetes_clusters.py
test_03_deploy_and_scale_kubernetes_cluster	`Failure`	0.04	test_kubernetes_clusters.py
test_04_autoscale_kubernetes_cluster	`Failure`	0.03	test_kubernetes_clusters.py
test_05_basic_lifecycle_kubernetes_cluster	`Failure`	0.04	test_kubernetes_clusters.py
test_06_delete_kubernetes_cluster	`Failure`	0.03	test_kubernetes_clusters.py
test_07_deploy_kubernetes_ha_cluster	`Failure`	0.03	test_kubernetes_clusters.py
test_08_upgrade_kubernetes_ha_cluster	`Failure`	0.04	test_kubernetes_clusters.py
test_09_delete_kubernetes_ha_cluster	`Failure`	0.04	test_kubernetes_clusters.py
test_10_vpc_tier_kubernetes_cluster	`Failure`	50.87	test_kubernetes_clusters.py
test_11_test_unmanaged_cluster_lifecycle	`Error`	1.23	test_kubernetes_clusters.py
ContextSuite context=TestKubernetesCluster>:teardown	`Error`	87.22	test_kubernetes_clusters.py
test_01_scale_up_verify	`Failure`	35.04	test_vm_autoscaling.py
test_02_update_vmprofile_and_vmgroup	`Error`	2.19	test_vm_autoscaling.py
test_03_scale_down_verify	`Error`	1.07	test_vm_autoscaling.py
test_04_stop_remove_vm_in_vmgroup	`Failure`	0.02	test_vm_autoscaling.py
test_06_autoscaling_vmgroup_on_project_network	`Failure`	45.12	test_vm_autoscaling.py
test_06_autoscaling_vmgroup_on_project_network	`Error`	45.12	test_vm_autoscaling.py
test_07_autoscaling_vmgroup_on_vpc_network	`Failure`	83.27	test_vm_autoscaling.py
test_07_autoscaling_vmgroup_on_vpc_network	`Error`	83.28	test_vm_autoscaling.py
ContextSuite context=TestVmAutoScaling>:teardown	`Error`	106.41	test_vm_autoscaling.py

harikrishna-patnala

Code LGTM and that is pretty detailed report. Thanks @mlsorensen

yadvr · 2023-09-29T06:48:33Z

@blueorangutan package

blueorangutan · 2023-09-29T06:50:03Z

@rohityadavcloud a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2023-09-29T07:52:36Z

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7167

DaanHoogland · 2023-09-29T09:03:00Z

Good reporting @mlsorensen ; my un-nuanced conclusions from your results is that there is a small penalty when searching for some explicit fields (list by host/network/state), and a great gain listing all or when searching by keyword. Put like that it seems a no-brainer.

mlsorensen · 2023-10-03T00:05:46Z

Should I be concerned about the failures in autoscaling and kubernetes, or is this environment? Will try to find time to review these if nobody knows of a current issue.

mlsorensen · 2023-10-03T00:07:11Z

LGTM - didn't test it though. Changes are quite tangled to be thoroughly code review them, instead I would look at smoketests and some manual QA if required. Great to see this PR !

I agree. I am hoping that targeting main, if some edge case is found later there is time to address it in the improved version later via small changes. I tried to be thorough but there are just so many ways to use this call.

DaanHoogland · 2023-10-03T07:53:03Z

Should I be concerned about the failures in autoscaling and kubernetes, or is this environment? Will try to find time to review these if nobody knows of a current issue.

The kubernetes cluster errors are known. The autoscaling are intermitted environmental. I like to see them pass...

DaanHoogland · 2023-10-03T07:53:12Z

@blueorangutan test matrix

blueorangutan · 2023-10-03T07:54:03Z

@DaanHoogland a [SF] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

blueorangutan · 2023-10-03T20:47:49Z

[SF] Trillian test result (tid-7814)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 44929 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8012-t7814-kvm-centos7.zip
Smoke tests completed. 111 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_02_upgrade_kubernetes_cluster	`Failure`	583.52	test_kubernetes_clusters.py
test_08_upgrade_kubernetes_ha_cluster	`Failure`	633.76	test_kubernetes_clusters.py
test_01_scale_up_verify	`Failure`	35.06	test_vm_autoscaling.py
test_02_update_vmprofile_and_vmgroup	`Error`	2.19	test_vm_autoscaling.py
test_03_scale_down_verify	`Error`	1.07	test_vm_autoscaling.py
test_04_stop_remove_vm_in_vmgroup	`Failure`	0.02	test_vm_autoscaling.py
test_06_autoscaling_vmgroup_on_project_network	`Failure`	46.10	test_vm_autoscaling.py
test_06_autoscaling_vmgroup_on_project_network	`Error`	46.11	test_vm_autoscaling.py
test_07_autoscaling_vmgroup_on_vpc_network	`Failure`	91.36	test_vm_autoscaling.py
test_07_autoscaling_vmgroup_on_vpc_network	`Error`	91.37	test_vm_autoscaling.py
ContextSuite context=TestVmAutoScaling>:teardown	`Error`	118.56	test_vm_autoscaling.py

blueorangutan · 2023-10-04T05:02:38Z

[SF] Trillian test result (tid-7813)
Environment: vmware-67u3 (x2), Advanced Networking with Mgmt server r8
Total time taken: 74618 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8012-t7813-vmware-67u3.zip
Smoke tests completed. 108 look OK, 5 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_list_vms_metrics_admin	`Error`	3622.45	test_metrics_api.py
test_list_vms_metrics_history	`Error`	4.69	test_metrics_api.py
test_list_volumes_metrics_history	`Error`	3620.86	test_metrics_api.py
test_01_scale_up_verify	`Failure`	35.04	test_vm_autoscaling.py
test_02_update_vmprofile_and_vmgroup	`Error`	2.29	test_vm_autoscaling.py
test_03_scale_down_verify	`Error`	1.10	test_vm_autoscaling.py
test_04_stop_remove_vm_in_vmgroup	`Failure`	0.03	test_vm_autoscaling.py
test_06_autoscaling_vmgroup_on_project_network	`Failure`	46.64	test_vm_autoscaling.py
test_06_autoscaling_vmgroup_on_project_network	`Error`	46.64	test_vm_autoscaling.py
test_07_autoscaling_vmgroup_on_vpc_network	`Failure`	100.51	test_vm_autoscaling.py
test_07_autoscaling_vmgroup_on_vpc_network	`Error`	100.52	test_vm_autoscaling.py
ContextSuite context=TestVmAutoScaling>:teardown	`Error`	148.60	test_vm_autoscaling.py
test_01_deploy_vm_on_specific_host	`Error`	3602.12	test_vm_deployment_planner.py
test_02_deploy_vm_on_specific_cluster	`Error`	4.41	test_vm_deployment_planner.py
test_03_deploy_vm_on_specific_pod	`Error`	4.45	test_vm_deployment_planner.py
test_04_deploy_vm_on_host_override_pod_and_cluster	`Error`	2.41	test_vm_deployment_planner.py
test_05_deploy_vm_on_cluster_override_pod	`Error`	2.38	test_vm_deployment_planner.py
test_09_expunge_vm	`Failure`	424.70	test_vm_life_cycle.py
test_01_vpc_site2site_vpn	`Error`	339.88	test_vpc_vpn.py

DaanHoogland · 2023-10-04T07:53:38Z

@blueorangutan LLtest matrix

blueorangutan · 2023-10-20T10:46:12Z

[SF] Trillian test result (tid-8027)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 41832 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8012-t8027-kvm-centos7.zip
Smoke tests completed. 111 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_03_deploy_vm_wrong_checksum	`Error`	40.66	test_templates.py
test_09_list_templates_download_details	`Failure`	0.05	test_templates.py
test_05_vmschedule_test_e2e	`Failure`	361.86	test_vm_schedule.py

shwstppr

Code LGTM

thanks @mlsorensen for catching the VM listing in the two clusters case.

github-actions · 2023-10-23T07:08:34Z

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

shwstppr · 2023-10-23T16:45:19Z

@blueorangutan package

blueorangutan · 2023-10-23T16:52:03Z

@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2023-10-23T17:17:21Z

Packaging result [SF]: ✖️ el7 ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 7474

mlsorensen · 2023-10-23T17:50:19Z

@blueorangutan package

blueorangutan · 2023-10-23T17:52:03Z

@mlsorensen a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

vishesh92 · 2023-10-23T19:51:57Z

@blueorangutan package

blueorangutan · 2023-10-23T19:54:03Z

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2023-10-23T20:52:53Z

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7477

shwstppr · 2023-10-24T04:14:27Z

@blueorangutan test

blueorangutan · 2023-10-24T04:16:05Z

@shwstppr a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

blueorangutan · 2023-10-24T16:33:58Z

[SF] Trillian test result (tid-8060)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 42837 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8012-t8060-kvm-centos7.zip
Smoke tests completed. 112 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_02_upgrade_kubernetes_cluster	`Failure`	541.20	test_kubernetes_clusters.py

yadvr · 2023-10-25T06:26:22Z

Rekicking smoketests by closing/reopening PR

yadvr · 2023-10-25T06:26:43Z

@blueorangutan test alma8 vmware-70u3

blueorangutan · 2023-10-25T06:28:03Z

@rohityadavcloud a [SL] Trillian-Jenkins test job (alma8 mgmt + vmware-70u3) has been kicked to run smoke tests

blueorangutan · 2023-10-25T20:48:54Z

[SF] Trillian test result (tid-8077)
Environment: vmware-70u3 (x2), Advanced Networking with Mgmt server a8
Total time taken: 49854 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8012-t8077-vmware-70u3.zip
Smoke tests completed. 113 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File

DaanHoogland · 2023-10-26T06:43:47Z

UI build failure is due to codecov upload. No further tests are needed IMNSHO.

boring-cyborg bot added the component:database label Sep 27, 2023

GutoVeronezi self-requested a review September 27, 2023 19:15

yadvr requested review from DaanHoogland and harikrishna-patnala September 28, 2023 06:39

yadvr added this to the 4.19.0.0 milestone Sep 28, 2023

yadvr requested a review from shwstppr September 28, 2023 06:39

DaanHoogland approved these changes Sep 28, 2023

View reviewed changes

yadvr approved these changes Sep 28, 2023

View reviewed changes

yadvr added the component:api label Sep 28, 2023

harikrishna-patnala approved these changes Sep 29, 2023

View reviewed changes

yadvr assigned kiranchavala and vishesh92 Oct 20, 2023

shwstppr approved these changes Oct 20, 2023

View reviewed changes

github-actions bot added the status:has-conflicts label Oct 23, 2023

Merge branch 'main' into main-listvms-performance

88f4a7f

github-actions bot removed the status:has-conflicts label Oct 23, 2023

Fix merge conflicts from new snapshot commit

78d18cf

yadvr closed this Oct 25, 2023

yadvr reopened this Oct 25, 2023

DaanHoogland merged commit 4ff592a into apache:main Oct 26, 2023

vishesh92 mentioned this pull request Dec 6, 2023

Use join instead of views #8321

Merged

13 tasks

weizhouapache mentioned this pull request Aug 5, 2024

New feature: Dynamic and Static Routing #9470

Merged

14 tasks

Use UserVmDao for listVirtualMachines API to increase performance #8012

Use UserVmDao for listVirtualMachines API to increase performance #8012

Uh oh!

Conversation

mlsorensen commented Sep 27, 2023

Description

Types of changes

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

Uh oh!

GutoVeronezi commented Sep 27, 2023

Uh oh!

codecov bot commented Sep 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

harikrishna-patnala commented Sep 28, 2023

Uh oh!

blueorangutan commented Sep 28, 2023

Uh oh!

blueorangutan commented Sep 28, 2023

Uh oh!

yadvr commented Sep 28, 2023

Uh oh!

blueorangutan commented Sep 28, 2023

Uh oh!

DaanHoogland left a comment

Choose a reason for hiding this comment

Uh oh!

DaanHoogland Sep 28, 2023

Choose a reason for hiding this comment

Uh oh!

mlsorensen Oct 3, 2023

Choose a reason for hiding this comment

Uh oh!

yadvr left a comment

Choose a reason for hiding this comment

Uh oh!

blueorangutan commented Sep 28, 2023

Uh oh!

harikrishna-patnala left a comment

Choose a reason for hiding this comment

Uh oh!

yadvr commented Sep 29, 2023

Uh oh!

blueorangutan commented Sep 29, 2023

Uh oh!

blueorangutan commented Sep 29, 2023

Uh oh!

DaanHoogland commented Sep 29, 2023

Uh oh!

mlsorensen commented Oct 3, 2023

Uh oh!

mlsorensen commented Oct 3, 2023

Uh oh!

DaanHoogland commented Oct 3, 2023

Uh oh!

DaanHoogland commented Oct 3, 2023

Uh oh!

blueorangutan commented Oct 3, 2023

Uh oh!

blueorangutan commented Oct 3, 2023

Uh oh!

blueorangutan commented Oct 4, 2023

Uh oh!

DaanHoogland commented Oct 4, 2023

Uh oh!

blueorangutan commented Oct 20, 2023

Uh oh!

shwstppr left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 23, 2023

Uh oh!

shwstppr commented Oct 23, 2023

Uh oh!

blueorangutan commented Oct 23, 2023

codecov bot commented Sep 27, 2023 •

edited

Loading