Skip to content

Conversation

@mlsorensen
Copy link
Contributor

Description

This PR addresses an issue found while scale testing CloudStack's listVirtualMachines API. It was observed that as the VM record count grows, the list APIs take progressively more time, even using paging to a smaller result set. With as little as 50k records the API starts to become noticeably slow, and by 100k records it becomes almost impossible to page through the VMs in the system in any reasonable amount of time. 100k records is not much in terms of MySQL, so I began trying to find the cause.

Screenshot 2023-09-27 at 12 45 55 PM

It seems to mostly affect large result sets, but in some cases also affects queries with small result sets that require searches (like keyword search).

The UserVm view is joined to many other tables, and left joined so we get multiplication of rows. In order to page, the full potential result set needs to be generated, and then the relevant page in the result set is selected. The current code fetches the relevant Vm IDs for the page from the view, then calls the view again to generate the API response for these VM IDs.

I was unable to find any meaningful way to increase performance of the UserVM view, maybe I missed something. However, I found that the UserVM table itself is much faster. As an alternative, I decided to modify the listVirtualMachines query to avoid the view and just join tables as necessary based on the input params. This works because the search to fetch the page only needs the IDs, so no need to search the view.

The result is anywhere from a 2x to 8x improvement in list call times.

One aspect of this PR is a change to how joins are processed. I added unit tests to ensure the existing behavior was unmodified, but I needed to add the ability to join the same table twice and generate an SQL table alias to do so. This was relevant for the performance of SSH key search, as with older VMs the only way to perform this search is via the value of user vm details, there is no direct map from VM to SSH key. To optimize this search a bit I added an additional join to narrow the scope by account, but to do so I needed to join the same table twice to other joined tables.

Ok, this code is ugly. I don't think it's much worse than the existing code was, but it's a big method with a lot of string binding per the way SearchBuilder/SearchCriteria work. If there's a much better way to refactor this, I'm all ears, but without cloudstack having some sort of benchmark suite it's a significant effort to test and re-test this. I'd be happy to have some help.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

This is probably the most used API, so I'm very cautious about trying to make a change here where I'm not super familiar with the code. There are a number of tests that cover listing virtual machines, but I needed an explicit comparison to old vs new.

In testing I focused on regressions, assuming the existing state is correct. If there is a bug in what listVirtualMachines returns, it is replicated.

I created a scale environment with two management servers, and a separate database server. One mgmt server was patched with this PR, and the other is unpatched. I then ran through varieties of listVirtualMachines calls and compared the result output to see if they were a perfect match between the patched and unpatched, as well as collected timing.

I tested some combo queries (two or more parameters) to see if multiple joins were behaving properly.
Screenshot 2023-09-27 at 12 44 39 PM

Screenshot 2023-09-27 at 12 44 29 PM

I also tested with basic user API keys, mostly focusing on exercising that they don't see anything they shouldn't, or nothing has changed. The result sets are smaller, but there is still a noticeable performance boost.

Screenshot 2023-09-27 at 12 44 57 PM Screenshot 2023-09-27 at 12 44 48 PM

@GutoVeronezi
Copy link
Contributor

Great initiative @mlsorensen

@codecov
Copy link

codecov bot commented Sep 27, 2023

Codecov Report

Merging #8012 (78d18cf) into main (543c54c) will increase coverage by 1.03%.
Report is 3 commits behind head on main.
The diff coverage is 42.02%.

@@             Coverage Diff              @@
##               main    #8012      +/-   ##
============================================
+ Coverage     28.15%   29.19%   +1.03%     
- Complexity    29181    30638    +1457     
============================================
  Files          5111     5111              
  Lines        360669   360740      +71     
  Branches      52700    52719      +19     
============================================
+ Hits         101562   105322    +3760     
+ Misses       245113   240950    -4163     
- Partials      13994    14468     +474     
Flag Coverage Δ
simulator-marvin-tests 25.16% <41.71%> (+1.30%) ⬆️
uitests 4.79% <ø> (ø)
unit-tests 14.51% <1.63%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...c/main/java/com/cloud/utils/db/GenericDaoBase.java 56.12% <57.14%> (-1.33%) ⬇️
...ain/java/com/cloud/api/query/QueryManagerImpl.java 44.58% <40.98%> (+0.32%) ⬆️

... and 275 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@harikrishna-patnala
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@harikrishna-patnala a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7138

@yadvr yadvr added this to the 4.19.0.0 milestone Sep 28, 2023
@yadvr yadvr requested a review from shwstppr September 28, 2023 06:39
@yadvr
Copy link
Member

yadvr commented Sep 28, 2023

@blueorangutan test matrix

@blueorangutan
Copy link

@rohityadavcloud a [SF] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks good, only functional considderation is are there any joined fields that searches are on? I suppose you would have encountered those during your refactor, though.

return new Pair<>(vms, count);
}

private Pair<List<Long>, Integer> searchForUserVMIdsAndCount(ListVMsCmd cmd) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a 400 line method and I would like to reduce its complexity a bit, but given the gain from the change, I'd say 'not now'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is ugly, it was beforehand as well.

I struggle with how to restructure it, we could call out to separate methods to handle each param perhaps but most of these are just simple one liner IF statements. There are just a lot of params to set up.

I'd love to have the SearchBuilder and matching SearchCriteria adjacent but I think we have to set up the SB completely first and then add the criteria.

Copy link
Member

@yadvr yadvr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - didn't test it though. Changes are quite tangled to be thoroughly code review them, instead I would look at smoketests and some manual QA if required. Great to see this PR !

@blueorangutan
Copy link

[SF] Trillian test result (tid-7747)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 54385 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8012-t7747-xenserver-71.zip
Smoke tests completed. 111 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_invalid_upgrade_kubernetes_cluster Failure 3606.39 test_kubernetes_clusters.py
test_02_upgrade_kubernetes_cluster Failure 3608.55 test_kubernetes_clusters.py
test_03_deploy_and_scale_kubernetes_cluster Failure 0.04 test_kubernetes_clusters.py
test_04_autoscale_kubernetes_cluster Failure 0.03 test_kubernetes_clusters.py
test_05_basic_lifecycle_kubernetes_cluster Failure 0.04 test_kubernetes_clusters.py
test_06_delete_kubernetes_cluster Failure 0.03 test_kubernetes_clusters.py
test_07_deploy_kubernetes_ha_cluster Failure 0.03 test_kubernetes_clusters.py
test_08_upgrade_kubernetes_ha_cluster Failure 0.04 test_kubernetes_clusters.py
test_09_delete_kubernetes_ha_cluster Failure 0.04 test_kubernetes_clusters.py
test_10_vpc_tier_kubernetes_cluster Failure 50.87 test_kubernetes_clusters.py
test_11_test_unmanaged_cluster_lifecycle Error 1.23 test_kubernetes_clusters.py
ContextSuite context=TestKubernetesCluster>:teardown Error 87.22 test_kubernetes_clusters.py
test_01_scale_up_verify Failure 35.04 test_vm_autoscaling.py
test_02_update_vmprofile_and_vmgroup Error 2.19 test_vm_autoscaling.py
test_03_scale_down_verify Error 1.07 test_vm_autoscaling.py
test_04_stop_remove_vm_in_vmgroup Failure 0.02 test_vm_autoscaling.py
test_06_autoscaling_vmgroup_on_project_network Failure 45.12 test_vm_autoscaling.py
test_06_autoscaling_vmgroup_on_project_network Error 45.12 test_vm_autoscaling.py
test_07_autoscaling_vmgroup_on_vpc_network Failure 83.27 test_vm_autoscaling.py
test_07_autoscaling_vmgroup_on_vpc_network Error 83.28 test_vm_autoscaling.py
ContextSuite context=TestVmAutoScaling>:teardown Error 106.41 test_vm_autoscaling.py

Copy link
Contributor

@harikrishna-patnala harikrishna-patnala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM and that is pretty detailed report. Thanks @mlsorensen

@yadvr
Copy link
Member

yadvr commented Sep 29, 2023

@blueorangutan package

@blueorangutan
Copy link

@rohityadavcloud a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7167

@DaanHoogland
Copy link
Contributor

Good reporting @mlsorensen ; my un-nuanced conclusions from your results is that there is a small penalty when searching for some explicit fields (list by host/network/state), and a great gain listing all or when searching by keyword. Put like that it seems a no-brainer.

@mlsorensen
Copy link
Contributor Author

Should I be concerned about the failures in autoscaling and kubernetes, or is this environment? Will try to find time to review these if nobody knows of a current issue.

@mlsorensen
Copy link
Contributor Author

LGTM - didn't test it though. Changes are quite tangled to be thoroughly code review them, instead I would look at smoketests and some manual QA if required. Great to see this PR !

I agree. I am hoping that targeting main, if some edge case is found later there is time to address it in the improved version later via small changes. I tried to be thorough but there are just so many ways to use this call.

@DaanHoogland
Copy link
Contributor

Should I be concerned about the failures in autoscaling and kubernetes, or is this environment? Will try to find time to review these if nobody knows of a current issue.

The kubernetes cluster errors are known. The autoscaling are intermitted environmental. I like to see them pass...

@DaanHoogland
Copy link
Contributor

@blueorangutan test matrix

@blueorangutan
Copy link

@DaanHoogland a [SF] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-7814)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 44929 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8012-t7814-kvm-centos7.zip
Smoke tests completed. 111 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_02_upgrade_kubernetes_cluster Failure 583.52 test_kubernetes_clusters.py
test_08_upgrade_kubernetes_ha_cluster Failure 633.76 test_kubernetes_clusters.py
test_01_scale_up_verify Failure 35.06 test_vm_autoscaling.py
test_02_update_vmprofile_and_vmgroup Error 2.19 test_vm_autoscaling.py
test_03_scale_down_verify Error 1.07 test_vm_autoscaling.py
test_04_stop_remove_vm_in_vmgroup Failure 0.02 test_vm_autoscaling.py
test_06_autoscaling_vmgroup_on_project_network Failure 46.10 test_vm_autoscaling.py
test_06_autoscaling_vmgroup_on_project_network Error 46.11 test_vm_autoscaling.py
test_07_autoscaling_vmgroup_on_vpc_network Failure 91.36 test_vm_autoscaling.py
test_07_autoscaling_vmgroup_on_vpc_network Error 91.37 test_vm_autoscaling.py
ContextSuite context=TestVmAutoScaling>:teardown Error 118.56 test_vm_autoscaling.py

@blueorangutan
Copy link

[SF] Trillian test result (tid-7813)
Environment: vmware-67u3 (x2), Advanced Networking with Mgmt server r8
Total time taken: 74618 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8012-t7813-vmware-67u3.zip
Smoke tests completed. 108 look OK, 5 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_list_vms_metrics_admin Error 3622.45 test_metrics_api.py
test_list_vms_metrics_history Error 4.69 test_metrics_api.py
test_list_volumes_metrics_history Error 3620.86 test_metrics_api.py
test_01_scale_up_verify Failure 35.04 test_vm_autoscaling.py
test_02_update_vmprofile_and_vmgroup Error 2.29 test_vm_autoscaling.py
test_03_scale_down_verify Error 1.10 test_vm_autoscaling.py
test_04_stop_remove_vm_in_vmgroup Failure 0.03 test_vm_autoscaling.py
test_06_autoscaling_vmgroup_on_project_network Failure 46.64 test_vm_autoscaling.py
test_06_autoscaling_vmgroup_on_project_network Error 46.64 test_vm_autoscaling.py
test_07_autoscaling_vmgroup_on_vpc_network Failure 100.51 test_vm_autoscaling.py
test_07_autoscaling_vmgroup_on_vpc_network Error 100.52 test_vm_autoscaling.py
ContextSuite context=TestVmAutoScaling>:teardown Error 148.60 test_vm_autoscaling.py
test_01_deploy_vm_on_specific_host Error 3602.12 test_vm_deployment_planner.py
test_02_deploy_vm_on_specific_cluster Error 4.41 test_vm_deployment_planner.py
test_03_deploy_vm_on_specific_pod Error 4.45 test_vm_deployment_planner.py
test_04_deploy_vm_on_host_override_pod_and_cluster Error 2.41 test_vm_deployment_planner.py
test_05_deploy_vm_on_cluster_override_pod Error 2.38 test_vm_deployment_planner.py
test_09_expunge_vm Failure 424.70 test_vm_life_cycle.py
test_01_vpc_site2site_vpn Error 339.88 test_vpc_vpn.py

@DaanHoogland
Copy link
Contributor

@blueorangutan LLtest matrix

@blueorangutan
Copy link

[SF] Trillian test result (tid-8027)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 41832 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8012-t8027-kvm-centos7.zip
Smoke tests completed. 111 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_03_deploy_vm_wrong_checksum Error 40.66 test_templates.py
test_09_list_templates_download_details Failure 0.05 test_templates.py
test_05_vmschedule_test_e2e Failure 361.86 test_vm_schedule.py

Copy link
Contributor

@shwstppr shwstppr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM

thanks @mlsorensen for catching the VM listing in the two clusters case.

@github-actions
Copy link

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

@shwstppr
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✖️ el7 ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 7474

@mlsorensen
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@mlsorensen a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@vishesh92
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7477

@shwstppr
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@shwstppr a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-8060)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 42837 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8012-t8060-kvm-centos7.zip
Smoke tests completed. 112 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_02_upgrade_kubernetes_cluster Failure 541.20 test_kubernetes_clusters.py

@yadvr
Copy link
Member

yadvr commented Oct 25, 2023

Rekicking smoketests by closing/reopening PR

@yadvr yadvr closed this Oct 25, 2023
@yadvr yadvr reopened this Oct 25, 2023
@yadvr
Copy link
Member

yadvr commented Oct 25, 2023

@blueorangutan test alma8 vmware-70u3

@blueorangutan
Copy link

@rohityadavcloud a [SL] Trillian-Jenkins test job (alma8 mgmt + vmware-70u3) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-8077)
Environment: vmware-70u3 (x2), Advanced Networking with Mgmt server a8
Total time taken: 49854 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8012-t8077-vmware-70u3.zip
Smoke tests completed. 113 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@DaanHoogland
Copy link
Contributor

UI build failure is due to codecov upload. No further tests are needed IMNSHO.

@DaanHoogland DaanHoogland merged commit 4ff592a into apache:main Oct 26, 2023
@vishesh92 vishesh92 mentioned this pull request Dec 6, 2023
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.