Skip to content

Conversation

@mlsorensen
Copy link
Contributor

@mlsorensen mlsorensen commented Aug 29, 2023

Description

This PR addresses slowness in FirstFitPlanner, looking for a cluster with capacity for new VM.

Adding an index to cluster_details.name yields a ~10x speed improvement in this query, and 2-3x improvement in overall time to create 1000 VMs in parallel during scale testing. The performance of this query is dependent on number of clusters, so one environment is not directly comparable to another.

Using slow query logging, creating 1000 VMs with 100 parallel workers on a 100 cluster zone:

# Query_time: 19.002664  
SELECT DISTINCT capacity.cluster_id  FROM `cloud`.`op_host_capacity` capacity INNER JOIN `cloud`.`cluster` cluster on (cluster.id = capacity.cluster_id AND cluster.removed is NULL)   INNER JOIN `cloud`.`cluster_details` cluster_details ON (cluster.id = cluster_details.cluster_id ) WHERE capacity.data_center_id = 1 AND capacity_type = 1 AND cluster_details.name= 'cpuOvercommitRatio' AND ((total_capacity * cluster_details.value ) - used_capacity + reserved_capacity) >= 100 AND capacity.cluster_id IN (SELECT distinct capacity.cluster_id  FROM `cloud`.`op_host_capacity` capacity INNER JOIN  `cloud`.`cluster_details` cluster_details ON (capacity.cluster_id = cluster_details.cluster_id ) WHERE capacity.data_center_id = 1 AND capacity_type = 0 AND cluster_details.name= 'memoryOvercommitRatio' AND ((total_capacity * cluster_details.value) - used_capacity + reserved_capacity) >= 1073741824);



# Query_time: 20.131617
SELECT DISTINCT capacity.cluster_id  FROM `cloud`.`op_host_capacity` capacity INNER JOIN `cloud`.`cluster` cluster on (cluster.id = capacity.cluster_id AND cluster.removed is NULL)   INNER JOIN `cloud`.`cluster_details` cluster_details ON (cluster.id = cluster_details.cluster_id ) WHERE capacity.data_center_id = 1 AND capacity_type = 1 AND cluster_details.name= 'cpuOvercommitRatio' AND ((total_capacity * cluster_details.value ) - used_capacity + reserved_capacity) >= 100 AND capacity.cluster_id IN (SELECT distinct capacity.cluster_id  FROM `cloud`.`op_host_capacity` capacity INNER JOIN  `cloud`.`cluster_details` cluster_details ON (capacity.cluster_id = cluster_details.cluster_id ) WHERE capacity.data_center_id = 1 AND capacity_type = 0 AND cluster_details.name= 'memoryOvercommitRatio' AND ((total_capacity * cluster_details.value) - used_capacity + reserved_capacity) >= 1073741824);

# Query_time: 12.710961
SELECT DISTINCT capacity.cluster_id  FROM `cloud`.`op_host_capacity` capacity INNER JOIN `cloud`.`cluster` cluster on (cluster.id = capacity.cluster_id AND cluster.removed is NULL)   INNER JOIN `cloud`.`cluster_details` cluster_details ON (cluster.id = cluster_details.cluster_id ) WHERE capacity.data_center_id = 1 AND capacity_type = 1 AND cluster_details.name= 'cpuOvercommitRatio' AND ((total_capacity * cluster_details.value ) - used_capacity + reserved_capacity) >= 100 AND capacity.cluster_id IN (SELECT distinct capacity.cluster_id  FROM `cloud`.`op_host_capacity` capacity INNER JOIN  `cloud`.`cluster_details` cluster_details ON (capacity.cluster_id = cluster_details.cluster_id ) WHERE capacity.data_center_id = 1 AND capacity_type = 0 AND cluster_details.name= 'memoryOvercommitRatio' AND ((total_capacity * cluster_details.value) - used_capacity + reserved_capacity) >= 1073741824);


Same test after index:

# Query_time: 1.139716 
SELECT DISTINCT capacity.cluster_id  FROM `cloud`.`op_host_capacity` capacity INNER JOIN `cloud`.`cluster` cluster on (cluster.id = capacity.cluster_id AND cluster.removed is NULL)   INNER JOIN `cloud`.`cluster_details` cluster_details ON (cluster.id = cluster_details.cluster_id ) WHERE capacity.data_center_id = 1 AND capacity_type = 1 AND cluster_details.name= 'cpuOvercommitRatio' AND ((total_capacity * cluster_details.value ) - used_capacity + reserved_capacity) >= 100 AND capacity.cluster_id IN (SELECT distinct capacity.cluster_id  FROM `cloud`.`op_host_capacity` capacity INNER JOIN  `cloud`.`cluster_details` cluster_details ON (capacity.cluster_id = cluster_details.cluster_id ) WHERE capacity.data_center_id = 1 AND capacity_type = 0 AND cluster_details.name= 'memoryOvercommitRatio' AND ((total_capacity * cluster_details.value) - used_capacity + reserved_capacity) >= 1073741824);



# Query_time: 1.025688  
SELECT DISTINCT capacity.cluster_id  FROM `cloud`.`op_host_capacity` capacity INNER JOIN `cloud`.`cluster` cluster on (cluster.id = capacity.cluster_id AND cluster.removed is NULL)   INNER JOIN `cloud`.`cluster_details` cluster_details ON (cluster.id = cluster_details.cluster_id ) WHERE capacity.data_center_id = 1 AND capacity_type = 1 AND cluster_details.name= 'cpuOvercommitRatio' AND ((total_capacity * cluster_details.value ) - used_capacity + reserved_capacity) >= 100 AND capacity.cluster_id IN (SELECT distinct capacity.cluster_id  FROM `cloud`.`op_host_capacity` capacity INNER JOIN  `cloud`.`cluster_details` cluster_details ON (capacity.cluster_id = cluster_details.cluster_id ) WHERE capacity.data_center_id = 1 AND capacity_type = 0 AND cluster_details.name= 'memoryOvercommitRatio' AND ((total_capacity * cluster_details.value) - used_capacity + reserved_capacity) >= 1073741824);

# Query_time: 2.250430
SELECT DISTINCT capacity.cluster_id  FROM `cloud`.`op_host_capacity` capacity INNER JOIN `cloud`.`cluster` cluster on (cluster.id = capacity.cluster_id AND cluster.removed is NULL)   INNER JOIN `cloud`.`cluster_details` cluster_details ON (cluster.id = cluster_details.cluster_id ) WHERE capacity.data_center_id = 1 AND capacity_type = 1 AND cluster_details.name= 'cpuOvercommitRatio' AND ((total_capacity * cluster_details.value ) - used_capacity + reserved_capacity) >= 100 AND capacity.cluster_id IN (SELECT distinct capacity.cluster_id  FROM `cloud`.`op_host_capacity` capacity INNER JOIN  `cloud`.`cluster_details` cluster_details ON (capacity.cluster_id = cluster_details.cluster_id ) WHERE capacity.data_center_id = 1 AND capacity_type = 0 AND cluster_details.name= 'memoryOvercommitRatio' AND ((total_capacity * cluster_details.value) - used_capacity + reserved_capacity) >= 1073741824);

Running the query directly with no load, this query goes from 0.85s to 0.12s when an index is added.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

Tested upgrade path locally:

4.18.0.0 to 4.18.1.0:

DEBUG [c.c.u.d.DatabaseAccessObject] (main:null) (logid:) Created index i_cluster_details__name
...
DEBUG [c.c.u.DatabaseUpgradeChecker] (main:null) (logid:) Upgrade completed for version 4.18.1.0

mysql> show indexes from cluster_details where Key_name="i_cluster_details__name";
+-----------------+------------+-------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table           | Non_unique | Key_name                | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+-----------------+------------+-------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| cluster_details |          1 | i_cluster_details__name |            1 | name        | A         |           2 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
+-----------------+------------+-------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+

In case index already exists:

4.18.0.0 to 4.18.1.0 where index already exists:

DEBUG [c.c.u.d.DatabaseAccessObject] (main:null) (logid:) Index i_cluster_details__name already exists
...
DEBUG [c.c.u.DatabaseUpgradeChecker] (main:null) (logid:) Upgrade completed for version 4.18.1.0

Signed-off-by: Marcus Sorensen <mls@apple.com>
Copy link
Member

@yadvr yadvr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM didn't test it though

@weizhouapache
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@weizhouapache a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@codecov
Copy link

codecov bot commented Aug 29, 2023

Codecov Report

Merging #7922 (4e5cd40) into 4.18 (439d70f) will increase coverage by 0.00%.
The diff coverage is 56.00%.

@@            Coverage Diff            @@
##               4.18    #7922   +/-   ##
=========================================
  Coverage     13.06%   13.06%           
- Complexity     9093     9096    +3     
=========================================
  Files          2720     2720           
  Lines        257431   257456   +25     
  Branches      40141    40144    +3     
=========================================
+ Hits          33622    33634   +12     
- Misses       219582   219595   +13     
  Partials       4227     4227           
Files Changed Coverage Δ
...ain/java/com/cloud/upgrade/dao/DbUpgradeUtils.java 63.15% <0.00%> (-16.85%) ⬇️
...ava/com/cloud/upgrade/dao/Upgrade41800to41810.java 2.85% <0.00%> (-0.07%) ⬇️
...va/com/cloud/upgrade/dao/DatabaseAccessObject.java 80.30% <77.77%> (-0.95%) ⬇️

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 6935

@weizhouapache
Copy link
Member

@blueorangutan test

1 similar comment
@yadvr
Copy link
Member

yadvr commented Aug 29, 2023

@blueorangutan test

@blueorangutan
Copy link

@rohityadavcloud a [SF] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

Signed-off-by: Marcus Sorensen <mls@apple.com>
Copy link
Member

@weizhouapache weizhouapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

@weizhouapache
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@weizhouapache a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 6936

@weizhouapache
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@weizhouapache a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✖️ el7 ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 6939

@blueorangutan
Copy link

[SF] Trillian test result (tid-7603)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 41398 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7922-t7603-kvm-centos7.zip
Smoke tests completed. 108 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@weizhouapache weizhouapache added this to the 4.18.1.0 milestone Aug 30, 2023
@mlsorensen
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@mlsorensen a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✖️ el8 ✖️ el9 ✔️ debian ✖️ suse15. SL-JID 6942

@weizhouapache
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@weizhouapache a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 6950

@weizhouapache
Copy link
Member

weizhouapache commented Aug 31, 2023

@blueorangutan test matrix

@blueorangutan
Copy link

@weizhouapache a [SF] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@weizhouapache
Copy link
Member

code lgtm

This will be merged when trillian tests finish.

@blueorangutan
Copy link

[SF] Trillian test result (tid-7613)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 38117 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7922-t7613-xenserver-71.zip
Smoke tests completed. 108 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@blueorangutan
Copy link

[SF] Trillian test result (tid-7615)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 42941 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7922-t7615-kvm-centos7.zip
Smoke tests completed. 108 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@weizhouapache weizhouapache merged commit 2cccd8f into apache:4.18 Aug 31, 2023
@blueorangutan
Copy link

[SF] Trillian test result (tid-7614)
Environment: vmware-67u3 (x2), Advanced Networking with Mgmt server r8
Total time taken: 57667 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7922-t7614-vmware-67u3.zip
Smoke tests completed. 105 look OK, 3 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_02_upgrade_kubernetes_cluster Failure 571.07 test_kubernetes_clusters.py
test_01_deploy_vm_on_specific_host Error 21.22 test_vm_deployment_planner.py
test_02_deploy_vm_on_specific_cluster Error 3602.15 test_vm_deployment_planner.py
test_03_deploy_vm_on_specific_pod Error 3.42 test_vm_deployment_planner.py
test_04_deploy_vm_on_host_override_pod_and_cluster Error 1.35 test_vm_deployment_planner.py
test_05_deploy_vm_on_cluster_override_pod Error 2.30 test_vm_deployment_planner.py
test_09_expunge_vm Failure 424.64 test_vm_life_cycle.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants