Skip to content

Conversation

@vishesh92
Copy link
Member

@vishesh92 vishesh92 commented Jan 12, 2024

Description

There are a lot of test failures due to test_vm_life_cycle.py in multiple PRs due to host not available for migration of VMs.
#8438 (comment)
#8433 (comment)
#7344 (comment)

While debugging I noticed that the hosts get stuck in Connecting state because MS is waiting for a response of the ReadyCommand from the agent. Since we take a lock on connection and disconnection, restarting the agent doesn't work. To fix this, we have to restart the MS or wait for ~1 hour (default timeout).

On the agent side, it gets stuck waiting for a response from the Script execution.

To reproduce, run smoke/test_vm_life_cycle.py (TestSecuredVmMigration test class to be specific). Once the tests are complete, you will notice that some hosts are stuck in Connecting state. And restarting the agent fails due to the named lock. Locks on DB can be checked using the below query.

SELECT *
FROM performance_schema.metadata_locks
INNER JOIN performance_schema.threads ON THREAD_ID = OWNER_THREAD_ID
WHERE PROCESSLIST_ID <> CONNECTION_ID() \G;

This PR adds a wait for the ready command and a timeout to the Script execution to ensure that the thread doesn't get stuck and the named lock from database is released.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

Ran smoke/test_vm_life_cycle.py multiple times. The host does get stuck again but only for 1 minute instead of 1 hour.

@vishesh92
Copy link
Member Author

@blueorangutan package

@codecov
Copy link

codecov bot commented Jan 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (3936f7c) 30.77% compared to head (a7d108c) 30.81%.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #8502      +/-   ##
============================================
+ Coverage     30.77%   30.81%   +0.03%     
- Complexity    33970    34003      +33     
============================================
  Files          5341     5341              
  Lines        374971   374973       +2     
  Branches      54546    54546              
============================================
+ Hits         115408   115541     +133     
+ Misses       244308   244155     -153     
- Partials      15255    15277      +22     
Flag Coverage Δ
simulator-marvin-tests 24.71% <66.66%> (+0.06%) ⬆️
uitests 4.39% <ø> (ø)
unit-tests 16.48% <66.66%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@GutoVeronezi
Copy link
Contributor

@vishesh92, please, describe the situation you are facing and the tests you have done.

@vishesh92
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.


final Long dcId = host.getDataCenterId();
final ReadyCommand ready = new ReadyCommand(dcId, host.getId(), NumbersUtil.enableHumanReadableSizes);
ready.setWait(60);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These wait/timeouts should be externalized.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember somebody complained there are too much global settings.
what's the benefit of adding another setting ? @GutoVeronezi

normally ReadyCommand is executed in less than 1 second.
60 seconds is very safe value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, hard coded "magic" numbers are not interesting; we bind the system to what we think is proper and do not allow the operators to shape the system according to their use case.

For the timeout/wait settings, we could think on a flexible mecanism to allow operators overriding the timeout/wait values in a granular way, e.g. defining a timeout/wait for each command, and fallbacking to the global setting wait in the absence of the specifc setting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GutoVeronezi
the problem is actually caused by the falling back to the default wait which is 30 mins by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @GutoVeronezi , but in this case if it fixes we can add a new ticket for externalizing it and merge this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DaanHoogland @GutoVeronezi
I understand there are some operations which can take long time (migration, snapshots, etc), or some intervals (for background tasks, etc) which operators want to change.
However, for these two commands which should be processed in less than 1 second, what's the benefit of adding a new setting ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am with @weizhouapache on this. We already have a lot of global settings which makes it confusing for the end user as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vishesh92 @DaanHoogland @GutoVeronezi @weizhouapache I'm inclined to include this fix in 4.19.0. I'm creating an improvement ticket for follow-up on externalizing the wait. Maybe we can look into some generic solution while addressing it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the number of global setting we already have should not be an argument. Nobody know them by head and anybody will have to use search facilities to manage those anyway. We have not the value 60 in three separate extra places for a value that has a default of 1800 (?). Both tuning and code maintenance are served by creating an externalisation for those.
@weizhouapache I understand your argument against the need for tuning, however we are now setting 60 for a value that you say should be one. That sounds like we will want to tune it to 10 or 5 or maybe to 7 afterwards.

Anyway, this is merged. Let's discuss on #8506.

@weizhouapache
Copy link
Member

@vishesh92
I also suggest to add the following change

diff --git a/engine/storage/volume/src/main/java/org/apache/cloudstack/storage/datastore/provider/DefaultHostListener.java b/engine/storage/volume/src/main/java/org/apache/cloudstack/storage/datastore/provider/DefaultHostListener.java
index e344a87831d..90e8742c84d 100644
--- a/engine/storage/volume/src/main/java/org/apache/cloudstack/storage/datastore/provider/DefaultHostListener.java
+++ b/engine/storage/volume/src/main/java/org/apache/cloudstack/storage/datastore/provider/DefaultHostListener.java
@@ -121,6 +121,7 @@ public class DefaultHostListener implements HypervisorHostListener {
     public boolean hostConnect(long hostId, long poolId) throws StorageConflictException {
         StoragePool pool = (StoragePool) this.dataStoreMgr.getDataStore(poolId, DataStoreRole.Primary);
         ModifyStoragePoolCommand cmd = new ModifyStoragePoolCommand(true, pool);
+        cmd.setWait(60);
         final Answer answer = agentMgr.easySend(hostId, cmd);
 
         if (answer == null) {

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✖️ el9 ✔️ debian ✖️ suse15. SL-JID 8304

@weizhouapache
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@weizhouapache a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✖️ el9 ✔️ debian ✖️ suse15. SL-JID 8306

@vishesh92 vishesh92 force-pushed the fix-host-connecting branch from de04062 to a7d108c Compare January 12, 2024 17:09
@vishesh92
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8307

@shwstppr
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@shwstppr a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-8823)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 50344 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8502-t8823-kvm-centos7.zip
Smoke tests completed. 120 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestSharedNetwork>:setup Error 57.66 test_network.py

@weizhouapache
Copy link
Member

@blueorangutan test rocky8 kvm-rocky8

@blueorangutan
Copy link

@weizhouapache a [SL] Trillian-Jenkins test job (rocky8 mgmt + kvm-rocky8) has been kicked to run smoke tests

@weizhouapache weizhouapache mentioned this pull request Jan 13, 2024
13 tasks
@weizhouapache
Copy link
Member

@blueorangutan test ubuntu22 kvm-rocky8

@blueorangutan
Copy link

[SF] Trillian test result (tid-8826)
Environment: kvm-rocky8 (x2), Advanced Networking with Mgmt server r8
Total time taken: 46721 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8502-t8826-kvm-rocky8.zip
Smoke tests completed. 120 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_redundant_vpc_site2site_vpn Failure 414.07 test_vpc_vpn.py

@shwstppr
Copy link
Contributor

@blueorangutan test ubuntu22 kvm-rocky8

@blueorangutan
Copy link

@shwstppr a [SL] Trillian-Jenkins test job (ubuntu22 mgmt + kvm-rocky8) has been kicked to run smoke tests

@shwstppr
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@shwstppr a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-8829)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 46484 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8502-t8829-kvm-centos7.zip
Smoke tests completed. 121 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@blueorangutan
Copy link

[SF] Trillian test result (tid-8828)
Environment: kvm-rocky8 (x2), Advanced Networking with Mgmt server u22
Total time taken: 55332 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8502-t8828-kvm-rocky8.zip
Smoke tests completed. 119 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_list_sec_storage_vm Failure 0.04 test_ssvm.py
test_03_ssvm_internals Failure 0.04 test_ssvm.py
test_05_stop_ssvm Failure 0.04 test_ssvm.py
test_08_upgrade_kubernetes_ha_cluster Failure 1097.85 test_kubernetes_clusters.py

@vishesh92 vishesh92 marked this pull request as ready for review January 15, 2024 07:42
Copy link
Contributor

@shwstppr shwstppr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No host in connecting state issue now. Some failures with U22 mgmt server looks intermittent and unrelated

@shwstppr shwstppr requested a review from DaanHoogland January 15, 2024 07:54
Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

Copy link
Member

@weizhouapache weizhouapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

@shwstppr shwstppr merged commit c3b77cb into apache:main Jan 15, 2024
@shwstppr shwstppr deleted the fix-host-connecting branch January 15, 2024 08:26
@shwstppr shwstppr added this to the 4.19.0.0 milestone Jan 15, 2024
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Jan 26, 2024
There are a lot of test failures due to test_vm_life_cycle.py in multiple PRs due to host not available for migration of VMs.

While debugging I noticed that the hosts get stuck in Connecting state because MS is waiting for a response of the ReadyCommand from the agent. Since we take a lock on connection and disconnection, restarting the agent doesn't work. To fix this, we have to restart the MS or wait for ~1 hour (default timeout).

On the agent side, it gets stuck waiting for a response from the Script execution.

To reproduce, run smoke/test_vm_life_cycle.py (TestSecuredVmMigration test class to be specific). Once the tests are complete, you will notice that some hosts are stuck in Connecting state. And restarting the agent fails due to the named lock. Locks on DB can be checked using the below query.

SELECT *
FROM performance_schema.metadata_locks
INNER JOIN performance_schema.threads ON THREAD_ID = OWNER_THREAD_ID
WHERE PROCESSLIST_ID <> CONNECTION_ID() \G;

This PR adds a wait for the ready command and a timeout to the Script execution to ensure that the thread doesn't get stuck and the named lock from database is released.
shwstppr pushed a commit that referenced this pull request Jan 27, 2024
…irt ready command wrapper (#8547)

This PR fixes bug introduced in #8502. Timeout for script execution was set to 60 ms instead of 60s which resulted in host not getting UEFI enabled. This is a blocker for 4.19 release.

We do this by introducing a new agent parameter `agent.script.timeout` (default - 60 seconds) to use as a timeout for the script checking host's UEFI status.

We also externalize the timeout for the ReadyCommand by introducing a new global setting `ready.command.wait` (default - 60 seconds).

For ModifyStoragePoolCommand, we don't externalize the timeout to avoid confusion for the user. Since, the required timeout can vary depending on the provider in use and we are only setting the wait for default host listener for now. Instead, we reuse the global `wait` setting by dividing it by `5` making the default value of 6 minutes (1800/5 = 360s) for ModifyStoragePoolCommand.

Note: the actual time, the MS waits is twice the wait set for a Command. Check reference code below.
https://github.com/apache/cloudstack/blob/19250403e645c76f60b17aa4aeb4dc915f5ca206/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentAttache.java#L406-L442
Pearl1594 pushed a commit that referenced this pull request Feb 6, 2024
…irt ready command wrapper (#8547)

This PR fixes bug introduced in #8502. Timeout for script execution was set to 60 ms instead of 60s which resulted in host not getting UEFI enabled. This is a blocker for 4.19 release.

We do this by introducing a new agent parameter `agent.script.timeout` (default - 60 seconds) to use as a timeout for the script checking host's UEFI status.

We also externalize the timeout for the ReadyCommand by introducing a new global setting `ready.command.wait` (default - 60 seconds).

For ModifyStoragePoolCommand, we don't externalize the timeout to avoid confusion for the user. Since, the required timeout can vary depending on the provider in use and we are only setting the wait for default host listener for now. Instead, we reuse the global `wait` setting by dividing it by `5` making the default value of 6 minutes (1800/5 = 360s) for ModifyStoragePoolCommand.

Note: the actual time, the MS waits is twice the wait set for a Command. Check reference code below.
https://github.com/apache/cloudstack/blob/19250403e645c76f60b17aa4aeb4dc915f5ca206/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentAttache.java#L406-L442
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Feb 7, 2024
…irt ready command wrapper (apache#8547)

This PR fixes bug introduced in apache#8502. Timeout for script execution was set to 60 ms instead of 60s which resulted in host not getting UEFI enabled. This is a blocker for 4.19 release.

We do this by introducing a new agent parameter `agent.script.timeout` (default - 60 seconds) to use as a timeout for the script checking host's UEFI status.

We also externalize the timeout for the ReadyCommand by introducing a new global setting `ready.command.wait` (default - 60 seconds).

For ModifyStoragePoolCommand, we don't externalize the timeout to avoid confusion for the user. Since, the required timeout can vary depending on the provider in use and we are only setting the wait for default host listener for now. Instead, we reuse the global `wait` setting by dividing it by `5` making the default value of 6 minutes (1800/5 = 360s) for ModifyStoragePoolCommand.

Note: the actual time, the MS waits is twice the wait set for a Command. Check reference code below.
https://github.com/apache/cloudstack/blob/19250403e645c76f60b17aa4aeb4dc915f5ca206/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentAttache.java#L406-L442
shwstppr pushed a commit to shapeblue/cloudstack that referenced this pull request Apr 11, 2024
* Fix host stuck in connecting state (apache#8502)

There are a lot of test failures due to test_vm_life_cycle.py in multiple PRs due to host not available for migration of VMs.
apache#8438 (comment)
apache#8433 (comment)
apache#7344 (comment)

While debugging I noticed that the hosts get stuck in Connecting state because MS is waiting for a response of the ReadyCommand from the agent. Since we take a lock on connection and disconnection, restarting the agent doesn't work. To fix this, we have to restart the MS or wait for ~1 hour (default timeout).

On the agent side, it gets stuck waiting for a response from the Script execution.

To reproduce, run smoke/test_vm_life_cycle.py (TestSecuredVmMigration test class to be specific). Once the tests are complete, you will notice that some hosts are stuck in Connecting state. And restarting the agent fails due to the named lock. Locks on DB can be checked using the below query.

SELECT *
FROM performance_schema.metadata_locks
INNER JOIN performance_schema.threads ON THREAD_ID = OWNER_THREAD_ID
WHERE PROCESSLIST_ID <> CONNECTION_ID() \G;

This PR adds a wait for the ready command and a timeout to the Script execution to ensure that the thread doesn't get stuck and the named lock from database is released.

* Externalise a few timeouts & fix timeout for hostSupportsUefi in libvirt ready command wrapper (apache#8547)

This PR fixes bug introduced in apache#8502. Timeout for script execution was set to 60 ms instead of 60s which resulted in host not getting UEFI enabled. This is a blocker for 4.19 release.

We do this by introducing a new agent parameter `agent.script.timeout` (default - 60 seconds) to use as a timeout for the script checking host's UEFI status.

We also externalize the timeout for the ReadyCommand by introducing a new global setting `ready.command.wait` (default - 60 seconds).

For ModifyStoragePoolCommand, we don't externalize the timeout to avoid confusion for the user. Since, the required timeout can vary depending on the provider in use and we are only setting the wait for default host listener for now. Instead, we reuse the global `wait` setting by dividing it by `5` making the default value of 6 minutes (1800/5 = 360s) for ModifyStoragePoolCommand.

Note: the actual time, the MS waits is twice the wait set for a Command. Check reference code below.
https://github.com/apache/cloudstack/blob/19250403e645c76f60b17aa4aeb4dc915f5ca206/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentAttache.java#L406-L442

* fixup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants