Fix host stuck in connecting state #8502

vishesh92 · 2024-01-12T12:10:48Z

Description

There are a lot of test failures due to test_vm_life_cycle.py in multiple PRs due to host not available for migration of VMs.
#8438 (comment)
#8433 (comment)
#7344 (comment)

While debugging I noticed that the hosts get stuck in Connecting state because MS is waiting for a response of the ReadyCommand from the agent. Since we take a lock on connection and disconnection, restarting the agent doesn't work. To fix this, we have to restart the MS or wait for ~1 hour (default timeout).

On the agent side, it gets stuck waiting for a response from the Script execution.

To reproduce, run smoke/test_vm_life_cycle.py (TestSecuredVmMigration test class to be specific). Once the tests are complete, you will notice that some hosts are stuck in Connecting state. And restarting the agent fails due to the named lock. Locks on DB can be checked using the below query.

SELECT *
FROM performance_schema.metadata_locks
INNER JOIN performance_schema.threads ON THREAD_ID = OWNER_THREAD_ID
WHERE PROCESSLIST_ID <> CONNECTION_ID() \G;

This PR adds a wait for the ready command and a timeout to the Script execution to ensure that the thread doesn't get stuck and the named lock from database is released.

Types of changes

Breaking change (fix or feature that would cause existing functionality to change)
New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Enhancement (improves an existing feature and functionality)
Cleanup (Code refactoring and cleanup, that may add test cases)
build/CI

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

Ran smoke/test_vm_life_cycle.py multiple times. The host does get stuck again but only for 1 minute instead of 1 hour.

vishesh92 · 2024-01-12T12:10:59Z

@blueorangutan package

codecov · 2024-01-12T12:19:22Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (3936f7c) 30.77% compared to head (a7d108c) 30.81%.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #8502      +/-   ##
============================================
+ Coverage     30.77%   30.81%   +0.03%     
- Complexity    33970    34003      +33     
============================================
  Files          5341     5341              
  Lines        374971   374973       +2     
  Branches      54546    54546              
============================================
+ Hits         115408   115541     +133     
+ Misses       244308   244155     -153     
- Partials      15255    15277      +22

Flag	Coverage Δ
simulator-marvin-tests	`24.71% <66.66%> (+0.06%)`	⬆️
uitests	`4.39% <ø> (ø)`
unit-tests	`16.48% <66.66%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

GutoVeronezi · 2024-01-12T12:37:54Z

@vishesh92, please, describe the situation you are facing and the tests you have done.

vishesh92 · 2024-01-12T12:56:03Z

@blueorangutan package

blueorangutan · 2024-01-12T12:58:03Z

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

GutoVeronezi · 2024-01-12T13:03:13Z

engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java


        final Long dcId = host.getDataCenterId();
        final ReadyCommand ready = new ReadyCommand(dcId, host.getId(), NumbersUtil.enableHumanReadableSizes);
+        ready.setWait(60);


These wait/timeouts should be externalized.

I remember somebody complained there are too much global settings.
what's the benefit of adding another setting ? @GutoVeronezi

normally ReadyCommand is executed in less than 1 second.
60 seconds is very safe value.

In general, hard coded "magic" numbers are not interesting; we bind the system to what we think is proper and do not allow the operators to shape the system according to their use case.

For the timeout/wait settings, we could think on a flexible mecanism to allow operators overriding the timeout/wait values in a granular way, e.g. defining a timeout/wait for each command, and fallbacking to the global setting wait in the absence of the specifc setting.

@GutoVeronezi
the problem is actually caused by the falling back to the default wait which is 30 mins by default.

I agree with @GutoVeronezi , but in this case if it fixes we can add a new ticket for externalizing it and merge this.

@DaanHoogland @GutoVeronezi
I understand there are some operations which can take long time (migration, snapshots, etc), or some intervals (for background tasks, etc) which operators want to change.
However, for these two commands which should be processed in less than 1 second, what's the benefit of adding a new setting ?

I am with @weizhouapache on this. We already have a lot of global settings which makes it confusing for the end user as well.

@vishesh92 @DaanHoogland @GutoVeronezi @weizhouapache I'm inclined to include this fix in 4.19.0. I'm creating an improvement ticket for follow-up on externalizing the wait. Maybe we can look into some generic solution while addressing it.

the number of global setting we already have should not be an argument. Nobody know them by head and anybody will have to use search facilities to manage those anyway. We have not the value 60 in three separate extra places for a value that has a default of 1800 (?). Both tuning and code maintenance are served by creating an externalisation for those.
@weizhouapache I understand your argument against the need for tuning, however we are now setting 60 for a value that you say should be one. That sounds like we will want to tune it to 10 or 5 or maybe to 7 afterwards.

Anyway, this is merged. Let's discuss on #8506.

weizhouapache · 2024-01-12T13:29:37Z

@vishesh92
I also suggest to add the following change

diff --git a/engine/storage/volume/src/main/java/org/apache/cloudstack/storage/datastore/provider/DefaultHostListener.java b/engine/storage/volume/src/main/java/org/apache/cloudstack/storage/datastore/provider/DefaultHostListener.java
index e344a87831d..90e8742c84d 100644
--- a/engine/storage/volume/src/main/java/org/apache/cloudstack/storage/datastore/provider/DefaultHostListener.java
+++ b/engine/storage/volume/src/main/java/org/apache/cloudstack/storage/datastore/provider/DefaultHostListener.java
@@ -121,6 +121,7 @@ public class DefaultHostListener implements HypervisorHostListener {
     public boolean hostConnect(long hostId, long poolId) throws StorageConflictException {
         StoragePool pool = (StoragePool) this.dataStoreMgr.getDataStore(poolId, DataStoreRole.Primary);
         ModifyStoragePoolCommand cmd = new ModifyStoragePoolCommand(true, pool);
+        cmd.setWait(60);
         final Answer answer = agentMgr.easySend(hostId, cmd);
 
         if (answer == null) {

blueorangutan · 2024-01-12T14:06:49Z

Packaging result [SF]: ✔️ el7 ✔️ el8 ✖️ el9 ✔️ debian ✖️ suse15. SL-JID 8304

weizhouapache · 2024-01-12T15:21:54Z

@blueorangutan package

blueorangutan · 2024-01-12T15:24:03Z

@weizhouapache a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2024-01-12T16:37:56Z

Packaging result [SF]: ✔️ el7 ✔️ el8 ✖️ el9 ✔️ debian ✖️ suse15. SL-JID 8306

vishesh92 · 2024-01-12T17:10:00Z

@blueorangutan package

blueorangutan · 2024-01-12T17:12:03Z

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2024-01-12T18:26:03Z

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8307

shwstppr · 2024-01-12T21:03:54Z

@blueorangutan test

blueorangutan · 2024-01-12T21:06:03Z

@shwstppr a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

blueorangutan · 2024-01-13T11:32:05Z

[SF] Trillian test result (tid-8823)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 50344 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8502-t8823-kvm-centos7.zip
Smoke tests completed. 120 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
ContextSuite context=TestSharedNetwork>:setup	`Error`	57.66	test_network.py

weizhouapache · 2024-01-13T12:52:21Z

@blueorangutan test rocky8 kvm-rocky8

blueorangutan · 2024-01-13T12:54:03Z

@weizhouapache a [SL] Trillian-Jenkins test job (rocky8 mgmt + kvm-rocky8) has been kicked to run smoke tests

weizhouapache · 2024-01-13T18:37:54Z

@blueorangutan test ubuntu22 kvm-rocky8

blueorangutan · 2024-01-14T02:26:42Z

[SF] Trillian test result (tid-8826)
Environment: kvm-rocky8 (x2), Advanced Networking with Mgmt server r8
Total time taken: 46721 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8502-t8826-kvm-rocky8.zip
Smoke tests completed. 120 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_01_redundant_vpc_site2site_vpn	`Failure`	414.07	test_vpc_vpn.py

shwstppr · 2024-01-14T08:32:10Z

@blueorangutan test ubuntu22 kvm-rocky8

blueorangutan · 2024-01-14T08:34:03Z

@shwstppr a [SL] Trillian-Jenkins test job (ubuntu22 mgmt + kvm-rocky8) has been kicked to run smoke tests

shwstppr · 2024-01-14T08:36:45Z

@blueorangutan test

blueorangutan · 2024-01-14T08:38:03Z

@shwstppr a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

blueorangutan · 2024-01-14T21:58:44Z

[SF] Trillian test result (tid-8829)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 46484 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8502-t8829-kvm-centos7.zip
Smoke tests completed. 121 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File

blueorangutan · 2024-01-15T00:29:12Z

[SF] Trillian test result (tid-8828)
Environment: kvm-rocky8 (x2), Advanced Networking with Mgmt server u22
Total time taken: 55332 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8502-t8828-kvm-rocky8.zip
Smoke tests completed. 119 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_01_list_sec_storage_vm	`Failure`	0.04	test_ssvm.py
test_03_ssvm_internals	`Failure`	0.04	test_ssvm.py
test_05_stop_ssvm	`Failure`	0.04	test_ssvm.py
test_08_upgrade_kubernetes_ha_cluster	`Failure`	1097.85	test_kubernetes_clusters.py

shwstppr

LGTM. No host in connecting state issue now. Some failures with U22 mgmt server looks intermittent and unrelated

DaanHoogland

clgtm

weizhouapache

code lgtm

There are a lot of test failures due to test_vm_life_cycle.py in multiple PRs due to host not available for migration of VMs. While debugging I noticed that the hosts get stuck in Connecting state because MS is waiting for a response of the ReadyCommand from the agent. Since we take a lock on connection and disconnection, restarting the agent doesn't work. To fix this, we have to restart the MS or wait for ~1 hour (default timeout). On the agent side, it gets stuck waiting for a response from the Script execution. To reproduce, run smoke/test_vm_life_cycle.py (TestSecuredVmMigration test class to be specific). Once the tests are complete, you will notice that some hosts are stuck in Connecting state. And restarting the agent fails due to the named lock. Locks on DB can be checked using the below query. SELECT * FROM performance_schema.metadata_locks INNER JOIN performance_schema.threads ON THREAD_ID = OWNER_THREAD_ID WHERE PROCESSLIST_ID <> CONNECTION_ID() \G; This PR adds a wait for the ready command and a timeout to the Script execution to ensure that the thread doesn't get stuck and the named lock from database is released.

…irt ready command wrapper (#8547) This PR fixes bug introduced in #8502. Timeout for script execution was set to 60 ms instead of 60s which resulted in host not getting UEFI enabled. This is a blocker for 4.19 release. We do this by introducing a new agent parameter `agent.script.timeout` (default - 60 seconds) to use as a timeout for the script checking host's UEFI status. We also externalize the timeout for the ReadyCommand by introducing a new global setting `ready.command.wait` (default - 60 seconds). For ModifyStoragePoolCommand, we don't externalize the timeout to avoid confusion for the user. Since, the required timeout can vary depending on the provider in use and we are only setting the wait for default host listener for now. Instead, we reuse the global `wait` setting by dividing it by `5` making the default value of 6 minutes (1800/5 = 360s) for ModifyStoragePoolCommand. Note: the actual time, the MS waits is twice the wait set for a Command. Check reference code below. https://github.com/apache/cloudstack/blob/19250403e645c76f60b17aa4aeb4dc915f5ca206/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentAttache.java#L406-L442

…irt ready command wrapper (apache#8547) This PR fixes bug introduced in apache#8502. Timeout for script execution was set to 60 ms instead of 60s which resulted in host not getting UEFI enabled. This is a blocker for 4.19 release. We do this by introducing a new agent parameter `agent.script.timeout` (default - 60 seconds) to use as a timeout for the script checking host's UEFI status. We also externalize the timeout for the ReadyCommand by introducing a new global setting `ready.command.wait` (default - 60 seconds). For ModifyStoragePoolCommand, we don't externalize the timeout to avoid confusion for the user. Since, the required timeout can vary depending on the provider in use and we are only setting the wait for default host listener for now. Instead, we reuse the global `wait` setting by dividing it by `5` making the default value of 6 minutes (1800/5 = 360s) for ModifyStoragePoolCommand. Note: the actual time, the MS waits is twice the wait set for a Command. Check reference code below. https://github.com/apache/cloudstack/blob/19250403e645c76f60b17aa4aeb4dc915f5ca206/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentAttache.java#L406-L442

* Fix host stuck in connecting state (apache#8502) There are a lot of test failures due to test_vm_life_cycle.py in multiple PRs due to host not available for migration of VMs. apache#8438 (comment) apache#8433 (comment) apache#7344 (comment) While debugging I noticed that the hosts get stuck in Connecting state because MS is waiting for a response of the ReadyCommand from the agent. Since we take a lock on connection and disconnection, restarting the agent doesn't work. To fix this, we have to restart the MS or wait for ~1 hour (default timeout). On the agent side, it gets stuck waiting for a response from the Script execution. To reproduce, run smoke/test_vm_life_cycle.py (TestSecuredVmMigration test class to be specific). Once the tests are complete, you will notice that some hosts are stuck in Connecting state. And restarting the agent fails due to the named lock. Locks on DB can be checked using the below query. SELECT * FROM performance_schema.metadata_locks INNER JOIN performance_schema.threads ON THREAD_ID = OWNER_THREAD_ID WHERE PROCESSLIST_ID <> CONNECTION_ID() \G; This PR adds a wait for the ready command and a timeout to the Script execution to ensure that the thread doesn't get stuck and the named lock from database is released. * Externalise a few timeouts & fix timeout for hostSupportsUefi in libvirt ready command wrapper (apache#8547) This PR fixes bug introduced in apache#8502. Timeout for script execution was set to 60 ms instead of 60s which resulted in host not getting UEFI enabled. This is a blocker for 4.19 release. We do this by introducing a new agent parameter `agent.script.timeout` (default - 60 seconds) to use as a timeout for the script checking host's UEFI status. We also externalize the timeout for the ReadyCommand by introducing a new global setting `ready.command.wait` (default - 60 seconds). For ModifyStoragePoolCommand, we don't externalize the timeout to avoid confusion for the user. Since, the required timeout can vary depending on the provider in use and we are only setting the wait for default host listener for now. Instead, we reuse the global `wait` setting by dividing it by `5` making the default value of 6 minutes (1800/5 = 360s) for ModifyStoragePoolCommand. Note: the actual time, the MS waits is twice the wait set for a Command. Check reference code below. https://github.com/apache/cloudstack/blob/19250403e645c76f60b17aa4aeb4dc915f5ca206/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentAttache.java#L406-L442 * fixup

boring-cyborg bot added component:agent component:kvm component:orchestration labels Jan 12, 2024

GutoVeronezi reviewed Jan 12, 2024

View reviewed changes

boring-cyborg bot added the component:storage label Jan 12, 2024

vishesh92 added 2 commits January 12, 2024 22:39

Fix host stuck in connecting state

e8bf874

Add wait for ModifyStoragePoolCommand cmd

a7d108c

vishesh92 force-pushed the fix-host-connecting branch from de04062 to a7d108c Compare January 12, 2024 17:09

weizhouapache mentioned this pull request Jan 13, 2024

Fix KVM smoke test failure #8505

Closed

13 tasks

vishesh92 marked this pull request as ready for review January 15, 2024 07:42

shwstppr approved these changes Jan 15, 2024

View reviewed changes

shwstppr requested a review from DaanHoogland January 15, 2024 07:54

DaanHoogland approved these changes Jan 15, 2024

View reviewed changes

shwstppr mentioned this pull request Jan 15, 2024

improvement: externalize timeout for hypervisor/agent commands #8506

Closed

weizhouapache approved these changes Jan 15, 2024

View reviewed changes

shwstppr merged commit c3b77cb into apache:main Jan 15, 2024

shwstppr deleted the fix-host-connecting branch January 15, 2024 08:26

shwstppr added this to the 4.19.0.0 milestone Jan 15, 2024

vishesh92 mentioned this pull request Jan 23, 2024

Externalise a few timeouts & fix timeout for hostSupportsUefi in libvirt ready command wrapper #8547

Merged

13 tasks

weizhouapache mentioned this pull request Apr 19, 2024

Unstable 4.19 and main branch smoketests #8931

Closed

vishesh92 mentioned this pull request Jun 13, 2024

Host is stuck in Connecting state after update to 4.18.2.0 #9250

Closed

Fix host stuck in connecting state #8502

Fix host stuck in connecting state #8502

Uh oh!

Conversation

vishesh92 commented Jan 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Types of changes

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

Uh oh!

vishesh92 commented Jan 12, 2024

Uh oh!

codecov bot commented Jan 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

GutoVeronezi commented Jan 12, 2024

Uh oh!

vishesh92 commented Jan 12, 2024

Uh oh!

blueorangutan commented Jan 12, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

weizhouapache commented Jan 12, 2024

Uh oh!

blueorangutan commented Jan 12, 2024

Uh oh!

weizhouapache commented Jan 12, 2024

Uh oh!

blueorangutan commented Jan 12, 2024

Uh oh!

blueorangutan commented Jan 12, 2024

Uh oh!

vishesh92 commented Jan 12, 2024

Uh oh!

blueorangutan commented Jan 12, 2024

Uh oh!

blueorangutan commented Jan 12, 2024

Uh oh!

shwstppr commented Jan 12, 2024

Uh oh!

blueorangutan commented Jan 12, 2024

Uh oh!

blueorangutan commented Jan 13, 2024

Uh oh!

weizhouapache commented Jan 13, 2024

Uh oh!

blueorangutan commented Jan 13, 2024

Uh oh!

weizhouapache commented Jan 13, 2024

Uh oh!

blueorangutan commented Jan 14, 2024

Uh oh!

shwstppr commented Jan 14, 2024

Uh oh!

blueorangutan commented Jan 14, 2024

Uh oh!

shwstppr commented Jan 14, 2024

Uh oh!

vishesh92 commented Jan 12, 2024 •

edited

Loading

codecov bot commented Jan 12, 2024 •

edited

Loading