-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Fix host stuck in connecting state #8502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@blueorangutan package |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #8502 +/- ##
============================================
+ Coverage 30.77% 30.81% +0.03%
- Complexity 33970 34003 +33
============================================
Files 5341 5341
Lines 374971 374973 +2
Branches 54546 54546
============================================
+ Hits 115408 115541 +133
+ Misses 244308 244155 -153
- Partials 15255 15277 +22
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
|
@vishesh92, please, describe the situation you are facing and the tests you have done. |
|
@blueorangutan package |
|
@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
|
||
| final Long dcId = host.getDataCenterId(); | ||
| final ReadyCommand ready = new ReadyCommand(dcId, host.getId(), NumbersUtil.enableHumanReadableSizes); | ||
| ready.setWait(60); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These wait/timeouts should be externalized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember somebody complained there are too much global settings.
what's the benefit of adding another setting ? @GutoVeronezi
normally ReadyCommand is executed in less than 1 second.
60 seconds is very safe value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, hard coded "magic" numbers are not interesting; we bind the system to what we think is proper and do not allow the operators to shape the system according to their use case.
For the timeout/wait settings, we could think on a flexible mecanism to allow operators overriding the timeout/wait values in a granular way, e.g. defining a timeout/wait for each command, and fallbacking to the global setting wait in the absence of the specifc setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GutoVeronezi
the problem is actually caused by the falling back to the default wait which is 30 mins by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @GutoVeronezi , but in this case if it fixes we can add a new ticket for externalizing it and merge this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DaanHoogland @GutoVeronezi
I understand there are some operations which can take long time (migration, snapshots, etc), or some intervals (for background tasks, etc) which operators want to change.
However, for these two commands which should be processed in less than 1 second, what's the benefit of adding a new setting ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am with @weizhouapache on this. We already have a lot of global settings which makes it confusing for the end user as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vishesh92 @DaanHoogland @GutoVeronezi @weizhouapache I'm inclined to include this fix in 4.19.0. I'm creating an improvement ticket for follow-up on externalizing the wait. Maybe we can look into some generic solution while addressing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the number of global setting we already have should not be an argument. Nobody know them by head and anybody will have to use search facilities to manage those anyway. We have not the value 60 in three separate extra places for a value that has a default of 1800 (?). Both tuning and code maintenance are served by creating an externalisation for those.
@weizhouapache I understand your argument against the need for tuning, however we are now setting 60 for a value that you say should be one. That sounds like we will want to tune it to 10 or 5 or maybe to 7 afterwards.
Anyway, this is merged. Let's discuss on #8506.
|
@vishesh92 |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✖️ el9 ✔️ debian ✖️ suse15. SL-JID 8304 |
|
@blueorangutan package |
|
@weizhouapache a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✖️ el9 ✔️ debian ✖️ suse15. SL-JID 8306 |
de04062 to
a7d108c
Compare
|
@blueorangutan package |
|
@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8307 |
|
@blueorangutan test |
|
@shwstppr a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-8823)
|
|
@blueorangutan test rocky8 kvm-rocky8 |
|
@weizhouapache a [SL] Trillian-Jenkins test job (rocky8 mgmt + kvm-rocky8) has been kicked to run smoke tests |
|
@blueorangutan test ubuntu22 kvm-rocky8 |
|
[SF] Trillian test result (tid-8826)
|
|
@blueorangutan test ubuntu22 kvm-rocky8 |
|
@shwstppr a [SL] Trillian-Jenkins test job (ubuntu22 mgmt + kvm-rocky8) has been kicked to run smoke tests |
|
@blueorangutan test |
|
@shwstppr a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-8829)
|
|
[SF] Trillian test result (tid-8828)
|
shwstppr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No host in connecting state issue now. Some failures with U22 mgmt server looks intermittent and unrelated
DaanHoogland
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clgtm
weizhouapache
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code lgtm
There are a lot of test failures due to test_vm_life_cycle.py in multiple PRs due to host not available for migration of VMs. While debugging I noticed that the hosts get stuck in Connecting state because MS is waiting for a response of the ReadyCommand from the agent. Since we take a lock on connection and disconnection, restarting the agent doesn't work. To fix this, we have to restart the MS or wait for ~1 hour (default timeout). On the agent side, it gets stuck waiting for a response from the Script execution. To reproduce, run smoke/test_vm_life_cycle.py (TestSecuredVmMigration test class to be specific). Once the tests are complete, you will notice that some hosts are stuck in Connecting state. And restarting the agent fails due to the named lock. Locks on DB can be checked using the below query. SELECT * FROM performance_schema.metadata_locks INNER JOIN performance_schema.threads ON THREAD_ID = OWNER_THREAD_ID WHERE PROCESSLIST_ID <> CONNECTION_ID() \G; This PR adds a wait for the ready command and a timeout to the Script execution to ensure that the thread doesn't get stuck and the named lock from database is released.
…irt ready command wrapper (#8547) This PR fixes bug introduced in #8502. Timeout for script execution was set to 60 ms instead of 60s which resulted in host not getting UEFI enabled. This is a blocker for 4.19 release. We do this by introducing a new agent parameter `agent.script.timeout` (default - 60 seconds) to use as a timeout for the script checking host's UEFI status. We also externalize the timeout for the ReadyCommand by introducing a new global setting `ready.command.wait` (default - 60 seconds). For ModifyStoragePoolCommand, we don't externalize the timeout to avoid confusion for the user. Since, the required timeout can vary depending on the provider in use and we are only setting the wait for default host listener for now. Instead, we reuse the global `wait` setting by dividing it by `5` making the default value of 6 minutes (1800/5 = 360s) for ModifyStoragePoolCommand. Note: the actual time, the MS waits is twice the wait set for a Command. Check reference code below. https://github.com/apache/cloudstack/blob/19250403e645c76f60b17aa4aeb4dc915f5ca206/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentAttache.java#L406-L442
…irt ready command wrapper (#8547) This PR fixes bug introduced in #8502. Timeout for script execution was set to 60 ms instead of 60s which resulted in host not getting UEFI enabled. This is a blocker for 4.19 release. We do this by introducing a new agent parameter `agent.script.timeout` (default - 60 seconds) to use as a timeout for the script checking host's UEFI status. We also externalize the timeout for the ReadyCommand by introducing a new global setting `ready.command.wait` (default - 60 seconds). For ModifyStoragePoolCommand, we don't externalize the timeout to avoid confusion for the user. Since, the required timeout can vary depending on the provider in use and we are only setting the wait for default host listener for now. Instead, we reuse the global `wait` setting by dividing it by `5` making the default value of 6 minutes (1800/5 = 360s) for ModifyStoragePoolCommand. Note: the actual time, the MS waits is twice the wait set for a Command. Check reference code below. https://github.com/apache/cloudstack/blob/19250403e645c76f60b17aa4aeb4dc915f5ca206/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentAttache.java#L406-L442
…irt ready command wrapper (apache#8547) This PR fixes bug introduced in apache#8502. Timeout for script execution was set to 60 ms instead of 60s which resulted in host not getting UEFI enabled. This is a blocker for 4.19 release. We do this by introducing a new agent parameter `agent.script.timeout` (default - 60 seconds) to use as a timeout for the script checking host's UEFI status. We also externalize the timeout for the ReadyCommand by introducing a new global setting `ready.command.wait` (default - 60 seconds). For ModifyStoragePoolCommand, we don't externalize the timeout to avoid confusion for the user. Since, the required timeout can vary depending on the provider in use and we are only setting the wait for default host listener for now. Instead, we reuse the global `wait` setting by dividing it by `5` making the default value of 6 minutes (1800/5 = 360s) for ModifyStoragePoolCommand. Note: the actual time, the MS waits is twice the wait set for a Command. Check reference code below. https://github.com/apache/cloudstack/blob/19250403e645c76f60b17aa4aeb4dc915f5ca206/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentAttache.java#L406-L442
* Fix host stuck in connecting state (apache#8502) There are a lot of test failures due to test_vm_life_cycle.py in multiple PRs due to host not available for migration of VMs. apache#8438 (comment) apache#8433 (comment) apache#7344 (comment) While debugging I noticed that the hosts get stuck in Connecting state because MS is waiting for a response of the ReadyCommand from the agent. Since we take a lock on connection and disconnection, restarting the agent doesn't work. To fix this, we have to restart the MS or wait for ~1 hour (default timeout). On the agent side, it gets stuck waiting for a response from the Script execution. To reproduce, run smoke/test_vm_life_cycle.py (TestSecuredVmMigration test class to be specific). Once the tests are complete, you will notice that some hosts are stuck in Connecting state. And restarting the agent fails due to the named lock. Locks on DB can be checked using the below query. SELECT * FROM performance_schema.metadata_locks INNER JOIN performance_schema.threads ON THREAD_ID = OWNER_THREAD_ID WHERE PROCESSLIST_ID <> CONNECTION_ID() \G; This PR adds a wait for the ready command and a timeout to the Script execution to ensure that the thread doesn't get stuck and the named lock from database is released. * Externalise a few timeouts & fix timeout for hostSupportsUefi in libvirt ready command wrapper (apache#8547) This PR fixes bug introduced in apache#8502. Timeout for script execution was set to 60 ms instead of 60s which resulted in host not getting UEFI enabled. This is a blocker for 4.19 release. We do this by introducing a new agent parameter `agent.script.timeout` (default - 60 seconds) to use as a timeout for the script checking host's UEFI status. We also externalize the timeout for the ReadyCommand by introducing a new global setting `ready.command.wait` (default - 60 seconds). For ModifyStoragePoolCommand, we don't externalize the timeout to avoid confusion for the user. Since, the required timeout can vary depending on the provider in use and we are only setting the wait for default host listener for now. Instead, we reuse the global `wait` setting by dividing it by `5` making the default value of 6 minutes (1800/5 = 360s) for ModifyStoragePoolCommand. Note: the actual time, the MS waits is twice the wait set for a Command. Check reference code below. https://github.com/apache/cloudstack/blob/19250403e645c76f60b17aa4aeb4dc915f5ca206/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentAttache.java#L406-L442 * fixup
Description
There are a lot of test failures due to
test_vm_life_cycle.pyin multiple PRs due to host not available for migration of VMs.#8438 (comment)
#8433 (comment)
#7344 (comment)
While debugging I noticed that the hosts get stuck in
Connectingstate because MS is waiting for a response of theReadyCommandfrom the agent. Since we take a lock on connection and disconnection, restarting the agent doesn't work. To fix this, we have to restart the MS or wait for ~1 hour (default timeout).On the agent side, it gets stuck waiting for a response from the Script execution.
To reproduce, run
smoke/test_vm_life_cycle.py(TestSecuredVmMigrationtest class to be specific). Once the tests are complete, you will notice that some hosts are stuck inConnectingstate. And restarting the agent fails due to the named lock. Locks on DB can be checked using the below query.This PR adds a wait for the ready command and a timeout to the Script execution to ensure that the thread doesn't get stuck and the named lock from database is released.
Types of changes
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
Ran
smoke/test_vm_life_cycle.pymultiple times. The host does get stuck again but only for 1 minute instead of 1 hour.