Skip to content

Conversation

@slavkap
Copy link
Contributor

@slavkap slavkap commented Oct 5, 2023

Description

Extending the current functionality of KVM Host HA for the StorPool storage plugin and the option for easy integration for the rest of the storage plugins to support Host HA

This extension works like the current NFS storage implementation. It allows it to be used simultaneously with NFS and StorPool storage or only with StorPool primary storage.

If it is used with different primary storages like NFS and StorPool, and one of the health checks fails for storage, there is an option to report the failure to the management with the global config kvm.ha.fence.on.storage.heartbeat.failure. By default this option is disabled when enabled the Host HA service will continue with the checks on the host and eventually will fence the host

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

How Has This Been Tested?

Environment configuration:
1 Zone
2 Cluster with 2 hosts in each cluster
2 StorPool primary storage (zone-wide)
with and without NFS primary storage (zone-wide)

@codecov
Copy link

codecov bot commented Oct 5, 2023

Codecov Report

Merging #8045 (b99d1a1) into main (121531e) will decrease coverage by 0.01%.
The diff coverage is 4.71%.

@@             Coverage Diff              @@
##               main    #8045      +/-   ##
============================================
- Coverage     29.19%   29.19%   -0.01%     
- Complexity    31015    31024       +9     
============================================
  Files          5181     5181              
  Lines        365255   365434     +179     
  Branches      53427    53449      +22     
============================================
+ Hits         106646   106676      +30     
- Misses       243988   244141     +153     
+ Partials      14621    14617       -4     
Flag Coverage Δ
simulator-marvin-tests 25.16% <1.25%> (+<0.01%) ⬆️
uitests 4.50% <ø> (ø)
unit-tests 14.79% <4.68%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
.../subsystem/api/storage/PrimaryDataStoreDriver.java 0.00% <ø> (ø)
...ain/java/com/cloud/ha/HighAvailabilityManager.java 100.00% <100.00%> (ø)
...urce/wrapper/LibvirtCheckOnHostCommandWrapper.java 60.00% <100.00%> (-2.50%) ⬇️
...m/resource/wrapper/LibvirtFenceCommandWrapper.java 47.61% <100.00%> (+4.76%) ⬆️
.../hypervisor/kvm/storage/LibvirtStorageAdaptor.java 0.24% <ø> (ø)
server/src/main/java/com/cloud/ha/KVMFencer.java 86.11% <100.00%> (+0.39%) ⬆️
...va/org/apache/cloudstack/kvm/ha/KVMHAProvider.java 16.66% <0.00%> (ø)
...ache/cloudstack/kvm/ha/KVMHostActivityChecker.java 1.92% <0.00%> (ø)
...datastore/driver/DateraPrimaryDataStoreDriver.java 0.26% <0.00%> (-0.01%) ⬇️
...e/driver/CloudStackPrimaryDataStoreDriverImpl.java 26.98% <0.00%> (-0.22%) ⬇️
... and 25 more

... and 26 files with indirect coverage changes

📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today!

@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@weizhouapache
Copy link
Member

good job @slavkap

it looks #5862 needs to be updated if this is merged

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7244

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a [SF] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@DaanHoogland
Copy link
Contributor

good job @slavkap

it looks #5862 needs to be updated if this is merged

given #7889 and this potential conflict, I think the storage people need a hackathon ;)

@blueorangutan
Copy link

[SF] Trillian test result (tid-7864)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 40483 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8045-t7864-kvm-centos7.zip
Smoke tests completed. 113 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@slavkap
Copy link
Contributor Author

slavkap commented Oct 10, 2023

good job @slavkap

it looks #5862 needs to be updated if this is merged

Thanks, Wei! I think it won't need much effort to move the changes from #5862 on top of this one

@shwstppr
Copy link
Contributor

@slavkap Hi, can you please have a look into the failing Lint GH action? Should this be considered for 4.19?

@slavkap
Copy link
Contributor Author

slavkap commented Oct 10, 2023

Hi @shwstppr, I know that there is no time for 4.19, but if possible for 4.19.1, it would be great

@shwstppr
Copy link
Contributor

Thanks @slavkap. I'm tentatively adding it to 4.19.0 milestone. We can move it later if it doesn't works out.
Btw, Lint is still seems to be failing

@shwstppr
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@shwstppr a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@shwstppr
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@shwstppr a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7397

@shwstppr
Copy link
Contributor

@blueorangutan test

@shwstppr
Copy link
Contributor

@kiranchavala can you please test any regression with NFS and existing Host HA functionality?

@kiranchavala
Copy link
Contributor

@shwstppr @DaanHoogland to test the host ha with NFS, it is possible only on a physical machine which supports Redfish testing or IPMI tool

@shwstppr
Copy link
Contributor

@kiranchavala afaik IPMI tool works for testing it. Maybe others can comment better @DaanHoogland @weizhouapache @rohityadavcloud

@shwstppr shwstppr closed this Oct 23, 2023
@shwstppr shwstppr reopened this Oct 23, 2023
@DaanHoogland
Copy link
Contributor

@kiranchavala see test_hostha_kvm.py and test_outofbandmanagement.py. ipmitool is used in those.

@shwstppr
Copy link
Contributor

shwstppr commented Nov 2, 2023

@blueorangutan package

@blueorangutan
Copy link

@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7607

Copy link

@rajujith rajujith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Testing was performed only to verify that the NFS-based host HA (fencing) still works. Unable to verify with StorPool since the hardware is not available.

Nov 03 09:56:08 pr8045-t8226-kvm-rocky8-kvm1 heartbeat[17466]: kvmheartbeat.sh will reboot system because it was unable to write the heartbeat to the storage.

Copy link
Contributor

@shwstppr shwstppr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good

@slavkap can you please check/respond on Daan's comments

@yadvr
Copy link
Member

yadvr commented Nov 3, 2023

@blueorangutan test

@blueorangutan
Copy link

@rohityadavcloud a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

Extending the current functionality of KVM Host HA for StorPool storage
plugin and option for easy integration for the rest of the storage
plugins
@slavkap
Copy link
Contributor Author

slavkap commented Nov 3, 2023

Code looks good

@slavkap can you please check/respond on Daan's comments

Done.

Thank you @DaanHoogland, @rajujith and @shwstppr for the reviews and testing!

@shwstppr
Copy link
Contributor

shwstppr commented Nov 3, 2023

@blueorangutan package

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

@blueorangutan
Copy link

[SF] Trillian test result (tid-8228)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 44383 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8045-t8228-kvm-centos7.zip
Smoke tests completed. 116 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@shwstppr
Copy link
Contributor

shwstppr commented Nov 4, 2023

Merging based on smoke test results and manual regression test

@shwstppr
Copy link
Contributor

shwstppr commented Nov 4, 2023

@slavkap will be great if you can add a note about the functionality in the documentation

@shwstppr shwstppr merged commit 2bb182c into apache:main Nov 4, 2023
jschoiRR pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Jan 9, 2024
Extending the current functionality of KVM Host HA for the StorPool storage plugin and the option for easy integration for the rest of the storage plugins to support Host HA

This extension works like the current NFS storage implementation. It allows it to be used simultaneously with NFS and StorPool storage or only with StorPool primary storage.

If it is used with different primary storages like NFS and StorPool, and one of the health checks fails for storage, there is an option to report the failure to the management with the global config kvm.ha.fence.on.storage.heartbeat.failure. By default this option is disabled when enabled the Host HA service will continue with the checks on the host and eventually will fence the host
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants