[baremetal] Move keepalived OCP_API check script to a separate file#1908
[baremetal] Move keepalived OCP_API check script to a separate file#1908yboaron wants to merge 1 commit intoopenshift:masterfrom
Conversation
|
/hold |
6b2047b to
ed45710
Compare
|
/lgtm |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ashcrow, bcrochet, celebdor, yboaron The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/test e2e-metal-ipi |
|
Update: I forgot to add the template for the |
|
@yboaron: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
It was not the same error, in fact the bootstrap finished and terraform tore down the bootstrap resources as expected. It failed later waiting for the cluster to come up. |
|
I think there is a possibility that this is a timing issue. I reproduced the deployment failure with this change locally (although it behaved slightly different from the ci failure), and the issue might be messages like: Tue Jul 7 15:15:06 2020: Track script chk_ocp is being timed out, expect idle - skipping run I found acassen/keepalived#1364 which discusses some problems where track_scripts quit working after that happens, which seems to be what is happening here. In my local deployment the VIP never moved to a master because all of the masters stopped checking before any of them had an active apiserver. In an attempt to work around this issue, I bumped the interval to 10 seconds to avoid the script timing out while the system was under load. That seems to have worked, but obviously it increases our failover time when something goes wrong. It's also possible I just got lucky with the timing and it had nothing to do with it. :-/ |
|
In my local environment, the bootstrap completed successfully and api-vip moves to one of the masters but deployment fails with [1] error. [1] |
I see the same behavior on my openstack deployment. Possibly, we could wrap the scripts with |
|
@yboaron ingress VIP being on different nodes might have the same root cause, not sure yet. In my case, all masters had the ingress VIP and show that they got it because they didn't get any advertisement from other nodes: Although, one of my master shows that it received an advert at some point (it later entered MASTER state again because of advert timeout): |
|
#1909 seems to work for OpenStack, although I'm seeing lots of priority changes when the system is loaded (I suppose): Status code 124 is the return code when the |
|
/close |
|
@celebdor: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
With the latest Keepalived version (2.0.0) seems that current API track_script [1] doesn't work, as a result of that the CI is broken.
This PR moves the track script to a file.
[1] https://github.com/openshift/machine-config-operator/blob/master/templates/master/00-master/baremetal/files/baremetal-keepalived-keepalived.yaml#L6