Skip to content

[baremetal] Move keepalived OCP_API check script to a separate file#1908

Closed
yboaron wants to merge 1 commit intoopenshift:masterfrom
yboaron:update_keep_script
Closed

[baremetal] Move keepalived OCP_API check script to a separate file#1908
yboaron wants to merge 1 commit intoopenshift:masterfrom
yboaron:update_keep_script

Conversation

@yboaron
Copy link
Copy Markdown
Contributor

@yboaron yboaron commented Jul 7, 2020

With the latest Keepalived version (2.0.0) seems that current API track_script [1] doesn't work, as a result of that the CI is broken.
This PR moves the track script to a file.

[1] https://github.com/openshift/machine-config-operator/blob/master/templates/master/00-master/baremetal/files/baremetal-keepalived-keepalived.yaml#L6

@yboaron
Copy link
Copy Markdown
Contributor Author

yboaron commented Jul 7, 2020

/hold
Just want to check if that fixes the api-vip problem

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 7, 2020
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not 755?

@yboaron yboaron force-pushed the update_keep_script branch from 6b2047b to ed45710 Compare July 7, 2020 14:30
Copy link
Copy Markdown
Member

@ashcrow ashcrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 7, 2020
@bcrochet
Copy link
Copy Markdown
Member

bcrochet commented Jul 7, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 7, 2020
@celebdor
Copy link
Copy Markdown
Contributor

celebdor commented Jul 7, 2020

/lgtm

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ashcrow, bcrochet, celebdor, yboaron

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@yboaron
Copy link
Copy Markdown
Contributor Author

yboaron commented Jul 7, 2020

/test e2e-metal-ipi

@mandre
Copy link
Copy Markdown
Member

mandre commented Jul 7, 2020

I tried the same with openstack at #1909. Seems like it's working. In this PR, everything is red but the openstack job (for a change).

Update: I forgot to add the template for the /etc/keepalived/chk_ocp_script.sh file. After I added it, the e2e-openstack failed with the usual error. It's interesting to see that the job passed without the /etc/keepalived/chk_ocp_script.sh file. Maybe we don't need the check after all ;-)

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@yboaron: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-gcp-upgrade ed45710 link /test e2e-gcp-upgrade
ci/prow/e2e-aws ed45710 link /test e2e-aws
ci/prow/e2e-ovn-step-registry ed45710 link /test e2e-ovn-step-registry
ci/prow/e2e-aws-scaleup-rhel7 ed45710 link /test e2e-aws-scaleup-rhel7
ci/prow/e2e-metal-ipi ed45710 link /test e2e-metal-ipi

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@mandre
Copy link
Copy Markdown
Member

mandre commented Jul 7, 2020

Update: I forgot to add the template for the /etc/keepalived/chk_ocp_script.sh file. After I added it, the e2e-openstack failed with the usual error. It's interesting to see that the job passed without the /etc/keepalived/chk_ocp_script.sh file. Maybe we don't need the check after all ;-)

It was not the same error, in fact the bootstrap finished and terraform tore down the bootstrap resources as expected. It failed later waiting for the cluster to come up.

@kikisdeliveryservice kikisdeliveryservice changed the title Move keepalived OCP_API check script to a separate file [baremetal] Move keepalived OCP_API check script to a separate file Jul 7, 2020
@cybertron
Copy link
Copy Markdown
Member

I think there is a possibility that this is a timing issue. I reproduced the deployment failure with this change locally (although it behaved slightly different from the ci failure), and the issue might be messages like:

Tue Jul 7 15:15:06 2020: Track script chk_ocp is being timed out, expect idle - skipping run
Tue Jul 7 15:15:06 2020: Child (PID 1488) failed to terminate after kill

I found acassen/keepalived#1364 which discusses some problems where track_scripts quit working after that happens, which seems to be what is happening here. In my local deployment the VIP never moved to a master because all of the masters stopped checking before any of them had an active apiserver.

In an attempt to work around this issue, I bumped the interval to 10 seconds to avoid the script timing out while the system was under load. That seems to have worked, but obviously it increases our failover time when something goes wrong. It's also possible I just got lucky with the timing and it had nothing to do with it. :-/

@yboaron
Copy link
Copy Markdown
Contributor Author

yboaron commented Jul 8, 2020

In my local environment, the bootstrap completed successfully and api-vip moves to one of the masters but deployment fails with [1] error.
The root cause is the ingress VIP set for some reason in three different nodes (2 masters and worker node).

[1]
level=error msg="Cluster operator console Degraded is True with RouteHealth_FailedGet: RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.ostest.test.metalkube.org/health): Get https://console-openshift-console.apps.ostest.test.metalkube.org/health: dial tcp [fd2e:6f44:5dd8:c956::4]:443: connect: connection refused"
level=info msg="Cluster operator console Progressing is True with SyncLoopRefresh_InProgress: SyncLoopRefreshProgressing: Working toward version 4.6.0-0.ci-2020-07-06-163355"
level=info msg="Cluster operator console Available is False with Deployment_InsufficientReplicas: DeploymentAvailable: 0 pods available for console deployment"
level=info msg="Cluster operator insights Disabled is True with Disabled: Health reporting is disabled"
level=info msg="Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available"
level=fatal msg="failed to initialize the cluster: Cluster operator console is reporting a failure: RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.ostest.test.metalkube.org/health): Get https://console-openshift-console.apps.ostest.test.metalkube.org/health: dial tcp [fd2e:6f44:5dd8:c956::4]:443: connect: connection refused"
+(utils.sh:1): create_cluster(): removetmp

@mandre
Copy link
Copy Markdown
Member

mandre commented Jul 8, 2020

I think there is a possibility that this is a timing issue. I reproduced the deployment failure with this change locally (although it behaved slightly different from the ci failure), and the issue might be messages like:

Tue Jul 7 15:15:06 2020: Track script chk_ocp is being timed out, expect idle - skipping run
Tue Jul 7 15:15:06 2020: Child (PID 1488) failed to terminate after kill

I found acassen/keepalived#1364 which discusses some problems where track_scripts quit working after that happens, which seems to be what is happening here. In my local deployment the VIP never moved to a master because all of the masters stopped checking before any of them had an active apiserver.

In an attempt to work around this issue, I bumped the interval to 10 seconds to avoid the script timing out while the system was under load. That seems to have worked, but obviously it increases our failover time when something goes wrong. It's also possible I just got lucky with the timing and it had nothing to do with it. :-/

I see the same behavior on my openstack deployment. Possibly, we could wrap the scripts with timeout 0.9 to ensure they finish in the allocated second.

@mandre
Copy link
Copy Markdown
Member

mandre commented Jul 8, 2020

@yboaron ingress VIP being on different nodes might have the same root cause, not sure yet. In my case, all masters had the ingress VIP and show that they got it because they didn't get any advertisement from other nodes:

Wed Jul  8 07:06:12 2020: (mandre_INGRESS) Receive advertisement timeout
Wed Jul  8 07:06:12 2020: (mandre_INGRESS) Entering MASTER STATE
Wed Jul  8 07:06:12 2020: (mandre_INGRESS) setting VIPs.

Although, one of my master shows that it received an advert at some point (it later entered MASTER state again because of advert timeout):

Wed Jul  8 07:06:12 2020: (mandre_INGRESS) Master received advert from 10.0.128.27 with same priority 40 but higher IP address than ours
Wed Jul  8 07:06:12 2020: (mandre_INGRESS) Entering BACKUP STATE
Wed Jul  8 07:06:12 2020: (mandre_INGRESS) removing VIPs.

@mandre
Copy link
Copy Markdown
Member

mandre commented Jul 8, 2020

#1909 seems to work for OpenStack, although I'm seeing lots of priority changes when the system is loaded (I suppose):

Wed Jul  8 09:10:38 2020: VRRP_Script(chk_ocp) succeeded                                                                                                                                                             
Wed Jul  8 09:10:38 2020: (mandre_API) Changing effective priority from 40 to 90                                                                                                                                     
Wed Jul  8 09:10:39 2020: pid 10654 exited due to signal 15                                                                                                                                                          
Wed Jul  8 09:10:39 2020: Script `chk_ocp` now returning 124                                                                                                                                                         
Wed Jul  8 09:10:39 2020: VRRP_Script(chk_ocp) failed (exited with status 124)                                                                                                                                       
Wed Jul  8 09:10:39 2020: (mandre_API) Changing effective priority from 90 to 40                                                                                                                                     
Wed Jul  8 09:10:39 2020: Script `chk_ocp` now returning 0                                                                                                                                                           
Wed Jul  8 09:10:39 2020: VRRP_Script(chk_ocp) succeeded                                                                                                                                                             
Wed Jul  8 09:10:39 2020: (mandre_API) Changing effective priority from 40 to 90                                                                                                                                     
Wed Jul  8 09:10:42 2020: Script `chk_ocp` now returning 124                                                                                                                                                         
Wed Jul  8 09:10:42 2020: VRRP_Script(chk_ocp) failed (exited with status 124)                                                                                                                                       
Wed Jul  8 09:10:42 2020: pid 10679 exited due to signal 15                                                                                                                                                          
Wed Jul  8 09:10:42 2020: (mandre_API) Changing effective priority from 90 to 40                                                                                                                                     
Wed Jul  8 09:10:42 2020: Script `chk_ocp` now returning 0                                                                                                                                                           
Wed Jul  8 09:10:42 2020: VRRP_Script(chk_ocp) succeeded                                                                                                                                                             
Wed Jul  8 09:10:42 2020: (mandre_API) Changing effective priority from 40 to 90                                                                                                                                     
Wed Jul  8 09:11:03 2020: Interface vethd58f787c added                                                                                                                                                               
Wed Jul  8 09:11:14 2020: Script `chk_ocp` now returning 124                                                                                                                                                         
Wed Jul  8 09:11:14 2020: pid 10992 exited due to signal 15                                                                                                                                                          
Wed Jul  8 09:11:14 2020: VRRP_Script(chk_ocp) failed (exited with status 124)                                                                                                                                       
Wed Jul  8 09:11:14 2020: (mandre_API) Changing effective priority from 90 to 40                                                                                                                                     

Status code 124 is the return code when the timeout command times out. Perhaps we should increase the check interval a little bit?

@celebdor
Copy link
Copy Markdown
Contributor

/close

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@celebdor: Closed this PR.

Details

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants