[baremetal] Move keepalived OCP_API check script to a separate file by yboaron · Pull Request #1908 · openshift/machine-config-operator

yboaron · 2020-07-07T14:08:53Z

With the latest Keepalived version (2.0.0) seems that current API track_script [1] doesn't work, as a result of that the CI is broken.
This PR moves the track script to a file.

[1] https://github.com/openshift/machine-config-operator/blob/master/templates/master/00-master/baremetal/files/baremetal-keepalived-keepalived.yaml#L6

yboaron · 2020-07-07T14:10:01Z

/hold
Just want to check if that fixes the api-vip problem

stbenjam · 2020-07-07T14:18:40Z

Why not 755?

ashcrow

Seems reasonable.

bcrochet · 2020-07-07T14:42:10Z

/lgtm

celebdor · 2020-07-07T15:10:43Z

/lgtm

openshift-ci-robot · 2020-07-07T15:11:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ashcrow, bcrochet, celebdor, yboaron

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [ashcrow]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

yboaron · 2020-07-07T17:09:37Z

/test e2e-metal-ipi

mandre · 2020-07-07T18:10:43Z

~~I tried the same with openstack at #1909. Seems like it's working. In this PR, everything is red but the openstack job (for a change).~~

Update: I forgot to add the template for the /etc/keepalived/chk_ocp_script.sh file. After I added it, the e2e-openstack failed with the usual error. It's interesting to see that the job passed without the /etc/keepalived/chk_ocp_script.sh file. Maybe we don't need the check after all ;-)

openshift-ci-robot · 2020-07-07T19:08:58Z

@yboaron: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-gcp-upgrade	`ed45710`	link	`/test e2e-gcp-upgrade`
ci/prow/e2e-aws	`ed45710`	link	`/test e2e-aws`
ci/prow/e2e-ovn-step-registry	`ed45710`	link	`/test e2e-ovn-step-registry`
ci/prow/e2e-aws-scaleup-rhel7	`ed45710`	link	`/test e2e-aws-scaleup-rhel7`
ci/prow/e2e-metal-ipi	`ed45710`	link	`/test e2e-metal-ipi`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

mandre · 2020-07-07T19:30:06Z

Update: I forgot to add the template for the /etc/keepalived/chk_ocp_script.sh file. After I added it, the e2e-openstack failed with the usual error. It's interesting to see that the job passed without the /etc/keepalived/chk_ocp_script.sh file. Maybe we don't need the check after all ;-)

It was not the same error, in fact the bootstrap finished and terraform tore down the bootstrap resources as expected. It failed later waiting for the cluster to come up.

cybertron · 2020-07-07T22:12:25Z

I think there is a possibility that this is a timing issue. I reproduced the deployment failure with this change locally (although it behaved slightly different from the ci failure), and the issue might be messages like:

Tue Jul 7 15:15:06 2020: Track script chk_ocp is being timed out, expect idle - skipping run
Tue Jul 7 15:15:06 2020: Child (PID 1488) failed to terminate after kill

I found acassen/keepalived#1364 which discusses some problems where track_scripts quit working after that happens, which seems to be what is happening here. In my local deployment the VIP never moved to a master because all of the masters stopped checking before any of them had an active apiserver.

In an attempt to work around this issue, I bumped the interval to 10 seconds to avoid the script timing out while the system was under load. That seems to have worked, but obviously it increases our failover time when something goes wrong. It's also possible I just got lucky with the timing and it had nothing to do with it. :-/

yboaron · 2020-07-08T07:17:17Z

In my local environment, the bootstrap completed successfully and api-vip moves to one of the masters but deployment fails with [1] error.
The root cause is the ingress VIP set for some reason in three different nodes (2 masters and worker node).

[1]
level=error msg="Cluster operator console Degraded is True with RouteHealth_FailedGet: RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.ostest.test.metalkube.org/health): Get https://console-openshift-console.apps.ostest.test.metalkube.org/health: dial tcp [fd2e:6f44:5dd8:c956::4]:443: connect: connection refused"
level=info msg="Cluster operator console Progressing is True with SyncLoopRefresh_InProgress: SyncLoopRefreshProgressing: Working toward version 4.6.0-0.ci-2020-07-06-163355"
level=info msg="Cluster operator console Available is False with Deployment_InsufficientReplicas: DeploymentAvailable: 0 pods available for console deployment"
level=info msg="Cluster operator insights Disabled is True with Disabled: Health reporting is disabled"
level=info msg="Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available"
level=fatal msg="failed to initialize the cluster: Cluster operator console is reporting a failure: RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.ostest.test.metalkube.org/health): Get https://console-openshift-console.apps.ostest.test.metalkube.org/health: dial tcp [fd2e:6f44:5dd8:c956::4]:443: connect: connection refused"
+(utils.sh:1): create_cluster(): removetmp

mandre · 2020-07-08T08:21:01Z

I think there is a possibility that this is a timing issue. I reproduced the deployment failure with this change locally (although it behaved slightly different from the ci failure), and the issue might be messages like:

Tue Jul 7 15:15:06 2020: Track script chk_ocp is being timed out, expect idle - skipping run
Tue Jul 7 15:15:06 2020: Child (PID 1488) failed to terminate after kill

I found acassen/keepalived#1364 which discusses some problems where track_scripts quit working after that happens, which seems to be what is happening here. In my local deployment the VIP never moved to a master because all of the masters stopped checking before any of them had an active apiserver.

In an attempt to work around this issue, I bumped the interval to 10 seconds to avoid the script timing out while the system was under load. That seems to have worked, but obviously it increases our failover time when something goes wrong. It's also possible I just got lucky with the timing and it had nothing to do with it. :-/

I see the same behavior on my openstack deployment. Possibly, we could wrap the scripts with timeout 0.9 to ensure they finish in the allocated second.

mandre · 2020-07-08T08:59:31Z

@yboaron ingress VIP being on different nodes might have the same root cause, not sure yet. In my case, all masters had the ingress VIP and show that they got it because they didn't get any advertisement from other nodes:

Wed Jul  8 07:06:12 2020: (mandre_INGRESS) Receive advertisement timeout
Wed Jul  8 07:06:12 2020: (mandre_INGRESS) Entering MASTER STATE
Wed Jul  8 07:06:12 2020: (mandre_INGRESS) setting VIPs.

Although, one of my master shows that it received an advert at some point (it later entered MASTER state again because of advert timeout):

Wed Jul  8 07:06:12 2020: (mandre_INGRESS) Master received advert from 10.0.128.27 with same priority 40 but higher IP address than ours
Wed Jul  8 07:06:12 2020: (mandre_INGRESS) Entering BACKUP STATE
Wed Jul  8 07:06:12 2020: (mandre_INGRESS) removing VIPs.

mandre · 2020-07-08T09:30:37Z

#1909 seems to work for OpenStack, although I'm seeing lots of priority changes when the system is loaded (I suppose):

Wed Jul  8 09:10:38 2020: VRRP_Script(chk_ocp) succeeded                                                                                                                                                             
Wed Jul  8 09:10:38 2020: (mandre_API) Changing effective priority from 40 to 90                                                                                                                                     
Wed Jul  8 09:10:39 2020: pid 10654 exited due to signal 15                                                                                                                                                          
Wed Jul  8 09:10:39 2020: Script `chk_ocp` now returning 124                                                                                                                                                         
Wed Jul  8 09:10:39 2020: VRRP_Script(chk_ocp) failed (exited with status 124)                                                                                                                                       
Wed Jul  8 09:10:39 2020: (mandre_API) Changing effective priority from 90 to 40                                                                                                                                     
Wed Jul  8 09:10:39 2020: Script `chk_ocp` now returning 0                                                                                                                                                           
Wed Jul  8 09:10:39 2020: VRRP_Script(chk_ocp) succeeded                                                                                                                                                             
Wed Jul  8 09:10:39 2020: (mandre_API) Changing effective priority from 40 to 90                                                                                                                                     
Wed Jul  8 09:10:42 2020: Script `chk_ocp` now returning 124                                                                                                                                                         
Wed Jul  8 09:10:42 2020: VRRP_Script(chk_ocp) failed (exited with status 124)                                                                                                                                       
Wed Jul  8 09:10:42 2020: pid 10679 exited due to signal 15                                                                                                                                                          
Wed Jul  8 09:10:42 2020: (mandre_API) Changing effective priority from 90 to 40                                                                                                                                     
Wed Jul  8 09:10:42 2020: Script `chk_ocp` now returning 0                                                                                                                                                           
Wed Jul  8 09:10:42 2020: VRRP_Script(chk_ocp) succeeded                                                                                                                                                             
Wed Jul  8 09:10:42 2020: (mandre_API) Changing effective priority from 40 to 90                                                                                                                                     
Wed Jul  8 09:11:03 2020: Interface vethd58f787c added                                                                                                                                                               
Wed Jul  8 09:11:14 2020: Script `chk_ocp` now returning 124                                                                                                                                                         
Wed Jul  8 09:11:14 2020: pid 10992 exited due to signal 15                                                                                                                                                          
Wed Jul  8 09:11:14 2020: VRRP_Script(chk_ocp) failed (exited with status 124)                                                                                                                                       
Wed Jul  8 09:11:14 2020: (mandre_API) Changing effective priority from 90 to 40

Status code 124 is the return code when the timeout command times out. Perhaps we should increase the check interval a little bit?

celebdor · 2020-07-10T09:54:01Z

/close

openshift-ci-robot · 2020-07-10T09:54:16Z

@celebdor: Closed this PR.

Details

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot requested review from ashcrow and runcom July 7, 2020 14:09

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 7, 2020

stbenjam reviewed Jul 7, 2020

View reviewed changes

Move keepalived OCP_API check script to a file

ed45710

yboaron force-pushed the update_keep_script branch from 6b2047b to ed45710 Compare July 7, 2020 14:30

ashcrow approved these changes Jul 7, 2020

View reviewed changes

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 7, 2020

openshift-ci-robot assigned bcrochet Jul 7, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 7, 2020

openshift-ci-robot assigned celebdor Jul 7, 2020

kikisdeliveryservice changed the title ~~Move keepalived OCP_API check script to a separate file~~ [baremetal] Move keepalived OCP_API check script to a separate file Jul 7, 2020

mandre mentioned this pull request Jul 8, 2020

Bug 1854249: [On-Prem] Workaround issues with keepalived v2.0.10 #1909

Merged

openshift-ci-robot closed this Jul 10, 2020

Conversation

yboaron commented Jul 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yboaron commented Jul 7, 2020

Uh oh!

stbenjam Jul 7, 2020

Choose a reason for hiding this comment

Uh oh!

ashcrow left a comment

Choose a reason for hiding this comment

Uh oh!

bcrochet commented Jul 7, 2020

Uh oh!

celebdor commented Jul 7, 2020

Uh oh!

openshift-ci-robot commented Jul 7, 2020

Uh oh!

yboaron commented Jul 7, 2020

Uh oh!

mandre commented Jul 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Jul 7, 2020

Uh oh!

mandre commented Jul 7, 2020

Uh oh!

cybertron commented Jul 7, 2020

Uh oh!

yboaron commented Jul 8, 2020

Uh oh!

mandre commented Jul 8, 2020

Uh oh!

mandre commented Jul 8, 2020

Uh oh!

mandre commented Jul 8, 2020

Uh oh!

celebdor commented Jul 10, 2020

Uh oh!

openshift-ci-robot commented Jul 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

yboaron commented Jul 7, 2020 •

edited

Loading

mandre commented Jul 7, 2020 •

edited

Loading