[baremetal] Update keepalived Liveness check by yboaron · Pull Request #1604 · openshift/machine-config-operator

yboaron · 2020-04-01T14:03:35Z

Update Liveness probe for keepalived container to check also keepalived process existence,
with this fix if for some reason keepalived process exits - kubelet will automatically restart the container.

- What I did

- How to verify it

- Description for the changelog

yboaron · 2020-04-01T14:04:20Z

/retitle [baremetal] Update keepalived Liveness check

yboaron · 2020-04-01T14:07:00Z

/cc @celebdor @bcrochet @cybertron

sinnykumari · 2020-04-02T05:38:43Z

/retest

cybertron · 2020-04-02T17:32:58Z

Isn't the problem here that this should have been && instead of ||? With the || here, it will short-circuit the state check below and return success just if keepalived.conf exists and keepalived is running. That doesn't seem like what we want.

Note that the kill command will already fail if the pgrep fails so that covers the new case being added here.

Yep, I think that I missed something here,
Let me see, we should return True if one of the following is OK:

if keepalived.conf doesn't exist (keepalived starts before the monitor container)

keepalived.conf exist and ( kill -s SIGUSR1 "$(pgrep -o keepalived)" && ! grep -q "State = FAULT" /tmp/keepalived.data)

So, I think I just need to change '[[ -s /etc/keepalived/keepalived.conf ]]' to
[[ ! -s /etc/keepalived/keepalived.conf ]] .

If keepalived.conf doesn't exist, then the liveness probe should fail, right? Keepalived is not alive if it isn't configured yet. As I understand it, three things need to be true for keepalived to be "alive":

keepalived.conf exists (the -s check)

keepalived is running (the pgrep check)

keepalived is not reporting an error (the FAULT check)

If any one of those things is not true then keepalived is not alive. That's why I'm thinking all of those checks should be &&.

Not exactly, keepalived container can start before keepalived-monitor rendered cfg file, and I think we shouldn't fail liveness in that case (it will wait for 'reload' message from monitor container).
In the other case, if cfg file exists, we need to make sure that keepalived is running.
I wanted to change (add !, to the -s if) it to :

[[ **!** -s /etc/keepalived/keepalived.conf ]] || \ kill -s SIGUSR1 "$(pgrep -o keepalived)" && ! grep -q "State = FAULT" /tmp/keepalived.data
But seems that kubelet fail on,
kill -s SIGUSR1 "$(pgrep -o keepalived)" ,
I'm getting -
sh-4.2# kill -s SIGUSR1 "$(pgrep -o keepalived)" sh: kill: SIGUSR1: invalid signal specification sh-4.2# exit
Could you please verify if kill -s SIGUSR1 "$(pgrep -o keepalived) work for you?

But is unconfigured keepalived "alive" for our purposes? The VIPs it is supposed to be providing won't exist so it's effectively non-functional. If the existence of the process is enough to call it alive then I would argue we don't even need the config file check. Also, the flipped logic will mean we report that keepalived is alive just because the config file doesn't exist, which seems wrong. That could mean neither keepalived is running nor the config file has been created, but we'll report the container alive.

I'm going to comment separately on the kill part because I think that's also an existing bug in this script. :-/

I'll leave details of what I found there.

Thanks for the bash tip!

for the other case, let's see, we have two options:

include cfg file not exist in liveness check + or keepalived is working (the SIGUSR1 thing)

check only if keepalived is working (the SIGUSR1 thing)
The only difference between the two, that for 2, on startup, kubelet may restart (depends on timing ) keepalived container until monitor container rendered the cfg file.
I assume that your preferred option is 2.
I think that '1' is more suitable because:
A. Our keepalived container should be considered as healthy if it's waiting for monitor container.
B. we won't have 'unnecessary' restarts for keepalived container.

But I guess that option '2' + increasing initialDelaySeconds value will be a good solution.
Thoughts?

cybertron · 2020-04-06T21:16:02Z

Oh, this is naughty when we're using [[]], which is a bash-ism that's not present in a plain POSIX shell. This should be /bin/bash. It's also the reason the kill fails. If you connect to the container and use /bin/bash explicitly it will work.

kill -s SIGUSR1 must be a bash thing? Maybe POSIX doesn't allow the signal name? In any case, it works as expected in bash.

cybertron · 2020-04-06T21:25:02Z

But is unconfigured keepalived "alive" for our purposes? The VIPs it is supposed to be providing won't exist so it's effectively non-functional. If the existence of the process is enough to call it alive then I would argue we don't even need the config file check. Also, the flipped logic will mean we report that keepalived is alive just because the config file doesn't exist, which seems wrong. That could mean neither keepalived is running nor the config file has been created, but we'll report the container alive.

I'm going to comment separately on the kill part because I think that's also an existing bug in this script. :-/

I'll leave details of what I found there.

cybertron

Some thoughts inline about why the kill is failing in the existing version of the check.

cybertron

/test e2e-metal-ipi

Okay, this seems fine to me now. I guess the liveness check isn't really a health check, it's just a verification that the process is still running so kubelet knows whether it needs to restart the container. Just waiting on the metal ci before lgtm.

sinnykumari · 2020-04-08T09:16:08Z

/retest

yboaron · 2020-04-12T10:30:50Z

/retest

openshift-ci-robot · 2020-04-12T12:34:32Z

@yboaron: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-scaleup-rhel7	`d823fda`	link	`/test e2e-aws-scaleup-rhel7`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

cybertron · 2020-04-13T17:00:25Z

/test e2e-metal-ipi

Also testing locally. Hopefully one of these will pass

cybertron · 2020-04-13T19:11:01Z

/lgtm

Both passed. I should go buy a lottery ticket. ;-)

sinnykumari · 2020-04-14T06:09:49Z

/skip

sinnykumari · 2020-04-14T06:12:07Z

/approve

openshift-ci-robot · 2020-04-14T06:12:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cybertron, sinnykumari, yboaron

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [sinnykumari]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

When using CNV or other operators that modify how the node is connected to the network, we may end up in the case where the configured VRRP interface no longer has an address in the network that it is configured to hold virtual IPs in. This patch takes a page from what we do for HAProxy and adds a monitor side car container that checks keepalived and reloads it when necessary. This ports openshift#1124 to OpenStack platform, alongside with fixes from openshift#1508 and openshift#1604.

openshift-ci-robot requested review from sinnykumari and yuqi-zhang April 1, 2020 14:04

openshift-ci-robot changed the title ~~Update keepalived Liveness check~~ [baremetal] Update keepalived Liveness check Apr 1, 2020

openshift-ci-robot requested review from bcrochet, celebdor and cybertron April 1, 2020 14:07

cybertron reviewed Apr 2, 2020

View reviewed changes

yboaron force-pushed the keepalived_liveness branch from 61bd039 to 453c83c Compare April 5, 2020 20:39

cybertron reviewed Apr 6, 2020

View reviewed changes

cybertron suggested changes Apr 6, 2020

View reviewed changes

Update keepalived Liveness check

d823fda

yboaron force-pushed the keepalived_liveness branch from 453c83c to d823fda Compare April 7, 2020 10:12

cybertron approved these changes Apr 7, 2020

View reviewed changes

openshift-ci-robot assigned cybertron Apr 13, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 13, 2020

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 14, 2020

openshift-merge-robot merged commit 1090f2a into openshift:master Apr 14, 2020

mandre mentioned this pull request Sep 11, 2020

Bug 1875005: OpenStack: Don't failover api vip if loadbalanced endpoint is responding #2077

Merged

Conversation

yboaron commented Apr 1, 2020

Uh oh!

yboaron commented Apr 1, 2020

Uh oh!

yboaron commented Apr 1, 2020

Uh oh!

sinnykumari commented Apr 2, 2020

Uh oh!

cybertron Apr 2, 2020

Choose a reason for hiding this comment

Uh oh!

yboaron Apr 2, 2020

Choose a reason for hiding this comment

Uh oh!

cybertron Apr 2, 2020

Choose a reason for hiding this comment

Uh oh!

yboaron Apr 5, 2020

Choose a reason for hiding this comment

Uh oh!

cybertron Apr 6, 2020

Choose a reason for hiding this comment

Uh oh!

yboaron Apr 7, 2020

Choose a reason for hiding this comment

Uh oh!

cybertron Apr 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cybertron Apr 6, 2020

Choose a reason for hiding this comment

Uh oh!

cybertron left a comment

Choose a reason for hiding this comment

Uh oh!

cybertron left a comment

Choose a reason for hiding this comment

Uh oh!

sinnykumari commented Apr 8, 2020

Uh oh!

yboaron commented Apr 12, 2020

Uh oh!

openshift-ci-robot commented Apr 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cybertron commented Apr 13, 2020

Uh oh!

cybertron commented Apr 13, 2020

Uh oh!

sinnykumari commented Apr 14, 2020

Uh oh!

sinnykumari commented Apr 14, 2020

Uh oh!

openshift-ci-robot commented Apr 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

cybertron Apr 6, 2020 •

edited

Loading

openshift-ci-robot commented Apr 12, 2020 •

edited

Loading