Skip to content

[baremetal] Update keepalived Liveness check#1604

Merged
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
yboaron:keepalived_liveness
Apr 14, 2020
Merged

[baremetal] Update keepalived Liveness check#1604
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
yboaron:keepalived_liveness

Conversation

@yboaron
Copy link
Copy Markdown
Contributor

@yboaron yboaron commented Apr 1, 2020

Update Liveness probe for keepalived container to check also keepalived process existence,
with this fix if for some reason keepalived process exits - kubelet will automatically restart the container.

- What I did

- How to verify it

- Description for the changelog

@yboaron
Copy link
Copy Markdown
Contributor Author

yboaron commented Apr 1, 2020

/retitle [baremetal] Update keepalived Liveness check

@openshift-ci-robot openshift-ci-robot changed the title Update keepalived Liveness check [baremetal] Update keepalived Liveness check Apr 1, 2020
@yboaron
Copy link
Copy Markdown
Contributor Author

yboaron commented Apr 1, 2020

/cc @celebdor @bcrochet @cybertron

@sinnykumari
Copy link
Copy Markdown
Contributor

/retest

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the problem here that this should have been && instead of ||? With the || here, it will short-circuit the state check below and return success just if keepalived.conf exists and keepalived is running. That doesn't seem like what we want.

Note that the kill command will already fail if the pgrep fails so that covers the new case being added here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I think that I missed something here,
Let me see, we should return True if one of the following is OK:

  1. if keepalived.conf doesn't exist (keepalived starts before the monitor container)
  2. keepalived.conf exist and ( kill -s SIGUSR1 "$(pgrep -o keepalived)" && ! grep -q "State = FAULT" /tmp/keepalived.data)

So, I think I just need to change '[[ -s /etc/keepalived/keepalived.conf ]]' to
[[ ! -s /etc/keepalived/keepalived.conf ]] .

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If keepalived.conf doesn't exist, then the liveness probe should fail, right? Keepalived is not alive if it isn't configured yet. As I understand it, three things need to be true for keepalived to be "alive":

  1. keepalived.conf exists (the -s check)
  2. keepalived is running (the pgrep check)
  3. keepalived is not reporting an error (the FAULT check)

If any one of those things is not true then keepalived is not alive. That's why I'm thinking all of those checks should be &&.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly, keepalived container can start before keepalived-monitor rendered cfg file, and I think we shouldn't fail liveness in that case (it will wait for 'reload' message from monitor container).
In the other case, if cfg file exists, we need to make sure that keepalived is running.
I wanted to change (add !, to the -s if) it to :

[[ **!** -s /etc/keepalived/keepalived.conf ]] || \ kill -s SIGUSR1 "$(pgrep -o keepalived)" && ! grep -q "State = FAULT" /tmp/keepalived.data
But seems that kubelet fail on,
kill -s SIGUSR1 "$(pgrep -o keepalived)" ,
I'm getting -
sh-4.2# kill -s SIGUSR1 "$(pgrep -o keepalived)" sh: kill: SIGUSR1: invalid signal specification sh-4.2# exit
Could you please verify if kill -s SIGUSR1 "$(pgrep -o keepalived) work for you?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is unconfigured keepalived "alive" for our purposes? The VIPs it is supposed to be providing won't exist so it's effectively non-functional. If the existence of the process is enough to call it alive then I would argue we don't even need the config file check. Also, the flipped logic will mean we report that keepalived is alive just because the config file doesn't exist, which seems wrong. That could mean neither keepalived is running nor the config file has been created, but we'll report the container alive.

I'm going to comment separately on the kill part because I think that's also an existing bug in this script. :-/

I'll leave details of what I found there.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the bash tip!

for the other case, let's see, we have two options:

  1. include cfg file not exist in liveness check + or keepalived is working (the SIGUSR1 thing)
  2. check only if keepalived is working (the SIGUSR1 thing)
    The only difference between the two, that for 2, on startup, kubelet may restart (depends on timing ) keepalived container until monitor container rendered the cfg file.
    I assume that your preferred option is 2.
    I think that '1' is more suitable because:
    A. Our keepalived container should be considered as healthy if it's waiting for monitor container.
    B. we won't have 'unnecessary' restarts for keepalived container.

But I guess that option '2' + increasing initialDelaySeconds value will be a good solution.
Thoughts?

@yboaron yboaron force-pushed the keepalived_liveness branch from 61bd039 to 453c83c Compare April 5, 2020 20:39
Copy link
Copy Markdown
Member

@cybertron cybertron Apr 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is naughty when we're using [[]], which is a bash-ism that's not present in a plain POSIX shell. This should be /bin/bash. It's also the reason the kill fails. If you connect to the container and use /bin/bash explicitly it will work.

kill -s SIGUSR1 must be a bash thing? Maybe POSIX doesn't allow the signal name? In any case, it works as expected in bash.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is unconfigured keepalived "alive" for our purposes? The VIPs it is supposed to be providing won't exist so it's effectively non-functional. If the existence of the process is enough to call it alive then I would argue we don't even need the config file check. Also, the flipped logic will mean we report that keepalived is alive just because the config file doesn't exist, which seems wrong. That could mean neither keepalived is running nor the config file has been created, but we'll report the container alive.

I'm going to comment separately on the kill part because I think that's also an existing bug in this script. :-/

I'll leave details of what I found there.

Copy link
Copy Markdown
Member

@cybertron cybertron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts inline about why the kill is failing in the existing version of the check.

@yboaron yboaron force-pushed the keepalived_liveness branch from 453c83c to d823fda Compare April 7, 2020 10:12
Copy link
Copy Markdown
Member

@cybertron cybertron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/test e2e-metal-ipi

Okay, this seems fine to me now. I guess the liveness check isn't really a health check, it's just a verification that the process is still running so kubelet knows whether it needs to restart the container. Just waiting on the metal ci before lgtm.

@sinnykumari
Copy link
Copy Markdown
Contributor

/retest

1 similar comment
@yboaron
Copy link
Copy Markdown
Contributor Author

yboaron commented Apr 12, 2020

/retest

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Apr 12, 2020

@yboaron: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-scaleup-rhel7 d823fda link /test e2e-aws-scaleup-rhel7

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@cybertron
Copy link
Copy Markdown
Member

/test e2e-metal-ipi

Also testing locally. Hopefully one of these will pass

@cybertron
Copy link
Copy Markdown
Member

/lgtm

Both passed. I should go buy a lottery ticket. ;-)

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 13, 2020
@sinnykumari
Copy link
Copy Markdown
Contributor

/skip

@sinnykumari
Copy link
Copy Markdown
Contributor

/approve

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cybertron, sinnykumari, yboaron

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 14, 2020
@openshift-merge-robot openshift-merge-robot merged commit 1090f2a into openshift:master Apr 14, 2020
mandre added a commit to mandre/machine-config-operator that referenced this pull request Sep 11, 2020
When using CNV or other operators that modify how the node is connected
to the network, we may end up in the case where the configured VRRP
interface no longer has an address in the network that it is configured
to hold virtual IPs in.

This patch takes a page from what we do for HAProxy and adds a monitor
side car container that checks keepalived and reloads it when necessary.

This ports openshift#1124 to OpenStack platform, alongside with fixes from openshift#1508
and openshift#1604.
mandre added a commit to mandre/machine-config-operator that referenced this pull request Sep 21, 2020
When using CNV or other operators that modify how the node is connected
to the network, we may end up in the case where the configured VRRP
interface no longer has an address in the network that it is configured
to hold virtual IPs in.

This patch takes a page from what we do for HAProxy and adds a monitor
side car container that checks keepalived and reloads it when necessary.

This ports openshift#1124 to OpenStack platform, alongside with fixes from openshift#1508
and openshift#1604.
vrutkovs pushed a commit to vrutkovs/machine-config-operator that referenced this pull request Oct 13, 2020
When using CNV or other operators that modify how the node is connected
to the network, we may end up in the case where the configured VRRP
interface no longer has an address in the network that it is configured
to hold virtual IPs in.

This patch takes a page from what we do for HAProxy and adds a monitor
side car container that checks keepalived and reloads it when necessary.

This ports openshift#1124 to OpenStack platform, alongside with fixes from openshift#1508
and openshift#1604.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants