[baremetal] Update keepalived Liveness check#1604
[baremetal] Update keepalived Liveness check#1604openshift-merge-robot merged 1 commit intoopenshift:masterfrom
Conversation
|
/retitle [baremetal] Update keepalived Liveness check |
|
/retest |
There was a problem hiding this comment.
Isn't the problem here that this should have been && instead of ||? With the || here, it will short-circuit the state check below and return success just if keepalived.conf exists and keepalived is running. That doesn't seem like what we want.
Note that the kill command will already fail if the pgrep fails so that covers the new case being added here.
There was a problem hiding this comment.
Yep, I think that I missed something here,
Let me see, we should return True if one of the following is OK:
- if keepalived.conf doesn't exist (keepalived starts before the monitor container)
- keepalived.conf exist and ( kill -s SIGUSR1 "$(pgrep -o keepalived)" && ! grep -q "State = FAULT" /tmp/keepalived.data)
So, I think I just need to change '[[ -s /etc/keepalived/keepalived.conf ]]' to
[[ ! -s /etc/keepalived/keepalived.conf ]] .
There was a problem hiding this comment.
If keepalived.conf doesn't exist, then the liveness probe should fail, right? Keepalived is not alive if it isn't configured yet. As I understand it, three things need to be true for keepalived to be "alive":
- keepalived.conf exists (the -s check)
- keepalived is running (the pgrep check)
- keepalived is not reporting an error (the FAULT check)
If any one of those things is not true then keepalived is not alive. That's why I'm thinking all of those checks should be &&.
There was a problem hiding this comment.
Not exactly, keepalived container can start before keepalived-monitor rendered cfg file, and I think we shouldn't fail liveness in that case (it will wait for 'reload' message from monitor container).
In the other case, if cfg file exists, we need to make sure that keepalived is running.
I wanted to change (add !, to the -s if) it to :
[[ **!** -s /etc/keepalived/keepalived.conf ]] || \ kill -s SIGUSR1 "$(pgrep -o keepalived)" && ! grep -q "State = FAULT" /tmp/keepalived.data
But seems that kubelet fail on,
kill -s SIGUSR1 "$(pgrep -o keepalived)" ,
I'm getting -
sh-4.2# kill -s SIGUSR1 "$(pgrep -o keepalived)" sh: kill: SIGUSR1: invalid signal specification sh-4.2# exit
Could you please verify if kill -s SIGUSR1 "$(pgrep -o keepalived) work for you?
There was a problem hiding this comment.
But is unconfigured keepalived "alive" for our purposes? The VIPs it is supposed to be providing won't exist so it's effectively non-functional. If the existence of the process is enough to call it alive then I would argue we don't even need the config file check. Also, the flipped logic will mean we report that keepalived is alive just because the config file doesn't exist, which seems wrong. That could mean neither keepalived is running nor the config file has been created, but we'll report the container alive.
I'm going to comment separately on the kill part because I think that's also an existing bug in this script. :-/
I'll leave details of what I found there.
There was a problem hiding this comment.
Thanks for the bash tip!
for the other case, let's see, we have two options:
- include cfg file not exist in liveness check + or keepalived is working (the SIGUSR1 thing)
- check only if keepalived is working (the SIGUSR1 thing)
The only difference between the two, that for 2, on startup, kubelet may restart (depends on timing ) keepalived container until monitor container rendered the cfg file.
I assume that your preferred option is 2.
I think that '1' is more suitable because:
A. Our keepalived container should be considered as healthy if it's waiting for monitor container.
B. we won't have 'unnecessary' restarts for keepalived container.
But I guess that option '2' + increasing initialDelaySeconds value will be a good solution.
Thoughts?
61bd039 to
453c83c
Compare
There was a problem hiding this comment.
Oh, this is naughty when we're using [[]], which is a bash-ism that's not present in a plain POSIX shell. This should be /bin/bash. It's also the reason the kill fails. If you connect to the container and use /bin/bash explicitly it will work.
kill -s SIGUSR1 must be a bash thing? Maybe POSIX doesn't allow the signal name? In any case, it works as expected in bash.
There was a problem hiding this comment.
But is unconfigured keepalived "alive" for our purposes? The VIPs it is supposed to be providing won't exist so it's effectively non-functional. If the existence of the process is enough to call it alive then I would argue we don't even need the config file check. Also, the flipped logic will mean we report that keepalived is alive just because the config file doesn't exist, which seems wrong. That could mean neither keepalived is running nor the config file has been created, but we'll report the container alive.
I'm going to comment separately on the kill part because I think that's also an existing bug in this script. :-/
I'll leave details of what I found there.
cybertron
left a comment
There was a problem hiding this comment.
Some thoughts inline about why the kill is failing in the existing version of the check.
453c83c to
d823fda
Compare
cybertron
left a comment
There was a problem hiding this comment.
/test e2e-metal-ipi
Okay, this seems fine to me now. I guess the liveness check isn't really a health check, it's just a verification that the process is still running so kubelet knows whether it needs to restart the container. Just waiting on the metal ci before lgtm.
|
/retest |
1 similar comment
|
/retest |
|
@yboaron: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/test e2e-metal-ipi Also testing locally. Hopefully one of these will pass |
|
/lgtm Both passed. I should go buy a lottery ticket. ;-) |
|
/skip |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cybertron, sinnykumari, yboaron The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
When using CNV or other operators that modify how the node is connected to the network, we may end up in the case where the configured VRRP interface no longer has an address in the network that it is configured to hold virtual IPs in. This patch takes a page from what we do for HAProxy and adds a monitor side car container that checks keepalived and reloads it when necessary. This ports openshift#1124 to OpenStack platform, alongside with fixes from openshift#1508 and openshift#1604.
When using CNV or other operators that modify how the node is connected to the network, we may end up in the case where the configured VRRP interface no longer has an address in the network that it is configured to hold virtual IPs in. This patch takes a page from what we do for HAProxy and adds a monitor side car container that checks keepalived and reloads it when necessary. This ports openshift#1124 to OpenStack platform, alongside with fixes from openshift#1508 and openshift#1604.
When using CNV or other operators that modify how the node is connected to the network, we may end up in the case where the configured VRRP interface no longer has an address in the network that it is configured to hold virtual IPs in. This patch takes a page from what we do for HAProxy and adds a monitor side car container that checks keepalived and reloads it when necessary. This ports openshift#1124 to OpenStack platform, alongside with fixes from openshift#1508 and openshift#1604.
Update Liveness probe for keepalived container to check also keepalived process existence,
with this fix if for some reason keepalived process exits - kubelet will automatically restart the container.
- What I did
- How to verify it
- Description for the changelog