Change keepalived healthcheck to loadbalanced API endpoint#1744
Change keepalived healthcheck to loadbalanced API endpoint#1744cybertron wants to merge 3 commits intoopenshift:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: cybertron The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
8980160 to
ffe175a
Compare
|
/hold If this proves to solve the baremetal api flake issues, I need to apply it to the other platforms. |
|
/test e2e-metal-ipi |
ffe175a to
f1f5590
Compare
It seems that /readyz is intercepting our healthchecks to the api, even though monitor-uri is only defined in a different listen block. This is probably good anyway as it clarifies that this is not a healthcheck against the kubernetes api.
We don't necessarily want to move the API VIP just because the API server on the local node goes down. HAProxy will happily continue to distribute traffic to other nodes (as long as one is still up) and moving the VIP may cause connection disruptions like the ones we are seeing in CI. This switches the keepalived healthcheck to look at port 9443, which is where HAProxy listens, instead of 6443, which is just the local API server. In order for this to work sanely, it was also necessary to change the timings on the healthchecks in keepalived and HAProxy. Previously, HAProxy had a check interval of 3 seconds and required two failing checks to mark a backend down. This meant up to 6 seconds to notice a dead backend. Keepalived had a check interval of only 1 second and no grace period for failed tests, which means it would trigger a failover much sooner than 6 seconds. Depending on how HAProxy chooses to distribute requests, it's possible the keepalived healthcheck would get sent to the down backend and trigger a VIP failover. Since we'd prefer to let HAProxy handle outages (as moving the VIP may break connections), let's swap those timings so that HAProxy is checking every second with a fall value of 2 (so two failed tests to mark backend down), while keepalived checks every 2 seconds, also with a fall value of 2. This should give HAProxy time to notice an outage before the second keepalived healthcheck (HAProxy will respond within 2 seconds, where keepalived will take between 2 and 4 seconds). This should still result in acceptably quick failover but hopefully avoid doing so unnecessarily.
Currently it's difficult/impossible to tell which check caused a failover of the VIP. Split the check so there's a separate one for each service and we can see which failed. The weight for each check is also changed to 6. The reason for this number is that the bootstrap's priority is 50, while the masters start at 40. Each passing vrrp_script adds its weight to the priority, and since we don't want the masters to take the VIP until both haproxy and the api are up, we need to make sure that the master priority is only >50 when both checks pass.
f1f5590 to
df74269
Compare
| inline: | | ||
| vrrp_script chk_ocp { | ||
| script "/usr/bin/curl -o /dev/null -kLfs https://localhost:6443/readyz && /usr/bin/curl -o /dev/null -kLfs http://localhost:50936/readyz" | ||
| script "/usr/bin/curl -o /dev/null -kLfs https://localhost:9443/readyz" |
There was a problem hiding this comment.
So we change to only chk_ocp to only check the API status through the load balancer.
| script "/usr/bin/curl -o /dev/null -kLfs https://localhost:9443/readyz" | ||
| interval 1 | ||
| weight 50 | ||
| weight 6 |
There was a problem hiding this comment.
Why are we reducing weight? Are we still getting heavier weight than the bootstrap?
There was a problem hiding this comment.
I should have mentioned that this was explained more extensively in the commit message that made this change: df74269
In short, we don't want the VIP to migrate to the masters until both checks are passing, so that means we need to make sure one check isn't enough to trigger a move from the bootstrap. The bootstrap priority is 50, the masters start at 40, and each check is additive so when both pass the masters are 52 and trigger the move.
| } | ||
|
|
||
| vrrp_script chk_haproxy { | ||
| script "/usr/bin/curl -o /dev/null -kLfs http://localhost:50936/haproxyready" |
There was a problem hiding this comment.
When does /haproxyready report failure? When all the servers in the master backend are down?
There was a problem hiding this comment.
I haven't made that change yet. For the purposes of this PR I wanted to limit the changes to what was necessary to change the healthcheck endpoint. I'm planning a followup to have the haproxy healthcheck fail when no backends are available.
I think right now this check only fails when haproxy isn't up at all. Come to think of it, I wonder if we even need a separate haproxy check now that we're checking the loadbalanced endpoint. If haproxy is down that check is going to fail anyway. I don't think it hurts anything to have it, but it's probably worth looking into.
There was a problem hiding this comment.
To complete this change I think we should :
A. Handle the firewall rule add/delete by Keepalived instead of haproxy-monitor
B. Haproxy-monitor should be responsible only for generating/updating haproxy configuration.
I think that the reason we see [1] in keepalived logs is the late start of HAProxy LB, it takes 3*6 seconds for haproxy-monitor to render the configuration.
Maybe we should also change the monitor polling timing.
[1] /usr/bin/curl -o /dev/null -kLfs https://localhost:6443/readyz && /usr/bin/curl -o /dev/null -kLfs http://localhost:50936/readyz exited with status 7
There was a problem hiding this comment.
I discussed this with Toni a bit yesterday and we realized that we actually need to keep the monitor check for the case where all loadbalancers go down, but at least one of the api servers is still up. If we just rely on keepalived to move the VIP then that wouldn't work because there's no functional haproxy for it to move to.
| inline: | | ||
| vrrp_script chk_ocp { | ||
| script "/usr/bin/curl -o /dev/null -kLfs https://localhost:6443/readyz && /usr/bin/curl -o /dev/null -kLfs http://localhost:50936/readyz" | ||
| script "/usr/bin/curl -o /dev/null -kLfs https://localhost:9443/readyz" |
There was a problem hiding this comment.
LB port passed as a parameter in baremetal-runtimecfg default to 9443, I assume it will be better to render it instead of using hardcoded value
[1] https://github.com/openshift/baremetal-runtimecfg/blob/master/cmd/monitor/monitor.go#L55
There was a problem hiding this comment.
Good point. Will update.
|
Needs rebase |
|
@cybertron: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
This is now being handled by #1893 |
We don't necessarily want to move the API VIP just because the API
server on the local node goes down. HAProxy will happily continue to
distribute traffic to other nodes (as long as one is still up) and
moving the VIP may cause connection disruptions like the ones we are
seeing in CI.
This switches the keepalived healthcheck to look at port 9443, which
is where HAProxy listens, instead of 6443, which is just the local API
server.
- Description for the changelog
Change the keepalived healthcheck to avoid unnecessary failover of the API VIP.