Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions data/data/aws/vpc/master-elb.tf
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ resource "aws_lb_target_group" "api_internal" {
var.tags,
)

// Refer to docs/dev/kube-apiserver-health-check.md on how to correctly setup health check probe for kube-apiserver
health_check {
healthy_threshold = 2
unhealthy_threshold = 2
Expand Down Expand Up @@ -84,6 +85,7 @@ resource "aws_lb_target_group" "api_external" {
var.tags,
)

// Refer to docs/dev/kube-apiserver-health-check.md on how to correctly setup health check probe for kube-apiserver
health_check {
healthy_threshold = 2
unhealthy_threshold = 2
Expand Down
1 change: 1 addition & 0 deletions data/data/gcp/network/lb-private.tf
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ resource "google_compute_address" "cluster_ip" {
subnetwork = local.master_subnet
}

// Refer to docs/dev/kube-apiserver-health-check.md on how to correctly setup health check probe for kube-apiserver
resource "google_compute_health_check" "api_internal" {
name = "${var.cluster_id}-api-internal"

Expand Down
1 change: 1 addition & 0 deletions data/data/gcp/network/lb-public.tf
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ resource "google_compute_address" "cluster_public_ip" {
name = "${var.cluster_id}-cluster-public-ip"
}

// Refer to docs/dev/kube-apiserver-health-check.md on how to correctly setup health check probe for kube-apiserver
resource "google_compute_http_health_check" "api" {
count = var.public_endpoints ? 1 : 0

Expand Down
83 changes: 83 additions & 0 deletions docs/dev/kube-apiserver-health-check.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
## Graceful Termination
`kube-apiserver` in OpenShift is fronted by an external and an internal Load Balancer. This document serves as a
guideline on how to properly configure the health check probe of the load balancers so that when a `kube-apiserver`
instance restarts we can ensure:
- The load balancers detect it and takes it out of service in time. No new request should be forwarded to the
`kube-apiserver` instance when it has stopped listening.
- Existing connections are not cut off hard, they are allowed to complete gracefully.
Comment thread
wking marked this conversation as resolved.
Outdated

## Load Balancer Health Check Probe
`kube-apiserver` provides graceful termination support via the `/readyz` health check endpoint. When `/readyz` reports
`HTTP Status 200 OK` it indicates that the apiserver is ready to serve request(s).

Now let's walk through the events (in chronological order) that unfold when a `kube-apiserver` instance restarts:
* E1: `T+0s`: `kube-apiserver` receives a TERM signal.
* E2: `T+0s`: `/readyz` starts reporting `failure` to signal to the load balancers that a shut down is in progress.
* The apiserver will continue to accept new request(s).
* The apiserver waits for certain amount of time (configurable by `shutdown-delay-duration`) before it stops accepting new request(s).
* E3: `T+30s`: `kube-apiserver` (the http server) stops listening:
* `/healthz` turns red.
* Default TCP health check probe on port `6443` will fail.
* Any new request forwarded to it will fail, most likely with a `connection refused` error or `GOAWAY` for http/2.
* Existing request(s) in-flight are not cut off but are given up to `60s` to complete gracefully.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this 60s also have a config variable name that we can use to guard against future default changes?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hardcoded in kube-apiserver.

* E4: `T+30s+60s`: Any existing request(s) that are still in flight are terminated with an error `reason: Timeout message: request did not complete within 60s`.
* E5: `T+30s+60s`: The apiserver process exits.

Please note that after `E3` takes place, there is a scenario where all existing requests in-flight can gracefully complete
before the `60s` timeout. In such a case no request is forcefully terminated (`E4` does not transpire) and `E5`
can come about well before `T+30s+60s`.

An important note to consider is that today in OpenShift the time difference between `E3` and `E2` is `70s`. This is known as
`shutdown-delay-duration` and is configurable by the devs only. This is not a knob we allow the end user to tweak.
```
$ kubectl -n openshift-kube-apiserver get cm config -o json | jq -r '.data."config.yaml"' |
jq '.apiServerArguments."shutdown-delay-duration"'
[
"70s"
]
```
In future we will reduce `shutdown-delay-duration` to `30s`. So in this document we will continue with `E3 - E2` is `30s`.

Given the above, we can infer the following:
* The load balancers should use `/readyz` endpoint for `kube-apiserver` health check probe. It must NOT use `/healthz` or
default TCP port probe.
* The time taken by a load balancer (let's say `t` seconds) to deem a `kube-apiserver` instance unhealthy and take it
out of service should not bleed into `E3`. So `E2 + t < E3` must be true so that no new request is forwarded to the
instance at `E3` or later.
* In the worst case, a load balancer should take at most `30s` (since `E2` triggers) to take the `kube-apiserver`
instance out of service.

Below is the health check configuration used by `aws` currently.

```
protocol: HTTPS
path: /readyz
port: 6443
unhealthy threshold: 2
timeout: 10s
interval: 10s
```

Based on aws documentation, the following is true of the ec2 load balancer health check probes:
* Each health check request is independent and lasts the entire interval.
* The time it takes for the instance to respond does not affect the interval for the next health check.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we know from aws/gcp docs that this is really the case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is true for aws, i copied them verbatim from aws doc.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link the docs, to make this easy to check? Seems like it's the classic-LB docs, but the installer uses network load balancers (classic LBs are aws_elb).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not find a doc that describes this for network load balancer exclusively. I think the health check mechanics should be the same for classic, application and network LB. Maybe we can ask this question to our AWS account rep.

On the other hand, what we stipulate above must hold true for all health checks universally. Otherwise if we allow one interval to bleed into another then we don't have a deterministic "at most".


Now let's verify that with the above configuration in effect, a load balancer takes at most `30s` (in the worst case) to
deem a particular `kube-apiserver` instance unhealthy and take it out of service. With that in mind we will plot the
timeline of the health check probes accordingly. There are three probes `P1`, `P2` and `P3` involved in this worst
case scenario:
* E1: T+0s: `P1` kicks off and it immediately gets a `200` response from `/readyz`.
* E2: T+0s: `/readyz` starts reporting red, immediately after `E1`.
* E3: T+10s: `P2` kicks off.
* E4: T+20s: `P2` times out (we assume the worst case here).
* E5: T+20s: `P3` kicks off (each health check is independent and will be kicked off at every interval).
* E6: T+30s: `P3` times out (we assume the worst case here)
* E7: T+30s: `unhealthy threshold` is satisfied and the load balancer takes the unhealthy `kube-apiserver` instance out
of service.

Based on the worst case scenario above we have verified that with the above configuration aws load balancer will take at
most `30s` to detect an unhealthy `kube-apiserver` instance and take it out of service.

If you are working with a different platform please take into consideration relevant health check probe specifics if any
and ensure that the worst case time to detect an unhealthy `kube-apiserver` instance is at most `30s` as explained in
this document.
1 change: 1 addition & 0 deletions upi/gcp/02_lb_ext.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ def GenerateConfig(context):
'region': context.properties['region']
}
}, {
# Refer to docs/dev/kube-apiserver-health-check.md on how to correctly setup health check probe for kube-apiserver
'name': context.properties['infra_id'] + '-api-http-health-check',
'type': 'compute.v1.httpHealthCheck',
'properties': {
Expand Down
1 change: 1 addition & 0 deletions upi/gcp/02_lb_int.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ def GenerateConfig(context):
'subnetwork': context.properties['control_subnet']
}
}, {
# Refer to docs/dev/kube-apiserver-health-check.md on how to correctly setup health check probe for kube-apiserver
'name': context.properties['infra_id'] + '-api-internal-health-check',
'type': 'compute.v1.healthCheck',
'properties': {
Expand Down