openshift · openshift-merge-robot · May 12, 2020 · May 3, 2020 · May 11, 2020 · wking
diff --git a/data/data/aws/vpc/master-elb.tf b/data/data/aws/vpc/master-elb.tf
@@ -57,6 +57,7 @@ resource "aws_lb_target_group" "api_internal" {
     var.tags,
   )
 
+  // Refer to docs/dev/kube-apiserver-health-check.md on how to correctly setup health check probe for kube-apiserver
   health_check {
     healthy_threshold   = 2
     unhealthy_threshold = 2
@@ -84,6 +85,7 @@ resource "aws_lb_target_group" "api_external" {
     var.tags,
   )
 
+  // Refer to docs/dev/kube-apiserver-health-check.md on how to correctly setup health check probe for kube-apiserver
   health_check {
     healthy_threshold   = 2
     unhealthy_threshold = 2

diff --git a/data/data/gcp/network/lb-private.tf b/data/data/gcp/network/lb-private.tf
@@ -4,6 +4,7 @@ resource "google_compute_address" "cluster_ip" {
   subnetwork   = local.master_subnet
 }
 
+// Refer to docs/dev/kube-apiserver-health-check.md on how to correctly setup health check probe for kube-apiserver
 resource "google_compute_health_check" "api_internal" {
   name = "${var.cluster_id}-api-internal"
 

diff --git a/data/data/gcp/network/lb-public.tf b/data/data/gcp/network/lb-public.tf
@@ -4,6 +4,7 @@ resource "google_compute_address" "cluster_public_ip" {
   name = "${var.cluster_id}-cluster-public-ip"
 }
 
+// Refer to docs/dev/kube-apiserver-health-check.md on how to correctly setup health check probe for kube-apiserver
 resource "google_compute_http_health_check" "api" {
   count = var.public_endpoints ? 1 : 0
 

diff --git a/docs/dev/kube-apiserver-health-check.md b/docs/dev/kube-apiserver-health-check.md
@@ -0,0 +1,83 @@
+## Graceful Termination
+`kube-apiserver` in OpenShift is fronted by an external and an internal Load Balancer. This document serves as a
+guideline on how to properly configure the health check probe of the load balancers so that when a `kube-apiserver`
+instance restarts we can ensure:
+- The load balancers detect it and takes it out of service in time. No new request should be forwarded to the 
+  `kube-apiserver` instance when it has stopped listening.
+- Existing connections are not cut off hard, they are allowed to complete gracefully.
+
+## Load Balancer Health Check Probe
+`kube-apiserver` provides graceful termination support via the `/readyz` health check endpoint. When `/readyz` reports
+`HTTP Status 200 OK` it indicates that the apiserver is ready to serve request(s).
+
+Now let's walk through the events (in chronological order) that unfold when a `kube-apiserver` instance restarts:
+* E1: `T+0s`: `kube-apiserver` receives a TERM signal.
+* E2: `T+0s`: `/readyz` starts reporting `failure` to signal to the load balancers that a shut down is in progress.
+  * The apiserver will continue to accept new request(s).
+  * The apiserver waits for certain amount of time (configurable by `shutdown-delay-duration`) before it stops accepting new request(s).
+* E3: `T+30s`: `kube-apiserver` (the http server) stops listening:
+  * `/healthz` turns red.
+  * Default TCP health check probe on port `6443` will fail.
+  * Any new request forwarded to it will fail, most likely with a `connection refused` error or `GOAWAY` for http/2.
+  * Existing request(s) in-flight are not cut off but are given up to `60s` to complete gracefully.
+* E4: `T+30s+60s`: Any existing request(s) that are still in flight are terminated with an error `reason: Timeout message: request did not complete within 60s`.
+* E5: `T+30s+60s`: The apiserver process exits.
+
+Please note that after `E3` takes place, there is a scenario where all existing requests in-flight can gracefully complete
+before the `60s` timeout. In such a case no request is forcefully terminated (`E4` does not transpire) and `E5` 
+can come about well before `T+30s+60s`. 
+
+An important note to consider is that today in OpenShift the time difference between `E3` and `E2` is `70s`. This is known as
+`shutdown-delay-duration` and is configurable by the devs only. This is not a knob we allow the end user to tweak. 
+```
+$ kubectl -n openshift-kube-apiserver get cm config -o json | jq -r '.data."config.yaml"' |
+  jq '.apiServerArguments."shutdown-delay-duration"'
+[
+  "70s"
+]
+```
+In future we will reduce `shutdown-delay-duration` to `30s`. So in this document we will continue with `E3 - E2` is `30s`.
+
+Given the above, we can infer the following:
+* The load balancers should use `/readyz` endpoint for `kube-apiserver` health check probe. It must NOT use `/healthz` or
+default TCP port probe.
+* The time taken by a load balancer (let's say `t` seconds) to deem a `kube-apiserver` instance unhealthy and take it
+out of service should not bleed into `E3`. So `E2 + t < E3` must be true so that no new request is forwarded to the 
+instance at `E3` or later. 
+* In the worst case, a load balancer should take at most `30s` (since `E2` triggers) to take the `kube-apiserver` 
+instance out of service.
+
+Below is the health check configuration used by `aws` currently.
+
+```
+protocol: HTTPS
+path: /readyz
+port: 6443
+unhealthy threshold: 2
+timeout: 10s
+interval: 10s
+```
+
+Based on aws documentation, the following is true of the ec2 load balancer health check probes:
+* Each health check request is independent and lasts the entire interval.
+* The time it takes for the instance to respond does not affect the interval for the next health check.
+
+Now let's verify that with the above configuration in effect, a load balancer takes at most `30s` (in the worst case) to
+deem a particular `kube-apiserver` instance unhealthy and take it out of service. With that in mind we will plot the 
+timeline of the health check probes accordingly. There are three probes `P1`, `P2` and `P3` involved in this worst 
+case scenario:
+* E1: T+0s:  `P1` kicks off and it immediately gets a `200` response from `/readyz`.
+* E2: T+0s:  `/readyz` starts reporting red, immediately after `E1`.
+* E3: T+10s: `P2` kicks off.
+* E4: T+20s: `P2` times out (we assume the worst case here).
+* E5: T+20s: `P3` kicks off (each health check is independent and will be kicked off at every interval).
+* E6: T+30s: `P3` times out (we assume the worst case here)
+* E7: T+30s: `unhealthy threshold` is satisfied and the load balancer takes the unhealthy `kube-apiserver` instance out 
+  of service.
+
+Based on the worst case scenario above we have verified that with the above configuration aws load balancer will take at
+most `30s` to detect an unhealthy `kube-apiserver` instance and take it out of service.
+
+If you are working with a different platform please take into consideration relevant health check probe specifics if any
+and ensure that the worst case time to detect an unhealthy `kube-apiserver` instance is at most `30s` as explained in
+this document.
diff --git a/upi/gcp/02_lb_ext.py b/upi/gcp/02_lb_ext.py
@@ -7,6 +7,7 @@ def GenerateConfig(context):
             'region': context.properties['region']
         }
     }, {
+        # Refer to docs/dev/kube-apiserver-health-check.md on how to correctly setup health check probe for kube-apiserver
         'name': context.properties['infra_id'] + '-api-http-health-check',
         'type': 'compute.v1.httpHealthCheck',
         'properties': {

diff --git a/upi/gcp/02_lb_int.py b/upi/gcp/02_lb_int.py
@@ -15,6 +15,7 @@ def GenerateConfig(context):
             'subnetwork': context.properties['control_subnet']
         }
     }, {
+        # Refer to docs/dev/kube-apiserver-health-check.md on how to correctly setup health check probe for kube-apiserver
         'name': context.properties['infra_id'] + '-api-internal-health-check',
         'type': 'compute.v1.healthCheck',
         'properties': {