Skip to content

load balancer health check for kube-apiserver#3537

Merged
openshift-merge-robot merged 2 commits intoopenshift:masterfrom
tkashem:kube-apiserver-health-check
May 12, 2020
Merged

load balancer health check for kube-apiserver#3537
openshift-merge-robot merged 2 commits intoopenshift:masterfrom
tkashem:kube-apiserver-health-check

Conversation

@tkashem
Copy link
Copy Markdown
Contributor

@tkashem tkashem commented May 3, 2020

No description provided.

@tkashem
Copy link
Copy Markdown
Contributor Author

tkashem commented May 4, 2020

@abhinavdahiya let me know if this is the right place for this doc, otherwise I will move it.

/assign @abhinavdahiya

@abhinavdahiya
Copy link
Copy Markdown
Contributor

Thanks for the detailed doc @tkashem !

I think the next step will be to make this doc discoverable by linking this doc from the code that defines these healthchecks in data/data/{aws,gcp} ..

@abhinavdahiya abhinavdahiya requested review from abhinavdahiya and removed request for jhixson74 and mtnbikenc May 4, 2020 22:19
@abhinavdahiya abhinavdahiya reopened this May 4, 2020
Comment thread docs/dev/kube-apiserver-health-check.md Outdated
Comment thread docs/dev/kube-apiserver-health-check.md Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space in front of "reports"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment thread docs/dev/kube-apiserver-health-check.md Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everywhere in the doc: load balancers

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment thread docs/dev/kube-apiserver-health-check.md Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or GOAWAY for http/2

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Comment thread docs/dev/kube-apiserver-health-check.md Outdated
Comment thread docs/dev/kube-apiserver-health-check.md Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be clear: not configurable by the user, but by the devs

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Comment thread docs/dev/kube-apiserver-health-check.md Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be clear that this is an example. P2 could be right at T+0s, depending on the alignment of the probe request interval.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made it clear that this is a worst case scenario to calculate at most 30s

Comment thread docs/dev/kube-apiserver-health-check.md Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we know from aws/gcp docs that this is really the case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is true for aws, i copied them verbatim from aws doc.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link the docs, to make this easy to check? Seems like it's the classic-LB docs, but the installer uses network load balancers (classic LBs are aws_elb).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not find a doc that describes this for network load balancer exclusively. I think the health check mechanics should be the same for classic, application and network LB. Maybe we can ask this question to our AWS account rep.

On the other hand, what we stipulate above must hold true for all health checks universally. Otherwise if we allow one interval to bleed into another then we don't have a deterministic "at most".

@tkashem tkashem force-pushed the kube-apiserver-health-check branch 4 times, most recently from 28371c9 to b8d4bb5 Compare May 6, 2020 18:50
Comment thread docs/dev/kube-apiserver-health-check.md Outdated
Copy link
Copy Markdown
Member

@wking wking May 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ok -> 200 OK? I expect LBs to care about HTTP status codes and not about the response body. And your ok is likely shorthand for the 200 status, but I think explicitly saying "200" (and possibly even "HTTP status 200 OK") would make it harder to misunderstand.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed doc @tkashem !
I think the next step will be to make this doc discoverable by linking this doc from the code that defines these healthchecks in data/data/{aws,gcp} ..

@abhinavdahiya I linked the doc.

Comment thread docs/dev/kube-apiserver-health-check.md Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elsewhere in the doc you have:

In future we will reduce shutdown-delay-duration to 30s.

I'd rather make this portion of the doc robust to that sort of pivot by using T+shudown-delay-duration here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as the user/dev is concerned, they should treat shutdown-delay-duration to be 30s for the purpose of designing health check probes. So I changed it to T+30s.

Comment thread docs/dev/kube-apiserver-health-check.md Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this 60s also have a config variable name that we can use to guard against future default changes?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hardcoded in kube-apiserver.

@sttts
Copy link
Copy Markdown
Contributor

sttts commented May 7, 2020

lgtm

@abhinavdahiya
Copy link
Copy Markdown
Contributor

Thanks for the detailed doc @tkashem !

I think the next step will be to make this doc discoverable by linking this doc from the code that defines these healthchecks in data/data/{aws,gcp} ..

@tkashem hopefully you saw this.

cybertron added a commit to cybertron/baremetal-runtimecfg that referenced this pull request May 11, 2020
Per [0], the /readyz endpoint is how the api communicates that it
is gracefully shutting down. Once /readyz starts to report failure,
we want to stop sending traffic to that backend. If we wait for
/healthz, it may be too late because once /healthz starts failing
the api is already not accepting connections.

0: openshift/installer#3537
cybertron added a commit to cybertron/machine-config-operator that referenced this pull request May 11, 2020
Per [0], the /readyz endpoint is how the api communicates that it
is gracefully shutting down. Once /readyz starts to report failure,
we want to stop sending traffic to that backend. If we wait for
/healthz, it may be too late because once /healthz starts failing
the api is already not accepting connections.

I also moved the liveness probe for haproxy itself to use a /readyz
endpoint for consistency. This isn't strictly necessary, but I think
it will be less confusing if there aren't multiple health check
endpoints in the config.

0: openshift/installer#3537
@tkashem tkashem force-pushed the kube-apiserver-health-check branch from b8d4bb5 to dcd415c Compare May 11, 2020 19:36
Comment thread upi/gcp/02_lb_ext.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably not correct comment syntax in python

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, my bad. fixed.

Comment thread upi/gcp/02_lb_int.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@abhinavdahiya
Copy link
Copy Markdown
Contributor

/test e2e-gcp-upi

@tkashem tkashem force-pushed the kube-apiserver-health-check branch from 49cb2af to 3bc71bb Compare May 11, 2020 20:29
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented May 11, 2020

@tkashem: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-gcp-upi 49cb2afe0354e060d278142d969c27779b037b9f link /test e2e-gcp-upi

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@abhinavdahiya
Copy link
Copy Markdown
Contributor

/approve
/lgtm

@abhinavdahiya
Copy link
Copy Markdown
Contributor

Adding valid bug since this is docs update

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 12, 2020
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 12, 2020
@abhinavdahiya abhinavdahiya added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. retest-not-required-docs-only labels May 12, 2020
@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit a1a6300 into openshift:master May 12, 2020
EgorLu pushed a commit to EgorLu/machine-config-operator that referenced this pull request Aug 10, 2020
Per [0], the /readyz endpoint is how the api communicates that it
is gracefully shutting down. Once /readyz starts to report failure,
we want to stop sending traffic to that backend. If we wait for
/healthz, it may be too late because once /healthz starts failing
the api is already not accepting connections.

I also moved the liveness probe for haproxy itself to use a /readyz
endpoint for consistency. This isn't strictly necessary, but I think
it will be less confusing if there aren't multiple health check
endpoints in the config.

0: openshift/installer#3537
(cherry picked from commit 022933c)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. retest-not-required-docs-only

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants