Skip to content

Add an initial delay to the router liveness probe#1658

Merged
openshift-bot merged 1 commit intoopenshift:masterfrom
pweil-:router-probe-delay
Apr 9, 2015
Merged

Add an initial delay to the router liveness probe#1658
openshift-bot merged 1 commit intoopenshift:masterfrom
pweil-:router-probe-delay

Conversation

@pweil-
Copy link

@pweil- pweil- commented Apr 8, 2015

The router is experiences issues with the liveness probe where it is probing the socket before the plugin has a chance to write out the initial configs and start the router.

There were two issues to address:

  1. The router never started until a watch event was received, fixed by The router should update config immediately on startup #1603
  2. The probe occurred before haproxy was started

The combination of writing the initial config plus the initial delay should help reduce the chance the probe fires before HAProxy is started. However we still have a case where, if the router is writing a large file, we can probe before the router is ready.

We discussed in IRC the possibility of having the kubelet backoff on probe failure based on configuration which would ultimately be the right solution. As @jimmidyson pointed out in #1603 (comment) it's very likely that a lot of containers may need to perform X amount of variable size work before they are considered ready and a static delay would not cover it.

@sdodson could you give this a test manually (the combination of #1603 plus the delay) in your environment? I was unable to reproduce the loop after at least 10 e2e runs with this combination.

Reference issues:
#1504
#1575

@sdodson @jimmidyson @smarterclayton

@smarterclayton
Copy link
Contributor

Can you spawn an upstream issue and lay out the use cases you and Jimmy can think of where the static delay is insufficient?

Jitter + backoff are probably indicated for problems like this.

On Apr 8, 2015, at 6:49 PM, Paul notifications@github.com wrote:

The router is experiences issues with the liveness probe where it is probing the socket before the plugin has a chance to write out the initial configs and start the router.

There were two issues to address:

The router never started until a watch event was received, fixed by #1603
The probe occurred before haproxy was started
The combination of writing the initial config plus the initial delay should help reduce the chance the probe fires before HAProxy is started. However we still have a case where, if the router is writing a large file, we can probe before the router is ready.

We discussed in IRC the possibility of having the kubelet backoff on probe failure based on configuration which would ultimately be the right solution. As @jimmidyson pointed out in #1603 (comment) it's very likely that a lot of containers may need to perform X amount of variable size work before they are considered ready and a static delay would not cover it.

@sdodson could you give this a test manually (the combination of #1603 plus the delay)? I was unable to reproduce the loop after at least 10 e2e runs with this combination.

Reference issues:
#1504
#1575

@sdodson @jimmidyson @smarterclayton

You can view, comment on, or merge this pull request online at:

#1658

Commit Summary

probe delay
File Changes

M pkg/cmd/experimental/router/router.go (1)
Patch Links:

https://github.com/openshift/origin/pull/1658.patch
https://github.com/openshift/origin/pull/1658.diff

Reply to this email directly or view it on GitHub.

@sdodson
Copy link
Member

sdodson commented Apr 9, 2015

@pweil- I've not seen my router killed due to liveness probes since I've applied this patch. LGTM

@pweil-
Copy link
Author

pweil- commented Apr 9, 2015

👍

@smarterclayton
Copy link
Contributor

[merge], link the upstream issue here when it exists.

@pweil-
Copy link
Author

pweil- commented Apr 9, 2015

10-4

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_openshift3/1516/) (Image: devenv-fedora_1242)

@openshift-bot
Copy link
Contributor

Evaluated for origin up to 42f5a0d

openshift-bot pushed a commit that referenced this pull request Apr 9, 2015
@openshift-bot openshift-bot merged commit 88f13de into openshift:master Apr 9, 2015
@pweil-
Copy link
Author

pweil- commented Apr 13, 2015

Linking upstream issue kubernetes/kubernetes#6758

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants