The router should update config immediately on startup by pweil- · Pull Request #1603 · openshift/origin

pweil- · 2015-04-06T16:24:55Z

These changes address the following:

The router does not start automatically. A commit was only happen when a watch event was processed. If the watch configuration was wrong then a config is never written and haproxy is never started. A new commit is added to the plugin to write an empty state config file after creating a new router
Delay probes for 60 seconds to allow the router time to write the config file (with the new initial commit) and be started before failing containers.

@pmorie PTAL at the plugin changes
@deads2k - this is the PR that I was mentioning in IRC that should help with e2e transient failures
@jimmidyson @sdodson - as discussed on #1575 and #1504 (comment)

pweil- · 2015-04-06T16:25:28Z

[test]

openshift-bot · 2015-04-06T16:30:41Z

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_openshift3/1738/)

smarterclayton · 2015-04-08T03:39:12Z

60 seconds is really long? This points to a fundamental flaw in liveness probes if we have to jack it up that high. Shouldn't we be writing the config before we answer the first liveness probe?

jimmidyson · 2015-04-08T03:44:07Z

The 60 seconds came from when we weren't writing the config file out until first watch event received - it was a safe number to prevent pod being killed as liveness probe failed. Now this pr writes config before first watch event we should probably be safe to reduce or even remove this delay.

smarterclayton · 2015-04-08T03:55:30Z

Which reminds me, we probably want two health checks (one a readiness probe on the Openshift router process, one the liveness check on the router ports themselves). The Openshift-router should know when it's ready and report that up to other listeners.

On Apr 7, 2015, at 11:44 PM, Jimmi Dyson notifications@github.com wrote:

The 60 seconds came from when we weren't writing the config file out until first watch event received - it was a safe number to prevent pod being killed as liveness probe failed. Now this pr writes config before first watch event we should probably be safe to reduce or even remove this delay.

—
Reply to this email directly or view it on GitHub.

pweil- · 2015-04-08T12:39:09Z

I will remove this delay and we can see if we still are seeing transient errors on the e2e since it works fine for me locally. Stand by for update.

I'll create and test a new PR for liveness probe

pweil- · 2015-04-08T12:48:42Z

@jimmidyson @smarterclayton updated

smarterclayton · 2015-04-08T13:29:37Z

LGTM [merge]

openshift-bot · 2015-04-08T13:30:41Z

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_openshift3/1503/) (Image: devenv-fedora_1229)

openshift-bot · 2015-04-08T13:30:41Z

Evaluated for origin up to 1c6002b

jimmidyson · 2015-04-08T14:13:50Z

@smarterclayton - how can we get an updated released image for the router once this is merged?

Merged by openshift-bot

smarterclayton · 2015-04-08T14:27:54Z

You can use the --latest flag on the master to force newer images for now. I was going to cut a release soon.

----- Original Message -----

@smarterclayton - how can we get an updated released image for the router
once this is merged?

Reply to this email directly or view it on GitHub:
#1603 (comment)

pweil- · 2015-04-08T18:01:44Z

I think we're still going to need an initial delay. I am seeing the killed container behavior in an e2e test I was running to fix a bug I found. I'll give it a test to make sure and open a PR

smarterclayton · 2015-04-08T18:08:47Z

And we know why we're getting killed, for sure?

On Apr 8, 2015, at 2:01 PM, Paul notifications@github.com wrote:

I think we're still going to need an initial delay. I am seeing the killed container behavior in an e2e test I was running to fix a bug I found. I'll give it a test to make sure and open a PR

—
Reply to this email directly or view it on GitHub.

jimmidyson · 2015-04-08T18:09:27Z

Damn. I did say we should probably be safe ;) That would really smack of a problem upstream though right with liveness probes being fired too eagerly?

jimmidyson · 2015-04-08T18:10:57Z

@smarterclayton - previously it was definitely liveness probe from the logs. Not sure if @pweil- is having the same problem now though.

pweil- · 2015-04-08T18:36:48Z

I need to apply head to desk repeatedly. I need to rebase my branch, I started working on it before this was merged. 😵

Will report back if I can still reproduce it after that

pweil- · 2015-04-08T20:17:03Z

Still seeing this after a rebase and rebuild of the haproxy image. It is failing in the logs over an over again with a probe failure.

I don't think this is necessarily a probe issue though, I think it is a plugin issue. The template router plugin reads state and and has to process 6 template files before it calls the reload script so any of those items could cause a delay. Adding the .Commit ensured that it had a file pre-watch events but if the probe occurs during or before it is finished processing the templates we're still broken.

@smarterclayton what do you think about this:

Provide a default empty config file in the image
In the method I changed for adding the commit make a direct call to reload
Follow the reload with a Commit call to save state, process templates, reload

Use case 1: brand new router - the default config file should start up instantly
Use case 2: restart of failed router - it should have already overwritten the default config so we're at the mercy of how long haproxy takes to start unless we want to always dump a default over top of the current config in the case of a restart and let item 3 above repopulate it

/cc @pmorie

pweil- · 2015-04-08T20:57:49Z

Brainstorming a bit more....even if we updated the plugin to be able to write a default config over the existing config in use case 2 (above) we're still at the mercy of how long it takes to read file.default and write file.actual. My suggestion is that we find a reasonable delay (say 10 seconds) that should cover most use cases and use the code with .Commit rather than complicate the plugin for future authors.

jimmidyson · 2015-04-08T21:24:27Z

Is there a default initial delay for liveness probes? It seems pretty short for the general case if so. Pretty much every java process ever for example is going to have to set initial delay. Is that expected?

pweil- · 2015-04-08T21:31:09Z

no, not that I could find. The probe returns success until Now - Created > Delay. Once it has passed that mark it probes the container for real.

pweil- · 2015-04-08T21:48:36Z

Well, between this PR and an InitialDelay of 10 seconds I have been unable to reproduce the probe failure after running e2e 10 times. I'm going to open a PR for that change unless there are any objections or alternate solutions.

smarterclayton · 2015-04-08T21:55:14Z

It seems weird that probes have to have a delay. The delay might depend on how loaded the host is. I thought there was a better mechanism planned to start this check, but maybe I'm wrong.

----- Original Message -----

no, not that I could find. The probe returns success until Now - Created >
Delay. Once it has passed that mark it probes the container for real.

Reply to this email directly or view it on GitHub:
#1603 (comment)

smarterclayton · 2015-04-08T21:59:14Z

----- Original Message -----

Still seeing this after a rebase and rebuild of the haproxy image. It is
failing in the logs over an over again with a probe failure.

I don't think this is necessarily a probe issue though, I think it is a
plugin issue. The template router plugin reads state and and has to process
6 template files before it calls the reload script so any of those items
could cause a delay. Adding the .Commit ensured that it had a file
pre-watch events but if the probe occurs during or before it is finished
processing the templates we're still broken.

@smarterclayton what do you think about this:

Provide a default empty config file in the image

If we did this, that fits the definition of "alive" (it's doing something). However.... if you have a huge set of routes, do you want to report "ready" before those are processed and you are brought into rotation? Probably not.

In the method I changed for adding the commit make a direct call to
reload

Follow the reload with a Commit call to save state, process templates,
reload

Use case 1: brand new router - the default config file should start up
instantly
Use case 2: restart of failed router - it should have already overwritten
the default config so we're at the mercy of how long haproxy takes to start
unless we want to always dump a default over top of the current config in
the case of a restart and let item 3 above repopulate it

It's likely in a restart you want to not signal ready until you can really take traffic (otherwise you'll start rejecting requests randomly). I think in this case haproxy shouldn't even come up until the config is in place (although if you have the last config, it's probably ok to load it).

/cc @pmorie

Reply to this email directly or view it on GitHub:
#1603 (comment)

…service-catalog/' changes from d969acde90..b69b4a6c80 b69b4a6c80 origin build: modify hard coded path 527fac4d02 origin build: add origin tooling 545ffdb chart changes for v0.1.5 release (openshift#1709) 4d9be8f Use userInfo for Originating-Identity so extras is correct. (openshift#1702) f358b99 Call destroy function on each storage interface (openshift#1705) 36b5de9 refactor binding reconciliation functions (openshift#1687) 5699360 Change binding_retrievable to bindingRetrievable 0d8bcfe thread through stopCh to DestroyFunc (openshift#1671) 1c45aef Migrate from glide to dep for dependency management (openshift#1670) 1cf0dd9 Add svcat to Makefile (openshift#1683) 45b1013 make verify validates that versioned APIs contain json tags for fields, addresses openshift#1303 (openshift#1480) 0ee8398 Build the integration test binary before running any tests (openshift#1666) 0fe0aa7 Update design.md (openshift#1674) 1280d24 controller requires permission to update secrets (openshift#1663) 129d98e Contribute svcat (openshift#1664) ff9739b Update dependencies to kubernetes-1.9.1 (openshift#1633) 9c36019 chart changes for v0.1.4 release (openshift#1669) 93319f6 move apiserver generation to script and verify (openshift#1662) 385f0da refactor service instance provision/update/poll reconciliation (openshift#1648) e015212 run each integration test individually (openshift#1661) 412e242 Tell people whether we're checking external hrefs (openshift#1659) ae05361 retry failed unbind requests (openshift#1653) 7eae845 doc for setting up Service Catalog with Prometheus metrics (openshift#1654) 0720cf9 minor README copy edit (openshift#1656) 8bd347d run some integration subtests in parallel (openshift#1637) b83800c Use $ and console to indicate multi-command blocks 789c4b2 Use dynamic reaction to fix data race (openshift#1650) f1be763 only check external hrefs on master (openshift#1652) 65c6d20 Controller-manager crash loops if API server is not available on startup (openshift#1376) (openshift#1591) 9225c92 embedded etcd is the way of the future for our tests (openshift#1651) 605c952 Fix required fields in OpenAPI schema (openshift#1602) 899ca21 Revert "Switch to wget for integration apiserver checks (openshift#1384)" (openshift#1585) 2f496ee Update code-of-conduct.md (openshift#1635) c1c69cf Build the e2e binary in CI (openshift#1647) 4e2dcef Wait for successful API discovery (openshift#1646) 5ae6d99 Bump copyright date in generated code (openshift#1645) 8be5b05 Serve OpenAPI spec only when --serve-openapi-spec switch is enabled (openshift#1612) 19fb30e silence go-restful logging output (openshift#1622) fdbabf0 Add walkthrough link back (openshift#1620) 7c73e9a Add link to main k8s docs on service-catalog (openshift#1627) f59adc9 Overhauling the design document (openshift#1619) cd7b633 Updating the install documentation (openshift#1616) f6e5441 fix compilation error from updated util.WaitForBindingCondition() (openshift#1629) (openshift#1631) 54e57af Provide OSB Client proxy to instrument with metrics (openshift#1615) 026b86f Disable test added in 1611 that contains data race (openshift#1626) cb735a6 Add integration tests for ServiceInstances (openshift#1611) 67dbabb Cleaning up the docs README (openshift#1618) 6bddc07 remove email from docker login during Travis deploy (openshift#1614) a604bc3 Use ConfigMaps for leader election (openshift#1599) c6f193a Add controller integration tests for ServiceInstance create and update (openshift#1578) 26cf23b Rename OWNERS assignees: to approvers: (openshift#1508) 1163edc expose Prometheus metrics from Controller (openshift#677) (openshift#1608) 2cd6554 Clean up docs/ folder (openshift#1609) 1d7e96d Adding Service Binding Create Integration Tests (openshift#1580) 6a4c469 Make the maximum polling backoff configurable (openshift#1607) 31bbf55 Rename the imported package to avoid name conflict (openshift#1603) 3cdd556 Add validation for broker spec to SAR admission controller (openshift#1605) a3408ce fix docker volume mount when building with docker under SELinux (openshift#1500) (openshift#1534) 307e747 Remove unneeded vendors of vendors (openshift#1596) 770fc74 Make ups-instance.yaml in walkthrough to demonstrate good practices (openshift#1592) 9112ba1 Add links to docs/README (openshift#1589) 8902648 Add additional service to ups-broker to fix e2e (openshift#1583) 0bcbc7d move instance update logic out of reconcileServiceInstanceDelete (openshift#1584) 7ef5a3e Do not send Parameters field when there are no parameters from sourced secret (openshift#1559) 4c51b25 Remove unneeded code that sets reason for provision/update call failure (openshift#1561) b122cb9 fix bind injection failure not being persisted in API server (openshift#1546) 66421d5 Clear out current operation when starting reconciliation of a delete (openshift#1564) 8cca70a Send an empty object for Parameters when deleting all parameters of a ServiceInstance (openshift#1555) 270426c Add controllerTest type as a helper for running controller integration tests. (openshift#1577) e26c2d7 Ignore .vscode folder in project root (openshift#1579) REVERT: d969acde90 Add additional service to ups-broker to fix e2e (openshift#1583) REVERT: 1bcd53b684 origin build: add origin tooling git-subtree-dir: cmd/service-catalog/go/src/github.com/kubernetes-incubator/service-catalog git-subtree-split: b69b4a6c8003f25d040e3087c7b1b16d1854a9e9

pweil- mentioned this pull request Apr 6, 2015

Controllers should rate limit themselves and space retries out across a longer period of time #1561

Merged

perform initial commit to write config

1c6002b

pweil- force-pushed the router-start-issues branch from 0aab4e9 to 1c6002b Compare April 8, 2015 12:44

openshift-bot pushed a commit that referenced this pull request Apr 8, 2015

Merge pull request #1603 from pweil-/router-start-issues

266ce4c

Merged by openshift-bot

openshift-bot merged commit 266ce4c into openshift:master Apr 8, 2015

smarterclayton changed the title ~~perform an initial commit and add a probe delay~~ The router should update config immediately on startup Apr 8, 2015

smarterclayton added the kind/bug Categorizes issue or PR as related to a bug. label Apr 8, 2015

pweil- mentioned this pull request Apr 8, 2015

Cannot create https-only route #1567

Closed

pweil- mentioned this pull request Apr 8, 2015

Add an initial delay to the router liveness probe #1658

Merged

smarterclayton modified the milestone: 0.5.0 (beta3) Apr 23, 2015

Conversation

pweil- commented Apr 6, 2015

Uh oh!

pweil- commented Apr 6, 2015

Uh oh!

openshift-bot commented Apr 6, 2015

Uh oh!

smarterclayton commented Apr 8, 2015

Uh oh!

jimmidyson commented Apr 8, 2015

Uh oh!

smarterclayton commented Apr 8, 2015

Uh oh!

pweil- commented Apr 8, 2015

Uh oh!

pweil- commented Apr 8, 2015

Uh oh!

smarterclayton commented Apr 8, 2015

Uh oh!

openshift-bot commented Apr 8, 2015

Uh oh!

openshift-bot commented Apr 8, 2015

Uh oh!

jimmidyson commented Apr 8, 2015

Uh oh!

smarterclayton commented Apr 8, 2015

Uh oh!

pweil- commented Apr 8, 2015

Uh oh!

smarterclayton commented Apr 8, 2015

Uh oh!

jimmidyson commented Apr 8, 2015

Uh oh!

jimmidyson commented Apr 8, 2015

Uh oh!

pweil- commented Apr 8, 2015

Uh oh!

pweil- commented Apr 8, 2015

Uh oh!

pweil- commented Apr 8, 2015

Uh oh!

jimmidyson commented Apr 8, 2015

Uh oh!

pweil- commented Apr 8, 2015

Uh oh!

pweil- commented Apr 8, 2015

Uh oh!

smarterclayton commented Apr 8, 2015

Uh oh!

smarterclayton commented Apr 8, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants