The router should update config immediately on startup#1603
The router should update config immediately on startup#1603openshift-bot merged 1 commit intoopenshift:masterfrom
Conversation
|
[test] |
|
continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_openshift3/1738/) |
|
60 seconds is really long? This points to a fundamental flaw in liveness probes if we have to jack it up that high. Shouldn't we be writing the config before we answer the first liveness probe? |
|
The 60 seconds came from when we weren't writing the config file out until first watch event received - it was a safe number to prevent pod being killed as liveness probe failed. Now this pr writes config before first watch event we should probably be safe to reduce or even remove this delay. |
|
Which reminds me, we probably want two health checks (one a readiness probe on the Openshift router process, one the liveness check on the router ports themselves). The Openshift-router should know when it's ready and report that up to other listeners.
|
|
I will remove this delay and we can see if we still are seeing transient errors on the e2e since it works fine for me locally. Stand by for update. I'll create and test a new PR for liveness probe |
0aab4e9 to
1c6002b
Compare
|
@jimmidyson @smarterclayton updated |
|
LGTM [merge] |
|
continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_openshift3/1503/) (Image: devenv-fedora_1229) |
|
Evaluated for origin up to 1c6002b |
|
@smarterclayton - how can we get an updated released image for the router once this is merged? |
|
You can use the --latest flag on the master to force newer images for now. I was going to cut a release soon. ----- Original Message -----
|
|
I think we're still going to need an initial delay. I am seeing the killed container behavior in an e2e test I was running to fix a bug I found. I'll give it a test to make sure and open a PR |
|
And we know why we're getting killed, for sure?
|
|
Damn. I did say |
|
@smarterclayton - previously it was definitely liveness probe from the logs. Not sure if @pweil- is having the same problem now though. |
|
I need to apply head to desk repeatedly. I need to rebase my branch, I started working on it before this was merged. 😵 Will report back if I can still reproduce it after that |
|
Still seeing this after a rebase and rebuild of the haproxy image. It is failing in the logs over an over again with a probe failure. I don't think this is necessarily a probe issue though, I think it is a plugin issue. The template router plugin reads state and and has to process 6 template files before it calls the reload script so any of those items could cause a delay. Adding the @smarterclayton what do you think about this:
Use case 1: brand new router - the default config file should start up instantly /cc @pmorie |
|
Brainstorming a bit more....even if we updated the plugin to be able to write a default config over the existing config in use case 2 (above) we're still at the mercy of how long it takes to read file.default and write file.actual. My suggestion is that we find a reasonable delay (say 10 seconds) that should cover most use cases and use the code with |
|
Is there a default initial delay for liveness probes? It seems pretty short for the general case if so. Pretty much every java process ever for example is going to have to set initial delay. Is that expected? |
|
no, not that I could find. The probe returns success until Now - Created > Delay. Once it has passed that mark it probes the container for real. |
|
Well, between this PR and an InitialDelay of 10 seconds I have been unable to reproduce the probe failure after running e2e 10 times. I'm going to open a PR for that change unless there are any objections or alternate solutions. |
|
It seems weird that probes have to have a delay. The delay might depend on how loaded the host is. I thought there was a better mechanism planned to start this check, but maybe I'm wrong. ----- Original Message -----
|
|
----- Original Message -----
If we did this, that fits the definition of "alive" (it's doing something). However.... if you have a huge set of routes, do you want to report "ready" before those are processed and you are brought into rotation? Probably not.
It's likely in a restart you want to not signal ready until you can really take traffic (otherwise you'll start rejecting requests randomly). I think in this case haproxy shouldn't even come up until the config is in place (although if you have the last config, it's probably ok to load it).
|
…service-catalog/' changes from d969acde90..b69b4a6c80 b69b4a6c80 origin build: modify hard coded path 527fac4d02 origin build: add origin tooling 545ffdb chart changes for v0.1.5 release (openshift#1709) 4d9be8f Use userInfo for Originating-Identity so extras is correct. (openshift#1702) f358b99 Call destroy function on each storage interface (openshift#1705) 36b5de9 refactor binding reconciliation functions (openshift#1687) 5699360 Change binding_retrievable to bindingRetrievable 0d8bcfe thread through stopCh to DestroyFunc (openshift#1671) 1c45aef Migrate from glide to dep for dependency management (openshift#1670) 1cf0dd9 Add svcat to Makefile (openshift#1683) 45b1013 make verify validates that versioned APIs contain json tags for fields, addresses openshift#1303 (openshift#1480) 0ee8398 Build the integration test binary before running any tests (openshift#1666) 0fe0aa7 Update design.md (openshift#1674) 1280d24 controller requires permission to update secrets (openshift#1663) 129d98e Contribute svcat (openshift#1664) ff9739b Update dependencies to kubernetes-1.9.1 (openshift#1633) 9c36019 chart changes for v0.1.4 release (openshift#1669) 93319f6 move apiserver generation to script and verify (openshift#1662) 385f0da refactor service instance provision/update/poll reconciliation (openshift#1648) e015212 run each integration test individually (openshift#1661) 412e242 Tell people whether we're checking external hrefs (openshift#1659) ae05361 retry failed unbind requests (openshift#1653) 7eae845 doc for setting up Service Catalog with Prometheus metrics (openshift#1654) 0720cf9 minor README copy edit (openshift#1656) 8bd347d run some integration subtests in parallel (openshift#1637) b83800c Use $ and console to indicate multi-command blocks 789c4b2 Use dynamic reaction to fix data race (openshift#1650) f1be763 only check external hrefs on master (openshift#1652) 65c6d20 Controller-manager crash loops if API server is not available on startup (openshift#1376) (openshift#1591) 9225c92 embedded etcd is the way of the future for our tests (openshift#1651) 605c952 Fix required fields in OpenAPI schema (openshift#1602) 899ca21 Revert "Switch to wget for integration apiserver checks (openshift#1384)" (openshift#1585) 2f496ee Update code-of-conduct.md (openshift#1635) c1c69cf Build the e2e binary in CI (openshift#1647) 4e2dcef Wait for successful API discovery (openshift#1646) 5ae6d99 Bump copyright date in generated code (openshift#1645) 8be5b05 Serve OpenAPI spec only when --serve-openapi-spec switch is enabled (openshift#1612) 19fb30e silence go-restful logging output (openshift#1622) fdbabf0 Add walkthrough link back (openshift#1620) 7c73e9a Add link to main k8s docs on service-catalog (openshift#1627) f59adc9 Overhauling the design document (openshift#1619) cd7b633 Updating the install documentation (openshift#1616) f6e5441 fix compilation error from updated util.WaitForBindingCondition() (openshift#1629) (openshift#1631) 54e57af Provide OSB Client proxy to instrument with metrics (openshift#1615) 026b86f Disable test added in 1611 that contains data race (openshift#1626) cb735a6 Add integration tests for ServiceInstances (openshift#1611) 67dbabb Cleaning up the docs README (openshift#1618) 6bddc07 remove email from docker login during Travis deploy (openshift#1614) a604bc3 Use ConfigMaps for leader election (openshift#1599) c6f193a Add controller integration tests for ServiceInstance create and update (openshift#1578) 26cf23b Rename OWNERS assignees: to approvers: (openshift#1508) 1163edc expose Prometheus metrics from Controller (openshift#677) (openshift#1608) 2cd6554 Clean up docs/ folder (openshift#1609) 1d7e96d Adding Service Binding Create Integration Tests (openshift#1580) 6a4c469 Make the maximum polling backoff configurable (openshift#1607) 31bbf55 Rename the imported package to avoid name conflict (openshift#1603) 3cdd556 Add validation for broker spec to SAR admission controller (openshift#1605) a3408ce fix docker volume mount when building with docker under SELinux (openshift#1500) (openshift#1534) 307e747 Remove unneeded vendors of vendors (openshift#1596) 770fc74 Make ups-instance.yaml in walkthrough to demonstrate good practices (openshift#1592) 9112ba1 Add links to docs/README (openshift#1589) 8902648 Add additional service to ups-broker to fix e2e (openshift#1583) 0bcbc7d move instance update logic out of reconcileServiceInstanceDelete (openshift#1584) 7ef5a3e Do not send Parameters field when there are no parameters from sourced secret (openshift#1559) 4c51b25 Remove unneeded code that sets reason for provision/update call failure (openshift#1561) b122cb9 fix bind injection failure not being persisted in API server (openshift#1546) 66421d5 Clear out current operation when starting reconciliation of a delete (openshift#1564) 8cca70a Send an empty object for Parameters when deleting all parameters of a ServiceInstance (openshift#1555) 270426c Add controllerTest type as a helper for running controller integration tests. (openshift#1577) e26c2d7 Ignore .vscode folder in project root (openshift#1579) REVERT: d969acde90 Add additional service to ups-broker to fix e2e (openshift#1583) REVERT: 1bcd53b684 origin build: add origin tooling git-subtree-dir: cmd/service-catalog/go/src/github.com/kubernetes-incubator/service-catalog git-subtree-split: b69b4a6c8003f25d040e3087c7b1b16d1854a9e9
These changes address the following:
@pmorie PTAL at the plugin changes
@deads2k - this is the PR that I was mentioning in IRC that should help with e2e transient failures
@jimmidyson @sdodson - as discussed on #1575 and #1504 (comment)