[DO NOT MERGE] Router HA and failover support by ramr · Pull Request #1564 · openshift/origin

ramr · 2015-04-02T05:02:48Z

HA and failover support for the router. Adding virtual-ips automatically generates a privileged container running keepalived. There's also a unicast option for environments without multicast support.

Example usage:
openshift ex router --credentials="${KUBECONFIG}" --virtual-ips="10.0.2.100-102,10.0.2.42" --latest-images=true --create

@pweil- @smarterclayton PTAL and comments would be appreciated. Thx.

Still has 1 pending issue - will log a separate comment on that.

This example using hello-openshift in lieu of the router but that should be easy enough to change soon.

clearer and some other minor cleanup.

* Change demo script to actually use the router and the hello-openshift app * Update demo/readme/instructions to run multiple routers and a scaled service (backend/app: hello-openshift with 2 replicas).

…mage changes and other bug fixes.

…licy (need to still delve into why the kubelet is always trying to pull the image).

versions have issues and don't properly generate the podSpec with host networking enabled.

ramr · 2015-04-02T05:06:45Z

pkg/cmd/experimental/router/router.go

I still have an issue here with this configuration. For some reason, the containers that get started are not started using the HostNetwork.

However my test json script works fine - the docker containers use the host network.
See: https://gist.github.com/ramr/b6389a7bfbade8df8cdc for details.
This works if created with osc create -f foo.json

So something in this API interaction is causing issues. Spent a good day plus trying to figure this out - thought originally it was an issue w/ api version v1beta3 to v1beta1 conversions but the above mentioned json config script works fine. So its something in here that am not doing correct. Any help/pointers would be greatly appreciated. thx

Does the kubelet see the host network setting? You can query the kubelet directly to see if it's getting the right response.

No it doesn't - think it is probably something in the interaction via the DeployConfig (via the deployer pod).

smarterclayton · 2015-04-02T05:26:15Z

To be honest, I'm a bit concerned with all of this complexity being directly coupled with a router (or it appearing to be coupled). First, I'm not sure sufficient number of organizations would choose this solution or want to tweak it for it to be as coupled as it is. Second, the abstraction should really be "add HA to a general deployment config, not just to routers". Third, this is pretty dependent on upfront configuration and doesn't appear to be flexible to the cluster changing, which is a serious issue.

The solution we want to get to is general virtualized IPs across the cluster (whether provided by the external infrastructure or something we emulate with VIPs internally). I don't want to bind it closely to routers and then have to rip it out again.

Before we dive too much into the details of this pull, I'd like to see the following

Move the keepalived code into a new command that adds a keepalived sidecar to an existing deployment config (similar to what you have, but something like openshift admin enable-keepalive <deployment config name> with the keepalive options there).
A design proposal on how dynamic cluster membership is going to be handled (i.e. if I'm reading this right the unicast depends on reading the ips of the cluster up front - when new nodes come in what happens)
A description of what an admin needs to do to configure HA IP outside the cluster, and why they would or wouldn't choose that (I'm an admin, I know how to set up keepalived, why is this solution better)
A description of how keepalived could be moved to run on a subset of hosts (as containers) that would then provide HA IPs for pods that they host (and what topologies are reasonable there). How would keepalived react to dynamic membership changes?

I'm looking to dramatically reduce the operational costs associated with managing VIPs - while this is a good starting point, it doesn't really reduce costs enough for it to be something we want to bless ootb.

ramr · 2015-04-02T06:55:19Z

Hmm, ok -- the trello card I had read router failover and not really VIP management across the cluster. That would seem to be a bit of scope creep here!! :)
I do agree its complex and there is benefit in making it generic ... imho, the wrappers around this are more complex than they need to be!! ;^) Also to your other point, this is actually flexible to the cluster changing and not fixed on the cluster - the workflow is similar to a node going down and a new one taking over the VIP or in a case of a small cluster, implicitly having a few nodes short/down - so in a way, its no different than a failure case.

As re: the points you raised:

Will address that in a separate PR .. if that's the way you want this to proceed.
unicast is a bit of an issue in any case - as the number of nodes scales, the unicast is not going to be a good solution or a scalable one in any case - imagine send the same advertisement packets to n hosts - would increase the number of cross host packets ... n * (n-1).
Now as re: new nodes coming into a cluster, this really only has an effect if a router instance is migrated to that node - but then so would a keepalived instance - its part of the same pod and in that case the cluster membership is automatically managed (all peers + the node itself form that group). Otherwise a new node joining the cluster but not running a router - really has no effect. There's probably some more work to still do there - not complete by any means but in all fairness, I threw in unicast support because I have seen environments ala Amazon EC2 where multicast is not supported. Didn't think that would really be a big driver in an enterprise case as it does have limitations of scale.
Not sure I understood this point - A description of what an admin needs to do to configure HA IP outside the cluster, and why they would or wouldn't choose that - if you are looking for a HA solution from outside the cluster, you still have to manage how traffic gets router to one or more router instances and handle cases where the router pod migrates between nodes.
In general, I can add more documentation around this and the other bits once this get flushed out a wee bit more. This point is however really easy to address in that - its would just be a selector on the ReplicationControllerSpec, which would define what set of nodes this gets run on (router + the failover component). Again the dynamic membership bits are as mentioned at the top of this comment.

And finally, as re: the last part, reducing cost - I don't know about that. There's a fair deal of complexity, understanding, layout - physical/network, migrations etc which need to be done by hand - so there is definitely automation and reduction of cost here. Now that said, we have a lot lot of assumptions across the board on how the nodes are connected - the network topologies of how the whole shebang is deployed - whether its cross availability zones/networks/WANs etc - and I don't think that's something specific to this PR though - its something that needs to be addressed outside the scope of this crazy lil' world where all the keepalived's know each other!! ;^)

smarterclayton · 2015-04-02T16:47:59Z

On Apr 2, 2015, at 2:55 AM, ramr notifications@github.com wrote:

Hmm, ok -- the card I had read router failover and not really VIP management across the cluster. That would seem to be a bit of scope creep here!! :)

My concern is that all solutions like this become beautiful snowflakes unless we have a path to them either becoming fundamental or being removed. I think this is a fundamental cluster problem (I want an external ip allocated for my pod, and I want it HA) which has many implications wrt IaaS (GCE and AWS already does this for us - we need that to be automated eventually).

You don't have to do the actual work to solve the generic problem, but I want to see that path before we merge it.

I do agree its complex and there is benefit in making it generic ... imho, the wrappers around this are more complex than they need to be!! ;^) Also to your other point, this is actually flexible to the cluster changing and not fixed on the cluster - the workflow is similar to a node going down and a new one taking over the VIP or in a case of a small cluster, implicitly having a few nodes short/down - so in a way, its no different than a failure case.

As re: the points you raised:

Will address that in a separate PR .. if that's the way you want this to proceed.

unicast is a bit of an issue in any case - as the number of nodes scales, the unicast is not going to be a good solution or a scalable one in any case - imagine send the same advertisement packets to n hosts - would increase the number of cross host packets ... n * (n-1).
Now as re: new nodes coming into a cluster, this really only has an effect if a router instance is migrated to that node - but then so would a keepalived instance - its part of the same pod and in that case the cluster membership is automatically managed (all peers + the node itself form that group). Otherwise a new node joining the cluster but not running a router - really has no effect. There's probably some more work to still do there - not complete by any means but in all fairness, I threw in unicast support because I have seen environments ala Amazon EC2 where multicast is not supported. Didn't think that would really be a big driver in an enterprise case as it does have limitations of scale.

I think it can definitely be valuable, but don't want to put it in unless your confident it's a long term solution.

Not sure I understood this point - A description of what an admin needs to do to configure HA IP outside the cluster, and why they would or wouldn't choose that - if you are looking for a HA solution from outside the cluster, you still have to manage how traffic gets router to one or more router instances and handle cases where the router pod migrates between nodes.

So this is a "we make pods HA" solution internal to the cluster. If an admin says "I know how to do that", what are the things they have to know to properly enable it from the outside in a normal, enterprisey ops shop. Ideally the outside directions are a clear subset of the work you're doing here, so that shared bit of explanation is important for us to support this.

In general, I can add more documentation around this and the other bits once this get flushed out a wee bit more. This point is however really easy to address in that - its would just be a selector on the ReplicationControllerSpec, which would define what set of nodes this gets run on (router + the failover component). Again the dynamic membership bits are as mentioned at the top of this comment.

And finally, as re: the last part, reducing cost - I don't know about that. There's a fair deal of complexity, understanding, layout - physical/network, migrations etc which need to be done by hand - so there is definitely automation and reduction of cost here. Now that said, we have a lot lot of assumptions across the board on how the nodes are connected - the network topologies of how the whole shebang is deployed - whether its cross availability zones/networks/WANs etc - and I don't think that's something specific to this PR though - its something that needs to be addressed outside the scope of this crazy lil' world where all the keepalived's know each other!! ;^)

—
Reply to this email directly or view it on GitHub.

ramr · 2015-04-02T17:12:26Z

Thanks for the clarifications/comments - I will take a lot at this when I am back next week. I will probably disable/not expose the unicast functionality for now - there's obvious mentioned scale issues therein. Meanwhile, let me mull on this a bit more. Closing this PR out for now. Thanks.

smarterclayton · 2015-04-02T17:57:02Z

Also can you look over the external ip discussion upstream (and the service public ip) and get familiar with the implications?

External IP at least is a request by an app consumer to get an IP provisioned by the infra that is bound to a pod (or the pods in a service?). That overlaps with services a great deal. And headless services having a pair of virtual ips also maps to this problem.

On Apr 2, 2015, at 1:12 PM, ramr notifications@github.com wrote:

Thanks for the clarifications/comments - I will take a lot at this when I am back next week. I will probably disable/not expose the unicast functionality for now - there's obvious mentioned scale issues therein. Meanwhile, let me mull on this a bit more. Closing this PR out for now. Thanks.

—
Reply to this email directly or view it on GitHub.

ramr · 2015-04-02T18:23:29Z

Sure - will take a look at the upstream external ip discussion as well next week when I am back.

…service-catalog/' changes from d969acde90..b69b4a6c80 b69b4a6c80 origin build: modify hard coded path 527fac4d02 origin build: add origin tooling 545ffdb chart changes for v0.1.5 release (openshift#1709) 4d9be8f Use userInfo for Originating-Identity so extras is correct. (openshift#1702) f358b99 Call destroy function on each storage interface (openshift#1705) 36b5de9 refactor binding reconciliation functions (openshift#1687) 5699360 Change binding_retrievable to bindingRetrievable 0d8bcfe thread through stopCh to DestroyFunc (openshift#1671) 1c45aef Migrate from glide to dep for dependency management (openshift#1670) 1cf0dd9 Add svcat to Makefile (openshift#1683) 45b1013 make verify validates that versioned APIs contain json tags for fields, addresses openshift#1303 (openshift#1480) 0ee8398 Build the integration test binary before running any tests (openshift#1666) 0fe0aa7 Update design.md (openshift#1674) 1280d24 controller requires permission to update secrets (openshift#1663) 129d98e Contribute svcat (openshift#1664) ff9739b Update dependencies to kubernetes-1.9.1 (openshift#1633) 9c36019 chart changes for v0.1.4 release (openshift#1669) 93319f6 move apiserver generation to script and verify (openshift#1662) 385f0da refactor service instance provision/update/poll reconciliation (openshift#1648) e015212 run each integration test individually (openshift#1661) 412e242 Tell people whether we're checking external hrefs (openshift#1659) ae05361 retry failed unbind requests (openshift#1653) 7eae845 doc for setting up Service Catalog with Prometheus metrics (openshift#1654) 0720cf9 minor README copy edit (openshift#1656) 8bd347d run some integration subtests in parallel (openshift#1637) b83800c Use $ and console to indicate multi-command blocks 789c4b2 Use dynamic reaction to fix data race (openshift#1650) f1be763 only check external hrefs on master (openshift#1652) 65c6d20 Controller-manager crash loops if API server is not available on startup (openshift#1376) (openshift#1591) 9225c92 embedded etcd is the way of the future for our tests (openshift#1651) 605c952 Fix required fields in OpenAPI schema (openshift#1602) 899ca21 Revert "Switch to wget for integration apiserver checks (openshift#1384)" (openshift#1585) 2f496ee Update code-of-conduct.md (openshift#1635) c1c69cf Build the e2e binary in CI (openshift#1647) 4e2dcef Wait for successful API discovery (openshift#1646) 5ae6d99 Bump copyright date in generated code (openshift#1645) 8be5b05 Serve OpenAPI spec only when --serve-openapi-spec switch is enabled (openshift#1612) 19fb30e silence go-restful logging output (openshift#1622) fdbabf0 Add walkthrough link back (openshift#1620) 7c73e9a Add link to main k8s docs on service-catalog (openshift#1627) f59adc9 Overhauling the design document (openshift#1619) cd7b633 Updating the install documentation (openshift#1616) f6e5441 fix compilation error from updated util.WaitForBindingCondition() (openshift#1629) (openshift#1631) 54e57af Provide OSB Client proxy to instrument with metrics (openshift#1615) 026b86f Disable test added in 1611 that contains data race (openshift#1626) cb735a6 Add integration tests for ServiceInstances (openshift#1611) 67dbabb Cleaning up the docs README (openshift#1618) 6bddc07 remove email from docker login during Travis deploy (openshift#1614) a604bc3 Use ConfigMaps for leader election (openshift#1599) c6f193a Add controller integration tests for ServiceInstance create and update (openshift#1578) 26cf23b Rename OWNERS assignees: to approvers: (openshift#1508) 1163edc expose Prometheus metrics from Controller (openshift#677) (openshift#1608) 2cd6554 Clean up docs/ folder (openshift#1609) 1d7e96d Adding Service Binding Create Integration Tests (openshift#1580) 6a4c469 Make the maximum polling backoff configurable (openshift#1607) 31bbf55 Rename the imported package to avoid name conflict (openshift#1603) 3cdd556 Add validation for broker spec to SAR admission controller (openshift#1605) a3408ce fix docker volume mount when building with docker under SELinux (openshift#1500) (openshift#1534) 307e747 Remove unneeded vendors of vendors (openshift#1596) 770fc74 Make ups-instance.yaml in walkthrough to demonstrate good practices (openshift#1592) 9112ba1 Add links to docs/README (openshift#1589) 8902648 Add additional service to ups-broker to fix e2e (openshift#1583) 0bcbc7d move instance update logic out of reconcileServiceInstanceDelete (openshift#1584) 7ef5a3e Do not send Parameters field when there are no parameters from sourced secret (openshift#1559) 4c51b25 Remove unneeded code that sets reason for provision/update call failure (openshift#1561) b122cb9 fix bind injection failure not being persisted in API server (openshift#1546) 66421d5 Clear out current operation when starting reconciliation of a delete (openshift#1564) 8cca70a Send an empty object for Parameters when deleting all parameters of a ServiceInstance (openshift#1555) 270426c Add controllerTest type as a helper for running controller integration tests. (openshift#1577) e26c2d7 Ignore .vscode folder in project root (openshift#1579) REVERT: d969acde90 Add additional service to ups-broker to fix e2e (openshift#1583) REVERT: 1bcd53b684 origin build: add origin tooling git-subtree-dir: cmd/service-catalog/go/src/github.com/kubernetes-incubator/service-catalog git-subtree-split: b69b4a6c8003f25d040e3087c7b1b16d1854a9e9

ramr added 10 commits April 1, 2015 21:45

Router HA failover prototype work.

1cc0d5f

This example using hello-openshift in lieu of the router but that should be easy enough to change soon.

Rearrange layout, enable track_script for testing, update README to be

3561c07

clearer and some other minor cleanup.

Changes for demo.

98d668c

* Change demo script to actually use the router and the hello-openshift app * Update demo/readme/instructions to run multiple routers and a scaled service (backend/app: hello-openshift with 2 replicas).

Checkpoint router ha failover options and base code.

6aa19ab

Rearrange files for new router-failover-monitor image.

98235c3

Checkpoint docker image, scripts, volume mount and failover-monitor i…

fc9d6fe

…mage changes and other bug fixes.

Bug fixes to get ipvs module to load.

2d94a43

Use host networking now that its available and also set image pull po…

e18cf8e

…licy (need to still delve into why the kubelet is always trying to pull the image).

Add HostNetwork capabilities.

b15c22e

Use api version v1beta3 'coz the conversion routines to the earlier

debddb8

versions have issues and don't properly generate the podSpec with host networking enabled.

ramr changed the title ~~Router HA and failover support~~ [DO NOT MERGE] Router HA and failover support Apr 2, 2015

ramr reviewed Apr 2, 2015
View reviewed changes

ramr closed this Apr 2, 2015

ramr deleted the experimental branch May 21, 2015 05:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT MERGE] Router HA and failover support#1564

[DO NOT MERGE] Router HA and failover support#1564
ramr wants to merge 10 commits intoopenshift:masterfrom
ramr:experimental

ramr commented Apr 2, 2015

Uh oh!

ramr Apr 2, 2015

Uh oh!

smarterclayton Apr 2, 2015

Uh oh!

ramr Apr 2, 2015

Uh oh!

smarterclayton commented Apr 2, 2015

Uh oh!

ramr commented Apr 2, 2015

Uh oh!

smarterclayton commented Apr 2, 2015

Uh oh!

ramr commented Apr 2, 2015

Uh oh!

smarterclayton commented Apr 2, 2015

Uh oh!

ramr commented Apr 2, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ramr commented Apr 2, 2015

Uh oh!

ramr Apr 2, 2015

Choose a reason for hiding this comment

Uh oh!

smarterclayton Apr 2, 2015

Choose a reason for hiding this comment

Uh oh!

ramr Apr 2, 2015

Choose a reason for hiding this comment

Uh oh!

smarterclayton commented Apr 2, 2015

Uh oh!

ramr commented Apr 2, 2015

Uh oh!

smarterclayton commented Apr 2, 2015

Uh oh!

ramr commented Apr 2, 2015

Uh oh!

smarterclayton commented Apr 2, 2015

Uh oh!

ramr commented Apr 2, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants