RFC: Per-reconciler leader election (and leader-aware reconcilers)

/area API
/kind cleanup

I've been thinking a bit about leader election, and the way we run things today we basically allocate N backup pods that wait to become the leader and effectively do nothing until they become the leader, they are essentially there as scheduling reservations.  

How we achieve this is by wrapping the *entire* controller bootstrap process in a giant resource lock, which kicks things off when it acquires the lock and terminates the process if it should lose it.  This means that our Reconcilers get no benefit from horizontal scaling because no load is spread, and in fact they pay a penalty of having to also run the leader election logic as well.

Given 10 reconcilers living in a single controller, by leader electing at the pod level, leader failovers result in a standby pod spinning up 10 reconcilers worth of state before they can become functional again.  Given that we build none of that state in advance, it doesn't seem like a particularly hot failover.

---

The `tl;dr` of what I've been considering is this:
1. Have per `controller.Impl` / `Reconciler` leaders
2. Run all of the informers and controllers, eliding* reconciliation when not the leader.
3. When we become the leader, global resync.

_* - We may want to run leaderelection-aware reconcilers even when not the leader, e.g. if the resources program routing information for an admission control webhook._

The cons of this approach are:
 - Higher resource utilization for non-leader pods because they have the full informer cache.
 - A multiple on the number of leader locks we would need, and a corresponding increase in API Server load.

The pros of this approach are:
 - Leaders can be spread across replicas, so instead of having 10 on the single leader, if I run 3 replicas, I might only run 3-4 reconcilers per Pod.  So our controllers start to see benefits from horizontal scaling.
 - Hot failovers, which best case cost O(nop global resync).  This is amplified by the above because generally fewer things should be failing over each time a pod fails.

---

I think it'd be interesting to try this with our webhooks, which don't currently perform leader election, and in some cases (e.g. Bindings), need the informer cache to program the webhook.  If we decide to move ahead with this approach, then I think we should probably integrate with the `// +genreconciler` work, which I believe should have all of the information to do all(?) of the heavy lifting for (now) the vast majority of our reconcilers.

I'd love to try this out, but my schedule is chaotic and I don't want to be a bottleneck if others have time and are interested.

cc @pmorie @markusthoemmes @dprotaso @vaikas @vagababov for thoughts (LMK if I am missing any pros/cons)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Per-reconciler leader election (and leader-aware reconcilers) #1181

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: Per-reconciler leader election (and leader-aware reconcilers) #1181

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions