Skip to content

RFC: Per-reconciler leader election (and leader-aware reconcilers) #1181

@mattmoor

Description

@mattmoor

/area API
/kind cleanup

I've been thinking a bit about leader election, and the way we run things today we basically allocate N backup pods that wait to become the leader and effectively do nothing until they become the leader, they are essentially there as scheduling reservations.

How we achieve this is by wrapping the entire controller bootstrap process in a giant resource lock, which kicks things off when it acquires the lock and terminates the process if it should lose it. This means that our Reconcilers get no benefit from horizontal scaling because no load is spread, and in fact they pay a penalty of having to also run the leader election logic as well.

Given 10 reconcilers living in a single controller, by leader electing at the pod level, leader failovers result in a standby pod spinning up 10 reconcilers worth of state before they can become functional again. Given that we build none of that state in advance, it doesn't seem like a particularly hot failover.


The tl;dr of what I've been considering is this:

  1. Have per controller.Impl / Reconciler leaders
  2. Run all of the informers and controllers, eliding* reconciliation when not the leader.
  3. When we become the leader, global resync.

* - We may want to run leaderelection-aware reconcilers even when not the leader, e.g. if the resources program routing information for an admission control webhook.

The cons of this approach are:

  • Higher resource utilization for non-leader pods because they have the full informer cache.
  • A multiple on the number of leader locks we would need, and a corresponding increase in API Server load.

The pros of this approach are:

  • Leaders can be spread across replicas, so instead of having 10 on the single leader, if I run 3 replicas, I might only run 3-4 reconcilers per Pod. So our controllers start to see benefits from horizontal scaling.
  • Hot failovers, which best case cost O(nop global resync). This is amplified by the above because generally fewer things should be failing over each time a pod fails.

I think it'd be interesting to try this with our webhooks, which don't currently perform leader election, and in some cases (e.g. Bindings), need the informer cache to program the webhook. If we decide to move ahead with this approach, then I think we should probably integrate with the // +genreconciler work, which I believe should have all of the information to do all(?) of the heavy lifting for (now) the vast majority of our reconcilers.

I'd love to try this out, but my schedule is chaotic and I don't want to be a bottleneck if others have time and are interested.

cc @pmorie @markusthoemmes @dprotaso @vaikas @vagababov for thoughts (LMK if I am missing any pros/cons)

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/APIkind/cleanupCategorizes issue or PR as related to cleaning up code, process, or technical debt.lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions