Node Promotion/Demotion workflow review

We've been seeing some flakiness around node promotion demotion, so this writeup of how it works, along with some possible issues.

# Node promotion

When a node is promoted, the desired state of the node is set to manager in the control api.

The `roleManager` (`github.com/docker/swarmkit/manager/role_manager.go`) is a service running on the leader which watches for updates to nodes, and reconciles the desired role with the observed role.  When it has gotten an update about a promotion, it just automatically updates the a node's observed role to manager, if the node's desired or observed roles or the node's existence at all hasn't already changed in the meantime.

However, the raft membership isn't updated yet.  The node is added to the raft cluster membership when it makes a call to another manager node's raft API (which gets forwarded to the leader) and requests to join.  The leader will:
  - observe the new node's IP address
  - attempt to contact it back, otherwise not bother adding it to raft
  - generate a raft ID for it, one that is guaranteed to be unique
  - send a conf change to all the other nodes adding the new raft member
  - send the confirmation back to the original joining node, with the raft ID

We don't want to pre-add the raft node to the cluster, because:
  - we may not know the originating IP
  - there may be a configuration issue that will prevent it from being able to join, and automatically adding it may break quorum if so
  - we need to make sure that the raft ID has never been used before

# Node demotion

When a node is demoted, before we allow a demotion we do a couple sanity checks:
  - the last manager of the cluster cannot be demoted
  - we won't demote a node that's not in the raft memberlist
  - we won't demote if doing so would cause a loss of quorum

The `roleManager`, when it gets an update about a demotion, will attempt to reconcile the role in the following manner:
  - if it's not a member of the raft cluster, or was successfully removed, the node is updated to reflect "worker" as the observed role
  - if it's a member of the raft cluster, we do not remove immediately if doing so would break quorum
  - if the leader is being demoted, it attempts to transfer leadership so the new leader can apply the conf change

# Node removal

When a node is removed, we do some sanity checks before we remove it:
- if its desired role is a manager, and it's a member of the raft cluster, we require that it be demoted before removal.  If it's not a member of the raft cluster, we allow removal
- if the node is down, we refuse to remove it unless the force flag is supplied

Once a node is removed, its node ID is added to a list of blacklisted certs - no node with this ID will be allowed to connect with another manager again (we have a custom GRPC authorizer that checks to see if the node ID is in the blacklist).  If the cluster has a stored certificate for this node, an expiry date is added so that the list of blacklisted certs can be cleaned up after the last valid cert for the node expires.  It's a blacklist rather than a whitelist because if a blacklist fails to propagate in time, a node will, for a time, be able to connect to a cluster when it shouldn't.  If the whitelist fails to propagate in time, a node won't be able to connect when it should, and this could destabilize the cluster.

# Renewing the TLS certs/manager startup/shutdown after a demotion or promotion

When a node's desired role changes (or the certificate status is `IssuanceStateRotate`), the dispatcher pushes node changes down to the agent.  The agent, upon seeing a node change, including the desired role change, will renew the certificate, and keep trying to renew until it gets a certificate with the expected (desired) role.

The CA server running on the leader only issues certs for the observed node role (so if the node role hasn't been reconciled yet, the CA server will issue a cert for the previous desired node role).

When a node gets a cert for a new role, only then does it officially either start up the manager if it's been promoted, or shut down the manager if it's been demoted.  This makes sense for promotions, because there is no point in starting up manager services without a manager cert, since the node will not be able to perform its role as a manager without the manager cert.

# Edge cases

1. @anshulpundir found this in https://github.com/docker/swarmkit/issues/2558:  If a node is promoted, but never joins the raft cluster (perhaps it goes down before being able to do so), then it can't ever be demoted.  But it can be (force-)removed.
2. Documented in https://github.com/docker/swarmkit/issues/2548:  If a node is demoted, and then immediately removed before the node reconciler can remove it from the raft cluster, and then there's a leader election, the raft node is never removed from the raft cluster and can cause quorum issues.
3. If node was promoted, then immediately removed before the raft join conf change goes through (which we allow), then it's possible that the raft node gets added to the cluster and never removed.  The node itself will be unable to contact the cluster since it will be blacklisted.

# Manager quick demotion-promotion

This is I think not an edge case, but it is involved and kinda quirky, and if I'm wrong it's possible the manager will be removed from the cluster but try to rejoin with the same raft ID and that could cause issues where the cluster thinks there might be 2 leaders (see https://aphyr.com/posts/330-jepsen-rethinkdb-2-2-3-reconfiguration).

If a manager node is demoted and then promoted before the reconciler can successfully demote (possibly due to quorum issues), the reconciler should see that the node's desired state matches it's current state, and do nothing (not demote).  This all works because there is a single event loop, so two reconciliations can't happen at the same time (assuming there's a single role manager running at any given time).

If the reconciler managed to remove the node from the raft consensus, but hadn't gotten around to updating the observed role yet, it will never update the observed role.  The node will probably stop trying to get a new cert.  The node's raft node will detect the conf change about it being removed from raft, and the manager will shut down with a error that the node was removed from raft, and wipe out all data.

The `github.com/docker/swarmkit/node/node.go`'s `superviseManager` function try to get a worker cert, because the manager was evicted, but will time out after a little while because it will fail to get a worker cert, and then restart the manager.

# Worker quick promotion-demotion

A manager can't be demoted if it's not part of the raft cluster, so the demotion would fail until it successfully joined the raft cluster, at which point the demotion logic should kick in.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node Promotion/Demotion workflow review #2565

Node promotion

Node demotion

Node removal

Renewing the TLS certs/manager startup/shutdown after a demotion or promotion

Edge cases

Manager quick demotion-promotion

Worker quick promotion-demotion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Node Promotion/Demotion workflow review #2565

Description

Node promotion

Node demotion

Node removal

Renewing the TLS certs/manager startup/shutdown after a demotion or promotion

Edge cases

Manager quick demotion-promotion

Worker quick promotion-demotion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions