You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've been seeing some flakiness around node promotion demotion, so this writeup of how it works, along with some possible issues.
Node promotion
When a node is promoted, the desired state of the node is set to manager in the control api.
The roleManager (github.com/docker/swarmkit/manager/role_manager.go) is a service running on the leader which watches for updates to nodes, and reconciles the desired role with the observed role. When it has gotten an update about a promotion, it just automatically updates the a node's observed role to manager, if the node's desired or observed roles or the node's existence at all hasn't already changed in the meantime.
However, the raft membership isn't updated yet. The node is added to the raft cluster membership when it makes a call to another manager node's raft API (which gets forwarded to the leader) and requests to join. The leader will:
observe the new node's IP address
attempt to contact it back, otherwise not bother adding it to raft
generate a raft ID for it, one that is guaranteed to be unique
send a conf change to all the other nodes adding the new raft member
send the confirmation back to the original joining node, with the raft ID
We don't want to pre-add the raft node to the cluster, because:
we may not know the originating IP
there may be a configuration issue that will prevent it from being able to join, and automatically adding it may break quorum if so
we need to make sure that the raft ID has never been used before
Node demotion
When a node is demoted, before we allow a demotion we do a couple sanity checks:
the last manager of the cluster cannot be demoted
we won't demote a node that's not in the raft memberlist
we won't demote if doing so would cause a loss of quorum
The roleManager, when it gets an update about a demotion, will attempt to reconcile the role in the following manner:
if it's not a member of the raft cluster, or was successfully removed, the node is updated to reflect "worker" as the observed role
if it's a member of the raft cluster, we do not remove immediately if doing so would break quorum
if the leader is being demoted, it attempts to transfer leadership so the new leader can apply the conf change
Node removal
When a node is removed, we do some sanity checks before we remove it:
if its desired role is a manager, and it's a member of the raft cluster, we require that it be demoted before removal. If it's not a member of the raft cluster, we allow removal
if the node is down, we refuse to remove it unless the force flag is supplied
Once a node is removed, its node ID is added to a list of blacklisted certs - no node with this ID will be allowed to connect with another manager again (we have a custom GRPC authorizer that checks to see if the node ID is in the blacklist). If the cluster has a stored certificate for this node, an expiry date is added so that the list of blacklisted certs can be cleaned up after the last valid cert for the node expires. It's a blacklist rather than a whitelist because if a blacklist fails to propagate in time, a node will, for a time, be able to connect to a cluster when it shouldn't. If the whitelist fails to propagate in time, a node won't be able to connect when it should, and this could destabilize the cluster.
Renewing the TLS certs/manager startup/shutdown after a demotion or promotion
When a node's desired role changes (or the certificate status is IssuanceStateRotate), the dispatcher pushes node changes down to the agent. The agent, upon seeing a node change, including the desired role change, will renew the certificate, and keep trying to renew until it gets a certificate with the expected (desired) role.
The CA server running on the leader only issues certs for the observed node role (so if the node role hasn't been reconciled yet, the CA server will issue a cert for the previous desired node role).
When a node gets a cert for a new role, only then does it officially either start up the manager if it's been promoted, or shut down the manager if it's been demoted. This makes sense for promotions, because there is no point in starting up manager services without a manager cert, since the node will not be able to perform its role as a manager without the manager cert.
Edge cases
@anshulpundir found this in Address node promotion inconsistencies #2558: If a node is promoted, but never joins the raft cluster (perhaps it goes down before being able to do so), then it can't ever be demoted. But it can be (force-)removed.
If node was promoted, then immediately removed before the raft join conf change goes through (which we allow), then it's possible that the raft node gets added to the cluster and never removed. The node itself will be unable to contact the cluster since it will be blacklisted.
Manager quick demotion-promotion
This is I think not an edge case, but it is involved and kinda quirky, and if I'm wrong it's possible the manager will be removed from the cluster but try to rejoin with the same raft ID and that could cause issues where the cluster thinks there might be 2 leaders (see https://aphyr.com/posts/330-jepsen-rethinkdb-2-2-3-reconfiguration).
If a manager node is demoted and then promoted before the reconciler can successfully demote (possibly due to quorum issues), the reconciler should see that the node's desired state matches it's current state, and do nothing (not demote). This all works because there is a single event loop, so two reconciliations can't happen at the same time (assuming there's a single role manager running at any given time).
If the reconciler managed to remove the node from the raft consensus, but hadn't gotten around to updating the observed role yet, it will never update the observed role. The node will probably stop trying to get a new cert. The node's raft node will detect the conf change about it being removed from raft, and the manager will shut down with a error that the node was removed from raft, and wipe out all data.
The github.com/docker/swarmkit/node/node.go's superviseManager function try to get a worker cert, because the manager was evicted, but will time out after a little while because it will fail to get a worker cert, and then restart the manager.
Worker quick promotion-demotion
A manager can't be demoted if it's not part of the raft cluster, so the demotion would fail until it successfully joined the raft cluster, at which point the demotion logic should kick in.
We've been seeing some flakiness around node promotion demotion, so this writeup of how it works, along with some possible issues.
Node promotion
When a node is promoted, the desired state of the node is set to manager in the control api.
The
roleManager(github.com/docker/swarmkit/manager/role_manager.go) is a service running on the leader which watches for updates to nodes, and reconciles the desired role with the observed role. When it has gotten an update about a promotion, it just automatically updates the a node's observed role to manager, if the node's desired or observed roles or the node's existence at all hasn't already changed in the meantime.However, the raft membership isn't updated yet. The node is added to the raft cluster membership when it makes a call to another manager node's raft API (which gets forwarded to the leader) and requests to join. The leader will:
We don't want to pre-add the raft node to the cluster, because:
Node demotion
When a node is demoted, before we allow a demotion we do a couple sanity checks:
The
roleManager, when it gets an update about a demotion, will attempt to reconcile the role in the following manner:Node removal
When a node is removed, we do some sanity checks before we remove it:
Once a node is removed, its node ID is added to a list of blacklisted certs - no node with this ID will be allowed to connect with another manager again (we have a custom GRPC authorizer that checks to see if the node ID is in the blacklist). If the cluster has a stored certificate for this node, an expiry date is added so that the list of blacklisted certs can be cleaned up after the last valid cert for the node expires. It's a blacklist rather than a whitelist because if a blacklist fails to propagate in time, a node will, for a time, be able to connect to a cluster when it shouldn't. If the whitelist fails to propagate in time, a node won't be able to connect when it should, and this could destabilize the cluster.
Renewing the TLS certs/manager startup/shutdown after a demotion or promotion
When a node's desired role changes (or the certificate status is
IssuanceStateRotate), the dispatcher pushes node changes down to the agent. The agent, upon seeing a node change, including the desired role change, will renew the certificate, and keep trying to renew until it gets a certificate with the expected (desired) role.The CA server running on the leader only issues certs for the observed node role (so if the node role hasn't been reconciled yet, the CA server will issue a cert for the previous desired node role).
When a node gets a cert for a new role, only then does it officially either start up the manager if it's been promoted, or shut down the manager if it's been demoted. This makes sense for promotions, because there is no point in starting up manager services without a manager cert, since the node will not be able to perform its role as a manager without the manager cert.
Edge cases
Manager quick demotion-promotion
This is I think not an edge case, but it is involved and kinda quirky, and if I'm wrong it's possible the manager will be removed from the cluster but try to rejoin with the same raft ID and that could cause issues where the cluster thinks there might be 2 leaders (see https://aphyr.com/posts/330-jepsen-rethinkdb-2-2-3-reconfiguration).
If a manager node is demoted and then promoted before the reconciler can successfully demote (possibly due to quorum issues), the reconciler should see that the node's desired state matches it's current state, and do nothing (not demote). This all works because there is a single event loop, so two reconciliations can't happen at the same time (assuming there's a single role manager running at any given time).
If the reconciler managed to remove the node from the raft consensus, but hadn't gotten around to updating the observed role yet, it will never update the observed role. The node will probably stop trying to get a new cert. The node's raft node will detect the conf change about it being removed from raft, and the manager will shut down with a error that the node was removed from raft, and wipe out all data.
The
github.com/docker/swarmkit/node/node.go'ssuperviseManagerfunction try to get a worker cert, because the manager was evicted, but will time out after a little while because it will fail to get a worker cert, and then restart the manager.Worker quick promotion-demotion
A manager can't be demoted if it's not part of the raft cluster, so the demotion would fail until it successfully joined the raft cluster, at which point the demotion logic should kick in.