Skip to content

Example to show infinity election when perform membership change.#943

Closed
wojiaodoubao wants to merge 3 commits intoapache:masterfrom
wojiaodoubao:membership-cluster-unavailable
Closed

Example to show infinity election when perform membership change.#943
wojiaodoubao wants to merge 3 commits intoapache:masterfrom
wojiaodoubao:membership-cluster-unavailable

Conversation

@wojiaodoubao
Copy link
Contributor

What changes were proposed in this pull request?

This is a patch showing a bug case. Supposing we want to replace 2 peers from a 3 peers cluster. We start the 2 new peers with empty conf. This can cause an infinity election and make the cluster unavailable forever.

What is the link to the Apache JIRA

RATIS-1912

https://issues.apache.org/jira/browse/RATIS-1912

How was this patch tested?

It is not tested.

@SzyWilliam
Copy link
Member

Thanks @wojiaodoubao for this detailed example! This case indeed may lead to indefinite unavailability. It seems like a nightmare to me :(

Maybe we can allow a whitelist for request vote, as proposed in previous thread https://lists.apache.org/thread/tt1j3jkogh71k2hvq5gtltwmphxfy736.

@szetszwo
Copy link
Contributor

@wojiaodoubao , thanks for testing Ratis!

... Supposing we want to replace 2 peers from a 3 peers cluster. ...

We probably should not allow changing a major of peers at the same time. It should replace the node one by one.

@wojiaodoubao
Copy link
Contributor Author

Thanks @SzyWilliam @szetszwo for your great reply! IMHO, the root cause is 'We can't find any majority without new peers in C_new'.
Bad cases example:

  1. Change from {n0} to {n0, n1}.
  2. Change from {n0, n1, n2} to {n0, n1, n2, n3, n4, n5}.
  3. Change from {n0, n1, n2} to {n0, n3, n4}.
  4. Change from {n0, n1, n2} to {n0, n1, n3, n4}.

Replace/add/remove one by one can fit most cases. But fails in case 1, change from {n0} to {n0, n1}.

The solution might be:

  1. RaftServerImpl#setConfiguration checks the C_new. Throwing exception If it can't find any majority without new peers in C_new. This might be the easiest way to fix the problem. And the shortcoming is not allowing any conf conversion as described in raft paper.
  2. Configure vote whitelist for new peers. This is a little tricky, because we must remove the whitelist as soon as the new peers' conf is updated to old_and_new/transitional. Otherwise it may vote for the removed peer.
  3. Start each new peer with {C_old, peer_itself}. And when a server not in conf is asking for vote, don't let it crash. Preventing nodes from crashing may cause very dangerous problems, which is probably why we let them crash. I haven't investigated it. Please help if you know the crash reason.

@Brokenice0415
Copy link
Contributor

Brokenice0415 commented Oct 20, 2023

The solution 3 seems nice.

The reason of shutdown when NOT_IN_CONF is according to this #560 -- It may cause the failure of shutdowning a NOT_IN_CONF peer who still requests vote after the remove.

According to the annotation below, we only shutdown candidate whose id is not included in conf when the conf is stable.

/**
* check if the remote peer is not included in the current conf
* and should shutdown. should shutdown if all the following stands:
* 1. this is a leader
* 2. current conf is stable and has been committed
* 3. candidate id is not included in conf
* 4. candidate's last entry's index < conf's index
*/
private boolean shouldSendShutdown(RaftPeerId candidateId,

So maybe we should change the NOT_IN_CONF handler of request vote reply regardless of whether to use solution 3 or not.

@Brokenice0415
Copy link
Contributor

I find out that the NOT_IN_CONF is only generated when the candidate finds itself is not in conf when requesting vote, which is not related to what we discussed.

Peer who receives vote request replies shutdown when shouldSendShutdown returning true, which demands C_old is null. So I wonder if the crash in solution 3 really happens.

@wojiaodoubao
Copy link
Contributor Author

Hi @Brokenice0415 , thanks your nice advice and explanation!

Peer who receives vote request replies shutdown when shouldSendShutdown returning true, which demands C_old is null. So I wonder if the crash in solution 3 really happens.

I made a test to simulate this case. The new peer did shutdown. Please have a look at TestMembership#testShutdown.

@szetszwo
Copy link
Contributor

  1. Change from {n0} to {n0, n1}.

@wojiaodoubao , This case should work well. How could it be bad?

@Brokenice0415
Copy link
Contributor

@wojiaodoubao, thanks for your case explanation!

However, it seems that the new peer 3 is not added to the new configuration by client, meaning the test case only shows a new peer which is not in C_old will shutdown after asking old peers for vote.

I tried to add client.admin().setConfiguration(Arrays.copyOfRange(peers, 0, 4)) before peer 3 starts, and peer 3 does not crash any more. So is after, unless a long sleep, i.e. 1s, before the call of setConfiguration.

So maybe the shutdown is reasonable. If a new peer starts for a long time but the cluster does not receive the request to change conf, it should shutdown. And in solution 3, the crash may not happen if client sets up a new conf in time.

}

// 3. Start {node-3} with {C_old, peer}.
servers[3] = startServer(RaftGroup.valueOf(GROUP_ID, peers[0], peers[1], peers[2], peers[3]), peers[3], RaftStorage.StartupOption.FORMAT, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

client.admin().setConfiguration(Arrays.copyOfRange(peers, 0, 4)) is needed if making peer 3 in real C_new.

@wojiaodoubao
Copy link
Contributor Author

wojiaodoubao commented Oct 23, 2023

Change from {n0} to {n0, n1}.
This case should work well. How could it be bad?

Hi @szetszwo , sorry for my late response. I made a test to simulate this case. Please have a look at TestMembership#testAddOneToOne. The procedure is as below.

  1. Originally we have a cluster with 1 node n0.
  2. Start n1 with empty conf.
  3. Call setConfiguration to n0. After n1 catches up, n0 changes conf to C_old_and_C_new.
  4. A network partition happens. n1 doesn't get conf C_old_and_C_new.
  5. n0 finds no majority because n1 doesn't answer. n0 quits and restarts election.
  6. Network partition recovers.
  7. n1 never votes to n0, so we won't have leader anymore.

I think the reason is: All new peers' conf are empty while the peers in C_old has became transitional conf(C_old_and_C_new). Peer with transitional conf needs votes from both C_old majority and C_new majority. But 'We can't find any majority without new peers in C_new'. That's why we won' get new leader.

@wojiaodoubao
Copy link
Contributor Author

wojiaodoubao commented Oct 23, 2023

Hi @Brokenice0415, thanks your detailed explanation! If I understand correctly, 'setConfiguration before staring new peers' is right for most cases. There might be a small corner case as I'll descried below. Please help review whether it exists, thanks. I'll try to simulate it later.

Firstly, we know 3 RaftServer behaviors:

  1. If a peer not in C_old is asking for votes from leader, leader will crash the peer.
  2. After we call 'setConfiguration', leader will start a new LogAppender thread, keep calling appendEntries to the new peer.
  3. If the new peer gets appendEntries, it will accept the leader as its leader.

The corner case is below. There is a race condition between 'leader appendEntries to peer 3' and 'peer 3 asks for vote from leader'.

  1. Originally we have a cluster with peer 0, 1, 2. Peer 0 is leader.
  2. SetConfiguration to peer 0, update conf from {0, 1, 2} to {0, 1, 2, 3}.
  3. Peer 0 starts a new LogAppender thread responsible for peer 3.
  4. A network partition happens.
  5. Start peer 3.
  6. Network partition recovers.
  7. A race condition between 'leader appendEntries to peer 3' and 'peer 3 asks for vote from leader' happens. If 'asks for vote' gets first, peer 3 will shutdown.

@Brokenice0415
Copy link
Contributor

Brokenice0415 commented Oct 23, 2023

Hi @wojiaodoubao , I think that race may happen.

I think it's difficult to distinguish the new peer from the removed old one when receiving vote request from a peer not in conf.

How about just restarting the crashed new peer, since the network partition has recovered and there is little likelihood of the partition happening again in such a short period of time? If the partition does happen again and the restarted new peer crashes again, maybe we should give up to add this unstable peer to the new conf?

@szetszwo
Copy link
Contributor

@wojiaodoubao , When there is NO PROGRESS, setConf will fail. However, if there is no leader, they may not be able to elect one. Good catch on the bug!

@szetszwo
Copy link
Contributor

  1. A race condition between 'leader appendEntries to peer 3' and 'peer 3 asks for vote from leader' happens. If 'asks for vote' gets first, peer 3 will shutdown.

If peer 3 starts with an empty conf, then it won't ask for vote.

@szetszwo
Copy link
Contributor

... Please have a look at TestMembership#testAddOneToOne. ...

This is an important case since it is for changing from non-HA to HA. We may:

  • Support only changing from 1 peer to 2 peers in a setConf (adding more peers require another setConf).
  • When changing from 1 peer to 2 peers, the old peer can become the leader even if the new peer not voted for it. The new peer can become the leader only if the old peer voted for it.

For the other cases, we should somehow time out the in-progress setConf.

@wojiaodoubao
Copy link
Contributor Author

How about just restarting the crashed new peer, since the network partition has recovered and there is little likelihood of the partition happening again in such a short period of time? If the partition does happen again and the restarted new peer crashes again, maybe we should give up to add this unstable peer to the new conf?

Hi @Brokenice0415 , thanks your nice comments. I think this is a good solution. Firstly it's very simple. We don't need to change any code. Providing an example and doc should be enough. Secondly, it's flexible. Allowing any conf updates as described in raft paper (JOINT consensus).

I'm new to ratis and I'm worried my view might be a bit narrow. Could @szetszwo @SzyWilliam you kindly help giving more advices on this solution (Starting new peer with conf {C_old, peer_itself}) ?

I also considered @szetszwo 's suggestion.

We may:

  • Support only changing from 1 peer to 2 peers in a setConf (adding more peers require another setConf).
  • When changing from 1 peer to 2 peers, the old peer can become the leader even if the new peer not voted for it. The > new peer can become the leader only if the old peer voted for it.
    For the other cases, we should somehow time out the in-progress setConf.

Based on the idea, I think we can set up 2 rules:

  1. Constraining configuration change behaviors. Only adding one peer and removing one peer are allowed.
  2. When changing from 1 peer to 2 peers, the old peer can become the leader even if the new peer not voted for it. The new peer can become the leader only if the old peer voted for it.

By rule 1, we reject all cases that may lead to infinite leader election, except 'adding 1 new peer to a 1 peer cluster'. By rule 2, we fix the 'adding 1 new peer to a 1 peer cluster' case.

We can prove the procedure won't cause split brain and infinite leader election. The combination of new peer and old peer configurations are shown below.

old peer new peer Split brain Infinite leader election Leader can win
C_old empty no no yes
C_old_and_C_new empty no no Yes, because of rule 2.
C_old_and_C_new C_old_and_C_new no no yes
C_new C_old_and_C_new no no yes
C_old_and_C_new C_new no no yes
C_new C_new no no yes

The advantages are:

  1. Allowing new peer to start with empty conf which is simple.
  2. Simplifying configuration change behaviors to only 'add 1 peer' and 'remove 1 peer'.

@szetszwo
Copy link
Contributor

How about just restarting the crashed new peer, ...

Restarting is a good workaround. However, it needs someone (a human or a program) to monitor it. Otherwise, the cluster becomes unavailable due to failing to elect a leader.

If we timeout the setConf, then the cluster can recover automatically.

  1. Simplifying configuration change behaviors to only 'add 1 peer' and 'remove 1 peer'.

Move generally, we should allow changing a minority set of peers in one setCont command, except for setConf from 1 peer to 2 peers.

@wojiaodoubao
Copy link
Contributor Author

Thanks @szetszwo your nice suggestion! Let me try to fix it.

@wojiaodoubao
Copy link
Contributor Author

The fix patch uploaded at #954. @szetszwo please have a look when you have time, thanks.

@szetszwo
Copy link
Contributor

@wojiaodoubao , thanks for working on the fix! Will review it.

Move generally, we should allow changing a minority set of peers in one setCont command, except for setConf from 1 peer to 2 peers.

For removing peers, it seems okay to allow removing any number of peers. The problem is only in adding or replacing peers.

@szetszwo
Copy link
Contributor

szetszwo commented Nov 5, 2023

Closing this since #954 is merged.

@szetszwo szetszwo closed this Nov 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants