Example to show infinity election when perform membership change. by wojiaodoubao · Pull Request #943 · apache/ratis

wojiaodoubao · 2023-10-19T08:02:31Z

What changes were proposed in this pull request?

This is a patch showing a bug case. Supposing we want to replace 2 peers from a 3 peers cluster. We start the 2 new peers with empty conf. This can cause an infinity election and make the cluster unavailable forever.

What is the link to the Apache JIRA

RATIS-1912

https://issues.apache.org/jira/browse/RATIS-1912

How was this patch tested?

It is not tested.

SzyWilliam · 2023-10-19T08:55:42Z

Thanks @wojiaodoubao for this detailed example! This case indeed may lead to indefinite unavailability. It seems like a nightmare to me :(

Maybe we can allow a whitelist for request vote, as proposed in previous thread https://lists.apache.org/thread/tt1j3jkogh71k2hvq5gtltwmphxfy736.

szetszwo · 2023-10-19T21:18:23Z

@wojiaodoubao , thanks for testing Ratis!

... Supposing we want to replace 2 peers from a 3 peers cluster. ...

We probably should not allow changing a major of peers at the same time. It should replace the node one by one.

wojiaodoubao · 2023-10-20T03:01:45Z

Thanks @SzyWilliam @szetszwo for your great reply! IMHO, the root cause is 'We can't find any majority without new peers in C_new'.
Bad cases example:

Change from {n0} to {n0, n1}.
Change from {n0, n1, n2} to {n0, n1, n2, n3, n4, n5}.
Change from {n0, n1, n2} to {n0, n3, n4}.
Change from {n0, n1, n2} to {n0, n1, n3, n4}.

Replace/add/remove one by one can fit most cases. But fails in case 1, change from {n0} to {n0, n1}.

The solution might be:

RaftServerImpl#setConfiguration checks the C_new. Throwing exception If it can't find any majority without new peers in C_new. This might be the easiest way to fix the problem. And the shortcoming is not allowing any conf conversion as described in raft paper.
Configure vote whitelist for new peers. This is a little tricky, because we must remove the whitelist as soon as the new peers' conf is updated to old_and_new/transitional. Otherwise it may vote for the removed peer.
Start each new peer with {C_old, peer_itself}. And when a server not in conf is asking for vote, don't let it crash. Preventing nodes from crashing may cause very dangerous problems, which is probably why we let them crash. I haven't investigated it. Please help if you know the crash reason.

Brokenice0415 · 2023-10-20T07:44:17Z

The solution 3 seems nice.

The reason of shutdown when NOT_IN_CONF is according to this #560 -- It may cause the failure of shutdowning a NOT_IN_CONF peer who still requests vote after the remove.

According to the annotation below, we only shutdown candidate whose id is not included in conf when the conf is stable.

ratis/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java

Lines 1333 to 1341 in 639f1bf

    
             /** 
        
              * check if the remote peer is not included in the current conf 
        
              * and should shutdown. should shutdown if all the following stands: 
        
              * 1. this is a leader 
        
              * 2. current conf is stable and has been committed 
        
              * 3. candidate id is not included in conf 
        
              * 4. candidate's last entry's index < conf's index 
        
              */ 
        
             private boolean shouldSendShutdown(RaftPeerId candidateId,

So maybe we should change the NOT_IN_CONF handler of request vote reply regardless of whether to use solution 3 or not.

Brokenice0415 · 2023-10-20T09:22:54Z

I find out that the NOT_IN_CONF is only generated when the candidate finds itself is not in conf when requesting vote, which is not related to what we discussed.

Peer who receives vote request replies shutdown when shouldSendShutdown returning true, which demands C_old is null. So I wonder if the crash in solution 3 really happens.

wojiaodoubao · 2023-10-20T11:06:23Z

Hi @Brokenice0415 , thanks your nice advice and explanation!

Peer who receives vote request replies shutdown when shouldSendShutdown returning true, which demands C_old is null. So I wonder if the crash in solution 3 really happens.

I made a test to simulate this case. The new peer did shutdown. Please have a look at TestMembership#testShutdown.

szetszwo · 2023-10-20T17:02:44Z

Change from {n0} to {n0, n1}.

@wojiaodoubao , This case should work well. How could it be bad?

Brokenice0415 · 2023-10-20T17:34:26Z

@wojiaodoubao, thanks for your case explanation!

However, it seems that the new peer 3 is not added to the new configuration by client, meaning the test case only shows a new peer which is not in C_old will shutdown after asking old peers for vote.

I tried to add client.admin().setConfiguration(Arrays.copyOfRange(peers, 0, 4)) before peer 3 starts, and peer 3 does not crash any more. So is after, unless a long sleep, i.e. 1s, before the call of setConfiguration.

So maybe the shutdown is reasonable. If a new peer starts for a long time but the cluster does not receive the request to change conf, it should shutdown. And in solution 3, the crash may not happen if client sets up a new conf in time.

Brokenice0415 · 2023-10-20T17:37:25Z

ratis-examples/src/test/java/org/apache/ratis/examples/TestMembership.java

+    }
+
+    // 3. Start {node-3} with {C_old, peer}.
+    servers[3] = startServer(RaftGroup.valueOf(GROUP_ID, peers[0], peers[1], peers[2], peers[3]), peers[3], RaftStorage.StartupOption.FORMAT, false);


client.admin().setConfiguration(Arrays.copyOfRange(peers, 0, 4)) is needed if making peer 3 in real C_new.

wojiaodoubao · 2023-10-23T01:24:04Z

Change from {n0} to {n0, n1}.
This case should work well. How could it be bad?

Hi @szetszwo , sorry for my late response. I made a test to simulate this case. Please have a look at TestMembership#testAddOneToOne. The procedure is as below.

Originally we have a cluster with 1 node n0.
Start n1 with empty conf.
Call setConfiguration to n0. After n1 catches up, n0 changes conf to C_old_and_C_new.
A network partition happens. n1 doesn't get conf C_old_and_C_new.
n0 finds no majority because n1 doesn't answer. n0 quits and restarts election.
Network partition recovers.
n1 never votes to n0, so we won't have leader anymore.

I think the reason is: All new peers' conf are empty while the peers in C_old has became transitional conf(C_old_and_C_new). Peer with transitional conf needs votes from both C_old majority and C_new majority. But 'We can't find any majority without new peers in C_new'. That's why we won' get new leader.

wojiaodoubao · 2023-10-23T01:48:36Z

Hi @Brokenice0415, thanks your detailed explanation! If I understand correctly, 'setConfiguration before staring new peers' is right for most cases. There might be a small corner case as I'll descried below. Please help review whether it exists, thanks. I'll try to simulate it later.

Firstly, we know 3 RaftServer behaviors:

If a peer not in C_old is asking for votes from leader, leader will crash the peer.
After we call 'setConfiguration', leader will start a new LogAppender thread, keep calling appendEntries to the new peer.
If the new peer gets appendEntries, it will accept the leader as its leader.

The corner case is below. There is a race condition between 'leader appendEntries to peer 3' and 'peer 3 asks for vote from leader'.

Originally we have a cluster with peer 0, 1, 2. Peer 0 is leader.
SetConfiguration to peer 0, update conf from {0, 1, 2} to {0, 1, 2, 3}.
Peer 0 starts a new LogAppender thread responsible for peer 3.
A network partition happens.
Start peer 3.
Network partition recovers.
A race condition between 'leader appendEntries to peer 3' and 'peer 3 asks for vote from leader' happens. If 'asks for vote' gets first, peer 3 will shutdown.

Brokenice0415 · 2023-10-23T09:44:08Z

Hi @wojiaodoubao , I think that race may happen.

I think it's difficult to distinguish the new peer from the removed old one when receiving vote request from a peer not in conf.

How about just restarting the crashed new peer, since the network partition has recovered and there is little likelihood of the partition happening again in such a short period of time? If the partition does happen again and the restarted new peer crashes again, maybe we should give up to add this unstable peer to the new conf?

szetszwo · 2023-10-23T15:34:07Z

@wojiaodoubao , When there is NO PROGRESS, setConf will fail. However, if there is no leader, they may not be able to elect one. Good catch on the bug!

szetszwo · 2023-10-23T15:36:18Z

A race condition between 'leader appendEntries to peer 3' and 'peer 3 asks for vote from leader' happens. If 'asks for vote' gets first, peer 3 will shutdown.

If peer 3 starts with an empty conf, then it won't ask for vote.

szetszwo · 2023-10-23T15:41:22Z

... Please have a look at TestMembership#testAddOneToOne. ...

This is an important case since it is for changing from non-HA to HA. We may:

Support only changing from 1 peer to 2 peers in a setConf (adding more peers require another setConf).
When changing from 1 peer to 2 peers, the old peer can become the leader even if the new peer not voted for it. The new peer can become the leader only if the old peer voted for it.

For the other cases, we should somehow time out the in-progress setConf.

wojiaodoubao · 2023-10-24T12:19:49Z

How about just restarting the crashed new peer, since the network partition has recovered and there is little likelihood of the partition happening again in such a short period of time? If the partition does happen again and the restarted new peer crashes again, maybe we should give up to add this unstable peer to the new conf?

Hi @Brokenice0415 , thanks your nice comments. I think this is a good solution. Firstly it's very simple. We don't need to change any code. Providing an example and doc should be enough. Secondly, it's flexible. Allowing any conf updates as described in raft paper (JOINT consensus).

I'm new to ratis and I'm worried my view might be a bit narrow. Could @szetszwo @SzyWilliam you kindly help giving more advices on this solution (Starting new peer with conf {C_old, peer_itself}) ?

I also considered @szetszwo 's suggestion.

We may:

Support only changing from 1 peer to 2 peers in a setConf (adding more peers require another setConf).

When changing from 1 peer to 2 peers, the old peer can become the leader even if the new peer not voted for it. The > new peer can become the leader only if the old peer voted for it.
For the other cases, we should somehow time out the in-progress setConf.

Based on the idea, I think we can set up 2 rules:

Constraining configuration change behaviors. Only adding one peer and removing one peer are allowed.
When changing from 1 peer to 2 peers, the old peer can become the leader even if the new peer not voted for it. The new peer can become the leader only if the old peer voted for it.

By rule 1, we reject all cases that may lead to infinite leader election, except 'adding 1 new peer to a 1 peer cluster'. By rule 2, we fix the 'adding 1 new peer to a 1 peer cluster' case.

We can prove the procedure won't cause split brain and infinite leader election. The combination of new peer and old peer configurations are shown below.

old peer	new peer	Split brain	Infinite leader election	Leader can win
C_old	empty	no	no	yes
C_old_and_C_new	empty	no	no	Yes, because of rule 2.
C_old_and_C_new	C_old_and_C_new	no	no	yes
C_new	C_old_and_C_new	no	no	yes
C_old_and_C_new	C_new	no	no	yes
C_new	C_new	no	no	yes

The advantages are:

Allowing new peer to start with empty conf which is simple.
Simplifying configuration change behaviors to only 'add 1 peer' and 'remove 1 peer'.

szetszwo · 2023-10-24T20:11:46Z

How about just restarting the crashed new peer, ...

Restarting is a good workaround. However, it needs someone (a human or a program) to monitor it. Otherwise, the cluster becomes unavailable due to failing to elect a leader.

If we timeout the setConf, then the cluster can recover automatically.

Simplifying configuration change behaviors to only 'add 1 peer' and 'remove 1 peer'.

Move generally, we should allow changing a minority set of peers in one setCont command, except for setConf from 1 peer to 2 peers.

wojiaodoubao · 2023-10-26T13:42:16Z

Thanks @szetszwo your nice suggestion! Let me try to fix it.

wojiaodoubao · 2023-10-27T10:34:39Z

The fix patch uploaded at #954. @szetszwo please have a look when you have time, thanks.

szetszwo · 2023-10-27T16:44:56Z

@wojiaodoubao , thanks for working on the fix! Will review it.

Move generally, we should allow changing a minority set of peers in one setCont command, except for setConf from 1 peer to 2 peers.

For removing peers, it seems okay to allow removing any number of peers. The problem is only in adding or replacing peers.

szetszwo · 2023-11-05T18:34:45Z

Closing this since #954 is merged.

Example to show cluster unavailable when perform membership change.

0143d89

Example to show server shutdown when start with {C_old, peer_itself}.

a24e406

Brokenice0415 reviewed Oct 20, 2023

View reviewed changes

Example to show add one server to a one server cluster.

8e16396

wojiaodoubao mentioned this pull request Oct 27, 2023

RATIS-1912. Fix infinity election when perform membership change. #954

Merged

szetszwo closed this Nov 5, 2023

Conversation

wojiaodoubao commented Oct 19, 2023

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

SzyWilliam commented Oct 19, 2023

Uh oh!

szetszwo commented Oct 19, 2023

Uh oh!

wojiaodoubao commented Oct 20, 2023

Uh oh!

Brokenice0415 commented Oct 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Brokenice0415 commented Oct 20, 2023

Uh oh!

wojiaodoubao commented Oct 20, 2023

Uh oh!

szetszwo commented Oct 20, 2023

Uh oh!

Brokenice0415 commented Oct 20, 2023

Uh oh!

Brokenice0415 Oct 20, 2023

Choose a reason for hiding this comment

Uh oh!

wojiaodoubao commented Oct 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wojiaodoubao commented Oct 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Brokenice0415 commented Oct 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szetszwo commented Oct 23, 2023

Uh oh!

szetszwo commented Oct 23, 2023

Uh oh!

szetszwo commented Oct 23, 2023

Uh oh!

wojiaodoubao commented Oct 24, 2023

Uh oh!

szetszwo commented Oct 24, 2023

Uh oh!

wojiaodoubao commented Oct 26, 2023

Uh oh!

wojiaodoubao commented Oct 27, 2023

Uh oh!

szetszwo commented Oct 27, 2023

Uh oh!

szetszwo commented Nov 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Brokenice0415 commented Oct 20, 2023 •

edited

Loading

wojiaodoubao commented Oct 23, 2023 •

edited

Loading

wojiaodoubao commented Oct 23, 2023 •

edited

Loading

Brokenice0415 commented Oct 23, 2023 •

edited

Loading