[SPARK-23191][CORE] Warn rather than terminate when duplicate worker register happens #24569

Ngone51 · 2019-05-09T15:59:18Z

What changes were proposed in this pull request?

Standalone HA Background

In Spark Standalone HA mode, we'll have multiple masters running at the same time. But, there's only one master leader, which actively serving scheduling requests. Once this master leader crashes, other masters would compete for the leader and only one master is guaranteed to be elected as new master leader, which would reconstruct the state from the original master leader and continute to serve scheduling requests.

Related Issues

#2828 firstly introduces the bug of duplicate Worker registration, and #3447 fixed it. But there're still corner cases(see SPARK-23191 for details) where #3447 can not cover it:

CASE 1
(1) Initially, Worker registered with Master A.
(2) After a while, the connection channel between Master A and Worker becomes inactive(e.g. due to network drop), and Worker is notified about that by calling onDisconnected from NettyRpcEnv
(3) When Worker invokes onDisconnected, then, it will attempt to reconnect to all masters(including Master A)
(4) At the meanwhile, network between Worker and Master A recover, Worker successfully register to Master A again
(5) Master A response with RegisterWorkerFailed("Duplicate worker ID")
(6) Worker receives that msg, exit
CASE 2
(1) Master A lost leadership(sends RevokedLeadership to itself). Master B takes over and recovery everything from master A(which would register workers for the first time in Master B) and sends MasterChanged to Worker
(2) Before Master A receives RevokedLeadership, it receives a late HeartBeat from Worker(which had been removed in Master A due to heartbeat timeout previously), so it sends ReconnectWorker to worker
(3) Worker receives MasterChanged before ReconnectWorker , changing masterRef to Master B
(4) Subsequently, Worker receives ReconnectWorker from Master A, then it reconnects to all masters
(5) Master B receives register request again from the Worker, response with RegisterWorkerFailed("Duplicate worker ID")
(6) Worker receives that msg, exit

In CASE 1, it is difficult for the Worker to know Master A's state. Normally, Worker thinks Master A has already died and is impossible that Master A would response with Worker's re-connect request.

In CASE 2, we can see race condition between RevokedLeadership and HeartBeat. Actually, Master A has already been revoked leadership while processing HeartBeat msg. That's means the state between Master and Zookeeper could be out of sync for a while.

Solutions

In this PR, instead of exiting Worker process when duplicate Worker registration happens, we suggest to log warn about it. This would be fine since Master actually perform no-op when it receives duplicate registration from a Worker. In turn, Worker could continue living with that Master normally without any side effect.

How was this patch tested?

Tested Manually.

I followed the steps as Neeraj Gupta suggested in JIRA SPARK-23191 to reproduce the case 1.

Before this pr, Worker would be DEAD from UI.
After this pr, Worker just warn the duplicate register behavior (as you can see the second last row in log snippet below), and still be ALIVE from UI.

19/05/09 20:58:32 ERROR Worker: Connection to master failed! Waiting for master to reconnect...
19/05/09 20:58:32 INFO Worker: wuyi.local:7077 Disassociated !
19/05/09 20:58:32 INFO Worker: Connecting to master wuyi.local:7077...
19/05/09 20:58:32 ERROR Worker: Connection to master failed! Waiting for master to reconnect...
19/05/09 20:58:32 INFO Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already.
19/05/09 20:58:37 WARN TransportClientFactory: DNS resolution for wuyi.local/127.0.0.1:7077 took 5005 ms
19/05/09 20:58:37 INFO TransportClientFactory: Found inactive connection to wuyi.local/127.0.0.1:7077, creating a new one.
19/05/09 20:58:37 INFO TransportClientFactory: Successfully created connection to wuyi.local/127.0.0.1:7077 after 3 ms (0 ms spent in bootstraps)
19/05/09 20:58:37 WARN Worker: Duplicate registration at master spark://wuyi.local:7077
19/05/09 20:58:37 INFO Worker: Successfully registered with master spark://wuyi.local:7077

SparkQA · 2019-05-09T18:01:44Z

Test build #105285 has finished for PR 24569 at commit 94a866d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2019-05-10T02:01:30Z

cc @cloud-fan

cloud-fan · 2019-05-13T08:45:28Z

can you briefly explain how master HA works? I'm afraid not many people are familiar with this part.

Ngone51 · 2019-05-14T06:17:57Z

@cloud-fan updated in desc.

cloud-fan · 2019-05-14T09:52:50Z

Master A receives a late HeartBeat from Worker, asking it to reconnect

worker asks master to reconnect? why?

Ngone51 · 2019-05-14T10:32:40Z

worker asks master to reconnect? why?

hmm...the real meaning I want to describe is that asking the worker to reconnect to the Master. I'v updated the description. Thanks.

cloud-fan · 2019-05-14T10:52:20Z

core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala

+        }
+
+        if (duplicate) {
+          logWarning(s"Duplicate registration at master $preferredMasterAddress")


would be good to include the possible reasons for this case, so that end users know what's going on.

Good idea, I'll update it.

cloud-fan · 2019-05-14T10:54:36Z

Master A receives a late HeartBeat from Worker(which had been removed in Master due to heartbeat timeout previously), asking the worker to reconnect

Is it possible that master A knows he is not the active master?

Ngone51 · 2019-05-14T11:12:57Z

Is it possible that master A knows he is not the active master?

IIRC, currently, we can not know that directly. Maybe, we can infer it by Master's state, ~~but this could also be unreliable~~.

cloud-fan · 2019-05-14T13:26:42Z

Then how do the workers know which master is active?

Ngone51 · 2019-05-14T14:46:42Z

Then how do the workers know which master is active?

hmmm... worker itself does not know active master well whenever it is not connected to a certain master. It just try to connect to all masters and wait for masters' replies.

And, if a master fails on leader election, its state would remains in STANDBY. So, when it receives worker's register request, the standby master would ignore the request and just tell the worker, 'I'm in standby'. In turn, worker would also ignore this msg(it would continue to wait for the active one's response).

If a master is elected to be the leader successfully, its state would change to ALIVE. So, when it receives worker's register request, the alive(or active) master would register the worker and response with RegisteredWorker. And when worker receives RegisteredWorker, it would change its masterRef(RpcEndpointRef) to the active one. Afterwards, the worker could communicate with the active master by the masterRef directly.

And if the active master crash happens, worker would re-try to connect to all masters(the behavior may be a little different after #3447). Then, the above process would reproduce.

Now, I'm thinking that we do can know the master is active or not just by checking its state is STANDBY or not.

But in case 2, even though we could recognize a master is active or not, we may still could not avoid step (2).

See the log snippets(see details in JIRA SPARK-23191) which provide by @zuotingbing :

2019-03-15 20:22:09,441 INFO ZooKeeperLeaderElectionAgent: We have lost leadership
2019-03-15 20:22:14,544 WARN Master: Removing worker-20190218183101-vmax18-33129 because we got no heartbeat in 60 seconds
2019-03-15 20:22:14,544 INFO Master: Removing worker worker-20190218183101-vmax18-33129 on vmax18:33129
2019-03-15 20:22:14,864 WARN Master: Got heartbeat from unregistered worker worker-20190218183101-vmax18-33129. Asking it to re-register.
2019-03-15 20:22:14,975 ERROR Master: Leadership has been revoked -- master shutting down.

In log, it seems like that we have a race condition between ZooKeeperLeaderElectionAgent and Master. When the master receives a late heartbeat, it's still active. But, almost simultaneously, it changes to in-active.

cloud-fan · 2019-05-14T15:01:50Z

If we rely on zoom keeper to do leader election, why don't the workers ask zoom keeper for the leader info? Reconnecting to all masters looks wrong to me, as it's likely that more than one master thinks himself as active (the communication between zoom keeper and masters has delays).

cloud-fan · 2019-05-14T15:02:52Z

BTW in the case 2, I'm still a little confused.

Master B receives register request again from the Worker

When does the first registration happen? in which step?

Ngone51 · 2019-05-14T15:17:27Z

When does the first registration happen? in which step?

When master B takes over from master A (or say, elected to be a new active master), it will recovery all info(including registered workers) from master A. The recovery process would trigger first registration silently. I'll update the description, thanks.

Ngone51 · 2019-05-14T15:39:24Z

If we rely on zoom keeper to do leader election, why don't the workers ask zoom keeper for the leader info?

This may be another implementation strategy. But if we ask workers to directly ask zk(zookeeper) for leader info, will there be too many zk clients concurrently if we have plenty of workers ? And it may pay pressure on network. (And not sure zk provides the api for leader info directly.)

Reconnecting to all masters looks wrong to me, as it's likely that more than one master thinks himself as active (the communication between zoom keeper and masters has delays).

hmmm...actually, it seems to me that it is worker rather than masters thinks that more than one master is active. However, masters themselves know whether itself is active well(though, has a little delays).

cloud-fan · 2019-05-14T15:39:29Z

core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala

    case ReconnectWorker(masterUrl) =>
-      logInfo(s"Master with url $masterUrl requested this worker to reconnect.")
-      registerWithMaster()
+      if (masterUrl != activeMasterUrl) {


This is the part that worries me most. We need to be super sure that it's 100% safe.

It looks to me that

activeMasterUrl is the master that this worker thinks it's active. This may not always be the actual active master. For example, when the zookeeper finishes leader election, there is a time window before this worker gets the MasterChanged message.

ReconnectWorker may be sent by a standby master, as you explained in the PR description.

Can we list all the cases that a ReconnectWorker is sent, and make sure we don't miss valid cases here?

The thing I want to avoid is, the worker ignores the ReconnectWorker message from the actual active master.

ReconnectWorker may be sent by a standby master, as you explained in the PR description.

I made a wrong PR description on step order in CASE 2(have revised). Sorry for it. Actually, while sending ReconnectWorker, Master A is still active but quickly going to die(as a race condition metioned above.)

Actually, there's no doubt that the msg ReconnectWorker(master) must come from an active Master.
So, when Worker receives that msg from Master X, cases would be:

Master X is active
1.1) Master X is the initial active master(No MasterChanged msg)
1.1.1) master == activeMasterUrl
just reonnect to (all) masters
1.1.2)master != activeMasterUrl
impossible case
1.2) Master X is elected to be new active master
1.2.1)master == activeMasterUrl (MasterChanged comes before ReconnectWorker)
just reonnect to (all) masters
1.2.2) master != activeMasterUrl (MasterChanged comes after ReconnectWorker)
seems very impossible, but can be a valid case as you mentioned above. In this case,
we'll always ignore the reconnect msg until we receive MasterChanged.

Master X is in-active, Master Y takes over after Master X sends ReconnectWorker
2.1) master == activeMasterUrl (MasterChanged from Y comes after ReconnectWorker from X)
the active master has changed, but Worker haven't relaized the truth. It will still try to
reconnect to (all) masters. In this case(contrary to CASE 2), we'll hit duplicate register issue.
2.2) master != activeMasterUrl (MasterChanged from Y comes before ReconnectWorker from X)
ignore it since Worker has already changed the active master to Master Y.

Since this PR suggests to change the result of worker duplicate register from exit to warn, so, I think it's ok if we remove this condition check here. Because the worst result by accepting ReconnectWorker is duplicate register to the active master, which is covered by this PR's fix solution.

zuotingbing · 2019-05-15T03:26:00Z

BTW in the case 2, I'm still a little confused.

Master B receives register request again from the Worker

When does the first registration happen? in which step?

master changed from vmax18 to vmax17.

In master vmax18, worker be removed because got no heartbeat but soon got heartbeat and asking to re-register with master vmax18(will tryRegisterAllMaster() which include master vmax17).

In the same time, worker has bean registered with master vmax17 when master vmax17 got leadership.

So Worker registration failed: Duplicate worker ID with master vmax17 (register with master vmax17 twice ).

cloud-fan · 2019-05-15T12:42:03Z

I think we need to go back to the design and think about how to fix the root cause. The information I have are:

zookeeper knows who is the actual leader and is the single source of truth
masters have states (standby, active), which should eventually be consistent with zookeeper, but can be out of sync for a while
each worker keeps the url of the active master, which should eventually be consistent with zookeeper, but can be out of sync for a while

The worker will search the active master by sending messages to all masters, when

on start
the heartbeat timeout (master disconnect)
one master sends ReconnectWorker message

There are 2 problems I see

sending messages to all masters may not find the active master, as it's possible that more than one master think it's the leader
non-active master may also send the ReconnectWorker message.

SparkQA · 2019-05-21T16:27:00Z

Test build #105629 has finished for PR 24569 at commit 90b2655.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class MasterInRevoking(masterUrl: String) extends DeployMessage

SparkQA · 2019-05-22T03:12:09Z

Test build #105658 has finished for PR 24569 at commit 2af34b7.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-22T06:04:21Z

Test build #105661 has finished for PR 24569 at commit a355b42.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2019-05-23T06:25:48Z

Hi @cloud-fan , I've updated the code, could you have another look please ?

Ngone51 · 2019-05-23T15:57:49Z

core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala

+      logWarning(s"Master with url $masterUrl is being revoked, current active" +
+        s" masterUrl $activeMasterUrl.")
+      if (masterUrl == activeMasterUrl) {
+        registerWithMaster()


I'm hesitate about this part. Do we really need to registerWithMaster() when masterUrl == activeMasterUrl ? MasterChanged haven't come at this time, but it will comes soon because Master has been re-elected. What's more, it is highly possible that we'll hit duplicate worker registration since the new leader master is in recovery.

Maybe, we should process this message just like MasterInStandby do ?

We need to think more about it.

This reverts commit a355b42.

This reverts commit 2af34b7.

This reverts commit 90b2655.

SparkQA · 2019-05-27T13:31:49Z

Test build #105834 has finished for PR 24569 at commit 9c5ba98.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-05-27T15:35:55Z

core/src/main/scala/org/apache/spark/deploy/master/Master.scala

        workerRef.send(MasterInStandby)
      } else if (idToWorker.contains(id)) {
-        workerRef.send(RegisterWorkerFailed("Duplicate worker ID"))
+        workerRef.send(RegisteredWorker(self, masterWebUiUrl, masterAddress, true))


Do we have to send back the signal to the worker? Can we just log warnning here and reply nothing?

Yes, we have to. Each time worker is trying to connect to masters, it sets registered=false and only set to true when it receives master's response(RegisteredWorker).

cloud-fan · 2019-05-27T15:59:02Z

core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala

-        } else {
-          logInfo("Successfully registered with master " + masterRef.address.toSparkURL)
+      case RegisteredWorker(masterRef, masterWebUiUrl, masterAddress, duplicate) =>
+        val preferredMasterAddress = {


nit:

val preferredMasterAddress = if (..) { .. } else { .. }

updated, thanks.

cloud-fan · 2019-05-27T15:59:33Z

LGTM

SparkQA · 2019-05-27T18:12:01Z

Test build #105844 has finished for PR 24569 at commit c77f140.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-05-28T03:59:51Z

thanks, merging to master!

Ngone51 · 2019-05-28T06:01:12Z

Thank you @cloud-fan

Ngone51 added 2 commits May 9, 2019 21:21

fix Duplicate Worker ID error

17920b3

remove unnecessary change

94a866d

cloud-fan reviewed May 14, 2019

View reviewed changes

add hasLeadership/MasterInRevoking

90b2655

fix mima tests

2af34b7

fix mima again

a355b42

Ngone51 commented May 23, 2019

View reviewed changes

Ngone51 added 2 commits May 27, 2019 19:17

Revert "fix mima again"

662d548

This reverts commit a355b42.

Revert "fix mima tests"

468df19

This reverts commit 2af34b7.

Ngone51 added 2 commits May 27, 2019 19:18

Revert "add hasLeadership/MasterInRevoking"

e5abc74

This reverts commit 90b2655.

fix

9c5ba98

cloud-fan reviewed May 27, 2019

View reviewed changes

address comment

c77f140

cloud-fan closed this in 826ee60 May 28, 2019

dongjoon-hyun mentioned this pull request Dec 10, 2023

[SPARK-46346][CORE] Fix Master to update a worker from UNKNOWN to ALIVE on RegisterWorker msg #44280

Closed

[SPARK-23191][CORE] Warn rather than terminate when duplicate worker register happens #24569

[SPARK-23191][CORE] Warn rather than terminate when duplicate worker register happens #24569

Uh oh!

Conversation

Ngone51 commented May 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Standalone HA Background

Related Issues

Solutions

How was this patch tested?

Uh oh!

SparkQA commented May 9, 2019

Uh oh!

Ngone51 commented May 10, 2019

Uh oh!

cloud-fan commented May 13, 2019

Uh oh!

Ngone51 commented May 14, 2019

Uh oh!

cloud-fan commented May 14, 2019

Uh oh!

Ngone51 commented May 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented May 14, 2019

Uh oh!

Ngone51 commented May 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented May 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ngone51 commented May 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented May 14, 2019

Uh oh!

cloud-fan commented May 14, 2019

Uh oh!

Ngone51 commented May 14, 2019

Uh oh!

Ngone51 commented May 14, 2019

Uh oh!

cloud-fan May 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zuotingbing commented May 15, 2019

Uh oh!

cloud-fan commented May 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented May 21, 2019

Uh oh!

SparkQA commented May 22, 2019

Uh oh!

SparkQA commented May 22, 2019

Uh oh!

Ngone51 commented May 23, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 27, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Ngone51 commented May 9, 2019 •

edited

Loading

Ngone51 commented May 14, 2019 •

edited

Loading

Ngone51 commented May 14, 2019 •

edited

Loading

cloud-fan commented May 14, 2019 •

edited

Loading

Ngone51 commented May 14, 2019 •

edited

Loading

cloud-fan May 14, 2019 •

edited

Loading

cloud-fan commented May 15, 2019 •

edited

Loading