Skip to content

Conversation

@Ngone51
Copy link
Member

@Ngone51 Ngone51 commented May 9, 2019

What changes were proposed in this pull request?

Standalone HA Background

In Spark Standalone HA mode, we'll have multiple masters running at the same time. But, there's only one master leader, which actively serving scheduling requests. Once this master leader crashes, other masters would compete for the leader and only one master is guaranteed to be elected as new master leader, which would reconstruct the state from the original master leader and continute to serve scheduling requests.

Related Issues

#2828 firstly introduces the bug of duplicate Worker registration, and #3447 fixed it. But there're still corner cases(see SPARK-23191 for details) where #3447 can not cover it:

  • CASE 1
    (1) Initially, Worker registered with Master A.
    (2) After a while, the connection channel between Master A and Worker becomes inactive(e.g. due to network drop), and Worker is notified about that by calling onDisconnected from NettyRpcEnv
    (3) When Worker invokes onDisconnected, then, it will attempt to reconnect to all masters(including Master A)
    (4) At the meanwhile, network between Worker and Master A recover, Worker successfully register to Master A again
    (5) Master A response with RegisterWorkerFailed("Duplicate worker ID")
    (6) Worker receives that msg, exit

  • CASE 2
    (1) Master A lost leadership(sends RevokedLeadership to itself). Master B takes over and recovery everything from master A(which would register workers for the first time in Master B) and sends MasterChanged to Worker
    (2) Before Master A receives RevokedLeadership, it receives a late HeartBeat from Worker(which had been removed in Master A due to heartbeat timeout previously), so it sends ReconnectWorker to worker
    (3) Worker receives MasterChanged before ReconnectWorker , changing masterRef to Master B
    (4) Subsequently, Worker receives ReconnectWorker from Master A, then it reconnects to all masters
    (5) Master B receives register request again from the Worker, response with RegisterWorkerFailed("Duplicate worker ID")
    (6) Worker receives that msg, exit

In CASE 1, it is difficult for the Worker to know Master A's state. Normally, Worker thinks Master A has already died and is impossible that Master A would response with Worker's re-connect request.

In CASE 2, we can see race condition between RevokedLeadership and HeartBeat. Actually, Master A has already been revoked leadership while processing HeartBeat msg. That's means the state between Master and Zookeeper could be out of sync for a while.

Solutions

In this PR, instead of exiting Worker process when duplicate Worker registration happens, we suggest to log warn about it. This would be fine since Master actually perform no-op when it receives duplicate registration from a Worker. In turn, Worker could continue living with that Master normally without any side effect.

How was this patch tested?

Tested Manually.

I followed the steps as Neeraj Gupta suggested in JIRA SPARK-23191 to reproduce the case 1.

Before this pr, Worker would be DEAD from UI.
After this pr, Worker just warn the duplicate register behavior (as you can see the second last row in log snippet below), and still be ALIVE from UI.

19/05/09 20:58:32 ERROR Worker: Connection to master failed! Waiting for master to reconnect...
19/05/09 20:58:32 INFO Worker: wuyi.local:7077 Disassociated !
19/05/09 20:58:32 INFO Worker: Connecting to master wuyi.local:7077...
19/05/09 20:58:32 ERROR Worker: Connection to master failed! Waiting for master to reconnect...
19/05/09 20:58:32 INFO Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already.
19/05/09 20:58:37 WARN TransportClientFactory: DNS resolution for wuyi.local/127.0.0.1:7077 took 5005 ms
19/05/09 20:58:37 INFO TransportClientFactory: Found inactive connection to wuyi.local/127.0.0.1:7077, creating a new one.
19/05/09 20:58:37 INFO TransportClientFactory: Successfully created connection to wuyi.local/127.0.0.1:7077 after 3 ms (0 ms spent in bootstraps)
19/05/09 20:58:37 WARN Worker: Duplicate registration at master spark://wuyi.local:7077
19/05/09 20:58:37 INFO Worker: Successfully registered with master spark://wuyi.local:7077

@SparkQA
Copy link

SparkQA commented May 9, 2019

Test build #105285 has finished for PR 24569 at commit 94a866d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member Author

Ngone51 commented May 10, 2019

cc @cloud-fan

@cloud-fan
Copy link
Contributor

can you briefly explain how master HA works? I'm afraid not many people are familiar with this part.

@Ngone51
Copy link
Member Author

Ngone51 commented May 14, 2019

@cloud-fan updated in desc.

@cloud-fan
Copy link
Contributor

Master A receives a late HeartBeat from Worker, asking it to reconnect

worker asks master to reconnect? why?

@Ngone51
Copy link
Member Author

Ngone51 commented May 14, 2019

worker asks master to reconnect? why?

hmm...the real meaning I want to describe is that asking the worker to reconnect to the Master. I'v updated the description. Thanks.

}

if (duplicate) {
logWarning(s"Duplicate registration at master $preferredMasterAddress")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to include the possible reasons for this case, so that end users know what's going on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I'll update it.

@cloud-fan
Copy link
Contributor

Master A receives a late HeartBeat from Worker(which had been removed in Master due to heartbeat timeout previously), asking the worker to reconnect

Is it possible that master A knows he is not the active master?

@Ngone51
Copy link
Member Author

Ngone51 commented May 14, 2019

Is it possible that master A knows he is not the active master?

IIRC, currently, we can not know that directly. Maybe, we can infer it by Master's state, but this could also be unreliable.

@cloud-fan
Copy link
Contributor

cloud-fan commented May 14, 2019

Then how do the workers know which master is active?

@Ngone51
Copy link
Member Author

Ngone51 commented May 14, 2019

Then how do the workers know which master is active?

hmmm... worker itself does not know active master well whenever it is not connected to a certain master. It just try to connect to all masters and wait for masters' replies.

And, if a master fails on leader election, its state would remains in STANDBY. So, when it receives worker's register request, the standby master would ignore the request and just tell the worker, 'I'm in standby'. In turn, worker would also ignore this msg(it would continue to wait for the active one's response).

If a master is elected to be the leader successfully, its state would change to ALIVE. So, when it receives worker's register request, the alive(or active) master would register the worker and response with RegisteredWorker. And when worker receives RegisteredWorker, it would change its masterRef(RpcEndpointRef) to the active one. Afterwards, the worker could communicate with the active master by the masterRef directly.

And if the active master crash happens, worker would re-try to connect to all masters(the behavior may be a little different after #3447). Then, the above process would reproduce.

Now, I'm thinking that we do can know the master is active or not just by checking its state is STANDBY or not.

But in case 2, even though we could recognize a master is active or not, we may still could not avoid step (2).

See the log snippets(see details in JIRA SPARK-23191) which provide by @zuotingbing :

2019-03-15 20:22:09,441 INFO ZooKeeperLeaderElectionAgent: We have lost leadership
2019-03-15 20:22:14,544 WARN Master: Removing worker-20190218183101-vmax18-33129 because we got no heartbeat in 60 seconds
2019-03-15 20:22:14,544 INFO Master: Removing worker worker-20190218183101-vmax18-33129 on vmax18:33129
2019-03-15 20:22:14,864 WARN Master: Got heartbeat from unregistered worker worker-20190218183101-vmax18-33129. Asking it to re-register.
2019-03-15 20:22:14,975 ERROR Master: Leadership has been revoked -- master shutting down.

In log, it seems like that we have a race condition between ZooKeeperLeaderElectionAgent and Master. When the master receives a late heartbeat, it's still active. But, almost simultaneously, it changes to in-active.

@cloud-fan
Copy link
Contributor

If we rely on zoom keeper to do leader election, why don't the workers ask zoom keeper for the leader info? Reconnecting to all masters looks wrong to me, as it's likely that more than one master thinks himself as active (the communication between zoom keeper and masters has delays).

@cloud-fan
Copy link
Contributor

BTW in the case 2, I'm still a little confused.

Master B receives register request again from the Worker

When does the first registration happen? in which step?

@Ngone51
Copy link
Member Author

Ngone51 commented May 14, 2019

When does the first registration happen? in which step?

When master B takes over from master A (or say, elected to be a new active master), it will recovery all info(including registered workers) from master A. The recovery process would trigger first registration silently. I'll update the description, thanks.

@Ngone51
Copy link
Member Author

Ngone51 commented May 14, 2019

If we rely on zoom keeper to do leader election, why don't the workers ask zoom keeper for the leader info?

This may be another implementation strategy. But if we ask workers to directly ask zk(zookeeper) for leader info, will there be too many zk clients concurrently if we have plenty of workers ? And it may pay pressure on network. (And not sure zk provides the api for leader info directly.)

Reconnecting to all masters looks wrong to me, as it's likely that more than one master thinks himself as active (the communication between zoom keeper and masters has delays).

hmmm...actually, it seems to me that it is worker rather than masters thinks that more than one master is active. However, masters themselves know whether itself is active well(though, has a little delays).

case ReconnectWorker(masterUrl) =>
logInfo(s"Master with url $masterUrl requested this worker to reconnect.")
registerWithMaster()
if (masterUrl != activeMasterUrl) {
Copy link
Contributor

@cloud-fan cloud-fan May 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the part that worries me most. We need to be super sure that it's 100% safe.

It looks to me that

  1. activeMasterUrl is the master that this worker thinks it's active. This may not always be the actual active master. For example, when the zookeeper finishes leader election, there is a time window before this worker gets the MasterChanged message.
  2. ReconnectWorker may be sent by a standby master, as you explained in the PR description.

Can we list all the cases that a ReconnectWorker is sent, and make sure we don't miss valid cases here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing I want to avoid is, the worker ignores the ReconnectWorker message from the actual active master.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. ReconnectWorker may be sent by a standby master, as you explained in the PR description.

I made a wrong PR description on step order in CASE 2(have revised). Sorry for it. Actually, while sending ReconnectWorker, Master A is still active but quickly going to die(as a race condition metioned above.)

Actually, there's no doubt that the msg ReconnectWorker(master) must come from an active Master.
So, when Worker receives that msg from Master X, cases would be:

  1. Master X is active
    1.1) Master X is the initial active master(No MasterChanged msg)
    1.1.1) master == activeMasterUrl
    just reonnect to (all) masters
    1.1.2)master != activeMasterUrl
    impossible case
    1.2) Master X is elected to be new active master
    1.2.1)master == activeMasterUrl (MasterChanged comes before ReconnectWorker)
    just reonnect to (all) masters
    1.2.2) master != activeMasterUrl (MasterChanged comes after ReconnectWorker)
    seems very impossible, but can be a valid case as you mentioned above. In this case,
    we'll always ignore the reconnect msg until we receive MasterChanged.
  2. Master X is in-active, Master Y takes over after Master X sends ReconnectWorker
    2.1) master == activeMasterUrl (MasterChanged from Y comes after ReconnectWorker from X)
    the active master has changed, but Worker haven't relaized the truth. It will still try to
    reconnect to (all) masters. In this case(contrary to CASE 2), we'll hit duplicate register issue.
    2.2) master != activeMasterUrl (MasterChanged from Y comes before ReconnectWorker from X)
    ignore it since Worker has already changed the active master to Master Y.

Since this PR suggests to change the result of worker duplicate register from exit to warn, so, I think it's ok if we remove this condition check here. Because the worst result by accepting ReconnectWorker is duplicate register to the active master, which is covered by this PR's fix solution.

@zuotingbing
Copy link

BTW in the case 2, I'm still a little confused.

Master B receives register request again from the Worker

When does the first registration happen? in which step?

master changed from vmax18 to vmax17.

In master vmax18, worker be removed because got no heartbeat but soon got heartbeat and asking to re-register with master vmax18(will tryRegisterAllMaster() which include master vmax17).

In the same time, worker has bean registered with master vmax17 when master vmax17 got leadership.

So Worker registration failed: Duplicate worker ID with master vmax17 (register with master vmax17 twice ).

@cloud-fan
Copy link
Contributor

cloud-fan commented May 15, 2019

I think we need to go back to the design and think about how to fix the root cause. The information I have are:

  1. zookeeper knows who is the actual leader and is the single source of truth
  2. masters have states (standby, active), which should eventually be consistent with zookeeper, but can be out of sync for a while
  3. each worker keeps the url of the active master, which should eventually be consistent with zookeeper, but can be out of sync for a while

The worker will search the active master by sending messages to all masters, when

  1. on start
  2. the heartbeat timeout (master disconnect)
  3. one master sends ReconnectWorker message

There are 2 problems I see

  1. sending messages to all masters may not find the active master, as it's possible that more than one master think it's the leader
  2. non-active master may also send the ReconnectWorker message.

@SparkQA
Copy link

SparkQA commented May 21, 2019

Test build #105629 has finished for PR 24569 at commit 90b2655.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class MasterInRevoking(masterUrl: String) extends DeployMessage

@SparkQA
Copy link

SparkQA commented May 22, 2019

Test build #105658 has finished for PR 24569 at commit 2af34b7.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2019

Test build #105661 has finished for PR 24569 at commit a355b42.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member Author

Ngone51 commented May 23, 2019

Hi @cloud-fan , I've updated the code, could you have another look please ?

logWarning(s"Master with url $masterUrl is being revoked, current active" +
s" masterUrl $activeMasterUrl.")
if (masterUrl == activeMasterUrl) {
registerWithMaster()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitate about this part. Do we really need to registerWithMaster() when masterUrl == activeMasterUrl ? MasterChanged haven't come at this time, but it will comes soon because Master has been re-elected. What's more, it is highly possible that we'll hit duplicate worker registration since the new leader master is in recovery.

Maybe, we should process this message just like MasterInStandby do ?

We need to think more about it.

Ngone51 added 2 commits May 27, 2019 19:17
This reverts commit a355b42.
This reverts commit 2af34b7.
@SparkQA
Copy link

SparkQA commented May 27, 2019

Test build #105834 has finished for PR 24569 at commit 9c5ba98.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

workerRef.send(MasterInStandby)
} else if (idToWorker.contains(id)) {
workerRef.send(RegisterWorkerFailed("Duplicate worker ID"))
workerRef.send(RegisteredWorker(self, masterWebUiUrl, masterAddress, true))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to send back the signal to the worker? Can we just log warnning here and reply nothing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we have to. Each time worker is trying to connect to masters, it sets registered=false and only set to true when it receives master's response(RegisteredWorker).

} else {
logInfo("Successfully registered with master " + masterRef.address.toSparkURL)
case RegisteredWorker(masterRef, masterWebUiUrl, masterAddress, duplicate) =>
val preferredMasterAddress = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

val preferredMasterAddress = if (..) {
  ..
} else {
  ..
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thanks.

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented May 27, 2019

Test build #105844 has finished for PR 24569 at commit c77f140.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 826ee60 May 28, 2019
@Ngone51
Copy link
Member Author

Ngone51 commented May 28, 2019

Thank you @cloud-fan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants