-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23191][CORE] Warn rather than terminate when duplicate worker register happens #24569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #105285 has finished for PR 24569 at commit
|
|
cc @cloud-fan |
|
can you briefly explain how master HA works? I'm afraid not many people are familiar with this part. |
|
@cloud-fan updated in desc. |
worker asks master to reconnect? why? |
hmm...the real meaning I want to describe is that asking the worker to reconnect to the Master. I'v updated the description. Thanks. |
| } | ||
|
|
||
| if (duplicate) { | ||
| logWarning(s"Duplicate registration at master $preferredMasterAddress") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be good to include the possible reasons for this case, so that end users know what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I'll update it.
Is it possible that master A knows he is not the active master? |
IIRC, currently, we can not know that directly. Maybe, we can infer it by Master's state, |
|
Then how do the workers know which master is active? |
hmmm... worker itself does not know active master well whenever it is not connected to a certain master. It just try to connect to all masters and wait for masters' replies. And, if a master fails on leader election, its state would remains in If a master is elected to be the leader successfully, its state would change to And if the active master crash happens, worker would re-try to connect to all masters(the behavior may be a little different after #3447). Then, the above process would reproduce. Now, I'm thinking that we do can know the master is active or not just by checking its state is STANDBY or not. But in case 2, even though we could recognize a master is active or not, we may still could not avoid step (2). See the log snippets(see details in JIRA SPARK-23191) which provide by @zuotingbing : In log, it seems like that we have a race condition between ZooKeeperLeaderElectionAgent and Master. When the master receives a late heartbeat, it's still active. But, almost simultaneously, it changes to in-active. |
|
If we rely on zoom keeper to do leader election, why don't the workers ask zoom keeper for the leader info? Reconnecting to all masters looks wrong to me, as it's likely that more than one master thinks himself as active (the communication between zoom keeper and masters has delays). |
|
BTW in the case 2, I'm still a little confused.
When does the first registration happen? in which step? |
When master B takes over from master A (or say, elected to be a new active master), it will recovery all info(including registered workers) from master A. The recovery process would trigger first registration silently. I'll update the description, thanks. |
This may be another implementation strategy. But if we ask workers to directly ask zk(zookeeper) for leader info, will there be too many zk clients concurrently if we have plenty of workers ? And it may pay pressure on network. (And not sure zk provides the api for leader info directly.)
hmmm...actually, it seems to me that it is worker rather than masters thinks that more than one master is active. However, masters themselves know whether itself is active well(though, has a little delays). |
| case ReconnectWorker(masterUrl) => | ||
| logInfo(s"Master with url $masterUrl requested this worker to reconnect.") | ||
| registerWithMaster() | ||
| if (masterUrl != activeMasterUrl) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the part that worries me most. We need to be super sure that it's 100% safe.
It looks to me that
activeMasterUrlis the master that this worker thinks it's active. This may not always be the actual active master. For example, when the zookeeper finishes leader election, there is a time window before this worker gets theMasterChangedmessage.ReconnectWorkermay be sent by a standby master, as you explained in the PR description.
Can we list all the cases that a ReconnectWorker is sent, and make sure we don't miss valid cases here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing I want to avoid is, the worker ignores the ReconnectWorker message from the actual active master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReconnectWorkermay be sent by a standby master, as you explained in the PR description.
I made a wrong PR description on step order in CASE 2(have revised). Sorry for it. Actually, while sending ReconnectWorker, Master A is still active but quickly going to die(as a race condition metioned above.)
Actually, there's no doubt that the msg ReconnectWorker(master) must come from an active Master.
So, when Worker receives that msg from Master X, cases would be:
- Master X is active
1.1) Master X is the initial active master(NoMasterChangedmsg)
1.1.1) master == activeMasterUrl
just reonnect to (all) masters
1.1.2)master != activeMasterUrl
impossible case
1.2) Master X is elected to be new active master
1.2.1)master == activeMasterUrl (MasterChangedcomes beforeReconnectWorker)
just reonnect to (all) masters
1.2.2) master != activeMasterUrl (MasterChangedcomes afterReconnectWorker)
seems very impossible, but can be a valid case as you mentioned above. In this case,
we'll always ignore the reconnect msg until we receiveMasterChanged. - Master X is in-active, Master Y takes over after Master X sends
ReconnectWorker
2.1) master == activeMasterUrl (MasterChangedfrom Y comes afterReconnectWorkerfrom X)
the active master has changed, but Worker haven't relaized the truth. It will still try to
reconnect to (all) masters. In this case(contrary to CASE 2), we'll hit duplicate register issue.
2.2) master != activeMasterUrl (MasterChangedfrom Y comes beforeReconnectWorkerfrom X)
ignore it since Worker has already changed the active master to Master Y.
Since this PR suggests to change the result of worker duplicate register from exit to warn, so, I think it's ok if we remove this condition check here. Because the worst result by accepting ReconnectWorker is duplicate register to the active master, which is covered by this PR's fix solution.
master changed from vmax18 to vmax17. In master vmax18, worker be removed because got no heartbeat but soon got heartbeat and asking to re-register with master vmax18(will tryRegisterAllMaster() which include master vmax17). In the same time, worker has bean registered with master vmax17 when master vmax17 got leadership. So Worker registration failed: Duplicate worker ID with master vmax17 (register with master vmax17 twice ). |
|
I think we need to go back to the design and think about how to fix the root cause. The information I have are:
The worker will search the active master by sending messages to all masters, when
There are 2 problems I see
|
|
Test build #105629 has finished for PR 24569 at commit
|
|
Test build #105658 has finished for PR 24569 at commit
|
|
Test build #105661 has finished for PR 24569 at commit
|
|
Hi @cloud-fan , I've updated the code, could you have another look please ? |
| logWarning(s"Master with url $masterUrl is being revoked, current active" + | ||
| s" masterUrl $activeMasterUrl.") | ||
| if (masterUrl == activeMasterUrl) { | ||
| registerWithMaster() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hesitate about this part. Do we really need to registerWithMaster() when masterUrl == activeMasterUrl ? MasterChanged haven't come at this time, but it will comes soon because Master has been re-elected. What's more, it is highly possible that we'll hit duplicate worker registration since the new leader master is in recovery.
Maybe, we should process this message just like MasterInStandby do ?
We need to think more about it.
This reverts commit 90b2655.
|
Test build #105834 has finished for PR 24569 at commit
|
| workerRef.send(MasterInStandby) | ||
| } else if (idToWorker.contains(id)) { | ||
| workerRef.send(RegisterWorkerFailed("Duplicate worker ID")) | ||
| workerRef.send(RegisteredWorker(self, masterWebUiUrl, masterAddress, true)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to send back the signal to the worker? Can we just log warnning here and reply nothing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we have to. Each time worker is trying to connect to masters, it sets registered=false and only set to true when it receives master's response(RegisteredWorker).
| } else { | ||
| logInfo("Successfully registered with master " + masterRef.address.toSparkURL) | ||
| case RegisteredWorker(masterRef, masterWebUiUrl, masterAddress, duplicate) => | ||
| val preferredMasterAddress = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
val preferredMasterAddress = if (..) {
..
} else {
..
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, thanks.
|
LGTM |
|
Test build #105844 has finished for PR 24569 at commit
|
|
thanks, merging to master! |
|
Thank you @cloud-fan |
What changes were proposed in this pull request?
Standalone HA Background
In Spark Standalone HA mode, we'll have multiple masters running at the same time. But, there's only one master leader, which actively serving scheduling requests. Once this master leader crashes, other masters would compete for the leader and only one master is guaranteed to be elected as new master leader, which would reconstruct the state from the original master leader and continute to serve scheduling requests.
Related Issues
#2828 firstly introduces the bug of duplicate Worker registration, and #3447 fixed it. But there're still corner cases(see SPARK-23191 for details) where #3447 can not cover it:
CASE 1
(1) Initially, Worker registered with Master A.
(2) After a while, the connection channel between Master A and Worker becomes inactive(e.g. due to network drop), and Worker is notified about that by calling
onDisconnectedfrom NettyRpcEnv(3) When Worker invokes
onDisconnected, then, it will attempt to reconnect to all masters(including Master A)(4) At the meanwhile, network between Worker and Master A recover, Worker successfully register to Master A again
(5) Master A response with
RegisterWorkerFailed("Duplicate worker ID")(6) Worker receives that msg, exit
CASE 2
(1) Master A lost leadership(sends
RevokedLeadershipto itself). Master B takes over and recovery everything from master A(which would register workers for the first time in Master B) and sendsMasterChangedto Worker(2) Before Master A receives
RevokedLeadership, it receives a lateHeartBeatfrom Worker(which had been removed in Master A due to heartbeat timeout previously), so it sendsReconnectWorkerto worker(3) Worker receives
MasterChangedbeforeReconnectWorker, changing masterRef to Master B(4) Subsequently, Worker receives
ReconnectWorkerfrom Master A, then it reconnects to all masters(5) Master B receives register request again from the Worker, response with
RegisterWorkerFailed("Duplicate worker ID")(6) Worker receives that msg, exit
In CASE 1, it is difficult for the Worker to know Master A's state. Normally, Worker thinks Master A has already died and is impossible that Master A would response with Worker's re-connect request.
In CASE 2, we can see race condition between
RevokedLeadershipandHeartBeat. Actually, Master A has already been revoked leadership while processingHeartBeatmsg. That's means the state between Master and Zookeeper could be out of sync for a while.Solutions
In this PR, instead of exiting Worker process when duplicate Worker registration happens, we suggest to log warn about it. This would be fine since Master actually perform no-op when it receives duplicate registration from a Worker. In turn, Worker could continue living with that Master normally without any side effect.
How was this patch tested?
Tested Manually.
I followed the steps as Neeraj Gupta suggested in JIRA SPARK-23191 to reproduce the case 1.
Before this pr, Worker would be DEAD from UI.
After this pr, Worker just warn the duplicate register behavior (as you can see the second last row in log snippet below), and still be ALIVE from UI.