-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-46346][CORE] Fix Master to update a worker from UNKNOWN to ALIVE on RegisterWorker msg
#44280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… on RegisterWorker msg
UNKNOWN to ALIVE on RegisterWorker msgUNKNOWN to ALIVE on RegisterWorker msg
|
Could you review this PR, @viirya ? |
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the worker's state is set to UNKNOWN in beginRecovery, right?
After that, the master will send MasterChanged to the worker. Then the worker will send WorkerSchedulerStateResponse back to the master. As the mater receives WorkerSchedulerStateResponse, it will set the worker's state to ALIVE.
I think this is the whole process during recovery on worker.
I am not sure when does the worker send RegisterWorker to the recovering master? Before the mater send MasterChanged to the worker, isn't the worker still using old master address? Why it sends RegisterWorker to the new master before changing master?
|
Yes, that's the correct normal case which we expect. The problem is that In EKS environment, the For this question, Worker has a retry logic with
For the following question, we use
For example, in the PR description, the following shows that Worker is trying to connect |
|
BTW, this is a known issue from SPARK-23191, @viirya |
Hmm, so in network failure case so the master cannot send
As the service mapping works, the worker will still send But the worker doesn't send I'm wondering, is it okay to skip |
|
For this part, you are right. The driver recovery and app recovery are two additional separate issues which we need to address later.
This PR focuses only on |
|
For the record, I'm also working on that part internally. |
|
BTW, spark/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala Lines 477 to 498 in 82a0232
spark/core/src/main/scala/org/apache/spark/deploy/DeployMessage.scala Lines 121 to 129 in 82a0232
|
Okay. This sounds correct. As this PR proposes to address the case
|
Yea, ideally they should be addressed too although currently looks like they are not regression. |
|
Btw, I found a test I.e., instead of |
Thank you, @viirya . I will start the test coverage improvement for this part. |
|
Hi, @viirya . I made a PR to refactor for unit testing on this area. |
|
Got it. Thank you @dongjoon-hyun for improving test coverage. |
…ALIVE` on `RegisterWorker` msg ### What changes were proposed in this pull request? This PR aims to fix `Spark Master`'s recovery process to update a worker status from `UNKNOWN` to `ALIVE` when it receives a `RegisterWroker` message from that worker. ### Why are the changes needed? This only happens during the recovery. - `Master` already has the recovered worker information in memory with `UNKNOWN` status. - `Worker` sends `RegisterWorker` message correctly. - `Master` keeps its worker status in `UNKNOWN` and informs the worker with `RegisteredWorker` message with `duplicated` flag. - Since `Worker` received like the following and will not try to reconnect. ``` 23/12/09 23:49:57 INFO Worker: Retrying connection to master (attempt # 3) 23/12/09 23:49:57 INFO Worker: Connecting to master ...:7077... 23/12/09 23:50:04 INFO TransportClientFactory: Successfully created connection to master...:7077 after 7089 ms (0 ms spent in bootstraps) 23/12/09 23:50:04 WARN Worker: Duplicate registration at master spark://... 23/12/09 23:50:04 INFO Worker: Successfully registered with master spark://... ``` The `UNKNOWN`-status workers blocks the recovery process and causes a long delay. https://github.com/apache/spark/blob/bac3492980a3e793065a9e9d511ddf0fb66357b3/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L604-L606 After the delay, master simply kills them all. https://github.com/apache/spark/blob/bac3492980a3e793065a9e9d511ddf0fb66357b3/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L647-L649 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This case is a little hard to make a unit test. Manually test. - Master ``` 23/12/10 04:58:30 WARN OneWayOutboxMessage: Failed to send one-way RPC. java.io.IOException: Connecting to /***:1024 timed out (10000 ms) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:291) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:214) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:226) at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840) 23/12/10 04:58:54 INFO Master: Registering worker ***:1024 with 2 cores, 23.0 GiB RAM 23/12/10 04:58:54 INFO Master: Worker has been re-registered: worker-20231210045613-***-1024 ``` - Worker ``` 23/12/10 04:58:45 INFO Worker: Retrying connection to master (attempt # 5) 23/12/10 04:58:45 INFO Worker: Connecting to master master:7077... 23/12/10 04:58:54 INFO TransportClientFactory: Successfully created connection to master/...:7077 after 63957 ms (0 ms spent in bootstraps) 23/12/10 04:58:54 WARN Worker: Duplicate registration at master spark://master:7077 23/12/10 04:58:54 INFO Worker: Successfully registered with master spark://master:7077 23/12/10 04:58:54 INFO Worker: WorkerWebUI is available at https://...-1***-1024 23/12/10 04:58:54 INFO Worker: Worker cleanup enabled; old application directories will be deleted in: /data/spark ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#44280 from dongjoon-hyun/SPARK-46346. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…ALIVE` on `RegisterWorker` msg ### What changes were proposed in this pull request? This PR aims to fix `Spark Master`'s recovery process to update a worker status from `UNKNOWN` to `ALIVE` when it receives a `RegisterWroker` message from that worker. ### Why are the changes needed? This only happens during the recovery. - `Master` already has the recovered worker information in memory with `UNKNOWN` status. - `Worker` sends `RegisterWorker` message correctly. - `Master` keeps its worker status in `UNKNOWN` and informs the worker with `RegisteredWorker` message with `duplicated` flag. - Since `Worker` received like the following and will not try to reconnect. ``` 23/12/09 23:49:57 INFO Worker: Retrying connection to master (attempt # 3) 23/12/09 23:49:57 INFO Worker: Connecting to master ...:7077... 23/12/09 23:50:04 INFO TransportClientFactory: Successfully created connection to master...:7077 after 7089 ms (0 ms spent in bootstraps) 23/12/09 23:50:04 WARN Worker: Duplicate registration at master spark://... 23/12/09 23:50:04 INFO Worker: Successfully registered with master spark://... ``` The `UNKNOWN`-status workers blocks the recovery process and causes a long delay. https://github.com/apache/spark/blob/bac3492980a3e793065a9e9d511ddf0fb66357b3/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L604-L606 After the delay, master simply kills them all. https://github.com/apache/spark/blob/bac3492980a3e793065a9e9d511ddf0fb66357b3/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L647-L649 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This case is a little hard to make a unit test. Manually test. - Master ``` 23/12/10 04:58:30 WARN OneWayOutboxMessage: Failed to send one-way RPC. java.io.IOException: Connecting to /***:1024 timed out (10000 ms) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:291) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:214) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:226) at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840) 23/12/10 04:58:54 INFO Master: Registering worker ***:1024 with 2 cores, 23.0 GiB RAM 23/12/10 04:58:54 INFO Master: Worker has been re-registered: worker-20231210045613-***-1024 ``` - Worker ``` 23/12/10 04:58:45 INFO Worker: Retrying connection to master (attempt # 5) 23/12/10 04:58:45 INFO Worker: Connecting to master master:7077... 23/12/10 04:58:54 INFO TransportClientFactory: Successfully created connection to master/...:7077 after 63957 ms (0 ms spent in bootstraps) 23/12/10 04:58:54 WARN Worker: Duplicate registration at master spark://master:7077 23/12/10 04:58:54 INFO Worker: Successfully registered with master spark://master:7077 23/12/10 04:58:54 INFO Worker: WorkerWebUI is available at https://...-1***-1024 23/12/10 04:58:54 INFO Worker: Worker cleanup enabled; old application directories will be deleted in: /data/spark ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#44280 from dongjoon-hyun/SPARK-46346. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
This PR aims to fix
Spark Master's recovery process to update a worker status fromUNKNOWNtoALIVEwhen it receives aRegisterWrokermessage from that worker.Why are the changes needed?
This only happens during the recovery.
Masteralready has the recovered worker information in memory withUNKNOWNstatus.WorkersendsRegisterWorkermessage correctly.Masterkeeps its worker status inUNKNOWNand informs the worker withRegisteredWorkermessage withduplicatedflag.Workerreceived like the following and will not try to reconnect.The
UNKNOWN-status workers blocks the recovery process and causes a long delay.spark/core/src/main/scala/org/apache/spark/deploy/master/Master.scala
Lines 604 to 606 in bac3492
After the delay, master simply kills them all.
spark/core/src/main/scala/org/apache/spark/deploy/master/Master.scala
Lines 647 to 649 in bac3492
Does this PR introduce any user-facing change?
No.
How was this patch tested?
This case is a little hard to make a unit test.
Manually test.
Was this patch authored or co-authored using generative AI tooling?
No.