Expose setting multiple protocols and ports via the dask-scheduler CLI#6898
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ±0 15 suites ±0 6h 50m 7s ⏱️ - 15m 48s For more details on these failures, see this check. Results for commit e4485c3. ± Comparison against base commit 1d0701b. ♻️ This comment has been updated with latest results. |
pentschev
left a comment
There was a problem hiding this comment.
Thanks @jacobtomlinson , this is a clean, simple solution, awesome! I did some brief testing with UCX as well and didn't immediately see any issues with that, so feels like it should work as expected.
For reference, this is what I ran:
# on scheduler
dask-scheduler --protocol ucx,ws --port 8786,8788
# on worker
dask-cuda-worker ucx://SCHEDULER_IP:8786
# on client -- from https://github.com/rapidsai/dask-cuda/blob/branch-22.10/dask_cuda/benchmarks/local_cudf_merge.py
python local_cudf_merge.py --runs 10 --scheduler-address ucx://SCHEDULER_IP:8786 -c 50_000_000I didn't do any test related to ws, mainly because I wouldn't know how to. 🙂
|
The interface change seems to work, it reports the listening interfaces correctly at least: $ dask-scheduler --protocol ucx,tcp,ws --port 8786,8788,8789 --interface ib0,enp1s0f0,enp1s0f0
2022-08-17 07:49:40,367 - distributed.scheduler - INFO - -----------------------------------------------
2022-08-17 07:49:40,985 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2022-08-17 07:49:41,028 - distributed.scheduler - INFO - State start
2022-08-17 07:49:41,039 - distributed.scheduler - INFO - -----------------------------------------------
2022-08-17 07:49:41,039 - distributed.scheduler - INFO - Clear task state
2022-08-17 07:49:42,568 - distributed.scheduler - INFO - Scheduler at: ucx://10.33.225.163:8786
2022-08-17 07:49:42,568 - distributed.scheduler - INFO - Scheduler at: tcp://10.33.227.163:8788
2022-08-17 07:49:42,568 - distributed.scheduler - INFO - Scheduler at: ws://10.33.227.163:8789
2022-08-17 07:49:42,568 - distributed.scheduler - INFO - dashboard at: 10.33.225.163:8787For the cases I'm testing, the IP addresses match those I specified to $ dask-scheduler --protocol tcp &
[1] 5397
$ 2022-08-17 07:51:30,454 - distributed.scheduler - INFO - -----------------------------------------------
2022-08-17 07:51:31,064 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2022-08-17 07:51:31,104 - distributed.scheduler - INFO - State start
2022-08-17 07:51:31,115 - distributed.scheduler - INFO - -----------------------------------------------
2022-08-17 07:51:31,116 - distributed.scheduler - INFO - Clear task state
2022-08-17 07:51:31,117 - distributed.scheduler - INFO - Scheduler at: tcp://10.33.227.163:8786
2022-08-17 07:51:31,117 - distributed.scheduler - INFO - dashboard at: :8787
$ netstat -tupan | grep :878
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 0.0.0.0:8786 0.0.0.0:* LISTEN 5397/python3.8
tcp 0 0 0.0.0.0:8787 0.0.0.0:* LISTEN 5397/python3.8
tcp6 0 0 :::8786 :::* LISTEN 5397/python3.8
tcp6 0 0 :::8787 :::* LISTEN 5397/python3.8In the example above we see Dask reporting it's binding to a specific IP address, but |
|
Nevermind, in the example above I didn't specify $ dask-scheduler --protocol tcp --interface ib0 &
[1] 9736
$ 2022-08-17 07:59:58,304 - distributed.scheduler - INFO - -----------------------------------------------
2022-08-17 07:59:59,078 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2022-08-17 07:59:59,120 - distributed.scheduler - INFO - State start
2022-08-17 07:59:59,130 - distributed.scheduler - INFO - -----------------------------------------------
2022-08-17 07:59:59,130 - distributed.scheduler - INFO - Clear task state
2022-08-17 07:59:59,131 - distributed.scheduler - INFO - Scheduler at: tcp://10.33.225.163:8786
2022-08-17 07:59:59,131 - distributed.scheduler - INFO - dashboard at: 10.33.225.163:8787
$ netstat -tupan | grep :878
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 10.33.225.163:8786 0.0.0.0:* LISTEN 9736/python3.8
tcp 0 0 0.0.0.0:8787 0.0.0.0:* LISTEN 9736/python3.8
tcp6 0 0 :::8787 :::* LISTEN 9736/python3.8That was an extrapolation of the incorrect behavior I see with individual protocols. The TCP interface is bound correctly, but the others are not: $ UCX_TCP_CM_REUSEADDR=y dask-scheduler --protocol ucx,tcp,ws --port 8786,8788,8789 --interface enp1s0f0,enp1s0f0,enp1s0f0 &
[1] 11971
$ 2022-08-17 08:03:30,634 - distributed.scheduler - INFO - -----------------------------------------------
2022-08-17 08:03:31,227 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2022-08-17 08:03:31,266 - distributed.scheduler - INFO - State start
2022-08-17 08:03:31,275 - distributed.scheduler - INFO - -----------------------------------------------
2022-08-17 08:03:31,276 - distributed.scheduler - INFO - Clear task state
2022-08-17 08:03:32,929 - distributed.scheduler - INFO - Scheduler at: ucx://10.33.227.163:8786
2022-08-17 08:03:32,929 - distributed.scheduler - INFO - Scheduler at: tcp://10.33.227.163:8788
2022-08-17 08:03:32,929 - distributed.scheduler - INFO - Scheduler at: ws://10.33.227.163:8789
2022-08-17 08:03:32,929 - distributed.scheduler - INFO - dashboard at: 10.33.227.163:8787
$ netstat -tupan | grep :878
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 10.33.227.163:8788 0.0.0.0:* LISTEN 11971/python3.8
tcp 0 0 0.0.0.0:8789 0.0.0.0:* LISTEN 11971/python3.8
tcp 0 0 0.0.0.0:8786 0.0.0.0:* LISTEN 11971/python3.8
tcp 0 0 0.0.0.0:8787 0.0.0.0:* LISTEN 11971/python3.8
tcp6 0 0 :::8789 :::* LISTEN 11971/python3.8
tcp6 0 0 :::8787 :::* LISTEN 11971/python3.8I can confirm the behavior above for individual protocols (without this PR). For sure there's a bug with UCX, but I don't know if this is a bug with websockets or a known limitation. |
|
Thanks for digging into this @pentschev. It sounds like this has identified some bugs but they are not related to this PR specifically. Should we open an issue to track that? |
|
For UCX I've filed rapidsai/ucx-py#871 and #6901 to correct this behavior. But it may be worth filing an issue for someone to investigate whether websockets should be fixed too. |
|
Test failures appear unrelated. Unless there are further review/comments I intend to merge on Monday. |
Closes #6891
pre-commit run --all-filesIn #6891 @mrocklin mentioned that
Schedulercan take a list forprotocolandport. This PR updates thedask-schedulerCLI to also allow lists. Users can optionally specify a comma-separated list for each.