Dont start http server in Scheduler.init #4928

fjetter · 2021-06-18T09:48:52Z

This should deal with the CI port already in use warning situation reported in #4806

Highlights

We have a few places where we start up a cluster (e.g. spec cluster or gen_cluster) where we do not close the scheduler if workers fail to start. This would leave ports still allocated
Previously HTTP server sockets were opened during Scheduler.__init__ which made some tests really awkward. This PR moves the start of the HTTP server to the Scheduler.start method.
A few tests intentionally start two clusters/schedulers where I circumvented this warning by simply allowing random ports. This revealed a shortcoming of our API since there is currently no way to not start an HTTP server
I added a filterwarnings statement in our setup.cfg to err whenever this warning is raised in our test setup. This is very strict but I would prefer keeping it and removing it again if it causes any issues. Preferably we'd have a test fixture which checks if the test left any open socks but I'm not aware of any code which can do this without root

Supersedes #4896 and #4921

Closes [CI] Warnings about used ports #4806
Tests added / passed
Passes black distributed / flake8 distributed / isort distributed

fjetter · 2021-06-18T09:51:50Z

distributed/core.py


        logger.debug("Connection from %r to %s", address, type(self).__name__)
        self._comms[comm] = op
-        await self


This is a very interesting change. I encountered race condition where this await would actually trigger the warning after I moved the HTTP server startup code to Scheduler.start. Apparently the start is not idempotent and sometimes we'd restart / start a scheduler twice but only under some strange timing sensitive conditions (I believe the handler was already active while the scheduler is still starting...). I can dig deeper here but I figured there shouldn't be a reason to await self here since this handler should only be registered after/during the start anyhow.

My hope would be that await self would be idempotent. If in the future we find that there is an easy way to make this true I would be in support.

I'd assume a if status==running; return would already do the trick. If we prefer this route, I can revert the await self thingy here

That, combined with using an async lock to avoid many await self calls at the same time. I think that we do this in other situations, like Client and Cluster. I'm surprised that we don't do it in Server.

The idempotence you are referring to is implemented once for Server, see

distributed/distributed/core.py

Lines 264 to 269 in d9bc3c6

def __await__(self):

async def _():

timeout = getattr(self, "death_timeout", 0)

async with self._startup_lock:

if self.status == Status.running:

return self

for a reason I haven't understood, yet, this still caused issues for me. my suspicion is that the scheduler already arrived in a non running but not properly closed state when I see this message. This would then try to revive a kind of dead scheduler and cause this warning. I'll let CI run on this a few times and see if I can reproduce.

I remembered running into this await before. I had a similar problem over in #4734

Indeed, it turns out a properly timed connection request to a dead or dying scheduler can revive it. hrhr, I guess this is merely an edge case relevant to our async test suite. Regardless, imho this is much more cleanly fixed in #4734 and I suggest to merge that one before.

Note after I became a bit smarter: This await is here to ensure that the server is actually up. problem is that the start is not idempotent, stumbled over this in #4734 as well

fjetter · 2021-06-18T13:41:39Z

distributed/scheduler.py

        # XXX what about nested state such as ClientState.wants_what
        # (see also fire-and-forget...)
-        logger.info("Clear task state")
+        logger.debug("Clear task state")


I changed this log level since it felt kind of wrong. This prints a Clear task state to the console whenever I call dask-scheduler. That's currently failing the tests. I assume this is OK?

Sure, I don't have a strong opinion on this

jrbourbeau · 2021-06-23T18:25:21Z

setup.cfg

    error:Since distributed.*:PendingDeprecationWarning
+
+    # See https://github.com/dask/distributed/issues/4806
+    error:Port:UserWarning:distributed.node


This might be a bit too optimistic but let's wait and see. It may definitely interfere if the code base is ran using pytest-xdist. We will always get port collisions when running in parallel

jrbourbeau · 2021-06-23T18:26:27Z

distributed/deploy/tests/test_local.py

            # This will never work but is a reliable way to block without hard
            # coding any sleep values
-            async with Client(cluster) as c:
+            async with Client(cluster, asynchronous=True) as c:


jrbourbeau · 2021-06-23T18:35:38Z

distributed/scheduler.py

+try:
+    import bokeh  # noqa: F401
+
+    HAS_BOKEH = True
+except ImportError:
+    HAS_BOKEH = False


Nitpick: This is subjective and not worth spending much time on, but more commonly throughout the codebase we will catch the ImportError and assign None to the module instead of defining a new HAS_* variable. For example:

distributed/distributed/utils.py

Lines 33 to 36 in 7d0f010

try:

import resource

except ImportError:

resource = None

Then later on we would do if bokeh is not None instead of if HAS_BOKEH

If we do this in other places already, I'll change it. I think keeping a consistent style in a codebase is worth the effort

jrbourbeau · 2021-06-23T18:37:35Z

distributed/scheduler.py

        # XXX what about nested state such as ClientState.wants_what
        # (see also fire-and-forget...)
-        logger.info("Clear task state")
+        logger.debug("Clear task state")


Sure, I don't have a strong opinion on this

jrbourbeau · 2021-06-23T19:01:15Z

distributed/tests/test_utils_test.py

        assert s.address.startswith("tls")
+
+
+from distributed.core import Status


Nitpick: Can you move this import to the top of the module with existing distributed.core import?

sure, I'm somehow used for either linters or formatters to make me aware of this. for some reason our config considers this perfectly fine 🤷‍♂️

jrbourbeau · 2021-06-23T19:09:25Z

distributed/deploy/utils_test.py

+        kwargs = self.kwargs.copy()
+        kwargs.pop("dashboard_address")


This would silently ignore any dashboard_address which has been specified in a test. Since this is only ever used in one place, perhaps we should just specify dashboard_address=":0" there instead? This isn't worth spending too much time on, it was just slightly surprising to see us manually setting dashboard_address=":54321"

yes, I think I changed this before I thought about the idea to simply use a random port which ignores the warning. I'll fix that

jrbourbeau · 2021-06-23T19:09:36Z

distributed/deploy/utils_test.py

+        kwargs = self.kwargs.copy()
+        kwargs.pop("dashboard_address")


Similar comment here

mrocklin · 2021-06-29T02:35:55Z

I'm curious, what is the status here?

fjetter · 2021-06-29T08:01:34Z

I'm curious, what is the status here?

Status here is a bit awkward. I needed/wanted to remove the await in the handle_comm which feels unmotivated and fragile in this PR. If I keep it, sometimes the scheduler restarts due to a race condition while closing and triggers the warning. I hit the same issue over in #4734 where, imho, I implemented this much safer with a proper status check in __await__ so I wanted to merge #4734 first. However, over there I'm hitting a lot of the flaky tests quite reliably and I cannot get the build green-ish.

Options:

We leave the await in handle_comm and remove the strict filterwarnings. We occasionally will get the port warning but much more rarely. Occasionally means that some CI runs won't have any while others might have thousands.
I try to cherry-pick the code around this await onto this PR or a dedicated smaller PR to see if I can make this work and then we can merge this. That would be Ensure exceptions in handlers are handled equally for sync and async #4734 without the shielding stuff (prob a good idea either way)
I will continue to dig in Ensure exceptions in handlers are handled equally for sync and async #4734 since I don't believe it is a coincidence that it triggers our flaky tests more likely.

I think today I will try 2. and if that fails we'll go for 1. since having no warnings in most of the runs is still a win and we can follow up once #4734 is properly resolved

fjetter commented Jun 18, 2021

View reviewed changes

fjetter force-pushed the dont_start_http_server_init branch from f28e187 to 284e45c Compare June 21, 2021 09:20

fjetter mentioned this pull request Jun 22, 2021

Disable worker dashboard by default in tests #4942

Closed

1 task

jrbourbeau reviewed Jun 23, 2021

View reviewed changes

Dont start http server in Scheduler.__init__

d6e9b99

fjetter force-pushed the dont_start_http_server_init branch from 284e45c to d6e9b99 Compare July 8, 2021 13:39

fjetter mentioned this pull request Jul 8, 2021

Ensure test_scheduler does not leak open sockets #4921

Closed

This was referenced Jul 16, 2021

Unit tests to use a random port for the dashboard #5060

Merged

Noisy wall of "Perhaps you already have a cluster running?" warnings when running pytest dask/dask-jobqueue#510

Closed

fjetter requested a review from jacobtomlinson as a code owner January 23, 2024 10:57

fjetter closed this Jul 10, 2025

	def __await__(self):
	async def _():
	timeout = getattr(self, "death_timeout", 0)
	async with self._startup_lock:
	if self.status == Status.running:
	return self

		assert s.address.startswith("tls")


		from distributed.core import Status

Uh oh!

Dont start http server in Scheduler.__init__ #4928

Dont start http server in Scheduler.__init__ #4928

Uh oh!

Conversation

fjetter commented Jun 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fjetter Jun 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Jun 29, 2021

Uh oh!

fjetter commented Jun 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Dont start http server in Scheduler.init #4928

Dont start http server in Scheduler.init #4928

fjetter commented Jun 18, 2021 •

edited

Loading

fjetter Jun 25, 2021 •

edited

Loading