Simplify `Scheduler.restart` logic and raise deliberate exceptions #7242

hendrikmakait · 2022-11-02T13:04:22Z

Supersedes #7184

Tests added / passed
Passes pre-commit run --all-files

hendrikmakait · 2022-11-02T14:48:37Z

distributed/scheduler.py

+                    logger.error("%s timed out trying to restart its worker.", nanny)
+                    bad_nannies.append(nanny)
+            if bad_nannies:
+                raise RuntimeError(


We may want to differentiate between "pure" timeouts, i.e. all the errors we see are a TimeoutError and raise a TimeoutError if that's the case and raise a generic RuntimeError as an aggregation of all occurring errors otherwise to signify that we observed an actual problem instead of just timing out. Then again, I'm not sure if that's really a problem and would warrant the additional logic.

Another thought: The benefit of collecting errors and handling them in bulk is that we can pass more information to the user if we return before the timeout. For example, if 5 nannies fail to restart and the other 95 succeed, we will be able to tell that to the user. The downside is that we do not pass any information back to the user if a single nanny times out.

hendrikmakait · 2022-11-02T14:49:11Z

distributed/semaphore.py

 logger = logging.getLogger(__name__)


-class _Watch:


This part conflicts with #7238

github-actions · 2022-11-02T18:25:16Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±    0       15 suites ±0 6h 12m 56s ⏱️ - 11m 49s
  3 226 tests ±    0   3 141 ✔️ -     1   84 💤 +  1 1 ❌ ±0
24 539 runs +687 23 551 ✔️ +608 987 💤 +79 1 ❌ ±0

For more details on these failures, see this check.

Results for commit 88e5d52. ± Comparison against base commit b6bff75.

♻️ This comment has been updated with latest results.

fjetter · 2022-11-10T16:43:16Z

distributed/scheduler.py

                *(
                    asyncio.wait_for(
-                        # FIXME does not raise if the process fails to shut down,
-                        # see https://github.com/dask/distributed/pull/6427/files#r894917424
-                        # NOTE: Nanny will automatically restart worker process when it's killed
-                        nanny.kill(
-                            reason="scheduler-restart",
-                            timeout=timeout,
-                        ),
-                        timeout,
+                        nanny.restart(timeout=timeout, reason="scheduler-restart"),
+                        timeout=timeout,
                    )


FYI these tasks will raise a CommClosedError if a worker is disconnecting from the scheduler while the restart is running.

We're calling ConnectionPool.remove for the worker that is being removed which closes all connection to that worker, including the one we're using here. This might be the exception we're seeing in coiled/benchmarks#468 (comment)
i.e. this would likely happen with or without wait_for_workers. I'm wondering though why this is isn't happening even more often

Here we're connecting to the nanny addresses, not the workers. The nanny connection shouldn't be closed by restart, so I wouldn't expect this to raise CommClosedError

distributed/client.py

gjoseph92 · 2022-11-14T23:09:58Z

distributed/scheduler.py

+            return {"status": "OK"}
+        except Exception as e:
+            logger.exception(e)
+            return error_message(e)


Why manually return the serialized error? This seems like an unusual pattern. I'd rather just raise an error in the function and have the RPC handle serializing and deserializing it if possible?

If it's just about the error message being hard to understand for users, that's a different topic: #4880.

Chalk that up to me not knowing any better. You're suggesting

Suggested change

return error_message(e)

raise

?

gjoseph92 · 2022-11-14T23:13:23Z

distributed/scheduler.py

+            self._expect_nannies(workers)
+            logger.debug("Asking nannies to restart workers: %s.", workers)


I'd expect this code to be in _restart_workers instead

gjoseph92 · 2022-11-14T23:15:20Z

distributed/scheduler.py

                *(
                    asyncio.wait_for(
-                        # FIXME does not raise if the process fails to shut down,
-                        # see https://github.com/dask/distributed/pull/6427/files#r894917424
-                        # NOTE: Nanny will automatically restart worker process when it's killed
-                        nanny.kill(
-                            reason="scheduler-restart",
-                            timeout=timeout,
-                        ),
-                        timeout,
+                        nanny.restart(timeout=timeout, reason="scheduler-restart"),
+                        timeout=timeout,
                    )


Here we're connecting to the nanny addresses, not the workers. The nanny connection shouldn't be closed by restart, so I wouldn't expect this to raise CommClosedError

gjoseph92 · 2022-11-14T23:32:23Z

distributed/scheduler.py

-                            timeout=timeout,
-                        ),
-                        timeout,
+                        nanny.restart(timeout=timeout, reason="scheduler-restart"),


Why are we calling restart here instead of kill? I don't trust restart.

restart doesn't pass the timeout into kill, so even if you have a 60s timeout here, kill will still fail if the worker doesn't shut down in 2s

restart manually calls instantiate, which is redundant because when the worker process exits, it'll call _on_worker_exit which also calls instantiate. instantiate should be idempotent so it should be okay, but it feels like weird design. There should be exactly one thing responsible for restarting the worker process.

I assume the reason for this is to make the RPC block until the worker has connected to the scheduler, instead of having to do the wait-for-workers loop scheduler-side. ~~That's reasonable, but only if we fix Nanny.restart first I think.~~ edit: see below, I don't like the Nanny being responsible for this at all.

Design-wise, I also don't love

restart being the Nanny's responsibility

the Nanny being responsible for deciding (or even knowing) when the worker has reconnected to the scheduler

The Nanny currently has a very complicated approach to subprocess management, and a lot of confusing IPC with queues, translating between blocking events and asyncio, etc.

I would love if the Nanny could (in theory) be rewritten in Go or even a bash script as a simple dumb process manager that just does its one job. If the process shuts down with a non-zero exit code, it restarts it. That's it. Maybe it sets some environment variables in the subprocess. Maybe it even still supports nanny plugins to run custom Python code. (Though I wish it couldn't even talk to the scheduler.) But I would love for the worker process to be a black box to the Nanny. They certainly shouldn't need to communicate or share any state; nothing beyond POSIX signals and an exit code.

If that were the case, then Scheduler.restart could be implemented as:

tell workers to close with non-zero exit code

wait for workers to reconnect because Nanny restarts them

Or, if the Nanny can communicate with the scheduler (maybe handy when a worker is busted and unresponsive), then Scheduler.restart could be:

call Nanny.restart RPC. This sends SIGTERM to the subprocess, joins with timeout, if it's not dead yet sends SIGKILL, joins, etc. Once the subprocess is dead, starts a new subprocess and returns.

wait for workers to reconnect because Nanny restarts them

The point is, I think we want Scheduler.restart to block until all workers have reconnected to the scheduler. And I don't think the Nanny should be able to know if a subprocess is connected to the scheduler or not. Therefore, that blocking logic should happen scheduler-side.

Why are we calling restart here instead of kill? I don't trust restart.

Just want to point out that the entire argument in the first comment is about the current implementation of the two handlers. If the handlers don't work as expected, we should fix them and not avoid using them.

restart being the Nanny's responsibility

IMO the nanny is the only sane place to implement this. The queue stuff is unnecessarily complex but that doesn't change anything about the way responsibilities should be assigned. From an ownership perspective, the nanny owns the worker resources and is responsible for closing and restarting it.

I would love if the Nanny could (in theory) be rewritten in Go or even a bash script as a simple dumb process manager that just does its one job. If the process shuts down with a non-zero exit code, it restarts it. That's it. Maybe it sets some environment variables in the subprocess. Maybe it even still supports nanny plugins to run custom Python code. (Though I wish it couldn't even talk to the scheduler.) But I would love for the worker process to be a black box to the Nanny. They certainly shouldn't need to communicate or share any state; nothing beyond POSIX signals and an exit code.

What problem is this actually solving? We'd abandon queues in favor of bash?? How would Go help? These are all implementation details but if we don't fix the fundamental architecture, we'll still be in the same mess.

FWIW Nanny.kill currently suffers from a race condition Nanny.restart apparently doesn't #7312 (comment)

In that issue I intend to clean this logic up a bit anyhow which should resolve the tension here a bit, I believe.

Why are we calling restart here instead of kill? I don't trust restart.

restart is imperfect, we agree on that. Fixing it is something we should do, but I would consider this outside the scope of this PR. The scope of this PR is to simplify the scheduler-side of restarting, i.e., Scheduler.restart, remove a number of implicit assumptions and raise exceptions in their place, and ensure that Client.restart and Client.restart_workers actually have similar semantics.

I agree with Florian that restarts should be the responsibility of the nanny. Arguably, that's all the nanny has to do: Watch over the worker process and restart it when told to. We also need a way to be able to kill/restart a struggling/unresponsive worker, which would not be possible to implement on the worker since it's, well, unresponsive.

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

hendrikmakait added 14 commits October 28, 2022 11:56

Simplify Scheduler.restart

7d3e2bf

Fix Client.restart

fc6b6ce

Adjust Client.restart_workers

37af3d1

Raise on client

2bca5d0

Raise on client

bdafc35

Merge branch 'main' into simplify-scheduler-restart

15d8c62

Fix regex pattern

878aa11

Remove invalid test

b56d872

Fix logging

5779353

Improve test

ba3840f

Fix test after changes

5de6bfd

Raise ValueError on workers without nannies

8b69eb8

Adjust tests

e308d94

Improve docstrings

6164760

hendrikmakait commented Nov 2, 2022

View reviewed changes

Add more tests

83b7cd3

hendrikmakait marked this pull request as ready for review November 2, 2022 15:29

hendrikmakait requested a review from fjetter November 2, 2022 16:29

Re-add TimeoutError

6b61950

hendrikmakait self-assigned this Nov 3, 2022

ntabris mentioned this pull request Nov 10, 2022

Reconnect times out waiting for workers coiled/benchmarks#468

Closed

hendrikmakait added 2 commits November 10, 2022 17:36

Merge branch 'main' into simplify-scheduler-restart

e8649ab

Drop CountdownTimer in favor of Deadline

dd0afa9

fjetter reviewed Nov 10, 2022

View reviewed changes

hendrikmakait mentioned this pull request Nov 11, 2022

Simplify Scheduler.restart logic #7184

Closed

2 tasks

gjoseph92 reviewed Nov 14, 2022

View reviewed changes

hendrikmakait and others added 2 commits November 15, 2022 15:35

Update distributed/client.py

ed8a412

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

Merge branch 'main' into simplify-scheduler-restart

88e5d52

hendrikmakait closed this Sep 1, 2025

		self._expect_nannies(workers)
		logger.debug("Asking nannies to restart workers: %s.", workers)

Uh oh!

Simplify Scheduler.restart logic and raise deliberate exceptions #7242

Simplify Scheduler.restart logic and raise deliberate exceptions #7242

Uh oh!

Conversation

hendrikmakait commented Nov 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gjoseph92 Nov 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Simplify `Scheduler.restart` logic and raise deliberate exceptions #7242

Simplify `Scheduler.restart` logic and raise deliberate exceptions #7242

hendrikmakait commented Nov 2, 2022 •

edited

Loading

github-actions bot commented Nov 2, 2022 •

edited

Loading

gjoseph92 Nov 14, 2022 •

edited

Loading