-
-
Notifications
You must be signed in to change notification settings - Fork 748
Simplify Scheduler.restart logic and raise deliberate exceptions
#7242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify Scheduler.restart logic and raise deliberate exceptions
#7242
Conversation
| logger.error("%s timed out trying to restart its worker.", nanny) | ||
| bad_nannies.append(nanny) | ||
| if bad_nannies: | ||
| raise RuntimeError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to differentiate between "pure" timeouts, i.e. all the errors we see are a TimeoutError and raise a TimeoutError if that's the case and raise a generic RuntimeError as an aggregation of all occurring errors otherwise to signify that we observed an actual problem instead of just timing out. Then again, I'm not sure if that's really a problem and would warrant the additional logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thought: The benefit of collecting errors and handling them in bulk is that we can pass more information to the user if we return before the timeout. For example, if 5 nannies fail to restart and the other 95 succeed, we will be able to tell that to the user. The downside is that we do not pass any information back to the user if a single nanny times out.
distributed/semaphore.py
Outdated
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class _Watch: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part conflicts with #7238
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ± 0 15 suites ±0 6h 12m 56s ⏱️ - 11m 49s For more details on these failures, see this check. Results for commit 88e5d52. ± Comparison against base commit b6bff75. ♻️ This comment has been updated with latest results. |
| *( | ||
| asyncio.wait_for( | ||
| # FIXME does not raise if the process fails to shut down, | ||
| # see https://github.com/dask/distributed/pull/6427/files#r894917424 | ||
| # NOTE: Nanny will automatically restart worker process when it's killed | ||
| nanny.kill( | ||
| reason="scheduler-restart", | ||
| timeout=timeout, | ||
| ), | ||
| timeout, | ||
| nanny.restart(timeout=timeout, reason="scheduler-restart"), | ||
| timeout=timeout, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI these tasks will raise a CommClosedError if a worker is disconnecting from the scheduler while the restart is running.
We're calling ConnectionPool.remove for the worker that is being removed which closes all connection to that worker, including the one we're using here. This might be the exception we're seeing in coiled/benchmarks#468 (comment)
i.e. this would likely happen with or without wait_for_workers. I'm wondering though why this is isn't happening even more often
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we're connecting to the nanny addresses, not the workers. The nanny connection shouldn't be closed by restart, so I wouldn't expect this to raise CommClosedError
| return {"status": "OK"} | ||
| except Exception as e: | ||
| logger.exception(e) | ||
| return error_message(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why manually return the serialized error? This seems like an unusual pattern. I'd rather just raise an error in the function and have the RPC handle serializing and deserializing it if possible?
If it's just about the error message being hard to understand for users, that's a different topic: #4880.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chalk that up to me not knowing any better. You're suggesting
| return error_message(e) | |
| raise |
?
| self._expect_nannies(workers) | ||
| logger.debug("Asking nannies to restart workers: %s.", workers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd expect this code to be in _restart_workers instead
| *( | ||
| asyncio.wait_for( | ||
| # FIXME does not raise if the process fails to shut down, | ||
| # see https://github.com/dask/distributed/pull/6427/files#r894917424 | ||
| # NOTE: Nanny will automatically restart worker process when it's killed | ||
| nanny.kill( | ||
| reason="scheduler-restart", | ||
| timeout=timeout, | ||
| ), | ||
| timeout, | ||
| nanny.restart(timeout=timeout, reason="scheduler-restart"), | ||
| timeout=timeout, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we're connecting to the nanny addresses, not the workers. The nanny connection shouldn't be closed by restart, so I wouldn't expect this to raise CommClosedError
| timeout=timeout, | ||
| ), | ||
| timeout, | ||
| nanny.restart(timeout=timeout, reason="scheduler-restart"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we calling restart here instead of kill? I don't trust restart.
restartdoesn't pass the timeout intokill, so even if you have a 60s timeout here,killwill still fail if the worker doesn't shut down in 2srestartmanually callsinstantiate, which is redundant because when the worker process exits, it'll call_on_worker_exitwhich also callsinstantiate.instantiateshould be idempotent so it should be okay, but it feels like weird design. There should be exactly one thing responsible for restarting the worker process.
I assume the reason for this is to make the RPC block until the worker has connected to the scheduler, instead of having to do the wait-for-workers loop scheduler-side. That's reasonable, but only if we fix edit: see below, I don't like the Nanny being responsible for this at all.Nanny.restart first I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Design-wise, I also don't love
restartbeing the Nanny's responsibility- the Nanny being responsible for deciding (or even knowing) when the worker has reconnected to the scheduler
The Nanny currently has a very complicated approach to subprocess management, and a lot of confusing IPC with queues, translating between blocking events and asyncio, etc.
I would love if the Nanny could (in theory) be rewritten in Go or even a bash script as a simple dumb process manager that just does its one job. If the process shuts down with a non-zero exit code, it restarts it. That's it. Maybe it sets some environment variables in the subprocess. Maybe it even still supports nanny plugins to run custom Python code. (Though I wish it couldn't even talk to the scheduler.) But I would love for the worker process to be a black box to the Nanny. They certainly shouldn't need to communicate or share any state; nothing beyond POSIX signals and an exit code.
If that were the case, then Scheduler.restart could be implemented as:
- tell workers to close with non-zero exit code
- wait for workers to reconnect because Nanny restarts them
Or, if the Nanny can communicate with the scheduler (maybe handy when a worker is busted and unresponsive), then Scheduler.restart could be:
- call
Nanny.restartRPC. This sends SIGTERM to the subprocess, joins with timeout, if it's not dead yet sends SIGKILL, joins, etc. Once the subprocess is dead, starts a new subprocess and returns. - wait for workers to reconnect because Nanny restarts them
The point is, I think we want Scheduler.restart to block until all workers have reconnected to the scheduler. And I don't think the Nanny should be able to know if a subprocess is connected to the scheduler or not. Therefore, that blocking logic should happen scheduler-side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we calling restart here instead of kill? I don't trust restart.
Just want to point out that the entire argument in the first comment is about the current implementation of the two handlers. If the handlers don't work as expected, we should fix them and not avoid using them.
restart being the Nanny's responsibility
IMO the nanny is the only sane place to implement this. The queue stuff is unnecessarily complex but that doesn't change anything about the way responsibilities should be assigned. From an ownership perspective, the nanny owns the worker resources and is responsible for closing and restarting it.
I would love if the Nanny could (in theory) be rewritten in Go or even a bash script as a simple dumb process manager that just does its one job. If the process shuts down with a non-zero exit code, it restarts it. That's it. Maybe it sets some environment variables in the subprocess. Maybe it even still supports nanny plugins to run custom Python code. (Though I wish it couldn't even talk to the scheduler.) But I would love for the worker process to be a black box to the Nanny. They certainly shouldn't need to communicate or share any state; nothing beyond POSIX signals and an exit code.
What problem is this actually solving? We'd abandon queues in favor of bash?? How would Go help? These are all implementation details but if we don't fix the fundamental architecture, we'll still be in the same mess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW Nanny.kill currently suffers from a race condition Nanny.restart apparently doesn't #7312 (comment)
In that issue I intend to clean this logic up a bit anyhow which should resolve the tension here a bit, I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we calling
restarthere instead ofkill? I don't trustrestart.
restart is imperfect, we agree on that. Fixing it is something we should do, but I would consider this outside the scope of this PR. The scope of this PR is to simplify the scheduler-side of restarting, i.e., Scheduler.restart, remove a number of implicit assumptions and raise exceptions in their place, and ensure that Client.restart and Client.restart_workers actually have similar semantics.
I agree with Florian that restarts should be the responsibility of the nanny. Arguably, that's all the nanny has to do: Watch over the worker process and restart it when told to. We also need a way to be able to kill/restart a struggling/unresponsive worker, which would not be possible to implement on the worker since it's, well, unresponsive.
Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>
Supersedes #7184
pre-commit run --all-files