Skip to content

Conversation

@hendrikmakait
Copy link
Member

@hendrikmakait hendrikmakait commented Nov 2, 2022

Supersedes #7184

  • Tests added / passed
  • Passes pre-commit run --all-files

logger.error("%s timed out trying to restart its worker.", nanny)
bad_nannies.append(nanny)
if bad_nannies:
raise RuntimeError(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to differentiate between "pure" timeouts, i.e. all the errors we see are a TimeoutError and raise a TimeoutError if that's the case and raise a generic RuntimeError as an aggregation of all occurring errors otherwise to signify that we observed an actual problem instead of just timing out. Then again, I'm not sure if that's really a problem and would warrant the additional logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought: The benefit of collecting errors and handling them in bulk is that we can pass more information to the user if we return before the timeout. For example, if 5 nannies fail to restart and the other 95 succeed, we will be able to tell that to the user. The downside is that we do not pass any information back to the user if a single nanny times out.

logger = logging.getLogger(__name__)


class _Watch:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part conflicts with #7238

@hendrikmakait hendrikmakait marked this pull request as ready for review November 2, 2022 15:29
@hendrikmakait hendrikmakait requested a review from fjetter November 2, 2022 16:29
@github-actions
Copy link
Contributor

github-actions bot commented Nov 2, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±    0         15 suites  ±0   6h 12m 56s ⏱️ - 11m 49s
  3 226 tests ±    0    3 141 ✔️  -     1    84 💤 +  1  1 ±0 
24 539 runs  +687  23 551 ✔️ +608  987 💤 +79  1 ±0 

For more details on these failures, see this check.

Results for commit 88e5d52. ± Comparison against base commit b6bff75.

♻️ This comment has been updated with latest results.

Comment on lines 5787 to 5791
*(
asyncio.wait_for(
# FIXME does not raise if the process fails to shut down,
# see https://github.com/dask/distributed/pull/6427/files#r894917424
# NOTE: Nanny will automatically restart worker process when it's killed
nanny.kill(
reason="scheduler-restart",
timeout=timeout,
),
timeout,
nanny.restart(timeout=timeout, reason="scheduler-restart"),
timeout=timeout,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI these tasks will raise a CommClosedError if a worker is disconnecting from the scheduler while the restart is running.

We're calling ConnectionPool.remove for the worker that is being removed which closes all connection to that worker, including the one we're using here. This might be the exception we're seeing in coiled/benchmarks#468 (comment)
i.e. this would likely happen with or without wait_for_workers. I'm wondering though why this is isn't happening even more often

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we're connecting to the nanny addresses, not the workers. The nanny connection shouldn't be closed by restart, so I wouldn't expect this to raise CommClosedError

return {"status": "OK"}
except Exception as e:
logger.exception(e)
return error_message(e)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why manually return the serialized error? This seems like an unusual pattern. I'd rather just raise an error in the function and have the RPC handle serializing and deserializing it if possible?

If it's just about the error message being hard to understand for users, that's a different topic: #4880.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chalk that up to me not knowing any better. You're suggesting

Suggested change
return error_message(e)
raise

?

Comment on lines +5766 to +5767
self._expect_nannies(workers)
logger.debug("Asking nannies to restart workers: %s.", workers)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect this code to be in _restart_workers instead

Comment on lines 5787 to 5791
*(
asyncio.wait_for(
# FIXME does not raise if the process fails to shut down,
# see https://github.com/dask/distributed/pull/6427/files#r894917424
# NOTE: Nanny will automatically restart worker process when it's killed
nanny.kill(
reason="scheduler-restart",
timeout=timeout,
),
timeout,
nanny.restart(timeout=timeout, reason="scheduler-restart"),
timeout=timeout,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we're connecting to the nanny addresses, not the workers. The nanny connection shouldn't be closed by restart, so I wouldn't expect this to raise CommClosedError

timeout=timeout,
),
timeout,
nanny.restart(timeout=timeout, reason="scheduler-restart"),
Copy link
Collaborator

@gjoseph92 gjoseph92 Nov 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we calling restart here instead of kill? I don't trust restart.

  1. restart doesn't pass the timeout into kill, so even if you have a 60s timeout here, kill will still fail if the worker doesn't shut down in 2s
  2. restart manually calls instantiate, which is redundant because when the worker process exits, it'll call _on_worker_exit which also calls instantiate. instantiate should be idempotent so it should be okay, but it feels like weird design. There should be exactly one thing responsible for restarting the worker process.

I assume the reason for this is to make the RPC block until the worker has connected to the scheduler, instead of having to do the wait-for-workers loop scheduler-side. That's reasonable, but only if we fix Nanny.restart first I think. edit: see below, I don't like the Nanny being responsible for this at all.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design-wise, I also don't love

  1. restart being the Nanny's responsibility
  2. the Nanny being responsible for deciding (or even knowing) when the worker has reconnected to the scheduler

The Nanny currently has a very complicated approach to subprocess management, and a lot of confusing IPC with queues, translating between blocking events and asyncio, etc.

I would love if the Nanny could (in theory) be rewritten in Go or even a bash script as a simple dumb process manager that just does its one job. If the process shuts down with a non-zero exit code, it restarts it. That's it. Maybe it sets some environment variables in the subprocess. Maybe it even still supports nanny plugins to run custom Python code. (Though I wish it couldn't even talk to the scheduler.) But I would love for the worker process to be a black box to the Nanny. They certainly shouldn't need to communicate or share any state; nothing beyond POSIX signals and an exit code.

If that were the case, then Scheduler.restart could be implemented as:

  1. tell workers to close with non-zero exit code
  2. wait for workers to reconnect because Nanny restarts them

Or, if the Nanny can communicate with the scheduler (maybe handy when a worker is busted and unresponsive), then Scheduler.restart could be:

  1. call Nanny.restart RPC. This sends SIGTERM to the subprocess, joins with timeout, if it's not dead yet sends SIGKILL, joins, etc. Once the subprocess is dead, starts a new subprocess and returns.
  2. wait for workers to reconnect because Nanny restarts them

The point is, I think we want Scheduler.restart to block until all workers have reconnected to the scheduler. And I don't think the Nanny should be able to know if a subprocess is connected to the scheduler or not. Therefore, that blocking logic should happen scheduler-side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we calling restart here instead of kill? I don't trust restart.

Just want to point out that the entire argument in the first comment is about the current implementation of the two handlers. If the handlers don't work as expected, we should fix them and not avoid using them.

restart being the Nanny's responsibility

IMO the nanny is the only sane place to implement this. The queue stuff is unnecessarily complex but that doesn't change anything about the way responsibilities should be assigned. From an ownership perspective, the nanny owns the worker resources and is responsible for closing and restarting it.

I would love if the Nanny could (in theory) be rewritten in Go or even a bash script as a simple dumb process manager that just does its one job. If the process shuts down with a non-zero exit code, it restarts it. That's it. Maybe it sets some environment variables in the subprocess. Maybe it even still supports nanny plugins to run custom Python code. (Though I wish it couldn't even talk to the scheduler.) But I would love for the worker process to be a black box to the Nanny. They certainly shouldn't need to communicate or share any state; nothing beyond POSIX signals and an exit code.

What problem is this actually solving? We'd abandon queues in favor of bash?? How would Go help? These are all implementation details but if we don't fix the fundamental architecture, we'll still be in the same mess.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW Nanny.kill currently suffers from a race condition Nanny.restart apparently doesn't #7312 (comment)

In that issue I intend to clean this logic up a bit anyhow which should resolve the tension here a bit, I believe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we calling restart here instead of kill? I don't trust restart.

restart is imperfect, we agree on that. Fixing it is something we should do, but I would consider this outside the scope of this PR. The scope of this PR is to simplify the scheduler-side of restarting, i.e., Scheduler.restart, remove a number of implicit assumptions and raise exceptions in their place, and ensure that Client.restart and Client.restart_workers actually have similar semantics.

I agree with Florian that restarts should be the responsibility of the nanny. Arguably, that's all the nanny has to do: Watch over the worker process and restart it when told to. We also need a way to be able to kill/restart a struggling/unresponsive worker, which would not be possible to implement on the worker since it's, well, unresponsive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants