-
-
Notifications
You must be signed in to change notification settings - Fork 748
Description
The __hash__ of a WorkerState object is just its address:
distributed/distributed/scheduler.py
Line 480 in 33fc50c
| self._hash = hash(address) |
As is the equality check (#3321 #3483):
distributed/distributed/scheduler.py
Lines 501 to 504 in 33fc50c
| def __eq__(self, other: object) -> bool: | |
| if not isinstance(other, WorkerState): | |
| return False | |
| return self.address == other.address |
And in general, there are a number of places where we store things in dicts keyed by worker address, and assume that if ws.address in self.workers, then ws is self.workers[ws.address]. (stealing.py is especially guilty—most of its logic is basically built around this.)
However, it's completely valid for a worker to disconnect, then for a new worker to connect from the same address. (Even with reconnection removed #6361, a Nanny #6387 or a user script could do this.) These are logically different workers, though they happen to have the same address.
This can cause:
- bad decisions: a scheduling or work-stealing decision is made about the old worker at that address; when it's enacted, there's a different worker at that address and the decision may no longer be appropriate
- deadlocks: a
WorkerStateobject is updated which is no longer inself.workers(though its address is), aTaskStateis made to point at aWorkerStatewhich has been removed, etc.
Outcomes:
WorkerStateobjects should be uniquely identifiable.WorkerStateobjects referring to logically differentdask-workerinvocations must not be equal or have the same hash, even if they happen to have the same address.- Any logic which gives up control flow (via
await, or storing some state in a dict to be used later, etc.) must verify, each time it regains control, that the worker it's dealing with still exists in the cluster (not just that its address exists).
Alternatives:
- If this is too much of a change to make, we could instead maintain a monotonically-increasing set of worker addresses, and prohibit address reuse. The scheduler would just reject a worker trying to connect if it had an address we'd already seen before. Of course, this would eliminate the possibility of worker reconnection Add back worker reconnection #6391, and maybe break nannies too.