Skip to content

Worker addresses are treated as unique identifiers, but may not be #6392

@gjoseph92

Description

@gjoseph92

The __hash__ of a WorkerState object is just its address:

self._hash = hash(address)

As is the equality check (#3321 #3483):

def __eq__(self, other: object) -> bool:
if not isinstance(other, WorkerState):
return False
return self.address == other.address

And in general, there are a number of places where we store things in dicts keyed by worker address, and assume that if ws.address in self.workers, then ws is self.workers[ws.address]. (stealing.py is especially guilty—most of its logic is basically built around this.)

However, it's completely valid for a worker to disconnect, then for a new worker to connect from the same address. (Even with reconnection removed #6361, a Nanny #6387 or a user script could do this.) These are logically different workers, though they happen to have the same address.

This can cause:

  • bad decisions: a scheduling or work-stealing decision is made about the old worker at that address; when it's enacted, there's a different worker at that address and the decision may no longer be appropriate
  • deadlocks: a WorkerState object is updated which is no longer in self.workers (though its address is), a TaskState is made to point at a WorkerState which has been removed, etc.

Outcomes:

  • WorkerState objects should be uniquely identifiable. WorkerState objects referring to logically different dask-worker invocations must not be equal or have the same hash, even if they happen to have the same address.
  • Any logic which gives up control flow (via await, or storing some state in a dict to be used later, etc.) must verify, each time it regains control, that the worker it's dealing with still exists in the cluster (not just that its address exists).

Alternatives:

  • If this is too much of a change to make, we could instead maintain a monotonically-increasing set of worker addresses, and prohibit address reuse. The scheduler would just reject a worker trying to connect if it had an address we'd already seen before. Of course, this would eliminate the possibility of worker reconnection Add back worker reconnection #6391, and maybe break nannies too.

Causes #6356, #3256, #6263, maybe #3892

cc @crusaderky @fjetter @bnaul

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions