Skip to content

Respect priority in earliest start time heuristic #5253

@mrocklin

Description

@mrocklin

Background

Today Dask decides where to place a task based on an "earliest start time" heuristic. For a given task it tries to find the worker on which it will start the soonest. It does this by computing two values:

  • The amount of work currently on the worker (currently measured by occupancy)
  • The amount of time it would take to transfer all dependencies not on the worker to that worker

This is what is computed by the worker_objective function

Problem

However, this is incorrect if the task that we're considering has higher priority than some of the tasks currently running on that worker. In that case looking at the occupancy of the worker is incorrect, because this task gets to cut in line.

But in general while we can count all of the work that is higher priority than this task, that might be somewhat expensive, especially in cases where there are lots of tasks on a worker (which is common). This might be the kind of thing where Cython can save us, but even then I'm not sure.

Proposed solutions

Let's write down a few possible solutions:

  1. Brute force: we can look at all possible tasks in ws.processing and count up the amount of occupancy of tasks with higher priority
  2. Ignore: We could ignore occupancy altogether, and just let work stealing take charge
  3. Middle ground: We could randomly take a few tasks in ws.processing (maybe four?) and see where we stand among those four. If we're worse then all of them then great, we take the full brunt of occupancy. If we're better than all of them then we take 0%. If we're in the middle then we take 50% and so on.
  4. fancy: we maintain some sort of t-digest per worker. This seems extreme, but we would only need to track like three quantile values for this to work well most of the time.
  5. less fancy: maybe we track min/max/mean and blend between them?

3 and 5 seem like the most probable. Each has some concerns:

  1. Sampling: I'm not sure how best to get these items. iter(seq(...)) is ordered these days, and so not a great sample. random.sample is somewhat expensive. %timeit random.sample(list(d), 2) takes 8us for me for a dict with 1000 items.
  2. min/max/mean: Our priorities are hierarchical, and so mean (or any quantile) is a little wonky.

Importance

This is, I suspect, especially important in workloads where we have lots of rootish tasks, which are common. The variance among all of those tasks can easily swamp the signal that tasks should stay where their data is.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions