Better task duration estimates for outliers. #4213

selshowk · 2020-11-02T12:27:53Z

This PR implements more adaptive worker duration estimates for "outlier" tasks (defined as those whose on-going runtime exceeds twice the current task average duration). Rather than use the average duration as an estimate, we estimate 2x the current runtime if the task is an outlier. See discussion in #4192 as well.

@jrbourbeau I had to add an additional check to the edits from #4192 because otherwise my test below generated a missing key error. Can you comment if it makes sense or suggestions a deeper problem?

CCing @jrbourbeau @mrocklin for thoughts.

Rather than always using the average task duration as an estimate we flag "outliers" (tasks that are taking 2x longer than expected duration) and we set expected duration to be twice their current running time. NOTE: also added check for missing key in worker metrics code.

distributed/tests/test_worker.py

distributed/scheduler.py

distributed/worker.py

pentschev · 2020-11-03T10:43:27Z

We run some Distributed tests that require CUDA support in dask-cuda, and I just found out that the failures we see there are due to pytest-asyncio=0.14.0 and the most recent was pytest-asyncio=0.12.0 until 3 days ago in Anaconda: https://anaconda.org/conda-forge/pytest-asyncio/files .

Locally I can run the tests from dask-cuda by just reverting to pytest-asyncio=0.12.0, and I noticed the builds here are picking pytest-asyncio=0.14.0, in contrast to pytest-asyncio=0.12.0 in #4211 where builds are passing. I'm not yet sure if there's something that needs to be changed in Distributed or if there's a bug in pytest-asyncio=0.14.0.

pentschev · 2020-11-03T11:46:50Z

Only saw that older version of pytest-asyncio was pinned in #4212, maybe running tests here again will fix failures?

jrbourbeau · 2020-11-03T14:57:33Z

Yeah, running tests again here should pick up the changes in #4212, which will fix test failures (up to a few known flaky tests)

…ime_outlier_duration

jrbourbeau · 2020-12-04T21:37:58Z

@selshowk I pushed a couple of commits to resolve some merge conflicts and update the test added here (hope that's okay). Let me know what you think

jrbourbeau · 2020-12-04T23:37:23Z

distributed/scheduler.py

    _actors: set
    _address: str
    _bandwidth: double
+    _executing: dict


We're adding a new attribute to the WorkerState class which we recently added type annotations to for Cythonization. @jakirkham I think I took care of everything needed for this addition, but is there some kind of check I can run to make sure Cython is happy? For instance, is being able to successfully build the C-extensions a sufficient check?

Thanks James! Looks like everything is good so far.

Had a few comments on the new method. IDK if we plan to keep that though or not (if not they can just serve as an example for new functions).

Oh sorry missed your second question. We build with Cython on the Python 3.7 GitHub Action job and run the tests there. If they pass, we should be good 🙂

jakirkham

Mostly things look good on the Cython front. Just a couple suggestions here on the new method (assuming we keep that).

distributed/scheduler.py

mrocklin · 2020-12-05T00:57:03Z

This brings up an interesting question about how we maintain Cython performance. The tests will make sure that things are valid, but I wouldn't be surprised if we slip a little with each PR, especially among those contributed by drive-by contributors. I suspect that for the near-term we'll want to do little Cython "tune-ups" before release, and that long-term we may want to see if some sort of Cython coverage tool, perhaps one that uses the annotated HTML output, is possible.

…

On Fri, Dec 4, 2020 at 4:49 PM jakirkham ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In distributed/scheduler.py <#4213 (comment)>: > @@ -286,6 +292,7 @@ class WorkerState: _actors: set _address: str _bandwidth: double + _executing: dict Oh sorry missed your second question. We build with Cython on the Python 3.7 GitHub Action job and run the tests there. If they pass, we should be good 🙂 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4213 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTCML4TNAHL6FHAY3S3STF7RZANCNFSM4THNAMZQ> .

mrocklin · 2020-12-05T00:57:34Z

This isn't something to think about too much right now (let's make sure that Cython is a good idea first), but it's a fun conversation to hold for the future.

…

On Fri, Dec 4, 2020 at 4:56 PM Matthew Rocklin ***@***.***> wrote: This brings up an interesting question about how we maintain Cython performance. The tests will make sure that things are valid, but I wouldn't be surprised if we slip a little with each PR, especially among those contributed by drive-by contributors. I suspect that for the near-term we'll want to do little Cython "tune-ups" before release, and that long-term we may want to see if some sort of Cython coverage tool, perhaps one that uses the annotated HTML output, is possible. On Fri, Dec 4, 2020 at 4:49 PM jakirkham ***@***.***> wrote: > ***@***.**** commented on this pull request. > ------------------------------ > > In distributed/scheduler.py > <#4213 (comment)>: > > > @@ -286,6 +292,7 @@ class WorkerState: > > _actors: set > > _address: str > > _bandwidth: double > > + _executing: dict > > > Oh sorry missed your second question. We build with Cython on the Python > 3.7 GitHub Action job and run the tests there. If they pass, we should be > good 🙂 > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#4213 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AACKZTCML4TNAHL6FHAY3S3STF7RZANCNFSM4THNAMZQ> > . >

jakirkham · 2020-12-05T01:15:51Z

Well I know @quasiben did some work setting up benchmarks that run nightly. I don't think we can promise to run them on every PR. Maybe we can eyeball those and see how things are going? Perhaps Ben has thoughts on these sorts of things 😉

Generally I think writing code that Cython can do well with is not difficult. It's really a matter of annotating variables with types and sticking to them. This has much less to do with Cython and more generally being thoughtful while programming. Would describe this similarly as to how we advise people to write code that will run well with Dask. First think about how it can be written to parallelize well and then running with Dask isn't so hard. Cython isn't too different 🙂

Co-authored-by: jakirkham <jakirkham@gmail.com>

jakirkham

Guessing the numbers set_duration_estimate is working with are floating point. If so, we can do a little more annotation through here. This will get us some straight C code for computation when Cythonized.

Skipped transition_waiting_processing as I'm planning to go back over that one in detail later.

distributed/scheduler.py

Co-authored-by: jakirkham <jakirkham@gmail.com>

jakirkham · 2020-12-07T22:36:36Z

Took the liberty of fixing conflicts with master (pretty minor changes really), hope that is ok 🙂

distributed/scheduler.py

Co-authored-by: jakirkham <jakirkham@gmail.com>

distributed/scheduler.py

…ime_outlier_duration

jrbourbeau

Thanks for your work on this @selshowk @jakirkham!

(FWIW the failures observed on Travis CI have been seen on other PRs and are unrelated to the changes here)

jakirkham · 2020-12-22T00:05:38Z

Thanks James! 😄 I think @selshowk did all the real work here. Just tried to keep merge conflicts to a minimum 😉

mrocklin reviewed Nov 2, 2020

View reviewed changes

distributed/tests/test_worker.py Outdated Show resolved Hide resolved

mrocklin reviewed Nov 2, 2020

View reviewed changes

distributed/scheduler.py Outdated Show resolved Hide resolved

mrocklin reviewed Nov 2, 2020

View reviewed changes

distributed/scheduler.py Outdated Show resolved Hide resolved

mrocklin reviewed Nov 2, 2020

View reviewed changes

distributed/worker.py Outdated Show resolved Hide resolved

jrbourbeau added 4 commits December 3, 2020 19:54

Merge branch 'master' of https://github.com/dask/distributed into est…

a728efb

…ime_outlier_duration

Update test

f5681d7

Add WorkerState.executing

5b10cfa

Merge branch 'master' of https://github.com/dask/distributed into est…

5b8bef0

…ime_outlier_duration

jrbourbeau reviewed Dec 4, 2020

View reviewed changes

jakirkham reviewed Dec 5, 2020

View reviewed changes

Apply suggestions from code review

7a61d5b

Co-authored-by: jakirkham <jakirkham@gmail.com>

jakirkham reviewed Dec 5, 2020

View reviewed changes

distributed/scheduler.py Outdated Show resolved Hide resolved

distributed/scheduler.py Outdated Show resolved Hide resolved

distributed/scheduler.py Outdated Show resolved Hide resolved

distributed/scheduler.py Outdated Show resolved Hide resolved

jrbourbeau and others added 2 commits December 4, 2020 22:00

Apply suggestions from code review

ec3e194

Co-authored-by: jakirkham <jakirkham@gmail.com>

Merge branch 'master' into estime_outlier_duration

5fd7432

jakirkham reviewed Dec 8, 2020

View reviewed changes

distributed/scheduler.py Outdated Show resolved Hide resolved

jrbourbeau and others added 2 commits December 7, 2020 19:54

Update distributed/scheduler.py

47f0f45

Co-authored-by: jakirkham <jakirkham@gmail.com>

Merge branch 'master' into estime_outlier_duration

308c4df

jakirkham reviewed Dec 9, 2020

View reviewed changes

distributed/scheduler.py Outdated Show resolved Hide resolved

jakirkham reviewed Dec 9, 2020

View reviewed changes

distributed/scheduler.py Show resolved Hide resolved

mrocklin mentioned this pull request Dec 11, 2020

Dask worker tasks allocation dask/dask#6956

Open

Merge branch 'master' of https://github.com/dask/distributed into est…

9144751

…ime_outlier_duration

Trigger CI

e9205b4

jrbourbeau approved these changes Dec 21, 2020

View reviewed changes

jrbourbeau merged commit 976e02e into dask:master Dec 21, 2020

Uh oh!

Better task duration estimates for outliers. #4213

Better task duration estimates for outliers. #4213

Uh oh!

Conversation

selshowk commented Nov 2, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pentschev commented Nov 3, 2020

Uh oh!

pentschev commented Nov 3, 2020

Uh oh!

jrbourbeau commented Nov 3, 2020

Uh oh!

jrbourbeau commented Dec 4, 2020

Uh oh!

jrbourbeau Dec 4, 2020

Choose a reason for hiding this comment

Uh oh!

jakirkham Dec 5, 2020

Choose a reason for hiding this comment

Uh oh!

jakirkham Dec 5, 2020

Choose a reason for hiding this comment

Uh oh!

jakirkham left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mrocklin commented Dec 5, 2020 via email

Uh oh!

mrocklin commented Dec 5, 2020 via email

Uh oh!

jakirkham commented Dec 5, 2020

Uh oh!

jakirkham left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jakirkham commented Dec 7, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

jakirkham commented Dec 22, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants