-
-
Notifications
You must be signed in to change notification settings - Fork 748
Better task duration estimates for outliers. #4213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Rather than always using the average task duration as an estimate we flag "outliers" (tasks that are taking 2x longer than expected duration) and we set expected duration to be twice their current running time. NOTE: also added check for missing key in worker metrics code.
|
We run some Distributed tests that require CUDA support in dask-cuda, and I just found out that the failures we see there are due to Locally I can run the tests from dask-cuda by just reverting to |
|
Only saw that older version of |
|
Yeah, running tests again here should pick up the changes in #4212, which will fix test failures (up to a few known flaky tests) |
…ime_outlier_duration
…ime_outlier_duration
|
@selshowk I pushed a couple of commits to resolve some merge conflicts and update the test added here (hope that's okay). Let me know what you think |
| _actors: set | ||
| _address: str | ||
| _bandwidth: double | ||
| _executing: dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're adding a new attribute to the WorkerState class which we recently added type annotations to for Cythonization. @jakirkham I think I took care of everything needed for this addition, but is there some kind of check I can run to make sure Cython is happy? For instance, is being able to successfully build the C-extensions a sufficient check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks James! Looks like everything is good so far.
Had a few comments on the new method. IDK if we plan to keep that though or not (if not they can just serve as an example for new functions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry missed your second question. We build with Cython on the Python 3.7 GitHub Action job and run the tests there. If they pass, we should be good 🙂
jakirkham
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly things look good on the Cython front. Just a couple suggestions here on the new method (assuming we keep that).
|
This brings up an interesting question about how we maintain Cython
performance. The tests will make sure that things are valid, but I
wouldn't be surprised if we slip a little with each PR, especially among
those contributed by drive-by contributors.
I suspect that for the near-term we'll want to do little Cython "tune-ups"
before release, and that long-term we may want to see if some sort of
Cython coverage tool, perhaps one that uses the annotated HTML output, is
possible.
…On Fri, Dec 4, 2020 at 4:49 PM jakirkham ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In distributed/scheduler.py
<#4213 (comment)>:
> @@ -286,6 +292,7 @@ class WorkerState:
_actors: set
_address: str
_bandwidth: double
+ _executing: dict
Oh sorry missed your second question. We build with Cython on the Python
3.7 GitHub Action job and run the tests there. If they pass, we should be
good 🙂
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4213 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTCML4TNAHL6FHAY3S3STF7RZANCNFSM4THNAMZQ>
.
|
|
This isn't something to think about too much right now (let's make sure
that Cython is a good idea first), but it's a fun conversation to hold for
the future.
…On Fri, Dec 4, 2020 at 4:56 PM Matthew Rocklin ***@***.***> wrote:
This brings up an interesting question about how we maintain Cython
performance. The tests will make sure that things are valid, but I
wouldn't be surprised if we slip a little with each PR, especially among
those contributed by drive-by contributors.
I suspect that for the near-term we'll want to do little Cython "tune-ups"
before release, and that long-term we may want to see if some sort of
Cython coverage tool, perhaps one that uses the annotated HTML output, is
possible.
On Fri, Dec 4, 2020 at 4:49 PM jakirkham ***@***.***> wrote:
> ***@***.**** commented on this pull request.
> ------------------------------
>
> In distributed/scheduler.py
> <#4213 (comment)>:
>
> > @@ -286,6 +292,7 @@ class WorkerState:
>
> _actors: set
>
> _address: str
>
> _bandwidth: double
>
> + _executing: dict
>
>
> Oh sorry missed your second question. We build with Cython on the Python
> 3.7 GitHub Action job and run the tests there. If they pass, we should be
> good 🙂
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#4213 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AACKZTCML4TNAHL6FHAY3S3STF7RZANCNFSM4THNAMZQ>
> .
>
|
|
Well I know @quasiben did some work setting up benchmarks that run nightly. I don't think we can promise to run them on every PR. Maybe we can eyeball those and see how things are going? Perhaps Ben has thoughts on these sorts of things 😉 Generally I think writing code that Cython can do well with is not difficult. It's really a matter of annotating variables with types and sticking to them. This has much less to do with Cython and more generally being thoughtful while programming. Would describe this similarly as to how we advise people to write code that will run well with Dask. First think about how it can be written to parallelize well and then running with Dask isn't so hard. Cython isn't too different 🙂 |
Co-authored-by: jakirkham <jakirkham@gmail.com>
jakirkham
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guessing the numbers set_duration_estimate is working with are floating point. If so, we can do a little more annotation through here. This will get us some straight C code for computation when Cythonized.
Skipped transition_waiting_processing as I'm planning to go back over that one in detail later.
Co-authored-by: jakirkham <jakirkham@gmail.com>
|
Took the liberty of fixing conflicts with |
Co-authored-by: jakirkham <jakirkham@gmail.com>
…ime_outlier_duration
jrbourbeau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work on this @selshowk @jakirkham!
(FWIW the failures observed on Travis CI have been seen on other PRs and are unrelated to the changes here)
|
Thanks James! 😄 I think @selshowk did all the real work here. Just tried to keep merge conflicts to a minimum 😉 |
This PR implements more adaptive worker duration estimates for "outlier" tasks (defined as those whose on-going runtime exceeds twice the current task average duration). Rather than use the average duration as an estimate, we estimate 2x the current runtime if the task is an outlier. See discussion in #4192 as well.
@jrbourbeau I had to add an additional check to the edits from #4192 because otherwise my test below generated a missing key error. Can you comment if it makes sense or suggestions a deeper problem?
CCing @jrbourbeau @mrocklin for thoughts.