Skip to content

Conversation

@selshowk
Copy link
Contributor

@selshowk selshowk commented Nov 2, 2020

This PR implements more adaptive worker duration estimates for "outlier" tasks (defined as those whose on-going runtime exceeds twice the current task average duration). Rather than use the average duration as an estimate, we estimate 2x the current runtime if the task is an outlier. See discussion in #4192 as well.

@jrbourbeau I had to add an additional check to the edits from #4192 because otherwise my test below generated a missing key error. Can you comment if it makes sense or suggestions a deeper problem?

CCing @jrbourbeau @mrocklin for thoughts.

Rather than always using the average task duration as an estimate we
flag "outliers" (tasks that are taking 2x longer than expected duration)
and we set expected duration to be twice their current running time.

NOTE: also added check for missing key in worker metrics code.
@pentschev
Copy link
Member

We run some Distributed tests that require CUDA support in dask-cuda, and I just found out that the failures we see there are due to pytest-asyncio=0.14.0 and the most recent was pytest-asyncio=0.12.0 until 3 days ago in Anaconda: https://anaconda.org/conda-forge/pytest-asyncio/files .

Locally I can run the tests from dask-cuda by just reverting to pytest-asyncio=0.12.0, and I noticed the builds here are picking pytest-asyncio=0.14.0, in contrast to pytest-asyncio=0.12.0 in #4211 where builds are passing. I'm not yet sure if there's something that needs to be changed in Distributed or if there's a bug in pytest-asyncio=0.14.0.

@pentschev
Copy link
Member

Only saw that older version of pytest-asyncio was pinned in #4212, maybe running tests here again will fix failures?

@jrbourbeau
Copy link
Member

Yeah, running tests again here should pick up the changes in #4212, which will fix test failures (up to a few known flaky tests)

@jrbourbeau
Copy link
Member

@selshowk I pushed a couple of commits to resolve some merge conflicts and update the test added here (hope that's okay). Let me know what you think

_actors: set
_address: str
_bandwidth: double
_executing: dict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're adding a new attribute to the WorkerState class which we recently added type annotations to for Cythonization. @jakirkham I think I took care of everything needed for this addition, but is there some kind of check I can run to make sure Cython is happy? For instance, is being able to successfully build the C-extensions a sufficient check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks James! Looks like everything is good so far.

Had a few comments on the new method. IDK if we plan to keep that though or not (if not they can just serve as an example for new functions).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry missed your second question. We build with Cython on the Python 3.7 GitHub Action job and run the tests there. If they pass, we should be good 🙂

Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly things look good on the Cython front. Just a couple suggestions here on the new method (assuming we keep that).

@mrocklin
Copy link
Member

mrocklin commented Dec 5, 2020 via email

@mrocklin
Copy link
Member

mrocklin commented Dec 5, 2020 via email

@jakirkham
Copy link
Member

Well I know @quasiben did some work setting up benchmarks that run nightly. I don't think we can promise to run them on every PR. Maybe we can eyeball those and see how things are going? Perhaps Ben has thoughts on these sorts of things 😉

Generally I think writing code that Cython can do well with is not difficult. It's really a matter of annotating variables with types and sticking to them. This has much less to do with Cython and more generally being thoughtful while programming. Would describe this similarly as to how we advise people to write code that will run well with Dask. First think about how it can be written to parallelize well and then running with Dask isn't so hard. Cython isn't too different 🙂

Co-authored-by: jakirkham <jakirkham@gmail.com>
Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guessing the numbers set_duration_estimate is working with are floating point. If so, we can do a little more annotation through here. This will get us some straight C code for computation when Cythonized.

Skipped transition_waiting_processing as I'm planning to go back over that one in detail later.

@jakirkham
Copy link
Member

Took the liberty of fixing conflicts with master (pretty minor changes really), hope that is ok 🙂

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this @selshowk @jakirkham!

(FWIW the failures observed on Travis CI have been seen on other PRs and are unrelated to the changes here)

@jrbourbeau jrbourbeau merged commit 976e02e into dask:master Dec 21, 2020
@jakirkham
Copy link
Member

Thanks James! 😄 I think @selshowk did all the real work here. Just tried to keep merge conflicts to a minimum 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants