-
-
Notifications
You must be signed in to change notification settings - Fork 748
Avoid repeatedly using the same worker on first task with quiet cluster. #4638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Also, thanks for taking this on. I think that this is a great educational
issue for you.
…On Fri, Mar 26, 2021 at 2:06 PM Doug Davis ***@***.***> wrote:
This adds an additional check to avoid quiet clusters (as Matt describes
in #4637 <#4637>) repeatedly
using the same worker (the new test fails without the added sum of
occupancies check). I understand this is a CPU sensitive area, and perhaps
the sum of occupancies is too expensive a task to use here. Very happy to
hear suggestions to try alternatives.
- Closes #4637 <#4637>
- Tests added / passed
- Passes black distributed / flake8 distributed
------------------------------
You can view, comment on, or merge this pull request online at:
#4638
Commit Summary
- Add protection against repeated use of one worker in a quiet cluster
- move and rename test
File Changes
- *M* distributed/scheduler.py
<https://github.com/dask/distributed/pull/4638/files#diff-bbcf2e505bf2f9dd0dc25de4582115ee4ed4a6e80997affc7b22122912cc6591>
(8)
- *M* distributed/tests/test_scheduler.py
<https://github.com/dask/distributed/pull/4638/files#diff-d4f4e24d830a24ab478c8cdce0a8da2357f57d7dbc08a7181c79328db65dbdff>
(8)
Patch Links:
- https://github.com/dask/distributed/pull/4638.patch
- https://github.com/dask/distributed/pull/4638.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#4638>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTFDEWCDHDYPRQRCJSTTFTLLBANCNFSM4Z355YUA>
.
|
56d13f7 to
5635353
Compare
|
How are things going here @douglasdavis ? I would be happy to meet up and debrief if you prefer. Sometimes it's nice to talk through these things with someone. |
|
Sounds like a good idea! I pushed a commit yesterday inspired by your idea to zoom to the nth task and search from there, I think it would useful to chat about it. |
jakirkham
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Included some comments regarding typing things in the Scheduler for Cythonization. Maybe not the most critical thing atm given the code is still being worked out, but wanted to share this perspective as well. Hopefully this shows how one can approach this when appropriate
many thanks for pointing that out @jakirkham - I commited the typing suggestions to make sure to stick with them for the rest of the PR. |
use type annotated attribute Co-authored-by: jakirkham <jakirkham@gmail.com>
type annotations Co-authored-by: jakirkham <jakirkham@gmail.com>
type annotations Co-authored-by: jakirkham <jakirkham@gmail.com>
type annotations Co-authored-by: jakirkham <jakirkham@gmail.com>
Co-authored-by: jakirkham <jakirkham@gmail.com>
Co-authored-by: jakirkham <jakirkham@gmail.com>
|
Matthew Rocklin ***@***.***> writes:
@mrocklin commented on this pull request.
----------------------------------------------------------------------
In distributed/scheduler.py:
> @@ -2126,6 +2126,19 @@ def decide_worker(self, ts: TaskState) -> WorkerState:
n_workers: Py_ssize_t = len(worker_pool_dv)
if n_workers < 20: # smart but linear in small case
ws = min(worker_pool.values(), key=operator.attrgetter("occupancy"))
+ if ws._occupancy == 0:
+ # special case to use round-robin; linear search
+ # for next worker with zero occupancy (or just
+ # land back where we started).
+ wp_vals = worker_pool.values()
+ wp_i: WorkerState
+ start: Py_ssize_t = self._n_tasks % n_workers
+ i: Py_ssize_t
+ for i in range(n_workers):
+ wp_i = wp_vals[i + start % n_workers]
⬇️ Suggested change
- wp_i = wp_vals[i + start % n_workers]
+ wp_i = wp_vals[(i + start) % n_workers]
Ah thanks, Matt.
|
jakirkham
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are using worker_pool.values() in a few places, maybe it makes sense to construct this object once and reuse it elsewhere? Included a few relevant suggestion below
agreed! |
jakirkham
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks Doug! 😄
jrbourbeau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @douglasdavis! The test_retries failure here looks like a genuine (non-flaky) test failure. I'll take a look at this (though if you have any thoughts, that would also be welcome : ) )
|
Do we have an issue tracking those flaky tests? Think I'm also seeing these in another PR |
|
The list of current flaky tests is here |
|
Thanks. Do we know why these are affecting all jobs now? |
|
It seems test flakyness has increased has generally increased over the past few weeks (I'll open an issue about this) but they're not at the rate where they will impact all jobs. Can you point me to the PR you're referring to? |
|
By all jobs I mean all jobs in a PR. For example all of the jobs for this PR are failing |
|
Gotcha -- it looks like there are genuine, non-flaky test failures both in this PR and over in #4650 (see #4650 (review)) |
|
@douglasdavis it looks like I recommend bumping up the diff --git a/distributed/tests/test_client_executor.py b/distributed/tests/test_client_executor.py
index 555bcb86..3942a2ca 100644
--- a/distributed/tests/test_client_executor.py
+++ b/distributed/tests/test_client_executor.py
@@ -210,26 +210,24 @@ def test_unsupported_arguments(client, s, a, b):
def test_retries(client):
args = [ZeroDivisionError("one"), ZeroDivisionError("two"), 42]
- with client.get_executor(retries=3, pure=False) as e:
+ with client.get_executor(retries=5, pure=False) as e:
future = e.submit(varying(args))
assert future.result() == 42
- with client.get_executor(retries=2) as e:
+ with client.get_executor(retries=4) as e:
future = e.submit(varying(args))
result = future.result()
assert result == 42
- with client.get_executor(retries=1) as e:
+ with client.get_executor(retries=2) as e:
future = e.submit(varying(args))
- with pytest.raises(ZeroDivisionError) as exc_info:
+ with pytest.raises(ZeroDivisionError, match="two"):
res = future.result()
- exc_info.match("two")
with client.get_executor(retries=0) as e:
future = e.submit(varying(args))
- with pytest.raises(ZeroDivisionError) as exc_info:
+ with pytest.raises(ZeroDivisionError, match="one"):
res = future.result()
- exc_info.match("one")which should resolve the issue. Alternatively, one could try to use |
|
thanks @jrbourbeau! I'm still getting a test failure locally; not exactly sure what's going on. I went ahead and committed your suggested changes to see how the full CI suite goes. More details about the local failure I'm seeing are below. This import error is popping up: there is indeed no full report from `pytest distributed/tests -k test_retries`(dask-dev) ddavis@strange ~/software/repos/distributed (gh4637) $ pytest distributed/tests -k test_retries -vv ========================================== test session starts ========================================== platform linux -- Python 3.9.4, pytest-6.2.3, py-1.10.0, pluggy-0.13.1 -- /home/ddavis/.pyenv/versions/3.9.4/envs/dask-dev/bin/python3.9 cachedir: .pytest_cache rootdir: /home/ddavis/software/repos/distributed, configfile: setup.cfg plugins: flaky-3.7.0, asyncio-0.14.0 collected 1192 items / 1188 deselected / 4 selecteddistributed/tests/test_client.py::test_retries_get FAILED [ 25%] =============================================== FAILURES ================================================ c = <Client: 'tcp://127.0.0.1:44311' processes=2 threads=2, memory=62.58 GiB>
distributed/tests/test_client.py:302: ../../../.pyenv/versions/3.9.4/envs/dask-dev/lib/python3.9/site-packages/dask/base.py:284: in compute self = <dask.highlevelgraph.HighLevelGraph object at 0x7faee221ca30>
E ImportError: cannot import name 'dumps_msgpack' from 'distributed.protocol.core' (/home/ddavis/software/repos/distributed/distributed/protocol/core.py) ../../../.pyenv/versions/3.9.4/envs/dask-dev/lib/python3.9/site-packages/dask/highlevelgraph.py:946: ImportError
distributed/utils_test.py:955: ../../../.pyenv/versions/3.9.4/envs/dask-dev/lib/python3.9/site-packages/tornado/ioloop.py:530: in run_sync self = <dask.highlevelgraph.HighLevelGraph object at 0x7faee20d9d00>, client = <Client: not connected>
E ImportError: cannot import name 'dumps_msgpack' from 'distributed.protocol.core' (/home/ddavis/software/repos/distributed/distributed/protocol/core.py) ../../../.pyenv/versions/3.9.4/envs/dask-dev/lib/python3.9/site-packages/dask/highlevelgraph.py:946: ImportError client = <Client: 'tcp://127.0.0.1:37433' processes=2 threads=2, memory=62.58 GiB>
distributed/tests/test_client_executor.py:214: distributed/cfexecutor.py:91: in submit self = <dask.highlevelgraph.HighLevelGraph object at 0x7faee1d6dbb0>
E ImportError: cannot import name 'dumps_msgpack' from 'distributed.protocol.core' (/home/ddavis/software/repos/distributed/distributed/protocol/core.py) ../../../.pyenv/versions/3.9.4/envs/dask-dev/lib/python3.9/site-packages/dask/highlevelgraph.py:946: ImportError
distributed/utils_test.py:955: ../../../.pyenv/versions/3.9.4/envs/dask-dev/lib/python3.9/site-packages/tornado/ioloop.py:530: in run_sync self = <dask.highlevelgraph.HighLevelGraph object at 0x7faee20c90a0>, client = <Client: not connected>
E ImportError: cannot import name 'dumps_msgpack' from 'distributed.protocol.core' (/home/ddavis/software/repos/distributed/distributed/protocol/core.py) ../../../.pyenv/versions/3.9.4/envs/dask-dev/lib/python3.9/site-packages/dask/highlevelgraph.py:946: ImportError ../../../.pyenv/versions/3.9.4/envs/dask-dev/lib/python3.9/site-packages/_pytest/config/init.py:1233 -- Docs: https://docs.pytest.org/en/stable/warnings.html |
|
May need to merge with latest Edit: The other possibility is an older version of |
|
Thanks @jakirkham a fresh env did the trick. Not 100% sure but I think I had mixed |
|
Yeah happens to me periodically as well 🙂 |
jrbourbeau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @douglasdavis! This is in
This adds an additional check to avoid quiet clusters (as Matt describes in #4637) repeatedly using the same worker (the new test fails without the added sum of occupancies check). I understand this is a CPU sensitive area, and perhaps the sum of occupancies is too expensive a task to use here. Very happy to hear suggestions to try alternatives.
black distributed/flake8 distributed