Refactor ensure_communicating #6165

crusaderky · 2022-04-20T16:53:52Z

Partially closes #5896

In scope for this PR

Refactor ensure_communicating() -> None to _ensure_communicating() -> RecsInstr
Remove self.loop.add_callback(self.gather_dep, ...)
ensure_communicating is no longer called periodically "just in case" - neither from every_cycle nor from find_missing

Out of scope for this PR, but in scope for #5896

Refactor gather_dep
Get rid of GatherDepDoneEvent (introduced in this PR)
Refactor _readd_busy_worker as an async instruction

Out of scope for #5896

Refactor find_missing
Get rid of EnsureCommunicatingLater (introduced in this PR) and _select_keys_for_gather

distributed/worker.py

crusaderky · 2022-04-20T17:14:21Z

distributed/worker.py

                    ts = self.tasks[k]
                    recommendations[ts] = tuple(msg.values())
-                raise
+                return GatherDepDoneEvent(stimulus_id=stimulus_id)


The exception is now shielded from @log_errors. Previously it was double reported as it is already logged by logger.exceptions(e) on line 3117.

crusaderky · 2022-04-20T17:14:48Z

distributed/worker.py

                self.periodic_callbacks[
                    "find-missing"
                ].callback_time = self.periodic_callbacks["heartbeat"].callback_time
-                self.ensure_communicating()


This is no longer necessary as _ensure_communicating is now in all transitions to fetch.

crusaderky · 2022-04-20T17:16:04Z

@fjetter you may give it a look now or wait until after I've fixed the regressions

github-actions · 2022-04-20T18:12:08Z

Unit Test Results

      16 files ±    0       16 suites ±0 7h 37m 30s ⏱️ + 46m 58s
  2 771 tests +    2   2 693 ✔️ +    6     78 💤 -   3 0 ❌ - 1
22 130 runs +308 21 111 ✔️ +290 1 019 💤 +20 0 ❌ - 2

Results for commit 95ce604. ± Comparison against base commit e390609.

♻️ This comment has been updated with latest results.

crusaderky · 2022-04-25T15:15:59Z

Current status: 8 tests failing. Pending investigation.

FAILED distributed/dashboard/tests/test_worker_bokeh.py::test_basic

This is because distributed.dashboard.components.worker.CrossFilter.update was never executed before and now for some reason is always executed - and its implementation is out of sync with the rest of the codebase. Need to investigate what it's meant to do.
[ EDIT ] logged in #6303

FAILED distributed/diagnostics/tests/test_eventstream.py::test_eventstream

FAILED distributed/tests/test_scheduler.py::test_decide_worker_coschedule_order_neighbors[nthreads1-1]

FAILED distributed/tests/test_worker.py::test_gather_many_small

This looks like a genuine regression in _select_keys_for_gather - concurrent fetches are being limited to 1 key per worker

FAILED distributed/tests/test_worker.py::test_acquire_replicas_already_in_flight

FAILED distributed/tests/test_worker.py::test_missing_released_zombie_tasks_2

FAILED distributed/tests/test_worker.py::test_gather_dep_cancelled_rescheduled

FAILED distributed/tests/test_worker.py::test_gather_dep_do_not_handle_response_of_not_requested_tasks

crusaderky · 2022-05-08T22:32:10Z

distributed/worker.py

+        # compute-task or acquire-replicas command from the scheduler, it allows
+        # clustering the transfers into less GatherDep instructions; see
+        # _select_keys_for_gather().
+        return {}, [EnsureCommunicatingLater(stimulus_id=stimulus_id)]


The alternative to this was to delete _select_keys_for_gather and either

Add logic to _handle_instructions to squash individual GatherDep instructions on the same worker

Implement no grouping whatsoever and just rely on rpc pooling (needs performance testing)

Both systems would remove the out-of-priority fetch from workers and imply revisiting the limits for concurrent connections.

Either way it would be a major functional change so I opted for this somewhat dirty hack instead which is functionally identical to main.

crusaderky · 2022-05-08T22:35:44Z

distributed/worker.py

        return [ev for ev in self.stimulus_log if getattr(ev, "key", None) in keys]

-    def ensure_communicating(self) -> None:
+    def _ensure_communicating(self, *, stimulus_id: str) -> RecsInstrs:


Fetching the stimulus_id from outside means that now gather_dep commands will have the stimulus_id of the event that triggered them, e.g.

compute-task

acquire-replicas

find_missing

GatherDepDoneEvent

distributed/worker.py

crusaderky · 2022-05-08T22:39:57Z

distributed/worker.py

        cancelled_keys: set[str] = set()
+
+        def done_event():
+            return GatherDepDoneEvent(stimulus_id=f"gather-dep-done-{time()}")


Temp hack, to be removed when refactoring gather_dep

crusaderky · 2022-05-08T22:40:49Z

distributed/worker.py

-        self.ensure_communicating()
+        self.handle_stimulus(
+            GatherDepDoneEvent(stimulus_id=f"readd-busy-worker-{time()}")
+        )


I'll change this method in a later PR to an async instruction

Pedantic, but it feels weird to issue a GatherDepDoneEvent when that isn't actually what happened. Something like

class BusyWorkerReAddedEvent(GatherDepDoneEvent): pass

might make it clearer that they're different things, just happen to be handled in the same way (for now).

But if GatherDepDoneEvent is itself a temporary hack and will be removed soon, then this isn't important.

crusaderky · 2022-05-09T10:06:38Z

All tests pass! This is ready for review and merge 🥳
Repeated failures in test_stress_scatter_death are unrelated (#6305).

gjoseph92

Some naming and design questions, but overall this seems good.

distributed/worker.py

gjoseph92 · 2022-05-11T00:34:54Z

distributed/worker.py

+                        stimulus_id=inst.stimulus_id
+                    )
+                    self.transitions(recs, stimulus_id=inst.stimulus_id)
+                    self._handle_instructions(instructions)


Why do we recurse into _handle_instructions here, instead of adding the new instructions onto the end of the current instructions list (in a safe way)? I'm wondering why the new instructions are treated as "higher priority" than the current ones.

I overhauled the method, please have another look

gjoseph92 · 2022-05-11T00:39:45Z

distributed/worker_state_machine.py



+@dataclass
+class EnsureCommunicatingLater(Instruction):


I find the "later" part of EnsureCommunicatingLater a little confusing. EnsureCommunicatingOnce? EnsureCommunicatingIdempotent?

AFAIU the point of doing this as an instruction (instead of calling _ensure_communicating directly in many places) is to allow batching of multiple EnsureCommunicating instructions into one, via special logic in _handle_instructions.

it's not just a matter of doing it once; it must happen after all recommendations to transition to fetch have been enacted.
Renamed to EnsureCommunicatingAfterTransitions.

distributed/worker.py

gjoseph92 · 2022-05-11T01:04:56Z

distributed/worker.py

-                stimulus_id=stimulus_id,
+            # Note: given n tasks that must be fetched from the same worker, this method
+            # may generate anywhere between 1 and n GatherDep instructions, as multiple
+            # tasks may be clustered in the same instruction by _select_keys_for_gather


Suggested change

# tasks may be clustered in the same instruction by _select_keys_for_gather

# tasks may be clustered in the same instruction by _select_keys_for_gather.

# The number will be greater than 1 when the tasks don't all fit in `target_message_size`.

This just took me a few reads to make sense of

It was incorrect to begin with - you'll never have more than one GatherDep from the same worker within the same iteration of ensure_communicating. I rewrote it.

gjoseph92 · 2022-05-11T01:11:40Z

distributed/worker.py

-        self.ensure_communicating()
+        self.handle_stimulus(
+            GatherDepDoneEvent(stimulus_id=f"readd-busy-worker-{time()}")
+        )


Pedantic, but it feels weird to issue a GatherDepDoneEvent when that isn't actually what happened. Something like

class BusyWorkerReAddedEvent(GatherDepDoneEvent): pass

might make it clearer that they're different things, just happen to be handled in the same way (for now).

But if GatherDepDoneEvent is itself a temporary hack and will be removed soon, then this isn't important.

gjoseph92 · 2022-05-11T01:31:33Z

distributed/worker_state_machine.py



+@dataclass
+class EnsureCommunicatingLater(Instruction):


I also find it a little odd that unlike other instructions, EnsureCommunicatingLater doesn't contain any data. It's relying on the side effect of _add_to_data_needed having already mutated data_needed and data_needed_per_worker, but the instruction itself is pointless without that side effect having occurred. I can't think of a cleaner design that avoids this and still has batching though.

I agree, but the whole instruction is a hack to begin with

gjoseph92

The renaming and _handle_instructions refactor helped, thank you. Just some comment-rewording and type annotations for clarity.

distributed/worker.py

gjoseph92 · 2022-05-11T17:29:13Z

distributed/worker.py

            )

            self.comm_nbytes += total_nbytes
            self.in_flight_workers[worker] = to_gather


Suggested change

self.in_flight_workers[worker] = to_gather

assert worker not in self.in_flight_workers, self.in_flight_workers[worker]

self.in_flight_workers[worker] = to_gather

Are we guaranteed that in_flight_workers[worker] is not already set? Because we'd be overwriting it if it is.

EDIT: I think we are because of the if w not in self.in_flight_workers above. Still might be nice to validate though? If this was not the case, it could probably cause a deadlock.

yes, it's impossible to be there already due to the line you mentioned just above. I think validation should be overkill since it's directly above

distributed/worker.py

gjoseph92 · 2022-05-11T17:44:03Z

distributed/worker.py

+                    # 1. there are many fetches queued because all workers are in flight
+                    # 2. a single compute-task or acquire-replicas command just sent
+                    #    many dependencies to fetch at once.
+                    ensure_communicating = inst


Is it even necessary to store the instruction right now (since it's just a sentinel), or could this just be a bool?

we need the stimulus_id

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

crusaderky commented Apr 20, 2022

View reviewed changes

distributed/worker.py Outdated Show resolved Hide resolved

crusaderky commented Apr 20, 2022

View reviewed changes

distributed/worker.py Outdated Show resolved Hide resolved

crusaderky commented Apr 20, 2022

View reviewed changes

crusaderky self-assigned this Apr 20, 2022

crusaderky requested a review from fjetter April 20, 2022 17:15

crusaderky mentioned this pull request Apr 21, 2022

Redesign worker exponential backoff on busy-gather #6173

Merged

crusaderky force-pushed the ensure_communicating branch from df6f189 to 738f7c6 Compare April 27, 2022 14:44

fjetter mentioned this pull request May 1, 2022

Race conditions from fetch to compute while AMM requests replica #6248

Merged

ensure_communicating

7569dd8

crusaderky force-pushed the ensure_communicating branch from 361b734 to 7569dd8 Compare May 6, 2022 21:46

crusaderky added 6 commits May 8, 2022 22:38

ensure_communicating to track stimulus_id from parent event

0cfce04

Merge branch 'main' into ensure_communicating

dfcf2d8

EnsureCommunicatingLater

0eeddc4

More thorough test

8118546

fix test

31d76bf

Cleanup

27bf2ea

crusaderky commented May 8, 2022

View reviewed changes

distributed/worker.py Show resolved Hide resolved

crusaderky commented May 8, 2022

View reviewed changes

crusaderky added 3 commits May 9, 2022 09:44

fix test

4cd73e8

polish

1c37f09

fix

fa40cad

crusaderky marked this pull request as ready for review May 9, 2022 10:06

crusaderky changed the title ~~WIP: Refactor ensure_communicating~~ Refactor ensure_communicating May 9, 2022

hayesgb requested a review from gjoseph92 May 9, 2022 14:51

gjoseph92 reviewed May 11, 2022

View reviewed changes

crusaderky added 3 commits May 11, 2022 13:28

Merge branch 'main' into ensure_communicating

850b579

Code review

8971855

fix regression

95ce604

gjoseph92 approved these changes May 11, 2022

View reviewed changes

crusaderky and others added 5 commits May 11, 2022 18:53

Update distributed/worker.py

565ea72

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

Update distributed/worker.py

d6703b4

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

Update distributed/worker.py

0112e5c

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

Update distributed/worker.py

7a9bdd4

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

line wrap

29b3d08

crusaderky merged commit 1937be7 into dask:main May 11, 2022

crusaderky deleted the ensure_communicating branch May 11, 2022 17:57

crusaderky mentioned this pull request May 26, 2022

Remove EnsureCommunicatingAfterTransitions #6462

Merged

	# tasks may be clustered in the same instruction by _select_keys_for_gather
	# tasks may be clustered in the same instruction by _select_keys_for_gather.
	# The number will be greater than 1 when the tasks don't all fit in `target_message_size`.

	self.in_flight_workers[worker] = to_gather
	assert worker not in self.in_flight_workers, self.in_flight_workers[worker]
	self.in_flight_workers[worker] = to_gather

Uh oh!

Refactor ensure_communicating #6165

Refactor ensure_communicating #6165

Uh oh!

Conversation

crusaderky commented Apr 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

In scope for this PR

Out of scope for this PR, but in scope for #5896

Out of scope for #5896

Uh oh!

Uh oh!

Uh oh!

crusaderky Apr 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crusaderky commented Apr 20, 2022

Uh oh!

github-actions bot commented Apr 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

crusaderky commented Apr 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

FAILED distributed/dashboard/tests/test_worker_bokeh.py::test_basic

FAILED distributed/diagnostics/tests/test_eventstream.py::test_eventstream

FAILED distributed/tests/test_scheduler.py::test_decide_worker_coschedule_order_neighbors[nthreads1-1]

FAILED distributed/tests/test_worker.py::test_gather_many_small

FAILED distributed/tests/test_worker.py::test_acquire_replicas_already_in_flight

FAILED distributed/tests/test_worker.py::test_missing_released_zombie_tasks_2

FAILED distributed/tests/test_worker.py::test_gather_dep_cancelled_rescheduled

FAILED distributed/tests/test_worker.py::test_gather_dep_do_not_handle_response_of_not_requested_tasks

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crusaderky May 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crusaderky commented May 9, 2022

Uh oh!

gjoseph92 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gjoseph92 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

crusaderky commented Apr 20, 2022 •

edited

Loading

crusaderky Apr 20, 2022 •

edited

Loading

github-actions bot commented Apr 20, 2022 •

edited

Loading

crusaderky commented Apr 25, 2022 •

edited

Loading

crusaderky May 8, 2022 •

edited

Loading