AMM ReduceReplicas to iterate only on replicated tasks #5297

crusaderky · 2021-09-08T13:54:20Z

Encapsulate changes to ts._who_has / ws._has_what.
Track in-memory tasks with more than one replica apart from all the others.
Change the AMM ReduceReplicas policy to iterate only on this subset, which in the vast majority of the cases should be much smaller than the set of all tasks in the cluster.

CC @mrocklin @fjetter @jrbourbeau @gjoseph92

crusaderky · 2021-09-14T09:12:42Z

ping @fjetter @gjoseph92 - this has potential performance implications on the whole scheduler, so I'd very much like your review.

fjetter

LGTM. I don't have any performance concerns

fjetter · 2021-09-14T11:27:56Z

distributed/scheduler.py

+    def stimulus_missing_data(self, key=None):
+        """Mark that certain keys have gone missing. Recover."""
        parent: SchedulerState = cast(SchedulerState, self)
        with log_errors():
-            logger.debug("Stimulus missing data %s, %s", key, worker)
-
-            recommendations: dict = {}
-            client_msgs: dict = {}
-            worker_msgs: dict = {}
-
+            logger.debug("Stimulus missing data: %r", key)
            ts: TaskState = parent._tasks.get(key)
            if ts is None or ts._state == "memory":
-                return recommendations, client_msgs, worker_msgs
-            cts: TaskState = parent._tasks.get(cause)
-
-            if cts is not None and cts._state == "memory":  # couldn't find this
-                ws: WorkerState
-                cts_nbytes: Py_ssize_t = cts.get_nbytes()
-                for ws in cts._who_has:  # TODO: this behavior is extreme
-                    del ws._has_what[ts]
-                    ws._nbytes -= cts_nbytes
-                cts._who_has.clear()
-                recommendations[cause] = "released"
-
-            if key:
-                recommendations[key] = "released"
-
-            parent._transitions(recommendations, client_msgs, worker_msgs)
-            recommendations = {}
-
-            if parent._validate:
-                assert cause not in self.who_has
-
-            return recommendations, client_msgs, worker_msgs
+                return {}, {}, {}
+            return {key: "released"}, {}, {}


I found myself more than once in a situation where I wanted to delete this stimulus handler since I believe most of it is dead code. However, I don't see the connection to the current PR. Is this somehow related? Can this change be split off?

Actually I can't. If you read the old code, it does something clearly wrong - on lines 4783:4787, it deletes ts from ws.has_what but it removes cts.nbytes from ws.nbytes; it should have removed ts.nbytes instead. With nothing invoking this code path, I had no idea what the original intent of the logic was, so I opted for cutting the method down to just what is actually invoked by handle_release_data.

if you prefer I can remove the stimulus_missing_data method altogether and inline the surviving code into handle_release_data?

Maybe that's reasonable, since it's not really a stimulus function anymore either and it's only a few lines.

I'm fine with moving the code to handle_release_data

done; please double check that the new code is functionally identical to the previous one

fjetter · 2021-09-14T11:35:26Z

distributed/scheduler.py

-                        ws._nbytes += ts_nbytes
-                        ws._has_what[ts] = None
-                        ts._who_has.add(ws)
+                    if ws not in ts._who_has:


I noticed changes like these in a few different places. You replaced a ts not in ws._has_what with a ws not in ts._who_has. Assuming the state machine is not corrupt, these are equivalent statements. What motivated this change?

pure aesthetics and consistency. Nothing beyond that.

gjoseph92

LGTM as well. Nice to have this consolidated into a function anyway.

gjoseph92 · 2021-09-14T19:01:41Z

distributed/scheduler.py

+
+    @ccall
+    def remove_replica(self, ts: TaskState, ws: WorkerState):
+        """Note that a worker no longer holds a replica of a task"""


Nit: maybe add a if self._validate case to this and remove_all_replicas to catch any issues in future tests?

both remove_replica and remove_all_replicas already raise KeyError. Replaced the discard() with remove() for extra safety.

gjoseph92 · 2021-09-14T19:08:19Z

distributed/scheduler.py

+    def stimulus_missing_data(self, key=None):
+        """Mark that certain keys have gone missing. Recover."""
        parent: SchedulerState = cast(SchedulerState, self)
        with log_errors():
-            logger.debug("Stimulus missing data %s, %s", key, worker)
-
-            recommendations: dict = {}
-            client_msgs: dict = {}
-            worker_msgs: dict = {}
-
+            logger.debug("Stimulus missing data: %r", key)
            ts: TaskState = parent._tasks.get(key)
            if ts is None or ts._state == "memory":
-                return recommendations, client_msgs, worker_msgs
-            cts: TaskState = parent._tasks.get(cause)
-
-            if cts is not None and cts._state == "memory":  # couldn't find this
-                ws: WorkerState
-                cts_nbytes: Py_ssize_t = cts.get_nbytes()
-                for ws in cts._who_has:  # TODO: this behavior is extreme
-                    del ws._has_what[ts]
-                    ws._nbytes -= cts_nbytes
-                cts._who_has.clear()
-                recommendations[cause] = "released"
-
-            if key:
-                recommendations[key] = "released"
-
-            parent._transitions(recommendations, client_msgs, worker_msgs)
-            recommendations = {}
-
-            if parent._validate:
-                assert cause not in self.who_has
-
-            return recommendations, client_msgs, worker_msgs
+                return {}, {}, {}
+            return {key: "released"}, {}, {}


Maybe that's reasonable, since it's not really a stimulus function anymore either and it's only a few lines.

gjoseph92 · 2021-09-14T19:15:54Z

distributed/scheduler.py

+            assert ws not in ts._who_has
+            assert ts not in ws._has_what
+
+        ws._nbytes += ts.get_nbytes()


Perhaps some mysterious magic type annotations are necessary to get this function to Cythonize well? I have no idea.

Added explicit declaration of nbytes.
I also noticed that replacing

del ws._has_what[ts] ts._who_has.remove(ws) if len(ts._who_has) == 1: self._replicated_tasks.remove(ts)

with

wh: set = ts._who_has hw: dict = ws._has_what del hw[ts] wh.remove(ws) if len(wh) == 1: rt: set = self._replicated_tasks rt.remove(ts)

marginally improves the Cython code, at the cost of greatly reduced readability, so I'd rather not do it.

crusaderky · 2021-09-15T13:40:37Z

All code review critiques have been incorporated. The new stimulus_task_erred and handle_release_data are supposed to be functionally identical to before but please double check.

fjetter

I reviewed the changes around handle_missing_data / stimulus_missing_data again and agree that they should be identical. The missing-data system evolved quite dramatically over time and I also discovered several places on worker side as well where the code does no longer make a lot of sense. I believe this is genuinely dead code.

Test failures appear to be unrelated as well

AMM ReduceReplicas to iterate only on replicate tasks

eecaf1a

crusaderky requested a review from fjetter September 8, 2021 13:55

fjetter reviewed Sep 14, 2021

View reviewed changes

gjoseph92 reviewed Sep 14, 2021

View reviewed changes

Code review

8306013

crusaderky force-pushed the replicated_tasks branch from 8f82a93 to 8306013 Compare September 15, 2021 13:33

Merge branch 'main' into replicated_tasks

96e646e

fjetter approved these changes Sep 16, 2021

View reviewed changes

fjetter merged commit 54760d8 into dask:main Sep 16, 2021

crusaderky deleted the replicated_tasks branch September 16, 2021 17:22

This was referenced Sep 21, 2021

Revert AMM ReduceReplicas and parallel AMMs updates #5335

Merged

Release 2021.09.1 dask/community#182

Closed

fjetter mentioned this pull request Sep 22, 2021

Revert "Revert AMM ReduceReplicas and parallel AMMs updates (#5335)" #5337

Closed

crusaderky mentioned this pull request Sep 22, 2021

Reinstate: Enhance AMM docstrings #5340

Merged

crusaderky added a commit to crusaderky/distributed that referenced this pull request Sep 22, 2021

AMM ReduceReplicas to iterate only on replicated tasks (dask#5297)

5431f27

This was referenced Sep 22, 2021

Reinstate: AMM ReduceReplicas to iterate only on replicated tasks #5341

Merged

Reinstate: Run multiple AMMs in parallel #5339

Merged

crusaderky added the memory label Mar 25, 2022

Uh oh!

AMM ReduceReplicas to iterate only on replicated tasks #5297

AMM ReduceReplicas to iterate only on replicated tasks #5297

Uh oh!

Conversation

crusaderky commented Sep 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crusaderky commented Sep 14, 2021

Uh oh!

fjetter left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crusaderky Sep 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gjoseph92 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crusaderky Sep 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crusaderky Sep 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crusaderky commented Sep 15, 2021

Uh oh!

fjetter left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

crusaderky commented Sep 8, 2021 •

edited

Loading

crusaderky Sep 14, 2021 •

edited

Loading

crusaderky Sep 15, 2021 •

edited

Loading

crusaderky Sep 15, 2021 •

edited

Loading