Overhaul gather() #7997

crusaderky · 2023-07-12T16:20:32Z

Closes Client/Scheduler gather not robust to busy worker #4698
Closes Race condition between gather() and AMM #7982
Closes gather() prints confusing error messages #7996
Supersedes Simple way to retry, solves #4698 #5546
DOES NOT address gather() should not remove unresponsive workers #7995
DOES NOT address gather(direct=True) may fall back to direct=False #7993
Soft-blocked by Regression: frequent deadlocks when gather_dep fails to contact peer #8006

github-actions · 2023-07-12T18:01:59Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      19 files -       1       19 suites - 1 11h 14m 52s ⏱️ - 44m 0s
  3 763 tests +      9   3 650 ✔️ +      7   106 💤 ±  0 6 ❌ +1 1 🔥 +1
35 034 runs - 1 280 33 312 ✔️ - 1 249 1 712 💤 - 36 9 ❌ +4 1 🔥 +1

For more details on these failures and errors, see this check.

Results for commit 41f4f8e. ± Comparison against base commit 9255987.

This pull request removes 1 and adds 10 tests. Note that renamed tests count towards both.

distributed.tests.test_worker ‑ test_gather_missing_workers_replicated[True]

distributed.tests.test_client ‑ test_gather_race_vs_AMM[False]
distributed.tests.test_client ‑ test_gather_race_vs_AMM[True]
distributed.tests.test_utils_comm ‑ test_gather_from_workers_busy
distributed.tests.test_utils_comm ‑ test_gather_from_workers_missing_replicas
distributed.tests.test_utils_comm ‑ test_gather_from_workers_serialization_error[pickle]
distributed.tests.test_utils_comm ‑ test_gather_from_workers_serialization_error[unpickle]
distributed.tests.test_worker ‑ test_gather_missing_workers_replicated[True0]
distributed.tests.test_worker ‑ test_gather_missing_workers_replicated[True1]
distributed.tests.test_worker ‑ test_gather_missing_workers_replicated[True2]
distributed.tests.test_worker ‑ test_gather_missing_workers_replicated[True3]

♻️ This comment has been updated with latest results.

crusaderky · 2023-07-19T11:09:23Z

distributed/client.py

+                response = await retry_operation(
+                    self.scheduler.gather, keys=missing_keys
+                )
+                if response["status"] == "OK":
+                    response["data"].update(data)


This PR does not change the previous behaviour. See

gather(direct=True) may fall back to direct=False #7993

This PR introduces significant changes to how we are gathering data. I strongly suggest to separate aesthetical refactoring from functional changes. Especially considering that this diff covers both logical changes (who_has) and aesthetical refactorings (if/else). This makes reviewing much harder than it should be.

I cannot see any aesthetical refactoring in this PR that could be moved to a separate PR while preserving the functionality.

distributed/tests/test_utils_comm.py

crusaderky · 2023-07-19T13:49:03Z

This is ready for review

hendrikmakait · 2023-07-24T15:47:09Z

distributed/client.py

+                if response["status"] == "OK":
+                    response["data"].update(data)


While this isn't new, I think it would be good to add a test to ensure that the refactoring works as expected.

I would like to get rid of the whole functionality in #7993, so it would be throwaway work.

distributed/utils_comm.py

hendrikmakait · 2023-07-24T16:16:06Z

distributed/utils_comm.py

+                for key in d[address]:
+                    missing_keys.add(key)
+                    del to_gather[key]


Potentially only one of these keys caused the exception. Would it make sense to attempt gathering them individually to keep the # missing keys to a minimum?

We may also consider returning these as erring keys, not missing. However, this is out of scope.

Changed code to mark keys as erring, meaning that gather() won't try again.
However, note that there is no transition memory->error in the scheduler; this is the same problem as with gather_dep (#6705).

Yes, we could try a fetch the keys one by one, but it feels like a substantial complication and ultimately overkill for what, in theory, should be a rare-ish problem easily weeded out during development?

distributed/worker.py

fjetter

There is a lot happening in this PR. Can we break this up a bit?

fjetter · 2023-07-25T07:54:37Z

distributed/client.py

+                response = await retry_operation(
+                    self.scheduler.gather, keys=missing_keys
+                )
+                if response["status"] == "OK":
+                    response["data"].update(data)


This PR introduces significant changes to how we are gathering data. I strongly suggest to separate aesthetical refactoring from functional changes. Especially considering that this diff covers both logical changes (who_has) and aesthetical refactorings (if/else). This makes reviewing much harder than it should be.

fjetter · 2023-07-25T08:16:59Z

distributed/utils_comm.py

+    who_has: Callable[
+        [list[str]],
+        Mapping[str, Collection[str]] | Awaitable[Mapping[str, Collection[str]]],
+    ],


I'm not convinced this callback structure is the right approach here. I find this makes the entire mechanism even harder to reason about. Besides, this bypasses (for better or worse) missing/update_who_has mechanics on the worker.
I believe this kind of logic should be handled a layer further up the stack instead of throwing this all into this function.

The method on the worker exclusively serves replicate, which we need to pen in to replace with AMM to begin with. The original method bypasses the whole worker missing/update_who_has system and I chose not to change that logic since it would be very complicated and ultimately throwaway.

distributed/worker.py

distributed/tests/test_utils_comm.py

fjetter · 2023-07-25T08:25:35Z

distributed/tests/test_scheduler.py

+    with captured_logger("distributed.scheduler") as sched_logger:
+        with captured_logger("distributed.client") as client_logger:
+            assert await c.gather(x, direct=False) == 2

-            assert s.tasks[fin.key].who_has == {s.workers[b.address]}
-            assert a.state.executed_count == 2
-            assert b.state.executed_count >= 1
-            # ^ leave room for a future switch from `remove_worker` to `retire_workers`
+    assert sched_logger.getvalue() == "Couldn't gather keys: {'x': 'memory'}\n" * 3
+    assert "Couldn't gather 1 keys, rescheduling" in client_logger.getvalue()


Putting aside for a moment what changes you are proposing here. Just reading this test I believe this behavior is false.

This test is fetching data via the scheduler but is running into connection failures. However, the Worker is still alive (otherwise the BatchedSend would've been broken and the Worker removed). Despite of knowing that the Worker is alive and it merely struggles to connect, it is rescheduling the key? This feels like an unwarranted escalation.

This is the previous behaviour. I did not touch nor investigate it.
More specifically, I exclusively amended Client._gather_remote and didn't go into the several layers worth of wrappers around it.
Happy to look into them, but it should be left to a later PR.

distributed/utils_comm.py

distributed/tests/test_utils_comm.py

crusaderky · 2023-08-03T15:31:04Z

As discussed online and offline:

moved fix of gather() should not remove unresponsive workers #7995 to a future PR
moved handling of CancelledError to match gather_dep should handle CancelledError #8013 to a future PR
refactored implementation without the callback
removed cosmetic refactorings as much as possible

crusaderky · 2023-08-03T16:49:48Z

@hendrikmakait @fjetter
This is in theory complete and ready for a second round of review.
However, I'm seeing a lot of instability in CI.

These failures I've never seen before; I'll investigate:

1 out of 10 runs failed: test_TaskStreamPlugin (distributed.diagnostics.tests.test_task_stream)
1 out of 10 runs failed: test_gather_after_failed_worker (distributed.tests.test_failed_workers)
1 out of 10 runs failed: test_dont_steal_fast_tasks_compute_time (distributed.tests.test_steal)

These are known offenders but potentially impacted by this PR; I'll investigate:

1 out of 10 runs failed: test_gather_then_submit_after_failed_workers (distributed.tests.test_failed_workers)
1 out of 10 runs failed: test_tell_workers_when_peers_have_left (distributed.tests.test_scheduler)
2 out of 10 runs failed: test_chaos_rechunk (distributed.tests.test_stress)

These are known offenders which shouldn't be impacted:

1 out of 8 runs failed: test_closed_input_only_worker_during_transfer (distributed.shuffle.tests.test_shuffle)
1 out of 8 runs failed: test_closed_worker_during_transfer (distributed.shuffle.tests.test_shuffle)
1 out of 8 runs failed: test_crashed_worker_during_transfer (distributed.shuffle.tests.test_shuffle)
2 out of 8 runs failed: test_restarting_during_transfer_raises_killed_worker (distributed.shuffle.tests.test_shuffle)
1 out of 10 runs failed: test_file_descriptors_dont_leak[Nanny] (distributed.tests.test_client)

This reverts commit 3a67188.

crusaderky · 2023-08-04T13:11:24Z

I've run all tests above 10 times on each CI environment, on main and on this PR, and counted the number of failures, and I can see no regression; this is ready for final review and merge.

On a side note, this was an extremely time-consuming activity and we should urgently pen in the time to fix CI.

test	Failures in main	Failures in this PR
diagnostics/tests/test_task_stream.py::test_TaskStreamPlugin	1	0
shuffle/tests/test_shuffle.py::test_clean_after_close	1	0
shuffle/tests/test_shuffle.py::test_closed_input_only_worker_during_transfer	0	1
shuffle/tests/test_shuffle.py::test_closed_worker_during_transfer	29	35
shuffle/tests/test_shuffle.py::test_crashed_worker_during_transfer	6	1
shuffle/tests/test_shuffle.py::test_restarting_during_transfer_raises_killed_worker	38	33
tests/test_client.py::test_file_descriptors_dont_leak	9	10
tests/test_failed_workers.py::test_gather_after_failed_worker	4	2
tests/test_failed_workers.py::test_gather_then_submit_after_failed_workers	9	1
tests/test_scheduler.py::test_tell_workers_when_peers_have_left	1	1
tests/test_stress.py::test_chaos_rechunk	0	1

fjetter

The above comment is not essential. I think it's good practice to cleanup our asyncio game but considering this hasn't bothered us before, I assume this thing is just never cancelled.

fjetter · 2023-08-09T14:44:23Z

distributed/utils_comm.py

-        for worker, c in coroutines.items():
+        for address, task in tasks.items():
            try:
-                r = await c
+                r = await task


I know this is the same as before but the task scheduling pattern here is problematic in the event of cancellation.

If gather_from_workers is cancelled after the tasks were created, only the task we're currently awaiting is cancelled, all others will continue running and we'll get a "never awaited foo" warning.

I guess this coroutinefunction is never actually cancelled which is why we never ran into this...

The correct approach to this would be to use asyncio.gather (or even better but not backwards copatible, the TaskGroup of asyncio in python 3.11+)

I think the changes should be straight forward to something like

results = asyncio.gather(tasks, return_exceptions=True) for addr, res in results.items(): if isinstance(res, OSError): ...

I'll address it in a follow-up

crusaderky force-pushed the gather branch from fa96b48 to a1d5e46 Compare July 12, 2023 16:50

crusaderky self-assigned this Jul 12, 2023

crusaderky force-pushed the gather branch 4 times, most recently from 5869dfd to b0c9f6e Compare July 18, 2023 16:43

crusaderky commented Jul 19, 2023

View reviewed changes

distributed/tests/test_utils_comm.py Show resolved Hide resolved

crusaderky marked this pull request as ready for review July 19, 2023 13:49

crusaderky requested review from fjetter and jacobtomlinson as code owners July 19, 2023 13:49

crusaderky force-pushed the gather branch 4 times, most recently from 90bbdab to fd83b33 Compare July 19, 2023 21:02

hendrikmakait self-requested a review July 20, 2023 07:18

Client.gather() overhaul

ac95306

crusaderky force-pushed the gather branch from fd83b33 to ac95306 Compare July 20, 2023 15:33

hendrikmakait reviewed Jul 24, 2023

View reviewed changes

fjetter reviewed Jul 25, 2023

View reviewed changes

crusaderky added 2 commits July 26, 2023 16:02

Merge branch 'main' into gather

8bef811

Rename who_has callable to get_who_has

b1d782e

fjetter reviewed Aug 2, 2023

View reviewed changes

distributed/utils_comm.py Outdated Show resolved Hide resolved

distributed/tests/test_utils_comm.py Outdated Show resolved Hide resolved

crusaderky added 5 commits August 3, 2023 10:44

Merge branch 'main' into gather

56c598c

Refactor without callback

47be3d2

Revert handling of CancelledError

3e0fe95

revert

1e9d584

Revert no shutdown of workers

86c8fa2

crusaderky added 3 commits August 3, 2023 16:19

tweaks

4acc370

Merge branch 'main' into gather

ed0203b

Revert cosmetic refactor

1285cbf

crusaderky mentioned this pull request Aug 3, 2023

Scheduler gather should warn or abort requests if data is too large #7964

Closed

crusaderky added 2 commits August 4, 2023 11:27

Merge branch 'main' into gather

443b5f4

stress flaky tests

3a67188

crusaderky force-pushed the gather branch from f370a5f to 3a67188 Compare August 4, 2023 10:35

Revert "stress flaky tests"

53d3319

This reverts commit 3a67188.

crusaderky added 2 commits August 4, 2023 15:30

Merge branch 'main' into gather

4fed08b

Merge branch 'main' into gather

41f4f8e

hendrikmakait self-requested a review August 7, 2023 16:09

fjetter approved these changes Aug 9, 2023

View reviewed changes

crusaderky merged commit 229a16f into dask:main Aug 9, 2023

crusaderky deleted the gather branch August 9, 2023 14:56

crusaderky mentioned this pull request Aug 9, 2023

Propagate CancelledError in gather_from_workers #8089

Merged

Uh oh!

Overhaul gather() #7997

Overhaul gather() #7997

Uh oh!

Conversation

crusaderky commented Jul 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

crusaderky Jul 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

crusaderky commented Jul 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fjetter left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

crusaderky commented Aug 3, 2023

Uh oh!

crusaderky commented Aug 3, 2023

Uh oh!

crusaderky commented Aug 4, 2023

Uh oh!

fjetter left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

crusaderky commented Jul 12, 2023 •

edited

Loading

github-actions bot commented Jul 12, 2023 •

edited

Loading

crusaderky Jul 19, 2023 •

edited

Loading

crusaderky commented Jul 19, 2023 •

edited

Loading