-
-
Notifications
You must be signed in to change notification settings - Fork 748
Overhaul gather() #7997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul gather() #7997
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 19 files - 1 19 suites - 1 11h 14m 52s ⏱️ - 44m 0s For more details on these failures and errors, see this check. Results for commit 41f4f8e. ± Comparison against base commit 9255987. This pull request removes 1 and adds 10 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
5869dfd to
b0c9f6e
Compare
distributed/client.py
Outdated
| response = await retry_operation( | ||
| self.scheduler.gather, keys=missing_keys | ||
| ) | ||
| if response["status"] == "OK": | ||
| response["data"].update(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR does not change the previous behaviour. See
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR introduces significant changes to how we are gathering data. I strongly suggest to separate aesthetical refactoring from functional changes. Especially considering that this diff covers both logical changes (who_has) and aesthetical refactorings (if/else). This makes reviewing much harder than it should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot see any aesthetical refactoring in this PR that could be moved to a separate PR while preserving the functionality.
|
This is ready for review |
90bbdab to
fd83b33
Compare
distributed/client.py
Outdated
| if response["status"] == "OK": | ||
| response["data"].update(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this isn't new, I think it would be good to add a test to ensure that the refactoring works as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to get rid of the whole functionality in #7993, so it would be throwaway work.
| for key in d[address]: | ||
| missing_keys.add(key) | ||
| del to_gather[key] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially only one of these keys caused the exception. Would it make sense to attempt gathering them individually to keep the # missing keys to a minimum?
We may also consider returning these as erring keys, not missing. However, this is out of scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed code to mark keys as erring, meaning that gather() won't try again.
However, note that there is no transition memory->error in the scheduler; this is the same problem as with gather_dep (#6705).
Yes, we could try a fetch the keys one by one, but it feels like a substantial complication and ultimately overkill for what, in theory, should be a rare-ish problem easily weeded out during development?
fjetter
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot happening in this PR. Can we break this up a bit?
distributed/client.py
Outdated
| response = await retry_operation( | ||
| self.scheduler.gather, keys=missing_keys | ||
| ) | ||
| if response["status"] == "OK": | ||
| response["data"].update(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR introduces significant changes to how we are gathering data. I strongly suggest to separate aesthetical refactoring from functional changes. Especially considering that this diff covers both logical changes (who_has) and aesthetical refactorings (if/else). This makes reviewing much harder than it should be.
distributed/utils_comm.py
Outdated
| who_has: Callable[ | ||
| [list[str]], | ||
| Mapping[str, Collection[str]] | Awaitable[Mapping[str, Collection[str]]], | ||
| ], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced this callback structure is the right approach here. I find this makes the entire mechanism even harder to reason about. Besides, this bypasses (for better or worse) missing/update_who_has mechanics on the worker.
I believe this kind of logic should be handled a layer further up the stack instead of throwing this all into this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method on the worker exclusively serves replicate, which we need to pen in to replace with AMM to begin with. The original method bypasses the whole worker missing/update_who_has system and I chose not to change that logic since it would be very complicated and ultimately throwaway.
distributed/tests/test_scheduler.py
Outdated
| with captured_logger("distributed.scheduler") as sched_logger: | ||
| with captured_logger("distributed.client") as client_logger: | ||
| assert await c.gather(x, direct=False) == 2 | ||
|
|
||
| assert s.tasks[fin.key].who_has == {s.workers[b.address]} | ||
| assert a.state.executed_count == 2 | ||
| assert b.state.executed_count >= 1 | ||
| # ^ leave room for a future switch from `remove_worker` to `retire_workers` | ||
| assert sched_logger.getvalue() == "Couldn't gather keys: {'x': 'memory'}\n" * 3 | ||
| assert "Couldn't gather 1 keys, rescheduling" in client_logger.getvalue() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting aside for a moment what changes you are proposing here. Just reading this test I believe this behavior is false.
This test is fetching data via the scheduler but is running into connection failures. However, the Worker is still alive (otherwise the BatchedSend would've been broken and the Worker removed). Despite of knowing that the Worker is alive and it merely struggles to connect, it is rescheduling the key? This feels like an unwarranted escalation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the previous behaviour. I did not touch nor investigate it.
More specifically, I exclusively amended Client._gather_remote and didn't go into the several layers worth of wrappers around it.
Happy to look into them, but it should be left to a later PR.
|
As discussed online and offline:
|
|
@hendrikmakait @fjetter These failures I've never seen before; I'll investigate:
These are known offenders but potentially impacted by this PR; I'll investigate:
These are known offenders which shouldn't be impacted:
|
This reverts commit 3a67188.
|
I've run all tests above 10 times on each CI environment, on main and on this PR, and counted the number of failures, and I can see no regression; this is ready for final review and merge. On a side note, this was an extremely time-consuming activity and we should urgently pen in the time to fix CI.
|
fjetter
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above comment is not essential. I think it's good practice to cleanup our asyncio game but considering this hasn't bothered us before, I assume this thing is just never cancelled.
| for worker, c in coroutines.items(): | ||
| for address, task in tasks.items(): | ||
| try: | ||
| r = await c | ||
| r = await task |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is the same as before but the task scheduling pattern here is problematic in the event of cancellation.
If gather_from_workers is cancelled after the tasks were created, only the task we're currently awaiting is cancelled, all others will continue running and we'll get a "never awaited foo" warning.
I guess this coroutinefunction is never actually cancelled which is why we never ran into this...
The correct approach to this would be to use asyncio.gather (or even better but not backwards copatible, the TaskGroup of asyncio in python 3.11+)
I think the changes should be straight forward to something like
results = asyncio.gather(tasks, return_exceptions=True)
for addr, res in results.items():
if isinstance(res, OSError):
...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll address it in a follow-up
Uh oh!
There was an error while loading. Please reload this page.