-
-
Notifications
You must be signed in to change notification settings - Fork 748
Robust gather in case of connection failures #3246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5347ccb to
4ea19b1
Compare
mrocklin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle this seems fine to me. I'm curious what happens if a worker has genuinely died / stalled in a way where it appears to still be alive. I'm not sure that we actually enforce TTL timeouts on dask workers today.
distributed/scheduler.py
Outdated
| iteration = 0 | ||
| while missing_workers: | ||
| missing_workers_begin = missing_workers.copy() | ||
| missed_data, missing_keys, missing_workers = await gather_from_workers( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the name missed_data? It seems like this is the data that we aren't missing.
The There are currently two other places where workers are removed if a connection breaks. These two instances rely on the same long running connection as far as I can tell so it is essentially the same error scenario (maybe the duplication might even cause issues since the one handles the removal immediately while the other one schedules a callback... Not my point though.)
No it isn't enforced, not even enabled by default. My personal feeling on this matter would be to strongly recommend users to either enable TTL or lifetime to cycle workers eventually but I guess this depends on the use case. Either way it's hard for me to imagine how the scheduler should be able to close a worker if it behaves completely out of line. This is where the nanny or even an external cluster manager must intervene. |
794f40b to
44df3e7
Compare
|
I pushed the retry logic down to |
44df3e7 to
f3bbbc5
Compare
|
Thanks for the update @fjetter . The retry logic looks good to me. I hope that this improves things. I'm still uncertain about removing the
Thoughts? |
If the removal of
I fully agree with the code comment
|
OK then. I'm happy to defer to your judgement here then. My guess is that at this point you're probably is familiar (or more familiar) with this code than I am :) . I'm happy to merge this as-is, or wait until you add the |
|
It looks like there is a linting failure, and also some intermittent failures. The |
4327ac3 to
bd69881
Compare
Fixed I added a rather long and complicated test case to capture the scenario we are facing. The test should cover that the worker is actually allowed to reconnect and may register its keys again such that they don't need to be computed again. If this test is too specific/flaky or poses issues in the future due to the specific log messages we can iterate it, of course. What I also see is that in this specific scenario we see a Cause: Upon removing the worker from the scheduler, it is also removed from the |
|
OK. Merging this in. Thank you for this fix @fjetter ! It's great having you around. |
|
Thanks for working on this! Since we are seeing this problem quite frequently: When can we expect the next release including this fix? |
|
Dask tends to do a release every couple of weeks. My guess would be this coming Friday. |
|
Thanks for the info, that would be great! |
Issue description
We're observing stability issues when gathering data. What we can currently reconstruct is the following
EnvironmentError(Root cause unknown, worker seems healthy otherwise).gather_from_workers) flags the worker for which the connection fails as missing.releasedearlier on this amounts to a full data loss and the entire graph is rescheduled.Root cause
This line removes workers if they are considered missing. As the comment already notes this is extreme and tbh I'm missing the reasoning there.
imho, this is not the place for dead worker cleanup and one should defer this responsibility to tasks which are intended for it, e.g. worker TTL. If there is a reason for this implementation, I'll happily learn something new and will add a comment there.
Changes in this PR
My intention is it to establish a better robustness in case of failure scenarios. My proposed implementation would get rid of the worker closing and would rely on another mechanism to clean up actual dead or misbehaving workers.
If we face a connection issue we retry previously failed workers until "no improvement" is detected but at least three times (there is obviously room for refinement). For happy path scenarios this behaves just like before but for connection failures we should be much more robust.
I hard coded the backoff and retry count since I wasn't sure if this is a good configuration parameter.
Open Question
I'm wondering how deep we should patch this in the stack. In this particular instance we have a call chain of
gather_from_workers->get_data_from_worker->send_recv/comm.read/comm.writeWhere I could argue that even the
comm.read/writeshould already retry but I'm not sure about the implications or if this isn't implemented intentionally this way.