Robust gather in case of connection failures #3246

fjetter · 2019-11-18T16:26:48Z

Issue description

We're observing stability issues when gathering data. What we can currently reconstruct is the following

During gathering a connection issue happens which we see as an EnvironmentError (Root cause unknown, worker seems healthy otherwise).
The data collection logic (gather_from_workers) flags the worker for which the connection fails as missing.
The missing worker is closed forcefully here. Since this worker holds the final result we will loose the result as well. Since the dependencies of the final task were already transitioned to released earlier on this amounts to a full data loss and the entire graph is rescheduled.
Reschedule the lost key, i.e. the entire graph.

Root cause

This line removes workers if they are considered missing. As the comment already notes this is extreme and tbh I'm missing the reasoning there.
imho, this is not the place for dead worker cleanup and one should defer this responsibility to tasks which are intended for it, e.g. worker TTL. If there is a reason for this implementation, I'll happily learn something new and will add a comment there.

Changes in this PR

My intention is it to establish a better robustness in case of failure scenarios. My proposed implementation would get rid of the worker closing and would rely on another mechanism to clean up actual dead or misbehaving workers.
If we face a connection issue we retry previously failed workers until "no improvement" is detected but at least three times (there is obviously room for refinement). For happy path scenarios this behaves just like before but for connection failures we should be much more robust.
I hard coded the backoff and retry count since I wasn't sure if this is a good configuration parameter.

Open Question

I'm wondering how deep we should patch this in the stack. In this particular instance we have a call chain of
gather_from_workers -> get_data_from_worker -> send_recv/comm.read/comm.write
Where I could argue that even the comm.read/write should already retry but I'm not sure about the implications or if this isn't implemented intentionally this way.

mrocklin

In principle this seems fine to me. I'm curious what happens if a worker has genuinely died / stalled in a way where it appears to still be alive. I'm not sure that we actually enforce TTL timeouts on dask workers today.

mrocklin · 2019-11-19T16:27:52Z

distributed/scheduler.py

+        iteration = 0
+        while missing_workers:
+            missing_workers_begin = missing_workers.copy()
+            missed_data, missing_keys, missing_workers = await gather_from_workers(


Why the name missed_data? It seems like this is the data that we aren't missing.

fjetter · 2019-11-20T09:42:22Z

I'm curious what happens if a worker has genuinely died / stalled in a way where it appears to still be alive.

The remove_worker within gather is the only place in the entire scheduler where, upon connection issue, the worker is actually requested to be closed (not only removed from the scheduler).
Genuinely dead workers are probably handled by the below mentioned methods. If the worker is in a weird semi-dead state I am wondering if the closing would actually work. Either way a more sensible approach seems appropriate.

There are currently two other places where workers are removed if a connection breaks.

These two instances rely on the same long running connection as far as I can tell so it is essentially the same error scenario (maybe the duplication might even cause issues since the one handles the removal immediately while the other one schedules a callback... Not my point though.)
The difference for these, compared to the removal within the gather method, is that they will not close the worker since the closing of the worker relies on the stream_comm. Both worker_send and handle_worker will only trigger the error handling / worker removal if the stream_comm is broken/non existent.
Therefore, the scheduler essentially only closes the workers during worker_ttl (or upon explicit request , e.g. restart, retire_worker, ...)

I'm not sure that we actually enforce TTL timeouts on dask workers today

No it isn't enforced, not even enabled by default. My personal feeling on this matter would be to strongly recommend users to either enable TTL or lifetime to cycle workers eventually but I guess this depends on the use case.

Either way it's hard for me to imagine how the scheduler should be able to close a worker if it behaves completely out of line. This is where the nanny or even an external cluster manager must intervene.

fjetter · 2019-11-20T17:07:06Z

I pushed the retry logic down to worker.get_data_from_worker where the retry is much simpler to implement and many places would benefit from this.

mrocklin · 2019-11-22T18:40:05Z

Thanks for the update @fjetter . The retry logic looks good to me. I hope that this improves things.

I'm still uncertain about removing the remote_worker call, but that's mostly because I haven't had the time to review the implications of this closely (my apologies also for the delayed response). Given this, I think that we have two options:

We merge this as-is, and you promise to help out if removing those lines ends up harming other use cases in the near-to-moderate future
We keep the remove_worker lines, and just add the retry logic, which will hopefully avoid us having to call it in the future.

Thoughts?

fjetter · 2019-11-25T14:52:23Z

you promise to help out if removing those lines ends up harming other use cases in the near-to-moderate future

If the removal of remove_worker poses issues I will help out. We have a strong interest for this to work properly.

We keep the remove_worker lines, and just add the retry logic

I fully agree with the code comment This is extreme and think this should not be there in the way it is at the moment.
Probably a better way to handle these situations would be to replace the current self.remove_worker(address=worker) with a self.remove_worker(address=worker, close=False). This would have the following effect:

close=False do not send the close signal to the worker but only remove it from internal scheduler state. This allows the worker to reconnect and we wouldn't loose state.
safe=False (i.e. keep the default) increase the suspicious count for all processing tasks on the worker. Without this, I'm not sure if the suspicious count is increased anywhere at all, effectively disabling the "suspicious tasks" feature.

mrocklin · 2019-11-25T15:44:50Z

If the removal of remove_worker poses issues I will help out. We have a strong interest for this to work properly.

OK then. I'm happy to defer to your judgement here then. My guess is that at this point you're probably is familiar (or more familiar) with this code than I am :) . I'm happy to merge this as-is, or wait until you add the close/safe=False, whichever you think is best.

mrocklin · 2019-11-25T17:14:07Z

It looks like there is a linting failure, and also some intermittent failures. The test_workspace_concurrency and test_nanny_terminate failures are known (my apologies for not fixing these yet, but the test_gather_failing_cnn_error failure is new.

fjetter · 2019-11-26T08:50:22Z

but the test_gather_failing_cnn_error failure is new.

Fixed

I added a rather long and complicated test case to capture the scenario we are facing. The test should cover that the worker is actually allowed to reconnect and may register its keys again such that they don't need to be computed again. If this test is too specific/flaky or poses issues in the future due to the specific log messages we can iterate it, of course.

What I also see is that in this specific scenario we see a
Unexpected worker completed task, likely due to work stealing. Expected: %s, Got: %s, Key: %s
I didn't include this in the test because it fells rather like an unwanted side effect than intended behaviour. To my understanding it shouldn't harm anyone, though (see transition_processing_memory)

Cause: Upon removing the worker from the scheduler, it is also removed from the TaskStates processing_on attribute. Once the worker reconnects and offers the result it is "unexpected" for the scheduler but it doesn't look harmful.

mrocklin · 2019-11-26T15:27:27Z

OK. Merging this in. Thank you for this fix @fjetter ! It's great having you around.

amerkel2 · 2019-12-02T13:53:25Z

Thanks for working on this! Since we are seeing this problem quite frequently: When can we expect the next release including this fix?

mrocklin · 2019-12-02T15:43:49Z

Dask tends to do a release every couple of weeks. My guess would be this coming Friday.

amerkel2 · 2019-12-03T08:34:58Z

Thanks for the info, that would be great!

fjetter force-pushed the robust_gather branch from 5347ccb to 4ea19b1 Compare November 19, 2019 09:05

mrocklin reviewed Nov 19, 2019

View reviewed changes

fjetter force-pushed the robust_gather branch 5 times, most recently from 794f40b to 44df3e7 Compare November 20, 2019 17:03

scheduler: Robust gather in case of connection failures

f3bbbc5

fjetter force-pushed the robust_gather branch from 44df3e7 to f3bbbc5 Compare November 20, 2019 17:07

bnaul mentioned this pull request Nov 21, 2019

Task stuck processing on non-existent worker #3256

Closed

Remove workers without closing them

bd69881

fjetter force-pushed the robust_gather branch from 4327ac3 to bd69881 Compare November 26, 2019 08:42

mrocklin merged commit 1d9aaac into dask:master Nov 26, 2019

marco-neumann-by mentioned this pull request Dec 9, 2019

Re-computions (flapping) due to GIL-holding #3204

Closed

Uh oh!

Robust gather in case of connection failures #3246

Robust gather in case of connection failures #3246

Uh oh!

Conversation

fjetter commented Nov 18, 2019

Issue description

Root cause

Changes in this PR

Open Question

Uh oh!

mrocklin left a comment

Choose a reason for hiding this comment

Uh oh!

mrocklin Nov 19, 2019

Choose a reason for hiding this comment

Uh oh!

fjetter commented Nov 20, 2019

Uh oh!

fjetter commented Nov 20, 2019

Uh oh!

mrocklin commented Nov 22, 2019

Uh oh!

fjetter commented Nov 25, 2019

Uh oh!

mrocklin commented Nov 25, 2019

Uh oh!

mrocklin commented Nov 25, 2019

Uh oh!

fjetter commented Nov 26, 2019

Uh oh!

mrocklin commented Nov 26, 2019

Uh oh!

amerkel2 commented Dec 2, 2019

Uh oh!

mrocklin commented Dec 2, 2019

Uh oh!

amerkel2 commented Dec 3, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants