Skip to content

Conversation

@fjetter
Copy link
Member

@fjetter fjetter commented Jul 21, 2021

Yet another deadlock and some inconsistent behaviour.

If a worker disconnects while it is computing, e.g. connection broke because of , it is allowed to reconnect but the behaviour is currently a bit ill defined and there are multiple issues

  • Tasks currently in memory of the worker are never registered and will neither be forgotten nor will they be used at any point in time
  • If the computation finishes before the rescheduled one, the task is simply ignored. We should either allow the worker to keep that task and transition everything to memory or instruct it to forget it again. Currently we're doing neither.
  • Lastly, since the reconnect submits all tasks it knows via nbytes but not via types this causes an exception on worker side which actually results in another deadlock, see stuck tasks that never complete when running on many SLURM nodes #5078 The scheduler is stuck with a partial registration and the worker waits for the heartbeat response.

I went for the "keep both tasks" once computed, regardless of the worker who is supposed to compute it. IMO this is a sane behaviour for most computations. I see problems with non-pure tasks but I do think we are not treating this case properly for any kind of rescheduling, do we??

@fjetter
Copy link
Member Author

fjetter commented Jul 23, 2021

This builds on #5034

@fjetter
Copy link
Member Author

fjetter commented Jul 23, 2021

Test failures on py3.7 appear to be unrelated

The ubu win-py3.8 test is a timeout of the newly introduced test test_worker_reconnects_mid_compute. will have a look

Ran the test locally about 2k times without encountering the failure. Ran it on windows as well without success of reproduction. I assume this is a funny coincidence. I re-triggered everything and this time a different test fails

@fjetter
Copy link
Member Author

fjetter commented Jul 23, 2021

cc @jrbourbeau @mrocklin

@fjetter fjetter force-pushed the worker_reconnect_deadlock branch from 8393578 to 3af453d Compare July 26, 2021 11:49
gjoseph92 added a commit to gjoseph92/coiled-parameter-sweep-profiling that referenced this pull request Jul 27, 2021
@fjetter
Copy link
Member Author

fjetter commented Jul 27, 2021

Test failures appear to be unrelated. Will give it another day and will merge tomorrow if there are no further objections.

@jakirkham I reverted the changes you mentioned. I'd appreciate a short comment if you are fine with this

@jakirkham
Copy link
Member

jakirkham commented Jul 28, 2021

Yep the new changes seem fine. Marked the thread resolved. Thank you :)

@fjetter fjetter merged commit 85c95be into dask:main Jul 28, 2021
@fjetter fjetter deleted the worker_reconnect_deadlock branch July 28, 2021 10:59
madsbk added a commit to madsbk/distributed that referenced this pull request Aug 16, 2021
madsbk added a commit to madsbk/distributed that referenced this pull request Aug 16, 2021
madsbk added a commit to madsbk/distributed that referenced this pull request Aug 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants