-
-
Notifications
You must be signed in to change notification settings - Fork 748
Closed
Description
At the end of long-running jobs, I'm often seeing one or two tasks that never finish. Looking a bit more closely, it seems that the task is processing on a non-existent worker:

Even after closing every worker, the status of this task doesn't change.
Searching through the scheduler logs for this task I found:
2019-11-20T18:58:16.118979408Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing. Expected: tcp://10.47.187.2:42561, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 9507)
Seems pretty fishy that a worker could steal a task from itself? 🤔
Some other context:
- k8s cluster w/ ~1000 workers (before scaling to investigate the deadlock), around 100k tasks total (50k of these pandas tasks, and a processing task for each partition)
- Our workers run OOM and get evicted somewhat frequently (every few hrs on average), and we use a lot of retries to compensate
- We GC futures as we go using
dask.as_completed(futures); del futures
Full scheduler logs related to this renegade worker, which has quite a 🎢 minute at 18:58:
2019-11-20T18:46:57.497189783Z distributed.scheduler - INFO - Register tcp://10.47.187.2:42561
2019-11-20T18:46:57.497605833Z distributed.scheduler - INFO - Starting worker compute stream, tcp://10.47.187.2:42561
2019-11-20T18:57:57.392080977Z distributed.scheduler - INFO - Remove worker tcp://10.47.187.2:42561
2019-11-20T18:57:57.392125386Z distributed.core - INFO - Removing comms to tcp://10.47.187.2:42561
2019-11-20T18:58:01.058431284Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing. Expected: tcp://10.47.193.16:36713, Got: tcp://10.47.187.2:42561, Key: ('to_records-ff4aa1a02b819ae4fa0769294d47599b', 21371)
2019-11-20T18:58:01.059278416Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing. Expected: tcp://10.47.198.7:38975, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 18520)
2019-11-20T18:58:01.059309058Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing. Expected: tcp://10.47.197.2:41001, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 4917)
2019-11-20T18:58:01.05939125Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing. Expected: tcp://10.47.191.7:39843, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 38984)
2019-11-20T18:58:01.059907822Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing. Expected: tcp://10.47.197.16:43383, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 10328)
2019-11-20T18:58:01.060021135Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing. Expected: tcp://10.47.191.4:45047, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 5609)
2019-11-20T18:58:01.060103572Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing. Expected: tcp://10.47.189.7:33727, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 20321)
2019-11-20T18:58:01.060238991Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing. Expected: tcp://10.47.196.17:46667, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 35456)
2019-11-20T18:58:01.060305807Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing. Expected: tcp://10.47.190.3:44043, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 5880)
2019-11-20T18:58:01.060395695Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing. Expected: tcp://10.47.197.9:34817, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 11345)
2019-11-20T18:58:01.060545848Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing. Expected: tcp://10.47.190.4:40125, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 3060)
2019-11-20T18:58:01.060574243Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing. Expected: tcp://10.47.198.11:45013, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 7430)
2019-11-20T18:58:01.060654418Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing. Expected: tcp://10.47.198.12:43271, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 1786)
2019-11-20T18:58:01.06072858Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing. Expected: tcp://10.47.196.16:36097, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 16601)
2019-11-20T18:58:01.060834564Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing. Expected: tcp://10.47.199.2:34239, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 20394)
2019-11-20T18:58:01.06088221Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing. Expected: tcp://10.47.197.17:34623, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 30949)
2019-11-20T18:58:01.06097191Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing. Expected: tcp://10.47.198.6:45873, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 40418)
2019-11-20T18:58:01.06109409Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing. Expected: tcp://10.47.198.13:38995, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 38647)
2019-11-20T18:58:01.061636897Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing. Expected: tcp://10.47.191.5:37251, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 22807)
2019-11-20T18:58:01.064454821Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing. Expected: tcp://10.47.199.9:44443, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 29398)
2019-11-20T18:58:01.064629746Z distributed.scheduler - INFO - Register tcp://10.47.187.2:42561
2019-11-20T18:58:01.064964628Z distributed.scheduler - INFO - Starting worker compute stream, tcp://10.47.187.2:42561
2019-11-20T18:58:16.118979408Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing. Expected: tcp://10.47.187.2:42561, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 9507)
2019-11-21T02:44:03.134375618Z distributed.scheduler - INFO - Remove worker tcp://10.47.187.2:42561
2019-11-21T02:44:03.134422097Z distributed.core - INFO - Removing comms to tcp://10.47.187.2:42561
2019-11-21T18:06:51.328850288Z distributed.utils - ERROR - Timed out trying to connect to 'tcp://10.47.187.2:42561' after 300 s: Timed out trying to connect to 'tcp://10.47.187.2:42561' after 300 s: connect() didn't finish in time
2019-11-21T18:06:51.328916264Z OSError: Timed out trying to connect to 'tcp://10.47.187.2:42561' after 300 s: connect() didn't finish in time
This seems like it could be related to #3246 but I'm not sure, @fjetter any of this reminiscent of the behavior that prompted that PR?
Metadata
Metadata
Assignees
Labels
No labels