Skip to content

Task stuck processing on non-existent worker #3256

@bnaul

Description

@bnaul

At the end of long-running jobs, I'm often seeing one or two tasks that never finish. Looking a bit more closely, it seems that the task is processing on a non-existent worker:
image

Even after closing every worker, the status of this task doesn't change.

Searching through the scheduler logs for this task I found:

2019-11-20T18:58:16.118979408Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.187.2:42561, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 9507)

Seems pretty fishy that a worker could steal a task from itself? 🤔

Some other context:

  • k8s cluster w/ ~1000 workers (before scaling to investigate the deadlock), around 100k tasks total (50k of these pandas tasks, and a processing task for each partition)
  • Our workers run OOM and get evicted somewhat frequently (every few hrs on average), and we use a lot of retries to compensate
  • We GC futures as we go using dask.as_completed(futures); del futures

Full scheduler logs related to this renegade worker, which has quite a 🎢 minute at 18:58:

2019-11-20T18:46:57.497189783Z distributed.scheduler - INFO - Register tcp://10.47.187.2:42561
2019-11-20T18:46:57.497605833Z distributed.scheduler - INFO - Starting worker compute stream, tcp://10.47.187.2:42561
2019-11-20T18:57:57.392080977Z distributed.scheduler - INFO - Remove worker tcp://10.47.187.2:42561
2019-11-20T18:57:57.392125386Z distributed.core - INFO - Removing comms to tcp://10.47.187.2:42561
2019-11-20T18:58:01.058431284Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.193.16:36713, Got: tcp://10.47.187.2:42561, Key: ('to_records-ff4aa1a02b819ae4fa0769294d47599b', 21371)
2019-11-20T18:58:01.059278416Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.198.7:38975, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 18520)
2019-11-20T18:58:01.059309058Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.197.2:41001, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 4917)
2019-11-20T18:58:01.05939125Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.191.7:39843, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 38984)
2019-11-20T18:58:01.059907822Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.197.16:43383, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 10328)
2019-11-20T18:58:01.060021135Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.191.4:45047, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 5609)
2019-11-20T18:58:01.060103572Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.189.7:33727, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 20321)
2019-11-20T18:58:01.060238991Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.196.17:46667, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 35456)
2019-11-20T18:58:01.060305807Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.190.3:44043, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 5880)
2019-11-20T18:58:01.060395695Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.197.9:34817, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 11345)
2019-11-20T18:58:01.060545848Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.190.4:40125, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 3060)
2019-11-20T18:58:01.060574243Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.198.11:45013, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 7430)
2019-11-20T18:58:01.060654418Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.198.12:43271, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 1786)
2019-11-20T18:58:01.06072858Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.196.16:36097, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 16601)
2019-11-20T18:58:01.060834564Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.199.2:34239, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 20394)
2019-11-20T18:58:01.06088221Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.197.17:34623, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 30949)
2019-11-20T18:58:01.06097191Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.198.6:45873, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 40418)
2019-11-20T18:58:01.06109409Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.198.13:38995, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 38647)
2019-11-20T18:58:01.061636897Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.191.5:37251, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 22807)
2019-11-20T18:58:01.064454821Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.199.9:44443, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 29398)
2019-11-20T18:58:01.064629746Z distributed.scheduler - INFO - Register tcp://10.47.187.2:42561
2019-11-20T18:58:01.064964628Z distributed.scheduler - INFO - Starting worker compute stream, tcp://10.47.187.2:42561
2019-11-20T18:58:16.118979408Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.187.2:42561, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 9507)
2019-11-21T02:44:03.134375618Z distributed.scheduler - INFO - Remove worker tcp://10.47.187.2:42561
2019-11-21T02:44:03.134422097Z distributed.core - INFO - Removing comms to tcp://10.47.187.2:42561
2019-11-21T18:06:51.328850288Z distributed.utils - ERROR - Timed out trying to connect to 'tcp://10.47.187.2:42561' after 300 s: Timed out trying to connect to 'tcp://10.47.187.2:42561' after 300 s: connect() didn't finish in time
2019-11-21T18:06:51.328916264Z OSError: Timed out trying to connect to 'tcp://10.47.187.2:42561' after 300 s: connect() didn't finish in time

This seems like it could be related to #3246 but I'm not sure, @fjetter any of this reminiscent of the behavior that prompted that PR?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions