Computations fail with "could not find dependent"

Frequently, large dask computations on large clusters seem to fail with "could not find dependent" errors:

```
/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1811                             exc = CancelledError(key)
   1812                         else:
-> 1813                             raise exception.with_traceback(traceback)
   1814                         raise exc
   1815                     if errors == "skip":

ValueError: Could not find dependent ('broadcast_to-concatenate-where-getitem-4916603adba736a1cf7a63ef1786c8d8', 8, 1, 0, 0).  Check worker logs
```

In checking the logs, by looking in the master logs for the item ID, and then tracking it down to a specific worker, the worker log looks like this:

```
distributed.worker - INFO - Start worker at: tls://10.244.47.9:36493
distributed.worker - INFO - Listening to: tls://10.244.47.9:36493
distributed.worker - INFO - dashboard at: 10.244.47.9:8787
distributed.worker - INFO - Waiting to connect to: tls://dask-258014f969ff4b288a395f5e84f2354b.prod:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 8.00 GiB
distributed.worker - INFO - Local Directory: /home/jovyan/dask-worker-space/worker-5bjfqpay
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tls://dask-258014f969ff4b288a395f5e84f2354b.prod:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Can't find dependencies {<Task "('broadcast_to-concatenate-where-getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0)" fetch>} for key ('getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0)
distributed.worker - INFO - Dependent not found: ('broadcast_to-concatenate-where-getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0) 0 . Asking scheduler
distributed.worker - INFO - Can't find dependencies {<Task "('broadcast_to-concatenate-where-getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0)" fetch>} for key ('getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0)
distributed.worker - INFO - Dependent not found: ('broadcast_to-concatenate-where-getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0) 1 . Asking scheduler
... many repeats
distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 2550, in handle_missing_dep for dep in deps: RuntimeError: Set changed size during iteration
```

The general situation is I've stacked a large, multi-tile stack (25 times, 5 bands, x~60000 y~40000) -- so 2.5TB.  The cluster size tried is varying from 45 -> 300 cores. 

Can you give me general guidance on this sort of error?  I can't reproduce it very reliably, so it's hard to provide a "minimal example" that can reproduce the error.   Perhaps someone might have some guidance about what is happening and what I could do to explore possible reasons / solutions?




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Computations fail with "could not find dependent" #11

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Computations fail with "could not find dependent" #11

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions