Skip to content

Computations fail with "could not find dependent" #11

@MikeBeller

Description

@MikeBeller

Frequently, large dask computations on large clusters seem to fail with "could not find dependent" errors:

/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1811                             exc = CancelledError(key)
   1812                         else:
-> 1813                             raise exception.with_traceback(traceback)
   1814                         raise exc
   1815                     if errors == "skip":

ValueError: Could not find dependent ('broadcast_to-concatenate-where-getitem-4916603adba736a1cf7a63ef1786c8d8', 8, 1, 0, 0).  Check worker logs

In checking the logs, by looking in the master logs for the item ID, and then tracking it down to a specific worker, the worker log looks like this:

distributed.worker - INFO - Start worker at: tls://10.244.47.9:36493
distributed.worker - INFO - Listening to: tls://10.244.47.9:36493
distributed.worker - INFO - dashboard at: 10.244.47.9:8787
distributed.worker - INFO - Waiting to connect to: tls://dask-258014f969ff4b288a395f5e84f2354b.prod:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 8.00 GiB
distributed.worker - INFO - Local Directory: /home/jovyan/dask-worker-space/worker-5bjfqpay
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tls://dask-258014f969ff4b288a395f5e84f2354b.prod:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Can't find dependencies {<Task "('broadcast_to-concatenate-where-getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0)" fetch>} for key ('getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0)
distributed.worker - INFO - Dependent not found: ('broadcast_to-concatenate-where-getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0) 0 . Asking scheduler
distributed.worker - INFO - Can't find dependencies {<Task "('broadcast_to-concatenate-where-getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0)" fetch>} for key ('getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0)
distributed.worker - INFO - Dependent not found: ('broadcast_to-concatenate-where-getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0) 1 . Asking scheduler
... many repeats
distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 2550, in handle_missing_dep for dep in deps: RuntimeError: Set changed size during iteration

The general situation is I've stacked a large, multi-tile stack (25 times, 5 bands, x60000 y40000) -- so 2.5TB. The cluster size tried is varying from 45 -> 300 cores.

Can you give me general guidance on this sort of error? I can't reproduce it very reliably, so it's hard to provide a "minimal example" that can reproduce the error. Perhaps someone might have some guidance about what is happening and what I could do to explore possible reasons / solutions?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions