-
Notifications
You must be signed in to change notification settings - Fork 12
Closed
Description
Frequently, large dask computations on large clusters seem to fail with "could not find dependent" errors:
/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
1811 exc = CancelledError(key)
1812 else:
-> 1813 raise exception.with_traceback(traceback)
1814 raise exc
1815 if errors == "skip":
ValueError: Could not find dependent ('broadcast_to-concatenate-where-getitem-4916603adba736a1cf7a63ef1786c8d8', 8, 1, 0, 0). Check worker logs
In checking the logs, by looking in the master logs for the item ID, and then tracking it down to a specific worker, the worker log looks like this:
distributed.worker - INFO - Start worker at: tls://10.244.47.9:36493
distributed.worker - INFO - Listening to: tls://10.244.47.9:36493
distributed.worker - INFO - dashboard at: 10.244.47.9:8787
distributed.worker - INFO - Waiting to connect to: tls://dask-258014f969ff4b288a395f5e84f2354b.prod:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 8.00 GiB
distributed.worker - INFO - Local Directory: /home/jovyan/dask-worker-space/worker-5bjfqpay
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tls://dask-258014f969ff4b288a395f5e84f2354b.prod:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Can't find dependencies {<Task "('broadcast_to-concatenate-where-getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0)" fetch>} for key ('getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0)
distributed.worker - INFO - Dependent not found: ('broadcast_to-concatenate-where-getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0) 0 . Asking scheduler
distributed.worker - INFO - Can't find dependencies {<Task "('broadcast_to-concatenate-where-getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0)" fetch>} for key ('getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0)
distributed.worker - INFO - Dependent not found: ('broadcast_to-concatenate-where-getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0) 1 . Asking scheduler
... many repeats
distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 2550, in handle_missing_dep for dep in deps: RuntimeError: Set changed size during iteration
The general situation is I've stacked a large, multi-tile stack (25 times, 5 bands, x60000 y40000) -- so 2.5TB. The cluster size tried is varying from 45 -> 300 cores.
Can you give me general guidance on this sort of error? I can't reproduce it very reliably, so it's hard to provide a "minimal example" that can reproduce the error. Perhaps someone might have some guidance about what is happening and what I could do to explore possible reasons / solutions?
Metadata
Metadata
Assignees
Labels
No labels