-
-
Notifications
You must be signed in to change notification settings - Fork 748
Retry get_data_from_worker when worker is busy in gather_from_workers #5692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
fjetter
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR @aiudirog
I would prefer us not mixing the connection retry mechanism with busy handling. Busy is perfectly fine and can basically happen an unlimited amount of time. It is also not an error case so it doesn't feel right that it is coupled to the retry counter which is off by default, see
distributed/distributed/distributed.yaml
Lines 178 to 182 in 4e2c1b3
| retry: # some operations (such as gathering data) are subject to re-tries with the below parameters | |
| count: 0 # the maximum retry attempts. 0 disables re-trying. | |
| delay: | |
| min: 1s # the first non-zero delay between re-tries | |
| max: 20s # the maximum delay between re-tries |
Instead, I would suggest to modify the line with response.update(r["data"]) to check for business instead of simply using the key data. All busy keys can be rescheduled after a short delay (maybe by recursion?)
would you be interested in looking into this?
|
That's fair, I'll look into it. Is there a stop case for a perpetually busy worker? |
Not directly but the distributed/distributed/worker.py Lines 1718 to 1731 in 9a66b71
This will eventually free up once the cluster reaches equilibrium. In the other place we're handling this, we are using an exponential backoff to manage this, see distributed/distributed/worker.py Lines 3085 to 3092 in 9a66b71
|
|
Thanks for the info, I'll see about implementing that and also maybe abstracting it into it's own retry function. |
|
Thanks for the info, I'll open a new PR when I have a better solution. |
pre-commit run --all-filesThis is an alternative solution to #5546 which applies the standard retry logic when a worker returns
status: busyinsideutils_comm.gather_from_workers()instead of attempting to force a retry inside ofutils_comm.get_data_from_worker(). It passes the tests that the other PR fails by not raising an error in the other usages ofget_data_from_worker()wherestatus: busyis already handled properly.