Skip to content

Conversation

@aiudirog
Copy link

This is an alternative solution to #5546 which applies the standard retry logic when a worker returns status: busy inside utils_comm.gather_from_workers() instead of attempting to force a retry inside of utils_comm.get_data_from_worker(). It passes the tests that the other PR fails by not raising an error in the other usages of get_data_from_worker() where status: busy is already handled properly.

@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR @aiudirog

I would prefer us not mixing the connection retry mechanism with busy handling. Busy is perfectly fine and can basically happen an unlimited amount of time. It is also not an error case so it doesn't feel right that it is coupled to the retry counter which is off by default, see

retry: # some operations (such as gathering data) are subject to re-tries with the below parameters
count: 0 # the maximum retry attempts. 0 disables re-trying.
delay:
min: 1s # the first non-zero delay between re-tries
max: 20s # the maximum delay between re-tries

Instead, I would suggest to modify the line with response.update(r["data"]) to check for business instead of simply using the key data. All busy keys can be rescheduled after a short delay (maybe by recursion?)

would you be interested in looking into this?

@aiudirog
Copy link
Author

That's fair, I'll look into it. Is there a stop case for a perpetually busy worker?

@fjetter
Copy link
Member

fjetter commented Jan 25, 2022

Is there a stop case for a perpetually busy worker?

Not directly but the busy response is only accounting for the number of incoming get_data requests. See

if (
max_connections is not False
and self.outgoing_current_count >= max_connections
):
logger.debug(
"Worker %s has too many open connections to respond to data request "
"from %s (%d/%d).%s",
self.address,
who,
self.outgoing_current_count,
max_connections,
throttle_msg,
)
return {"status": "busy"}

This will eventually free up once the cluster reaches equilibrium.

In the other place we're handling this, we are using an exponential backoff to manage this, see

if not busy:
self.repetitively_busy = 0
else:
# Exponential backoff to avoid hammering scheduler/worker
self.repetitively_busy += 1
await asyncio.sleep(0.100 * 1.5 ** self.repetitively_busy)
await self.query_who_has(*to_gather_keys, stimulus_id=stimulus_id)

@aiudirog
Copy link
Author

Thanks for the info, I'll see about implementing that and also maybe abstracting it into it's own retry function.

@aiudirog
Copy link
Author

Thanks for the info, I'll open a new PR when I have a better solution.

@aiudirog aiudirog closed this Jan 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Client/Scheduler gather not robust to busy worker

3 participants