Retry get_data_from_worker when worker is busy in gather_from_workers #5692

aiudirog · 2022-01-24T20:32:35Z

Closes Client/Scheduler gather not robust to busy worker #4698
Tests added / passed
Passes pre-commit run --all-files
- Note: only checked the modified file

This is an alternative solution to #5546 which applies the standard retry logic when a worker returns status: busy inside utils_comm.gather_from_workers() instead of attempting to force a retry inside of utils_comm.get_data_from_worker(). It passes the tests that the other PR fails by not raising an error in the other usages of get_data_from_worker() where status: busy is already handled properly.

GPUtester · 2022-01-24T20:32:36Z

Can one of the admins verify this patch?

fjetter

Thanks for your PR @aiudirog

I would prefer us not mixing the connection retry mechanism with busy handling. Busy is perfectly fine and can basically happen an unlimited amount of time. It is also not an error case so it doesn't feel right that it is coupled to the retry counter which is off by default, see

distributed/distributed/distributed.yaml

Lines 178 to 182 in 4e2c1b3

    
           retry:  # some operations (such as gathering data) are subject to re-tries with the below parameters 
        
             count: 0  # the maximum retry attempts. 0 disables re-trying. 
        
             delay: 
        
                min: 1s  # the first non-zero delay between re-tries 
        
                max: 20s  # the maximum delay between re-tries

Instead, I would suggest to modify the line with response.update(r["data"]) to check for business instead of simply using the key data. All busy keys can be rescheduled after a short delay (maybe by recursion?)

would you be interested in looking into this?

aiudirog · 2022-01-25T13:36:11Z

That's fair, I'll look into it. Is there a stop case for a perpetually busy worker?

fjetter · 2022-01-25T16:58:39Z

Is there a stop case for a perpetually busy worker?

Not directly but the busy response is only accounting for the number of incoming get_data requests. See

distributed/distributed/worker.py

Lines 1718 to 1731 in 9a66b71

    
           if ( 
        
               max_connections is not False 
        
               and self.outgoing_current_count >= max_connections 
        
           ): 
        
               logger.debug( 
        
                   "Worker %s has too many open connections to respond to data request " 
        
                   "from %s (%d/%d).%s", 
        
                   self.address, 
        
                   who, 
        
                   self.outgoing_current_count, 
        
                   max_connections, 
        
                   throttle_msg, 
        
               ) 
        
               return {"status": "busy"}

This will eventually free up once the cluster reaches equilibrium.

In the other place we're handling this, we are using an exponential backoff to manage this, see

distributed/distributed/worker.py

Lines 3085 to 3092 in 9a66b71

    
           if not busy: 
        
               self.repetitively_busy = 0 
        
           else: 
        
               # Exponential backoff to avoid hammering scheduler/worker 
        
               self.repetitively_busy += 1 
        
               await asyncio.sleep(0.100 * 1.5 ** self.repetitively_busy) 
        
               await self.query_who_has(*to_gather_keys, stimulus_id=stimulus_id)

aiudirog · 2022-01-25T17:38:36Z

Thanks for the info, I'll see about implementing that and also maybe abstracting it into it's own retry function.

aiudirog · 2022-01-29T03:55:28Z

Thanks for the info, I'll open a new PR when I have a better solution.

Retry get_data_from_worker when worker is busy in gather_from_workers

7cc2efe

fjetter reviewed Jan 25, 2022

View reviewed changes

aiudirog closed this Jan 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Retry get_data_from_worker when worker is busy in gather_from_workers #5692

Retry get_data_from_worker when worker is busy in gather_from_workers #5692

Uh oh!

aiudirog commented Jan 24, 2022

Uh oh!

GPUtester commented Jan 24, 2022

Uh oh!

fjetter left a comment

Uh oh!

aiudirog commented Jan 25, 2022

Uh oh!

fjetter commented Jan 25, 2022

Uh oh!

aiudirog commented Jan 25, 2022

Uh oh!

aiudirog commented Jan 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	retry: # some operations (such as gathering data) are subject to re-tries with the below parameters
	count: 0 # the maximum retry attempts. 0 disables re-trying.
	delay:
	min: 1s # the first non-zero delay between re-tries
	max: 20s # the maximum delay between re-tries

Uh oh!

Retry get_data_from_worker when worker is busy in gather_from_workers #5692

Retry get_data_from_worker when worker is busy in gather_from_workers #5692

Uh oh!

Conversation

aiudirog commented Jan 24, 2022

Uh oh!

GPUtester commented Jan 24, 2022

Uh oh!

fjetter left a comment

Choose a reason for hiding this comment

Uh oh!

aiudirog commented Jan 25, 2022

Uh oh!

fjetter commented Jan 25, 2022

Uh oh!

aiudirog commented Jan 25, 2022

Uh oh!

aiudirog commented Jan 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants