Recover from broken process pool #42

tswayne · 2022-04-15T18:41:38Z

We hit a pretty unfortunate edge case in this library where one broken process will corrupt the entire pool. When that happens, the entire process pool (self._executor) is broken and the worker does not recover, but it also does not terminate. The result is that it continues to pick up new jobs and send them to the process pool to execute and they immediately throw a BrokenExecutor exception. For us, this resulted in the corrupt worker essentially draining our queue and funneling them directly to the dead queue.

It does look like there is an older attempt to handle this here, however it is unreachable because this line catches all exceptions first.

The fix in this pr, which we've been using for a while now, recovers from this state by throwing the broken process pool away and letting the next tick re-create a fresh one. This has been working really well, since broken processes are very rare (for us), and all the initial jobs impacted by the broken process are retried.

This PR also includes a small useful change to enable the proto client to accept and pass all the kwargs that the connection accepts along. We use that client directly in a few places and need to forward some options to the connection.

tswayne · 2022-04-25T14:10:47Z

Hey @cdrx - just want to bump this to make sure it's on your radar.

cdrx · 2022-04-30T08:40:49Z

Good catch, thank you for opening this PR.

cdrx · 2022-04-30T09:48:28Z

This has been published to PyPi, in version 1.0.0.

tswayne added 4 commits March 10, 2022 12:24

Debugging broken pool

50b27eb

fix missing import

526feb8

pass kwargs from connection manager to client

fddded9

remove info logs

9b5a8d1

cdrx merged commit d83bb6a into cdrx:master Apr 30, 2022

JoaoPedroAssis mentioned this pull request Jun 29, 2023

BrokenProcessPool error #53

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Recover from broken process pool #42

Recover from broken process pool #42

Uh oh!

tswayne commented Apr 15, 2022

Uh oh!

tswayne commented Apr 25, 2022

Uh oh!

cdrx commented Apr 30, 2022

Uh oh!

cdrx commented Apr 30, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Recover from broken process pool #42

Recover from broken process pool #42

Uh oh!

Conversation

tswayne commented Apr 15, 2022

Uh oh!

tswayne commented Apr 25, 2022

Uh oh!

cdrx commented Apr 30, 2022

Uh oh!

cdrx commented Apr 30, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants