Skip to content

Conversation

@tswayne
Copy link
Contributor

@tswayne tswayne commented Apr 15, 2022

We hit a pretty unfortunate edge case in this library where one broken process will corrupt the entire pool. When that happens, the entire process pool (self._executor) is broken and the worker does not recover, but it also does not terminate. The result is that it continues to pick up new jobs and send them to the process pool to execute and they immediately throw a BrokenExecutor exception. For us, this resulted in the corrupt worker essentially draining our queue and funneling them directly to the dead queue.

It does look like there is an older attempt to handle this here, however it is unreachable because this line catches all exceptions first.

The fix in this pr, which we've been using for a while now, recovers from this state by throwing the broken process pool away and letting the next tick re-create a fresh one. This has been working really well, since broken processes are very rare (for us), and all the initial jobs impacted by the broken process are retried.

This PR also includes a small useful change to enable the proto client to accept and pass all the kwargs that the connection accepts along. We use that client directly in a few places and need to forward some options to the connection.

@tswayne
Copy link
Contributor Author

tswayne commented Apr 25, 2022

Hey @cdrx - just want to bump this to make sure it's on your radar.

@cdrx cdrx merged commit d83bb6a into cdrx:master Apr 30, 2022
@cdrx
Copy link
Owner

cdrx commented Apr 30, 2022

Good catch, thank you for opening this PR.

@cdrx
Copy link
Owner

cdrx commented Apr 30, 2022

This has been published to PyPi, in version 1.0.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants