Couple more stability fixes #6
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
https://simplifi.atlassian.net/browse/INT-11129
I know the ticket talks about a race condition, but after poking with it more today I'm not sure it's actually a race condition causing the problem -- there are a couple of other things that seem to impact the stability, at least when running the integration tests:
First off we were trying to restart the WorkerSupervisor's DynamicSupervisor when handling a termination. I don't know this actually broke anything, but it's messy so I fixed it.
Added worker_supervisor max_restart and max_seconds options. We were certainly having problems because the defaults for these were too low (see the comments in the code)