-
Notifications
You must be signed in to change notification settings - Fork 4
Fix health check loop closes on probe timeout #80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| if counter >= r.healthyThreshold { | ||
| func() { | ||
| ctx, cancel := context.WithTimeout(ctx, r.timeout) | ||
| defer cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously this was within a for loop which I thought usually lint would catch, but by creating a new func this should be scoped properly now
| ctx, cancel := context.WithTimeout(ctx, r.timeout) | ||
| defer cancel() | ||
|
|
||
| result := r.prober.Probe(ctx, conn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't change this, which means the timeout is implicit, relying on Probe to use the context's timeout (we see such a change in the unit test). To handle timeout here, I think we would need to spawn a goroutine here for the probe - wasn't sure if that's overkill or not.
Personally I would move timeout parameter off of the poller and onto the prober's themselves, though it does mean a user has to set both interval and timeout meaningfully themselves so maybe not
| }() | ||
|
|
||
| select { | ||
| case <-ctx.Done(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This return used to trigger when ctx created for the probe is done, meaning the loop ends. Because defer cancel() was essentially deferring accumulated closures to the end of the loop itself, it didn't run into a "probe-only-once" situation, but a timeout would similarly cause the loop to break prematurely.
Now, this should only be triggered by closing the poller / http client, not by timeout.
| process := checker.New(ctx, connection, tracker) | ||
| advance := func(response *http.Response) { | ||
| t.Helper() | ||
| testClock.Advance(interval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't figure out the pattern from the other test and changed it to advance before providing the response - this seemed to match the flow better, trigger the health check and provide the response. In the flow exercised by this test, the other pattern would deadlock since providing the response before triggering a health check would block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine to me. Looks like other tests are using a buffered channel for fakeConn, so they can store a value in the channel w/out blocking. That would have resolved the deadlock here, too. But this approach also makes sense.
jhump
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Thanks so much for the fix! That is indeed a troubling bug (both the use of the wrong context in the ticker select and the use of defer inside a (potentially very long-running) for loop.
When using health checks, I found that most clients would stop functioning with no healthy endpoints. I found that a poll timeout will completely break out of the health checking poller, so this makes sure to separate the probe's timeout from the one managing the poll loop.