Skip to content

Conversation

@anuraaga
Copy link
Contributor

When using health checks, I found that most clients would stop functioning with no healthy endpoints. I found that a poll timeout will completely break out of the health checking poller, so this makes sure to separate the probe's timeout from the one managing the poll loop.

if counter >= r.healthyThreshold {
func() {
ctx, cancel := context.WithTimeout(ctx, r.timeout)
defer cancel()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously this was within a for loop which I thought usually lint would catch, but by creating a new func this should be scoped properly now

ctx, cancel := context.WithTimeout(ctx, r.timeout)
defer cancel()

result := r.prober.Probe(ctx, conn)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't change this, which means the timeout is implicit, relying on Probe to use the context's timeout (we see such a change in the unit test). To handle timeout here, I think we would need to spawn a goroutine here for the probe - wasn't sure if that's overkill or not.

Personally I would move timeout parameter off of the poller and onto the prober's themselves, though it does mean a user has to set both interval and timeout meaningfully themselves so maybe not

}()

select {
case <-ctx.Done():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This return used to trigger when ctx created for the probe is done, meaning the loop ends. Because defer cancel() was essentially deferring accumulated closures to the end of the loop itself, it didn't run into a "probe-only-once" situation, but a timeout would similarly cause the loop to break prematurely.

Now, this should only be triggered by closing the poller / http client, not by timeout.

process := checker.New(ctx, connection, tracker)
advance := func(response *http.Response) {
t.Helper()
testClock.Advance(interval)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't figure out the pattern from the other test and changed it to advance before providing the response - this seemed to match the flow better, trigger the health check and provide the response. In the flow exercised by this test, the other pattern would deadlock since providing the response before triggering a health check would block.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me. Looks like other tests are using a buffered channel for fakeConn, so they can store a value in the channel w/out blocking. That would have resolved the deadlock here, too. But this approach also makes sense.

Copy link
Member

@jhump jhump left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Thanks so much for the fix! That is indeed a troubling bug (both the use of the wrong context in the ticker select and the use of defer inside a (potentially very long-running) for loop.

@jhump jhump merged commit 3d01798 into bufbuild:main May 12, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants