Fix health check loop closes on probe timeout #80

anuraaga · 2025-05-12T02:51:22Z

When using health checks, I found that most clients would stop functioning with no healthy endpoints. I found that a poll timeout will completely break out of the health checking poller, so this makes sure to separate the probe's timeout from the one managing the poll loop.

anuraaga · 2025-05-12T02:52:28Z

health/polling.go

-				if counter >= r.healthyThreshold {
+			func() {
+				ctx, cancel := context.WithTimeout(ctx, r.timeout)
+				defer cancel()


Previously this was within a for loop which I thought usually lint would catch, but by creating a new func this should be scoped properly now

anuraaga · 2025-05-12T02:54:23Z

health/polling.go

+				ctx, cancel := context.WithTimeout(ctx, r.timeout)
+				defer cancel()
+
+				result := r.prober.Probe(ctx, conn)


I didn't change this, which means the timeout is implicit, relying on Probe to use the context's timeout (we see such a change in the unit test). To handle timeout here, I think we would need to spawn a goroutine here for the probe - wasn't sure if that's overkill or not.

Personally I would move timeout parameter off of the poller and onto the prober's themselves, though it does mean a user has to set both interval and timeout meaningfully themselves so maybe not

anuraaga · 2025-05-12T02:55:50Z

health/polling.go

+			}()

 			select {
 			case <-ctx.Done():


This return used to trigger when ctx created for the probe is done, meaning the loop ends. Because defer cancel() was essentially deferring accumulated closures to the end of the loop itself, it didn't run into a "probe-only-once" situation, but a timeout would similarly cause the loop to break prematurely.

Now, this should only be triggered by closing the poller / http client, not by timeout.

anuraaga · 2025-05-12T02:57:19Z

health/polling_test.go

+	process := checker.New(ctx, connection, tracker)
+	advance := func(response *http.Response) {
+		t.Helper()
+		testClock.Advance(interval)


I couldn't figure out the pattern from the other test and changed it to advance before providing the response - this seemed to match the flow better, trigger the health check and provide the response. In the flow exercised by this test, the other pattern would deadlock since providing the response before triggering a health check would block.

This looks fine to me. Looks like other tests are using a buffered channel for fakeConn, so they can store a value in the channel w/out blocking. That would have resolved the deadlock here, too. But this approach also makes sense.

jhump

LGTM!

Thanks so much for the fix! That is indeed a troubling bug (both the use of the wrong context in the ticker select and the use of defer inside a (potentially very long-running) for loop.

Fix health check loop closes on probe timeout

2c3e057

anuraaga commented May 12, 2025

View reviewed changes

jhump approved these changes May 12, 2025

View reviewed changes

jhump merged commit 3d01798 into bufbuild:main May 12, 2025
5 checks passed

jhump mentioned this pull request May 12, 2025

Fix flaky TestPollingChecker in health package #81

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix health check loop closes on probe timeout #80

Fix health check loop closes on probe timeout #80

Uh oh!

anuraaga commented May 12, 2025

Uh oh!

anuraaga May 12, 2025

Uh oh!

anuraaga May 12, 2025

Uh oh!

anuraaga May 12, 2025

Uh oh!

anuraaga May 12, 2025

Uh oh!

jhump May 12, 2025

Uh oh!

jhump left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix health check loop closes on probe timeout #80

Fix health check loop closes on probe timeout #80

Uh oh!

Conversation

anuraaga commented May 12, 2025

Uh oh!

anuraaga May 12, 2025

Choose a reason for hiding this comment

Uh oh!

anuraaga May 12, 2025

Choose a reason for hiding this comment

Uh oh!

anuraaga May 12, 2025

Choose a reason for hiding this comment

Uh oh!

anuraaga May 12, 2025

Choose a reason for hiding this comment

Uh oh!

jhump May 12, 2025

Choose a reason for hiding this comment

Uh oh!

jhump left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants