PC - 821 Replace existing probes with HealthManager-backed endpoints by mutantkeyboard · Pull Request #276 · devzero-inc/zxporter

mutantkeyboard · 2026-02-16T10:13:03Z

Summary by Gitar

Custom health server:
- HTTP server at :8081 with /healthz and /readyz endpoints returning JSON with component status details
Liveness suppression:
- Grace period mechanism prevents pod termination during planned collector restarts (up to 5 minutes)
Thread-safe atomic operations:
- CheckLiveness() and CheckReadiness() methods prevent TOCTOU races by building report and evaluating health under single lock
Architecture improvements:
- Health logic decoupled from HTTP transport (removed *http.Request parameters from check methods)
Production-ready implementation:
- 21 unit tests, graceful shutdown with 5s timeout, proper error handling with lint directives

_{This will update automatically on new commits.}

[Title]

📚 Description of Changes

Provide an overview of your changes and why they're needed. Link to any related issues (e.g., "Fixes #123"). If your PR fixes a bug, resolves a feature request, or updates documentation, please explain how.

What Changed:
(Describe the modifications, additions, or removals.)
Why This Change:
(Explain the problem this PR addresses or the improvement it provides.)
Affected Components:
(Which component does this change affect? - put x for all components)
Compose
K8s
Other (please specify)

❓ Motivation and Context

Why is this change required? What problem does it solve?

Context:
(Provide background information or link to related discussions/issues.)
Relevant Tasks/Issues:
(e.g., Fixes: #GitHub Issue)

🔍 Types of Changes

Indicate which type of changes your code introduces (check all that apply):

BUGFIX: Non-breaking fix for an issue.
NEW FEATURE: Non-breaking addition of functionality.
BREAKING CHANGE: Fix or feature that causes existing functionality to not work as expected.
ENHANCEMENT: Improvement to existing functionality.
CHORE: Changes that do not affect production (e.g., documentation, build tooling, CI).

🔬 QA / Verification Steps

Describe the steps a reviewer should take to verify your changes:

(Step one: e.g., "Run make test to verify all tests pass.")
(Step two: e.g., "Deploy to a Kind cluster with make create-kind && make deploy.")
(Additional steps as needed.)

✅ Global Checklist

Please check all boxes that apply:

I have read and followed the CONTRIBUTING guidelines.
My code follows the code style of this project.
I have updated the documentation as needed.
I have added tests that cover my changes.
All new and existing tests have passed locally.
I have run this code in a local environment to verify functionality.
I have considered the security implications of this change.

…server wiring

Tzvonimir

Generally this looks ok, I would just make sure that this will not cause constant backoffs if for example any of collectors is not working or similar issues. Also please wait with merge until new release. Should be done today in the evening.

Addresses reviewer concern about constant backoffs when collectors are restarting. StopAll() sets collector_manager to Unhealthy, which fails the liveness probe and triggers pod kills. This adds a grace period mechanism: planned restarts suppress liveness checks for up to 5 minutes, and StartAll() clears the suppression immediately. Unplanned failures (cleanupOnFailure) intentionally do NOT suppress, so the pod still restarts on truly fatal states.

- TOCTOU fix: add CheckLiveness/CheckReadiness that atomically build report and evaluate check under a single lock, replacing separate BuildReport + LivenessCheck calls in HTTP handlers - Remove unused *http.Request param from LivenessCheck/ReadinessCheck to decouple health logic from net/http transport - Fix JSON tag: rename Error field from "message" to "error" to avoid confusion with ComponentResponse.Message

gitar-bot · 2026-02-18T12:56:41Z

🔍 CI failure analysis for 77b7da8: Integration test failed due to a pre-existing logger bug (panic: send on closed channel at logger.go:106). Same issue as previous test failures, completely unrelated to health endpoint changes.

Issue

The "Test Metrics Server Lifecycle on K8s v1.32.3" integration test failed with a panic in the logger component during application execution.

Root Cause

The application crashed with panic: send on closed channel at internal/logger/logger.go:106. This is the same pre-existing concurrency bug that has appeared in multiple test runs. The logger attempts to send on a channel that has already been closed.

Details

Panic Stack Trace

panic: send on closed channel

goroutine 243 [running]:
github.com/devzero-inc/zxporter/internal/logger.(*logger).Report(...)
	/workspace/internal/logger/logger.go:106 +0x2fd
github.com/devzero-inc/zxporter/internal/controller.(*CollectionPolicyReconciler).initializeCollectors(...)
	/workspace/internal/controller/collectionpolicy_controller.go:1723 +0x548

Sequence of Events

Application starts successfully
Prometheus becomes available (statusCode: 200)
Controller begins periodic reconciliation
During initializeCollectors() execution, the logger attempts to report telemetry
Logger tries to send on a closed channel at line 106
Panic occurs, causing container crash

Why This Is Unrelated to PR Changes

This PR only modifies health endpoint functionality:

.gitignore - Added .DS_Store
cmd/main.go - Added custom health server startup/shutdown
internal/health/manager.go - Added health check methods with liveness suppression
internal/health/server.go - New health HTTP server
internal/health/*_test.go - Tests for health components
internal/collector/manager.go - Added call to ClearLivenessSuppression()
internal/controller/collectionpolicy_controller.go - Added liveness suppression before restarts

The panic occurs in internal/logger/logger.go:106, which is not touched by this PR. The logger has a concurrency bug where it attempts to send on a closed channel.

Previous Occurrences

This is a recurring issue that has appeared in multiple CI runs:

Job 63734268267 (earlier commit): Same panic close of closed channel at logger.go:153 during shutdown
Current Job 64001912665 (commit c2cd9ac): Panic send on closed channel at logger.go:106 during operation

Both are manifestations of the same underlying logger concurrency bug - improper channel lifecycle management.

Health Endpoints Working Correctly

The Kubernetes events show the health probes are functioning as expected:

6m15s  Warning  Unhealthy  pod/devzero-zxporter-controller-manager-5964ff4c5b-h68jt  
       Readiness probe failed: Get "http://10.244.0.8:8081/readyz": dial tcp 10.244.0.8:8081: connect: connection refused
82s    Warning  Unhealthy  pod/devzero-zxporter-controller-manager-5964ff4c5b-h68jt  
       Readiness probe failed: HTTP probe failed with statuscode: 503

✅ Connection refused before server starts (expected)
✅ 503 status when components not ready (expected)
✅ Health server on port 8081 is working correctly

The container restarts are due to the logger panic, not the health system.

Root Cause Analysis

The logger bug appears to have two failure modes:

During shutdown (logger.go:153): Attempting to close an already-closed channel
During operation (logger.go:106): Attempting to send on a closed channel

Both indicate that the logger's channel lifecycle is not properly synchronized with shutdown/restart operations. The bug is triggered when:

The application undergoes multiple restarts (visible in LeaderElection events)
Logger shutdown occurs while telemetry reporting is active
Multiple goroutines access the logger channel concurrently

Code Review ✅ Approved 5 resolved / 5 findings

All three previous findings (TOCTOU race, unused HTTP parameter, JSON field naming) have been properly resolved with atomic check methods, decoupled signatures, and correct field tags. The new liveness suppression mechanism is well-designed with proper safety nets. Clean implementation overall.

✅ 5 resolved

✅ Bug: TOCTOU: report and check are not atomic in handlers

📄 internal/health/server.go:65 📄 internal/health/server.go:84
In both healthzHandler and readyzHandler, BuildReport() and LivenessCheck()/ReadinessCheck() are called as separate operations, each acquiring and releasing the RWMutex independently. A concurrent status update between these two calls could result in the response showing component statuses that don't match the health/readiness decision.

For example: BuildReport() sees collector_manager as Healthy, then another goroutine updates it to Unhealthy, then LivenessCheck() returns an error — but the response components still show "healthy" for collector_manager.

This is a minor issue since health probes are polled frequently and transient inconsistency is unlikely to cause real harm, but it's worth noting. A possible fix would be to have the check methods accept the report, or to combine report building and checking into a single lock acquisition.

✅ Quality: Unused *http.Request parameter couples health to net/http

📄 internal/health/manager.go:115 📄 internal/health/manager.go:128
Both LivenessCheck and ReadinessCheck accept *http.Request but never use it (the parameter is explicitly ignored with _). This unnecessarily imports net/http into manager.go and couples the health check logic to the HTTP transport layer.

The health manager's check methods should be transport-agnostic — just returning an error based on component state. The HTTP handler layer already handles the request and can pass relevant data if needed.

Removing the *http.Request parameter also simplifies testing (no need to pass nil in tests).

✅ Quality: JSON field name message used for error field Error

📄 internal/health/server.go:14
In HealthResponse, the struct field is named Error (a string) but its JSON tag is "message". Meanwhile, in ComponentResponse, there's a field named Message also tagged "message". This is confusing for API consumers:
type HealthResponse struct {
    Error      string  `json:"message,omitempty"`  // this is the error
    ...
}
type ComponentResponse struct {
    Message  string  `json:"message,omitempty"`    // this is the message
    ...
}
When the top-level status is "unhealthy", the message field contains an error description. Consider either renaming the Go field to Message for consistency, or using "error" as the JSON tag to clearly distinguish error messages from informational messages.

✅ Bug: Health server is never gracefully shut down

📄 cmd/main.go:166
The HealthServer has a Stop(ctx) method for graceful shutdown, but it's never called in cmd/main.go. When mgr.Start() returns (on signal or error), the process exits without shutting down the health server. This means in-flight health check requests could be abruptly terminated, and the TCP listener won't be cleanly released.

Since mgr.Start() blocks until the manager is stopped, the health server should be shut down after mgr.Start() returns (or via a signal handler / deferred call). A simple approach is to defer the shutdown right after starting it.

✅ Bug: Commented-out code placed after blocking mgr.Start() call

📄 cmd/main.go:178
The TODO-commented code (lines 178-186) is placed after mgr.Start() which is a blocking call. This code is unreachable during normal operation — it only executes after the manager has already stopped. If someone later uncomments this code to "revert" to the old behavior, the health checks would never actually be registered because the manager would have already been running (and stopped) by that point.

The original code (before this PR) had these health check registrations before mgr.Start(), which was correct. The commented-out code should either be removed entirely or, if kept as reference, moved above mgr.Start() with a clear note that it must be placed before the Start call.

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

`Auto-apply`	`Compact`
`gitar auto-apply:on`	`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

…276) * Add healthz and readyz probes as well as server * feat(health): add graceful shutdown and remove dead code from health server wiring * Golint fixes * fix(health): add liveness suppression during planned collector restarts Addresses reviewer concern about constant backoffs when collectors are restarting. StopAll() sets collector_manager to Unhealthy, which fails the liveness probe and triggers pod kills. This adds a grace period mechanism: planned restarts suppress liveness checks for up to 5 minutes, and StartAll() clears the suppression immediately. Unplanned failures (cleanupOnFailure) intentionally do NOT suppress, so the pod still restarts on truly fatal states. * fix(health): address code review findings - TOCTOU fix: add CheckLiveness/CheckReadiness that atomically build report and evaluate check under a single lock, replacing separate BuildReport + LivenessCheck calls in HTTP handlers - Remove unused *http.Request param from LivenessCheck/ReadinessCheck to decouple health logic from net/http transport - Fix JSON tag: rename Error field from "message" to "error" to avoid confusion with ComponentResponse.Message --------- Co-authored-by: Antonio Nesic <antonio.nesic@devzero.io>

Add healthz and readyz probes as well as server

52efc13

gitar-bot Bot reviewed Feb 16, 2026

View reviewed changes

Comment thread cmd/main.go

gitar-bot Bot reviewed Feb 16, 2026

View reviewed changes

Comment thread cmd/main.go Outdated

Antonio Nesic added 3 commits February 16, 2026 11:23

feat(health): add graceful shutdown and remove dead code from health …

f94cf1b

…server wiring

Merge branch 'main' into health_endpoints_grpc

8e59456

Golint fixes

8ea608e

Tzvonimir approved these changes Feb 17, 2026

View reviewed changes

gitar-bot Bot reviewed Feb 18, 2026

View reviewed changes

Comment thread internal/health/server.go Outdated

gitar-bot Bot reviewed Feb 18, 2026

View reviewed changes

Comment thread internal/health/manager.go

gitar-bot Bot reviewed Feb 18, 2026

View reviewed changes

Comment thread internal/health/server.go Outdated

mutantkeyboard enabled auto-merge (squash) February 18, 2026 12:53

mutantkeyboard merged commit 61a3802 into main Feb 18, 2026
25 of 26 checks passed

mutantkeyboard deleted the health_endpoints_grpc branch February 18, 2026 13:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PC - 821 Replace existing probes with HealthManager-backed endpoints#276

PC - 821 Replace existing probes with HealthManager-backed endpoints#276
mutantkeyboard merged 6 commits into
mainfrom
health_endpoints_grpc

mutantkeyboard commented Feb 16, 2026 •

edited by gitar-bot Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Tzvonimir left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gitar-bot Bot commented Feb 18, 2026 •

edited

Loading

Issue

Root Cause

Details

Panic Stack Trace

Sequence of Events

Why This Is Unrelated to PR Changes

Previous Occurrences

Health Endpoints Working Correctly

Root Cause Analysis

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mutantkeyboard commented Feb 16, 2026 • edited by gitar-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Gitar

[Title]

📚 Description of Changes

❓ Motivation and Context

🔍 Types of Changes

🔬 QA / Verification Steps

✅ Global Checklist

Uh oh!

Uh oh!

Uh oh!

Tzvonimir left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gitar-bot Bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue

Root Cause

Details

Panic Stack Trace

Sequence of Events

Why This Is Unrelated to PR Changes

Previous Occurrences

Health Endpoints Working Correctly

Root Cause Analysis

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mutantkeyboard commented Feb 16, 2026 •

edited by gitar-bot Bot

Loading

gitar-bot Bot commented Feb 18, 2026 •

edited

Loading