Skip to content

PC - 821 Replace existing probes with HealthManager-backed endpoints#276

Merged
mutantkeyboard merged 6 commits into
mainfrom
health_endpoints_grpc
Feb 18, 2026
Merged

PC - 821 Replace existing probes with HealthManager-backed endpoints#276
mutantkeyboard merged 6 commits into
mainfrom
health_endpoints_grpc

Conversation

@mutantkeyboard
Copy link
Copy Markdown
Contributor

@mutantkeyboard mutantkeyboard commented Feb 16, 2026

Summary by Gitar

  • Custom health server:
    • HTTP server at :8081 with /healthz and /readyz endpoints returning JSON with component status details
  • Liveness suppression:
    • Grace period mechanism prevents pod termination during planned collector restarts (up to 5 minutes)
  • Thread-safe atomic operations:
    • CheckLiveness() and CheckReadiness() methods prevent TOCTOU races by building report and evaluating health under single lock
  • Architecture improvements:
    • Health logic decoupled from HTTP transport (removed *http.Request parameters from check methods)
  • Production-ready implementation:
    • 21 unit tests, graceful shutdown with 5s timeout, proper error handling with lint directives

This will update automatically on new commits.

[Title]

📚 Description of Changes

Provide an overview of your changes and why they're needed. Link to any related issues (e.g., "Fixes #123"). If your PR fixes a bug, resolves a feature request, or updates documentation, please explain how.

  • What Changed:
    (Describe the modifications, additions, or removals.)

  • Why This Change:
    (Explain the problem this PR addresses or the improvement it provides.)

  • Affected Components:
    (Which component does this change affect? - put x for all components)

  • Compose

  • K8s

  • Other (please specify)

❓ Motivation and Context

Why is this change required? What problem does it solve?

  • Context:
    (Provide background information or link to related discussions/issues.)

  • Relevant Tasks/Issues:
    (e.g., Fixes: #GitHub Issue)

🔍 Types of Changes

Indicate which type of changes your code introduces (check all that apply):

  • BUGFIX: Non-breaking fix for an issue.
  • NEW FEATURE: Non-breaking addition of functionality.
  • BREAKING CHANGE: Fix or feature that causes existing functionality to not work as expected.
  • ENHANCEMENT: Improvement to existing functionality.
  • CHORE: Changes that do not affect production (e.g., documentation, build tooling, CI).

🔬 QA / Verification Steps

Describe the steps a reviewer should take to verify your changes:

  1. (Step one: e.g., "Run make test to verify all tests pass.")
  2. (Step two: e.g., "Deploy to a Kind cluster with make create-kind && make deploy.")
  3. (Additional steps as needed.)

✅ Global Checklist

Please check all boxes that apply:

  • I have read and followed the CONTRIBUTING guidelines.
  • My code follows the code style of this project.
  • I have updated the documentation as needed.
  • I have added tests that cover my changes.
  • All new and existing tests have passed locally.
  • I have run this code in a local environment to verify functionality.
  • I have considered the security implications of this change.

Comment thread cmd/main.go
Comment thread cmd/main.go Outdated
Copy link
Copy Markdown
Contributor

@Tzvonimir Tzvonimir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally this looks ok, I would just make sure that this will not cause constant backoffs if for example any of collectors is not working or similar issues. Also please wait with merge until new release. Should be done today in the evening.

Addresses reviewer concern about constant backoffs when collectors are
restarting. StopAll() sets collector_manager to Unhealthy, which fails
the liveness probe and triggers pod kills. This adds a grace period
mechanism: planned restarts suppress liveness checks for up to 5 minutes,
and StartAll() clears the suppression immediately. Unplanned failures
(cleanupOnFailure) intentionally do NOT suppress, so the pod still
restarts on truly fatal states.
Comment thread internal/health/server.go Outdated
Comment thread internal/health/manager.go
Comment thread internal/health/server.go Outdated
- TOCTOU fix: add CheckLiveness/CheckReadiness that atomically build
  report and evaluate check under a single lock, replacing separate
  BuildReport + LivenessCheck calls in HTTP handlers
- Remove unused *http.Request param from LivenessCheck/ReadinessCheck
  to decouple health logic from net/http transport
- Fix JSON tag: rename Error field from "message" to "error" to avoid
  confusion with ComponentResponse.Message
@mutantkeyboard mutantkeyboard enabled auto-merge (squash) February 18, 2026 12:53
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Feb 18, 2026

🔍 CI failure analysis for 77b7da8: Integration test failed due to a pre-existing logger bug (panic: send on closed channel at logger.go:106). Same issue as previous test failures, completely unrelated to health endpoint changes.

Issue

The "Test Metrics Server Lifecycle on K8s v1.32.3" integration test failed with a panic in the logger component during application execution.

Root Cause

The application crashed with panic: send on closed channel at internal/logger/logger.go:106. This is the same pre-existing concurrency bug that has appeared in multiple test runs. The logger attempts to send on a channel that has already been closed.

Details

Panic Stack Trace

panic: send on closed channel

goroutine 243 [running]:
github.com/devzero-inc/zxporter/internal/logger.(*logger).Report(...)
	/workspace/internal/logger/logger.go:106 +0x2fd
github.com/devzero-inc/zxporter/internal/controller.(*CollectionPolicyReconciler).initializeCollectors(...)
	/workspace/internal/controller/collectionpolicy_controller.go:1723 +0x548

Sequence of Events

  1. Application starts successfully
  2. Prometheus becomes available (statusCode: 200)
  3. Controller begins periodic reconciliation
  4. During initializeCollectors() execution, the logger attempts to report telemetry
  5. Logger tries to send on a closed channel at line 106
  6. Panic occurs, causing container crash

Why This Is Unrelated to PR Changes

This PR only modifies health endpoint functionality:

  • .gitignore - Added .DS_Store
  • cmd/main.go - Added custom health server startup/shutdown
  • internal/health/manager.go - Added health check methods with liveness suppression
  • internal/health/server.go - New health HTTP server
  • internal/health/*_test.go - Tests for health components
  • internal/collector/manager.go - Added call to ClearLivenessSuppression()
  • internal/controller/collectionpolicy_controller.go - Added liveness suppression before restarts

The panic occurs in internal/logger/logger.go:106, which is not touched by this PR. The logger has a concurrency bug where it attempts to send on a closed channel.

Previous Occurrences

This is a recurring issue that has appeared in multiple CI runs:

  1. Job 63734268267 (earlier commit): Same panic close of closed channel at logger.go:153 during shutdown
  2. Current Job 64001912665 (commit c2cd9ac): Panic send on closed channel at logger.go:106 during operation

Both are manifestations of the same underlying logger concurrency bug - improper channel lifecycle management.

Health Endpoints Working Correctly

The Kubernetes events show the health probes are functioning as expected:

6m15s  Warning  Unhealthy  pod/devzero-zxporter-controller-manager-5964ff4c5b-h68jt  
       Readiness probe failed: Get "http://10.244.0.8:8081/readyz": dial tcp 10.244.0.8:8081: connect: connection refused
82s    Warning  Unhealthy  pod/devzero-zxporter-controller-manager-5964ff4c5b-h68jt  
       Readiness probe failed: HTTP probe failed with statuscode: 503
  • ✅ Connection refused before server starts (expected)
  • ✅ 503 status when components not ready (expected)
  • ✅ Health server on port 8081 is working correctly

The container restarts are due to the logger panic, not the health system.

Root Cause Analysis

The logger bug appears to have two failure modes:

  1. During shutdown (logger.go:153): Attempting to close an already-closed channel
  2. During operation (logger.go:106): Attempting to send on a closed channel

Both indicate that the logger's channel lifecycle is not properly synchronized with shutdown/restart operations. The bug is triggered when:

  • The application undergoes multiple restarts (visible in LeaderElection events)
  • Logger shutdown occurs while telemetry reporting is active
  • Multiple goroutines access the logger channel concurrently
Code Review ✅ Approved 5 resolved / 5 findings

All three previous findings (TOCTOU race, unused HTTP parameter, JSON field naming) have been properly resolved with atomic check methods, decoupled signatures, and correct field tags. The new liveness suppression mechanism is well-designed with proper safety nets. Clean implementation overall.

✅ 5 resolved
Bug: TOCTOU: report and check are not atomic in handlers

📄 internal/health/server.go:65 📄 internal/health/server.go:84
In both healthzHandler and readyzHandler, BuildReport() and LivenessCheck()/ReadinessCheck() are called as separate operations, each acquiring and releasing the RWMutex independently. A concurrent status update between these two calls could result in the response showing component statuses that don't match the health/readiness decision.

For example: BuildReport() sees collector_manager as Healthy, then another goroutine updates it to Unhealthy, then LivenessCheck() returns an error — but the response components still show "healthy" for collector_manager.

This is a minor issue since health probes are polled frequently and transient inconsistency is unlikely to cause real harm, but it's worth noting. A possible fix would be to have the check methods accept the report, or to combine report building and checking into a single lock acquisition.

Quality: Unused *http.Request parameter couples health to net/http

📄 internal/health/manager.go:115 📄 internal/health/manager.go:128
Both LivenessCheck and ReadinessCheck accept *http.Request but never use it (the parameter is explicitly ignored with _). This unnecessarily imports net/http into manager.go and couples the health check logic to the HTTP transport layer.

The health manager's check methods should be transport-agnostic — just returning an error based on component state. The HTTP handler layer already handles the request and can pass relevant data if needed.

Removing the *http.Request parameter also simplifies testing (no need to pass nil in tests).

Quality: JSON field name message used for error field Error

📄 internal/health/server.go:14
In HealthResponse, the struct field is named Error (a string) but its JSON tag is "message". Meanwhile, in ComponentResponse, there's a field named Message also tagged "message". This is confusing for API consumers:

type HealthResponse struct {
    Error      string  `json:"message,omitempty"`  // this is the error
    ...
}
type ComponentResponse struct {
    Message  string  `json:"message,omitempty"`    // this is the message
    ...
}

When the top-level status is "unhealthy", the message field contains an error description. Consider either renaming the Go field to Message for consistency, or using "error" as the JSON tag to clearly distinguish error messages from informational messages.

Bug: Health server is never gracefully shut down

📄 cmd/main.go:166
The HealthServer has a Stop(ctx) method for graceful shutdown, but it's never called in cmd/main.go. When mgr.Start() returns (on signal or error), the process exits without shutting down the health server. This means in-flight health check requests could be abruptly terminated, and the TCP listener won't be cleanly released.

Since mgr.Start() blocks until the manager is stopped, the health server should be shut down after mgr.Start() returns (or via a signal handler / deferred call). A simple approach is to defer the shutdown right after starting it.

Bug: Commented-out code placed after blocking mgr.Start() call

📄 cmd/main.go:178
The TODO-commented code (lines 178-186) is placed after mgr.Start() which is a blocking call. This code is unreachable during normal operation — it only executes after the manager has already stopped. If someone later uncomments this code to "revert" to the old behavior, the health checks would never actually be registered because the manager would have already been running (and stopped) by that point.

The original code (before this PR) had these health check registrations before mgr.Start(), which was correct. The commented-out code should either be removed entirely or, if kept as reference, moved above mgr.Start() with a clear note that it must be placed before the Start call.

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@mutantkeyboard mutantkeyboard merged commit 61a3802 into main Feb 18, 2026
25 of 26 checks passed
@mutantkeyboard mutantkeyboard deleted the health_endpoints_grpc branch February 18, 2026 13:03
Parthiba-Hazra pushed a commit that referenced this pull request May 5, 2026
…276)

* Add healthz and readyz probes as well as server

* feat(health): add graceful shutdown and remove dead code from health server wiring

* Golint fixes

* fix(health): add liveness suppression during planned collector restarts

Addresses reviewer concern about constant backoffs when collectors are
restarting. StopAll() sets collector_manager to Unhealthy, which fails
the liveness probe and triggers pod kills. This adds a grace period
mechanism: planned restarts suppress liveness checks for up to 5 minutes,
and StartAll() clears the suppression immediately. Unplanned failures
(cleanupOnFailure) intentionally do NOT suppress, so the pod still
restarts on truly fatal states.

* fix(health): address code review findings

- TOCTOU fix: add CheckLiveness/CheckReadiness that atomically build
  report and evaluate check under a single lock, replacing separate
  BuildReport + LivenessCheck calls in HTTP handlers
- Remove unused *http.Request param from LivenessCheck/ReadinessCheck
  to decouple health logic from net/http transport
- Fix JSON tag: rename Error field from "message" to "error" to avoid
  confusion with ComponentResponse.Message

---------

Co-authored-by: Antonio Nesic <antonio.nesic@devzero.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants