PC - 821 Replace existing probes with HealthManager-backed endpoints#276
Conversation
Tzvonimir
left a comment
There was a problem hiding this comment.
Generally this looks ok, I would just make sure that this will not cause constant backoffs if for example any of collectors is not working or similar issues. Also please wait with merge until new release. Should be done today in the evening.
Addresses reviewer concern about constant backoffs when collectors are restarting. StopAll() sets collector_manager to Unhealthy, which fails the liveness probe and triggers pod kills. This adds a grace period mechanism: planned restarts suppress liveness checks for up to 5 minutes, and StartAll() clears the suppression immediately. Unplanned failures (cleanupOnFailure) intentionally do NOT suppress, so the pod still restarts on truly fatal states.
- TOCTOU fix: add CheckLiveness/CheckReadiness that atomically build report and evaluate check under a single lock, replacing separate BuildReport + LivenessCheck calls in HTTP handlers - Remove unused *http.Request param from LivenessCheck/ReadinessCheck to decouple health logic from net/http transport - Fix JSON tag: rename Error field from "message" to "error" to avoid confusion with ComponentResponse.Message
🔍 CI failure analysis for 77b7da8: Integration test failed due to a pre-existing logger bug (panic: send on closed channel at logger.go:106). Same issue as previous test failures, completely unrelated to health endpoint changes.IssueThe "Test Metrics Server Lifecycle on K8s v1.32.3" integration test failed with a panic in the logger component during application execution. Root CauseThe application crashed with DetailsPanic Stack TraceSequence of Events
Why This Is Unrelated to PR ChangesThis PR only modifies health endpoint functionality:
The panic occurs in Previous OccurrencesThis is a recurring issue that has appeared in multiple CI runs:
Both are manifestations of the same underlying logger concurrency bug - improper channel lifecycle management. Health Endpoints Working CorrectlyThe Kubernetes events show the health probes are functioning as expected:
The container restarts are due to the logger panic, not the health system. Root Cause AnalysisThe logger bug appears to have two failure modes:
Both indicate that the logger's channel lifecycle is not properly synchronized with shutdown/restart operations. The bug is triggered when:
Code Review ✅ Approved 5 resolved / 5 findingsAll three previous findings (TOCTOU race, unused HTTP parameter, JSON field naming) have been properly resolved with atomic check methods, decoupled signatures, and correct field tags. The new liveness suppression mechanism is well-designed with proper safety nets. Clean implementation overall. ✅ 5 resolved✅ Bug: TOCTOU: report and check are not atomic in handlers
✅ Quality: Unused
|
| Auto-apply | Compact |
|
|
Was this helpful? React with 👍 / 👎 | Gitar
…276) * Add healthz and readyz probes as well as server * feat(health): add graceful shutdown and remove dead code from health server wiring * Golint fixes * fix(health): add liveness suppression during planned collector restarts Addresses reviewer concern about constant backoffs when collectors are restarting. StopAll() sets collector_manager to Unhealthy, which fails the liveness probe and triggers pod kills. This adds a grace period mechanism: planned restarts suppress liveness checks for up to 5 minutes, and StartAll() clears the suppression immediately. Unplanned failures (cleanupOnFailure) intentionally do NOT suppress, so the pod still restarts on truly fatal states. * fix(health): address code review findings - TOCTOU fix: add CheckLiveness/CheckReadiness that atomically build report and evaluate check under a single lock, replacing separate BuildReport + LivenessCheck calls in HTTP handlers - Remove unused *http.Request param from LivenessCheck/ReadinessCheck to decouple health logic from net/http transport - Fix JSON tag: rename Error field from "message" to "error" to avoid confusion with ComponentResponse.Message --------- Co-authored-by: Antonio Nesic <antonio.nesic@devzero.io>
Summary by Gitar
:8081with/healthzand/readyzendpoints returning JSON with component status detailsCheckLiveness()andCheckReadiness()methods prevent TOCTOU races by building report and evaluating health under single lock*http.Requestparameters from check methods)This will update automatically on new commits.
[Title]
📚 Description of Changes
Provide an overview of your changes and why they're needed. Link to any related issues (e.g., "Fixes #123"). If your PR fixes a bug, resolves a feature request, or updates documentation, please explain how.
What Changed:
(Describe the modifications, additions, or removals.)
Why This Change:
(Explain the problem this PR addresses or the improvement it provides.)
Affected Components:
(Which component does this change affect? - put x for all components)
Compose
K8s
Other (please specify)
❓ Motivation and Context
Why is this change required? What problem does it solve?
Context:
(Provide background information or link to related discussions/issues.)
Relevant Tasks/Issues:
(e.g., Fixes: #GitHub Issue)
🔍 Types of Changes
Indicate which type of changes your code introduces (check all that apply):
🔬 QA / Verification Steps
Describe the steps a reviewer should take to verify your changes:
make testto verify all tests pass.")make create-kind && make deploy.")✅ Global Checklist
Please check all boxes that apply: