-
Notifications
You must be signed in to change notification settings - Fork 8
HYPERFLEET-448, HYPERFLEET-449 - feat: add health endpoints and graceful shutdown readiness #29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add HTTP health and metrics endpoints per HyperFleet standards: - /healthz on port 8080 - liveness probe endpoint (always returns 200) - /readyz on port 8080 - readiness probe endpoint (503 during startup/shutdown, 200 when ready) - /metrics on port 9090 - Prometheus metrics endpoint The readiness state is managed atomically: - Starts as not ready (503) - Becomes ready after broker subscription is established (200) - Returns to not ready when SIGTERM/SIGINT is received (503) Helm chart updated to enable liveness and readiness probes by default.
WalkthroughAdds a concurrent-safe HTTP health server with /healthz and /readyz, per-check statuses, config-loaded and shutting-down flags, and graceful Start/Shutdown. Adds a Prometheus metrics server exposing /metrics with Sequence Diagram(s)sequenceDiagram
participant Main as Application/Main
participant HS as Health Server
participant MS as Metrics Server
participant Broker as Broker
participant Sig as Signal Handler
Main->>HS: NewServer(...) / Start(ctx)
HS-->>Main: listening on :8080 (/healthz, /readyz)
Main->>MS: NewMetricsServer(...) / Start(ctx)
MS-->>Main: listening on :9090 (/metrics)
Main->>Broker: Subscribe()
Broker-->>Main: subscription established
Main->>HS: SetConfigLoaded()
Main->>HS: SetCheck("broker", ok)
Main->>HS: SetBrokerReady(true)
Sig-->>Main: shutdown signal
Main->>HS: SetBrokerReady(false)
Main->>HS: SetShuttingDown(true)
Main->>HS: Shutdown(ctx)
Main->>MS: Shutdown(ctx)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. 📜 Recent review detailsConfiguration used: Organization UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🔇 Additional comments (3)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
pkg/health/server_test.go (1)
111-129: Minor: Response bodies not closed in this test.Lines 120 and 128 call
w.Result()but don't close the response body, unlike other tests. While this is unlikely to cause issues in tests, it's inconsistent with the pattern used elsewhere in the file.Suggested fix
req := httptest.NewRequest(http.MethodGet, "/readyz", nil) w := httptest.NewRecorder() server.readyzHandler(w, req) - assert.Equal(t, http.StatusOK, w.Result().StatusCode) + resp := w.Result() + defer resp.Body.Close() + assert.Equal(t, http.StatusOK, resp.StatusCode) // Set not ready (simulating shutdown) server.SetReady(false) req = httptest.NewRequest(http.MethodGet, "/readyz", nil) w = httptest.NewRecorder() server.readyzHandler(w, req) - assert.Equal(t, http.StatusServiceUnavailable, w.Result().StatusCode) + resp = w.Result() + defer resp.Body.Close() + assert.Equal(t, http.StatusServiceUnavailable, resp.StatusCode)pkg/health/metrics.go (1)
35-47:Start()always returnsnil, masking potential startup errors.The method launches the server in a goroutine and immediately returns
nil. IfListenAndServefails (e.g., port already in use), the error is only logged asynchronously and the caller proceeds as if startup succeeded. This mirrors the pattern inserver.go, so it's consistent within this PR.For improved reliability, consider using a synchronization mechanism (e.g., a channel or
sync.WaitGroup) to confirm the server is actually listening before returning, or at least document this behavior.Example approach using a ready channel
func (s *MetricsServer) Start(ctx context.Context) error { s.log.Infof(ctx, "Starting metrics server on port %s", s.port) errCh := make(chan error, 1) go func() { if err := s.server.ListenAndServe(); err != nil && err != http.ErrServerClosed { errCtx := logger.WithErrorField(ctx, err) s.log.Errorf(errCtx, "Metrics server error") select { case errCh <- err: default: } } }() // Give the server a moment to fail fast on port conflicts select { case err := <-errCh: return err case <-time.After(100 * time.Millisecond): return nil } }pkg/health/server.go (3)
25-25: Unused fieldcomponent.The
componentfield is stored in the struct but never used in this file. Consider removing it if it's not needed, or use it in log messages to identify which component's health is being reported.
40-44: Consider adding ReadTimeout and WriteTimeout for defense in depth.While
ReadHeaderTimeoutis set (good for preventing slowloris attacks), addingReadTimeoutandWriteTimeoutwould provide additional protection against slow clients holding connections open.💡 Suggested enhancement
s.server = &http.Server{ Addr: ":" + port, Handler: mux, ReadHeaderTimeout: 5 * time.Second, + ReadTimeout: 10 * time.Second, + WriteTimeout: 10 * time.Second, }
50-61: Startup errors are not propagated to the caller.The goroutine logs errors but doesn't communicate them back. If the server fails to bind (e.g., port already in use), the caller won't know. Consider using a channel to report startup success/failure, or at minimum document this behavior.
💡 Alternative pattern with startup error handling
// Start starts the health server in a goroutine. -func (s *Server) Start(ctx context.Context) error { +// Returns an error channel that will receive any server errors. +func (s *Server) Start(ctx context.Context) <-chan error { s.log.Infof(ctx, "Starting health server on port %s", s.port) + errCh := make(chan error, 1) go func() { if err := s.server.ListenAndServe(); err != nil && err != http.ErrServerClosed { errCtx := logger.WithErrorField(ctx, err) s.log.Errorf(errCtx, "Health server error") + errCh <- err } + close(errCh) }() - return nil + return errCh }Alternatively, if keeping the current signature, document that startup failures are only logged.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (6)
charts/values.yamlcmd/adapter/main.gogo.modpkg/health/metrics.gopkg/health/server.gopkg/health/server_test.go
🧰 Additional context used
🧬 Code graph analysis (3)
pkg/health/server_test.go (3)
pkg/logger/logger.go (1)
Logger(14-40)pkg/health/server.go (2)
NewServer(29-47)Response(14-17)internal/config_loader/types.go (1)
Header(172-175)
cmd/adapter/main.go (5)
pkg/health/server.go (1)
NewServer(29-47)internal/config_loader/types.go (1)
Metadata(67-71)pkg/logger/context.go (1)
WithErrorField(156-168)internal/hyperfleet_api/client.go (1)
WithTimeout(70-74)pkg/health/metrics.go (1)
NewMetricsServer(20-33)
pkg/health/metrics.go (3)
pkg/health/server.go (1)
Server(20-26)pkg/logger/logger.go (1)
Logger(14-40)pkg/logger/context.go (1)
WithErrorField(156-168)
🔇 Additional comments (17)
charts/values.yaml (2)
36-43: LGTM!Container port definitions correctly expose the health server (8080) and metrics server (9090) endpoints, aligning with the constants defined in
cmd/adapter/main.go.
89-122: LGTM!Probe configurations are well-structured:
- Liveness probe with 30s initial delay allows time for full initialization
- Readiness probe with 5s initial delay and shorter intervals enables quick traffic routing once ready
- Startup probe (disabled by default) with high failureThreshold (30) is useful for slow-starting scenarios
The paths (
/healthz,/readyz) and port (8080) correctly match the health server implementation.go.mod (1)
104-107: LGTM!Prometheus client libraries are correctly added as indirect dependencies to support the metrics server. The versions appear current.
cmd/adapter/main.go (5)
50-60: LGTM!Constants are well-organized and clearly documented. Port values align with the Helm chart configuration.
205-235: Health and metrics server lifecycle is well-implemented.Good practices observed:
- Health server starts early with
readiness=false(correct initial state)- Deferred shutdowns with explicit timeout contexts
- Error handling properly enriches context for logging
One observation: both
Start()methods return immediately before the server goroutine actually binds to the port. If port binding fails (e.g., port already in use), the error is only logged asynchronously in the goroutine. Consider verifying the server is actually listening before proceeding, though for most deployments this is acceptable.
281-293: LGTM!Correct graceful shutdown pattern - marking the service as not ready immediately upon receiving a termination signal ensures Kubernetes stops routing traffic before the actual shutdown begins.
338-340: LGTM!Correctly marks the adapter as ready only after the broker subscription is successfully established, ensuring
/readyzreturns 503 during the initialization phase.
370-376: LGTM!Properly marks the service as not ready before initiating shutdown on fatal subscription errors, maintaining consistency with the graceful shutdown pattern.
pkg/health/server_test.go (3)
15-29: LGTM!The mock logger correctly implements all methods from the
logger.Loggerinterface, providing a clean test double.
31-50: LGTM!Good test coverage for the
/healthzendpoint, verifying both the HTTP status code and JSON response structure.
52-94: LGTM!Comprehensive tests for the
/readyzendpoint covering both ready and not-ready states with proper JSON response validation.pkg/health/metrics.go (2)
12-33: LGTM!Clean implementation with proper
ReadHeaderTimeoutto prevent slowloris attacks. The use ofpromhttp.Handler()correctly exposes the default Prometheus registry.
49-53: LGTM!Graceful shutdown correctly uses the provided context for timeout control.
pkg/health/server.go (4)
63-67: LGTM!The shutdown implementation correctly uses context for timeout control and propagates errors to the caller.
69-77: LGTM!Atomic operations ensure thread-safe readiness state management.
79-85: LGTM!The liveness handler correctly returns 200 OK unconditionally, which is the expected behavior for Kubernetes liveness probes.
87-104: LGTM!The readiness handler correctly returns 503 when not ready and 200 when ready, which properly implements Kubernetes readiness probe semantics.
Update health endpoints and metrics to comply with HyperFleet standards:
/readyz response format:
- Now returns detailed checks per standard: {"status": "ok", "checks": {"config": "ok", "broker": "ok"}}
- Tracks individual check states (config, broker) instead of single boolean
Probe timing (per health-endpoints.md):
- Liveness: initialDelaySeconds=15, periodSeconds=20
- Readiness: initialDelaySeconds=5, periodSeconds=10
Metrics (per metrics.md):
- Added hyperfleet_adapter_build_info{component, version, commit}
- Added hyperfleet_adapter_up{component, version}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @pkg/health/server_test.go:
- Around line 188-193: The test calls w.Result() multiple times which creates
distinct http.Response objects and can cause the body to be read from a
different instance; fix by calling w.Result() once (e.g., res := w.Result()) and
reuse res for decoding into ReadyResponse and for assertions, decoding from
res.Body and then asserting response.Checks["config"] == CheckOK and
response.Checks["broker"] == CheckError so ReadyResponse and the response body
come from the same http.Response.
🧹 Nitpick comments (5)
pkg/health/metrics.go (1)
52-54:MustRegisterwill panic if called multiple times.Using
prometheus.MustRegisterwith the default registry will panic if metrics are already registered (e.g., during tests or ifNewMetricsServeris called multiple times). Consider using a custom registry to isolate metrics per server instance.♻️ Suggested refactor using a custom registry
func NewMetricsServer(log logger.Logger, port string, cfg MetricsConfig) *MetricsServer { + // Use a custom registry to avoid panics on duplicate registration + registry := prometheus.NewRegistry() + // Create build_info metric per HyperFleet metrics standard buildInfo := prometheus.NewGaugeVec( prometheus.GaugeOpts{ Name: "hyperfleet_adapter_build_info", Help: "Build information for the adapter", }, []string{"component", "version", "commit"}, ) // ... (upGauge creation remains the same) // Register metrics - prometheus.MustRegister(buildInfo) - prometheus.MustRegister(upGauge) + registry.MustRegister(buildInfo) + registry.MustRegister(upGauge) // ... (metric initialization remains the same) mux := http.NewServeMux() - mux.Handle("/metrics", promhttp.Handler()) + mux.Handle("/metrics", promhttp.HandlerFor(registry, promhttp.HandlerOpts{}))pkg/health/server_test.go (1)
176-179: Unclosed response body on first request.The response body from the first request (Line 179) is not closed before making the second request. While this is a test, it's good practice to close response bodies to avoid resource leaks and ensure consistent test behavior.
♻️ Suggested fix
req := httptest.NewRequest(http.MethodGet, "/readyz", nil) w := httptest.NewRecorder() server.readyzHandler(w, req) - assert.Equal(t, http.StatusOK, w.Result().StatusCode) + resp := w.Result() + assert.Equal(t, http.StatusOK, resp.StatusCode) + resp.Body.Close()cmd/adapter/main.go (1)
234-236: Metrics server usesHealthServerShutdownTimeoutinstead of a dedicated constant.For clarity and future maintainability, consider defining a separate
MetricsServerShutdownTimeoutconstant, even if it has the same value asHealthServerShutdownTimeout.pkg/health/server.go (2)
127-133: JSON encoding error is not checked.While unlikely to fail for simple structs, ignoring the error from
json.NewEncoder(w).Encode()is not ideal. If encoding fails, the client receives incomplete JSON.♻️ Suggested improvement
func (s *Server) healthzHandler(w http.ResponseWriter, r *http.Request) { w.Header().Set("Content-Type", "application/json") w.WriteHeader(http.StatusOK) - json.NewEncoder(w).Encode(HealthResponse{Status: "ok"}) + if err := json.NewEncoder(w).Encode(HealthResponse{Status: "ok"}); err != nil { + s.log.Warnf(r.Context(), "Failed to encode health response: %v", err) + } }
36-45:componentfield is stored but never used.The
componentfield is set in the constructor but not referenced anywhere in the Server methods. Consider removing it or using it in log messages for better observability.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
charts/values.yamlcmd/adapter/main.gopkg/health/metrics.gopkg/health/server.gopkg/health/server_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- charts/values.yaml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-11T02:06:33.975Z
Learnt from: xueli181114
Repo: openshift-hyperfleet/hyperfleet-adapter PR: 17
File: internal/config_loader/validator_schema.go:126-133
Timestamp: 2025-12-11T02:06:33.975Z
Learning: In the hyperfleet-adapter repository (Go), environment variable validation for required parameters (those with `env.*` source, `Required=true`, and no default) occurs at config load time in `internal/config_loader/validator_schema.go`. This is intentional design - environment variables must be set before deployment, and the validation ensures fail-fast behavior with clear error messages.
Applied to files:
cmd/adapter/main.go
🧬 Code graph analysis (3)
cmd/adapter/main.go (3)
pkg/health/server.go (1)
NewServer(48-70)internal/config_loader/types.go (1)
Metadata(67-71)pkg/health/metrics.go (2)
NewMetricsServer(30-76)MetricsConfig(23-27)
pkg/health/server.go (2)
pkg/logger/logger.go (1)
Logger(14-40)pkg/logger/context.go (1)
WithErrorField(156-168)
pkg/health/metrics.go (3)
pkg/health/server.go (1)
Server(37-45)pkg/logger/logger.go (1)
Logger(14-40)pkg/logger/context.go (1)
WithErrorField(156-168)
🔇 Additional comments (8)
pkg/health/metrics.go (2)
78-90: Start method errors are not propagated to caller.The
Startmethod always returnsnil, even if the server fails to bind to the port. SinceListenAndServeruns in a goroutine, immediate binding errors (like "address already in use") are logged but not returned to the caller. This could cause silent failures during startup.Consider adding a brief startup delay or using a channel to detect early binding failures if this is a concern for your deployment.
62-75: LGTM on server configuration.Good practices applied:
ReadHeaderTimeoutis set to mitigate Slowloris attacks, and the HTTP mux setup is straightforward.pkg/health/server_test.go (1)
103-120: Good coverage of readiness state transitions.The test properly validates the multi-check readiness model: config and broker checks must both pass for
IsReady()to return true.cmd/adapter/main.go (3)
205-221: Well-structured health server lifecycle integration.The health server starts early in the initialization sequence, and
SetConfigLoaded()is appropriately called after successful config loading. The deferred shutdown with a dedicated timeout context ensures graceful cleanup.
287-293: Good graceful shutdown handling.Marking
SetReady(false)immediately upon receiving a shutdown signal ensures Kubernetes stops routing traffic before the pod terminates. This aligns with graceful shutdown best practices.
344-346: Readiness signaling after broker subscription is correct.Setting readiness only after the broker subscription is successfully established ensures the pod won't receive traffic until it can actually process events.
pkg/health/server.go (2)
138-150: Good thread-safe implementation of readiness check aggregation.The read lock is properly acquired before iterating over the checks map, and the map is copied before releasing the lock. This prevents race conditions while keeping the critical section minimal.
47-70: Well-designed constructor with sensible defaults.Initializing checks to
CheckErrorby default ensures the server starts in a "not ready" state, which is the correct behavior for Kubernetes readiness probes during startup.
Add atomic shuttingDown flag to health server that immediately returns 503 Service Unavailable on /readyz when SIGTERM/SIGINT is received. This follows the HyperFleet Graceful Shutdown Standard: mark not ready before context cancellation to allow Kubernetes to remove the pod from Service endpoints before draining in-flight work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
pkg/health/server.go (1)
69-73: Consider addingReadTimeoutandWriteTimeoutfor defense in depth.While
ReadHeaderTimeoutprotects against slowloris attacks on headers, addingReadTimeoutandWriteTimeoutwould provide additional protection against slow body reads and slow clients. For health endpoints with small payloads, short timeouts (e.g., 10s) are reasonable.♻️ Suggested enhancement
s.server = &http.Server{ Addr: ":" + port, Handler: mux, ReadHeaderTimeout: 5 * time.Second, + ReadTimeout: 10 * time.Second, + WriteTimeout: 10 * time.Second, }cmd/adapter/main.go (1)
234-241: Consider using a dedicatedMetricsServerShutdownTimeoutconstant.The metrics server shutdown uses
HealthServerShutdownTimeout. While the value is likely appropriate for both, using a dedicated constant improves clarity and allows independent tuning if needed.♻️ Suggested change
Add a constant:
// MetricsServerShutdownTimeout is the timeout for gracefully shutting down the metrics server MetricsServerShutdownTimeout = 5 * time.SecondThen use it in the deferred shutdown:
- shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), HealthServerShutdownTimeout) + shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), MetricsServerShutdownTimeout)
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cmd/adapter/main.gopkg/health/server.gopkg/health/server_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/health/server_test.go
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-11T02:06:33.975Z
Learnt from: xueli181114
Repo: openshift-hyperfleet/hyperfleet-adapter PR: 17
File: internal/config_loader/validator_schema.go:126-133
Timestamp: 2025-12-11T02:06:33.975Z
Learning: In the hyperfleet-adapter repository (Go), environment variable validation for required parameters (those with `env.*` source, `Required=true`, and no default) occurs at config load time in `internal/config_loader/validator_schema.go`. This is intentional design - environment variables must be set before deployment, and the validation ensures fail-fast behavior with clear error messages.
Applied to files:
cmd/adapter/main.go
🧬 Code graph analysis (1)
pkg/health/server.go (2)
pkg/logger/logger.go (1)
Logger(14-40)pkg/logger/context.go (1)
WithErrorField(156-168)
🔇 Additional comments (11)
pkg/health/server.go (6)
14-35: LGTM!The type definitions are clean and follow Go conventions. Using a string type for
CheckStatuswith constants provides good type safety while allowing JSON serialization. The response structs are well-designed with appropriateomitemptytags.
37-51: LGTM!Good use of
atomic.Boolfor the shutdown flag (lock-free fast path) combined withsync.RWMutexfor the checks map. This design allows the shutdown check to be performed without contention.
78-90: Start returns before server is confirmed listening.
Start()returnsnilimmediately while the server starts in a background goroutine. IfListenAndServefails (e.g., port already in use), the error is only logged, not returned to the caller. This is acceptable for health probes in Kubernetes (probes start later), but callers should be aware the server may not be listening whenStart()returns.If startup failure detection is important, consider using a channel or
net.Listento confirm the port is bound before returning.
133-149: LGTM!The
IsReadyimplementation correctly checks the atomic shutdown flag first (fast path without locking), then acquires a read lock to check the checks map. This is an efficient and thread-safe pattern.
151-157: LGTM!Simple liveness probe that returns 200 if the process is alive - correct behavior. The ignored
json.Encodeerror is acceptable since headers are already written and there's no meaningful recovery.
162-202: LGTM!The handler correctly:
- Checks the atomic shutdown flag first for fast 503 response
- Copies the checks map under read lock, then releases before encoding
- Returns detailed check states for debugging when not ready
Good implementation following the HyperFleet Graceful Shutdown Standard.
cmd/adapter/main.go (5)
50-60: LGTM!Well-documented constants for server ports and shutdown timeouts. Using string type for ports aligns with the
net/httpAPI expectations.
205-221: LGTM!Health server is correctly initialized early (after logging/OTel setup) so probes are available quickly.
SetConfigLoaded()is called immediately after successful start since config was already validated. The deferred shutdown with timeout ensures graceful termination.
287-300: LGTM!Excellent implementation of the HyperFleet Graceful Shutdown Standard. The
SetShuttingDown(true)call beforecancel()ensures/readyzreturns 503 immediately, allowing Kubernetes to stop routing traffic before the context cancellation propagates to in-flight requests.
345-348: LGTM!Readiness is correctly set only after the broker subscription is successfully established, ensuring
/readyzreturns 503 during the startup phase until the adapter is fully operational.
377-383: LGTM!Consistent with the signal handling pattern -
SetShuttingDown(true)is called beforecancel()on fatal subscription errors, ensuring the readiness probe correctly reflects the unhealthy state.
…ances Call w.Result() once and reuse the response object for both status code assertion and body decoding to ensure consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @pkg/health/server_test.go:
- Around line 1-284: Add tests covering the Server Start() and Shutdown()
lifecycle: create a Server via NewServer(...), start it in a goroutine with
Start(), verify it begins listening by performing an HTTP request to /healthz or
/readyz, then call Shutdown() and assert the server stops accepting connections
(requests fail or return context canceled) and that IsShuttingDown()/IsReady()
reflect expected state transitions; also test graceful shutdown behavior by
setting SetShuttingDown(true) and ensuring subsequent requests return the
shutdown response.
🧹 Nitpick comments (2)
pkg/health/server_test.go (2)
176-179: Minor: Response body not closed.The first
w.Result()call creates a response whose body is never closed. While this won't cause issues in tests, it's inconsistent with the pattern used elsewhere in this file.♻️ Suggested fix
req := httptest.NewRequest(http.MethodGet, "/readyz", nil) w := httptest.NewRecorder() server.readyzHandler(w, req) - assert.Equal(t, http.StatusOK, w.Result().StatusCode) + resp := w.Result() + assert.Equal(t, http.StatusOK, resp.StatusCode) + _ = resp.Body.Close()
222-225: Minor: Same unclosed response body pattern.Similar to the previous test, the response body from the first assertion is not closed.
♻️ Suggested fix
req := httptest.NewRequest(http.MethodGet, "/readyz", nil) w := httptest.NewRecorder() server.readyzHandler(w, req) - assert.Equal(t, http.StatusOK, w.Result().StatusCode) + resp := w.Result() + assert.Equal(t, http.StatusOK, resp.StatusCode) + _ = resp.Body.Close()
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
pkg/health/server_test.go
🔇 Additional comments (11)
pkg/health/server_test.go (11)
1-13: LGTM!Imports are appropriate for the test file. Good use of
requirefor fatal assertions andassertfor non-fatal ones.
15-29: LGTM!Clean no-op mock implementation. The self-returning pattern for
With/WithFields/Withoutcorrectly maintains chainability.
31-50: LGTM!Well-structured test with proper resource cleanup and comprehensive assertions for status code, content type, and response body.
52-75: LGTM!Good coverage of the default not-ready state, including verification of individual check statuses (config and broker).
77-101: LGTM!Properly tests the ready state after both
SetConfigLoaded()andSetReady(true)are called.
103-120: LGTM!Good state transition testing with clear progression through the readiness lifecycle.
122-144: LGTM!Good test of the custom check extensibility feature.
146-167: LGTM!Important edge case testing partial readiness where config is loaded but broker connection is not established.
200-213: LGTM!Good coverage of the shutdown flag state transitions.
249-260: LGTM!Critical test verifying that the shutdown flag takes precedence over check statuses in
IsReady().
262-284: LGTM!Good test ensuring the shutdown message has priority over the generic "not ready" message in the response.
Add tests covering: - Server Start() and Shutdown() lifecycle - HTTP requests to /healthz and /readyz while server is running - Graceful shutdown state transitions (SetShuttingDown) - Verify /readyz returns 503 during shutdown while /healthz stays 200 - Verify server stops accepting connections after Shutdown()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
pkg/health/server_test.go (2)
170-199: Minor inconsistency in response body handling.Line 180 calls
w.Result()without closing the body, unlike other test methods in this file. Whilehttptest.ResponseRecorderreturns a body backed bybytes.Reader(whereClose()is a no-op), it's good practice to maintain consistency.♻️ Suggested fix for consistency
- server.readyzHandler(w, req) - assert.Equal(t, http.StatusOK, w.Result().StatusCode) + server.readyzHandler(w, req) + resp := w.Result() + defer func() { _ = resp.Body.Close() }() + assert.Equal(t, http.StatusOK, resp.StatusCode)
287-325: Consider dynamic port allocation for parallel test safety.Hardcoded ports (18080-18083) could cause flaky tests if run with
-parallel. Using port0withnet.Listenlets the OS assign a free port.♻️ Example approach for dynamic port allocation
// In NewServer or test setup, use port "0" to let OS assign a free port // Then retrieve the actual port from the listener listener, err := net.Listen("tcp", ":0") if err != nil { t.Fatal(err) } actualPort := listener.Addr().(*net.TCPAddr).PortThis would require refactoring
Serverto accept a listener or expose the bound address, which may be out of scope for this PR.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
pkg/health/server_test.go
🧰 Additional context used
🧬 Code graph analysis (1)
pkg/health/server_test.go (2)
pkg/logger/logger.go (1)
Logger(14-40)pkg/health/server.go (5)
NewServer(54-76)HealthResponse(25-28)ReadyResponse(31-35)CheckError(21-21)CheckOK(19-19)
🔇 Additional comments (5)
pkg/health/server_test.go (5)
1-14: LGTM!Imports are appropriate for testing HTTP handlers with JSON responses. Good use of
testify/assertandtestify/requirefor assertions.
16-30: LGTM!The
mockLoggercorrectly implements all methods of thelogger.Loggerinterface with no-op behavior, which is appropriate for unit tests focused on HTTP behavior rather than logging verification.
32-102: LGTM!Good test coverage for the healthz and readyz handlers. Tests correctly verify:
- HTTP status codes (200 OK, 503 Service Unavailable)
- Content-Type header
- JSON response structure and values
- Proper resource cleanup with deferred body close
201-285: LGTM!Excellent coverage of graceful shutdown behavior:
- Tests
IsShuttingDown()state transitions- Verifies
/readyzreturns 503 with "server is shutting down" message- Confirms shutdown flag takes priority over check status
- Validates that checks are not included in response during shutdown
364-420: LGTM!
TestServerLifecycle_GracefulShutdownStateTransitionsprovides excellent coverage of the graceful shutdown sequence described in the PR objectives:
- Verifies server starts ready
- Confirms
/readyzreturns 503 afterSetShuttingDown(true)- Validates
/healthz(liveness) remains 200 during shutdown- Properly asserts state transitions via
IsReady()andIsShuttingDown()
charts/values.yaml
Outdated
| periodSeconds: 20 | ||
| timeoutSeconds: 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason not to go with the k8s defaults?
https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/core/v1/defaults_test.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. So I've updated it to use the defaults.
|
|
||
| // Start health server immediately (readiness starts as false) | ||
| healthServer := health.NewServer(log, HealthServerPort, adapterConfig.Metadata.Name) | ||
| if err := healthServer.Start(ctx); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the healthServer is started with a Context.... should that be used to cancel it during shutdown?
I mean, should we listen for the signals and call the context.cancel function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The context passed to Start() is intentionally used only for logging, not for lifecycle control.
Using explicit Shutdown() is preferred because:
- http.Server.Shutdown() does graceful shutdown - it stops accepting new connections and waits for in-flight requests to complete
- Context cancellation would be more abrupt (similar to Close())
- Explicit shutdown allows us to control the order: mark not-ready → shutdown health server → shutdown metrics → cleanup
The current pattern follows the standard Go HTTP server lifecycle: start in goroutine, shutdown explicitly when signaled.
…Ready to SetBrokerReady - Remove custom probe values (initialDelaySeconds, periodSeconds, etc.) to use K8s defaults - Rename SetReady to SetBrokerReady for consistency with SetConfigLoaded - Remove misleading "backward compatibility" comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
cmd/adapter/main.go (1)
223-241: Consider using a dedicated constant for metrics server shutdown timeout.The metrics server shutdown uses
HealthServerShutdownTimeout(line 235). While both servers likely have similar shutdown requirements, using a dedicatedMetricsServerShutdownTimeoutconstant would improve clarity and allow independent tuning if needed.Suggested change
// Timeout constants const ( // OTelShutdownTimeout is the timeout for gracefully shutting down the OpenTelemetry TracerProvider OTelShutdownTimeout = 5 * time.Second // HealthServerShutdownTimeout is the timeout for gracefully shutting down the health server HealthServerShutdownTimeout = 5 * time.Second + // MetricsServerShutdownTimeout is the timeout for gracefully shutting down the metrics server + MetricsServerShutdownTimeout = 5 * time.Second )Then use
MetricsServerShutdownTimeoutat line 235.pkg/health/server_test.go (1)
287-325: Consider using a helper to wait for server readiness.Using
time.Sleep(50 * time.Millisecond)for server startup is common but fragile. A polling approach would be more reliable if test flakiness becomes an issue.Polling helper example
func waitForServer(t *testing.T, url string, timeout time.Duration) { t.Helper() deadline := time.Now().Add(timeout) for time.Now().Before(deadline) { resp, err := http.Get(url) if err == nil { _ = resp.Body.Close() return } time.Sleep(10 * time.Millisecond) } t.Fatalf("server did not become ready at %s within %v", url, timeout) }pkg/health/server.go (1)
78-90: Server startup errors are not propagated to the caller.If
ListenAndServefails (e.g., port already in use), the error is logged butStart()returnsnil. The caller has no indication that the server failed to bind.This is a common pattern in Go but can make debugging harder during deployment. Consider using a ready channel or brief delay to verify startup success.
Alternative with startup verification
func (s *Server) Start(ctx context.Context) error { s.log.Infof(ctx, "Starting health server on port %s", s.port) errCh := make(chan error, 1) go func() { if err := s.server.ListenAndServe(); err != nil && err != http.ErrServerClosed { errCtx := logger.WithErrorField(ctx, err) s.log.Errorf(errCtx, "Health server error") select { case errCh <- err: default: } } }() // Brief wait to catch immediate startup failures (e.g., port in use) select { case err := <-errCh: return fmt.Errorf("health server failed to start: %w", err) case <-time.After(100 * time.Millisecond): return nil } }
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
charts/values.yamlcmd/adapter/main.gopkg/health/server.gopkg/health/server_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- charts/values.yaml
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-12-04T14:06:51.656Z
Learnt from: xueli181114
Repo: openshift-hyperfleet/hyperfleet-adapter PR: 13
File: internal/executor/post_action_executor.go:198-205
Timestamp: 2025-12-04T14:06:51.656Z
Learning: Logger refactor is tracked in HYPERFLEET-304 for the hyperfleet-adapter repository, which will address how CEL evaluation failures and similar errors are handled and logged.
Applied to files:
cmd/adapter/main.go
📚 Learning: 2025-12-11T02:06:33.975Z
Learnt from: xueli181114
Repo: openshift-hyperfleet/hyperfleet-adapter PR: 17
File: internal/config_loader/validator_schema.go:126-133
Timestamp: 2025-12-11T02:06:33.975Z
Learning: In the hyperfleet-adapter repository (Go), environment variable validation for required parameters (those with `env.*` source, `Required=true`, and no default) occurs at config load time in `internal/config_loader/validator_schema.go`. This is intentional design - environment variables must be set before deployment, and the validation ensures fail-fast behavior with clear error messages.
Applied to files:
cmd/adapter/main.go
🧬 Code graph analysis (2)
pkg/health/server_test.go (1)
pkg/health/server.go (5)
NewServer(54-76)HealthResponse(25-28)ReadyResponse(31-35)CheckError(21-21)CheckOK(19-19)
pkg/health/server.go (2)
pkg/logger/logger.go (1)
Logger(14-40)pkg/logger/context.go (1)
WithErrorField(156-168)
🔇 Additional comments (14)
cmd/adapter/main.go (3)
205-221: LGTM! Health server integration follows proper lifecycle management.The health server initialization correctly:
- Starts before other critical components
- Marks config as loaded after successful initialization
- Uses deferred shutdown with proper timeout context
287-300: LGTM! Graceful shutdown follows HyperFleet standard.The signal handler correctly marks
shuttingDownbefore context cancellation, ensuring/readyzreturns 503 immediately upon receiving SIGTERM/SIGINT. This aligns with the HyperFleet Graceful Shutdown Standard documented in the PR objectives.
345-347: LGTM! Broker readiness marking is correctly positioned.Marking the health server as ready only after broker subscription is established ensures
/readyzaccurately reflects the adapter's ability to process events.pkg/health/server_test.go (6)
1-14: LGTM! Well-structured test file with comprehensive imports.
16-30: LGTM! Mock logger correctly satisfies the interface.The mock implementation properly handles all methods including
With,WithFields, andWithoutreturning the mock instance for chaining.
32-102: LGTM! Handler tests verify expected behavior.Tests correctly verify:
/healthzalways returns 200 with{"status":"ok"}/readyzreturns 503 with check details when not ready/readyzreturns 200 when all checks pass
104-168: LGTM! State transition tests are thorough.Tests properly verify:
SetBrokerReadyandSetConfigLoadedcorrectly update check statuses- Partial readiness (config OK, broker error) returns 503
- Custom checks can be added via
SetCheck
201-285: LGTM! Shutdown behavior tests are comprehensive.Tests verify critical shutdown requirements:
SetShuttingDowntoggles the atomic flag correctly/readyzreturns 503 with "server is shutting down" message when flag is set- Shutdown state takes priority over individual check statuses
IsReady()returns false when shutting down, even if all checks pass
364-420: LGTM! Integration test validates graceful shutdown behavior end-to-end.This test correctly verifies the HyperFleet Graceful Shutdown Standard:
- Server transitions from ready to not-ready on shutdown flag
/readyzreturns 503 with shutdown message/healthzcontinues returning 200 during shutdown (liveness unaffected)pkg/health/server.go (5)
14-35: LGTM! Well-defined types for health responses.The
CheckStatustype and response structs are properly documented and use correct JSON tags withomitemptywhere appropriate.
37-51: LGTM! Thread-safe server struct design.Good use of
atomic.Boolfor the high-frequencyshuttingDownflag (avoids lock contention on every/readyzrequest) andsync.RWMutexfor the checks map.
53-76: LGTM! Constructor initializes with safe defaults.Good practices:
- Default checks start in error state (fail-safe)
ReadHeaderTimeoutset to 5s (prevents slowloris attacks)- Routes registered to dedicated ServeMux
132-148: LGTM! IsReady correctly prioritizes shutdown flag.The check order (atomic shutdown flag first, then mutex-protected checks) ensures fast-path shutdown detection without lock acquisition.
158-200: LGTM! readyzHandler implementation is correct and efficient.Good implementation details:
- Shutdown check is atomic (no lock needed)
- Checks map is copied under RLock, then lock released before I/O
- Response includes detailed check statuses for debugging
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rh-amarin The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Summary
/healthz,/readyzand/metricsendpoints per HyperFleet Health Endpoints Standard/readyzto return 503 during graceful shutdown per HyperFleet Graceful Shutdown StandardHYPERFLEET-448: Health Endpoints
/healthz(liveness) on port 8080 - always returns 200 OK/readyz(readiness) on port 8080 - returns 200 when config loaded and broker connected/metrics(Prometheus) on port 9090 - exposes build info and standard Go metricsHYPERFLEET-449: Graceful Shutdown Readiness
shuttingDownflag to health server/readyzimmediately returns 503 when SIGTERM/SIGINT is receivedTest plan
make testpassesmake lintpassesSummary by CodeRabbit
New Features
Infrastructure
Tests
✏️ Tip: You can customize this high-level summary in your review settings.