HYPERFLEET-448, HYPERFLEET-449 - feat: add health endpoints and graceful shutdown readiness #29

rafabene · 2026-01-12T15:27:05Z

Summary

Add /healthz, /readyz and /metrics endpoints per HyperFleet Health Endpoints Standard
Set /readyz to return 503 during graceful shutdown per HyperFleet Graceful Shutdown Standard

HYPERFLEET-448: Health Endpoints

/healthz (liveness) on port 8080 - always returns 200 OK
/readyz (readiness) on port 8080 - returns 200 when config loaded and broker connected
/metrics (Prometheus) on port 9090 - exposes build info and standard Go metrics

HYPERFLEET-449: Graceful Shutdown Readiness

Add atomic shuttingDown flag to health server
/readyz immediately returns 503 when SIGTERM/SIGINT is received
Shutdown sequence follows standard: mark not ready → stop accepting → drain → cleanup → exit

Test plan

Unit tests for health endpoints
Unit tests for shutdown state transitions
make test passes
make lint passes

Summary by CodeRabbit

New Features
- Added standalone HTTP health and Prometheus metrics endpoints: /healthz, /readyz and /metrics with up and build-info metrics.
Infrastructure
- Probes enabled and exposed on explicit ports; liveness, readiness and startup probes active and readiness reflects broker/config state and honors shutdown signaling.
Tests
- Comprehensive tests covering health endpoints, readiness/shutdown transitions and server lifecycle.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Add HTTP health and metrics endpoints per HyperFleet standards: - /healthz on port 8080 - liveness probe endpoint (always returns 200) - /readyz on port 8080 - readiness probe endpoint (503 during startup/shutdown, 200 when ready) - /metrics on port 9090 - Prometheus metrics endpoint The readiness state is managed atomically: - Starts as not ready (503) - Becomes ready after broker subscription is established (200) - Returns to not ready when SIGTERM/SIGINT is received (503) Helm chart updated to enable liveness and readiness probes by default.

coderabbitai · 2026-01-12T15:27:16Z

Walkthrough

Adds a concurrent-safe HTTP health server with /healthz and /readyz, per-check statuses, config-loaded and shutting-down flags, and graceful Start/Shutdown. Adds a Prometheus metrics server exposing /metrics with up and build_info gauges. Adapter main starts both servers, marks config loaded on startup, updates broker readiness after subscription, marks not-ready before shutdown, and exposes new health/metrics ports and shutdown timeout constants. Helm chart enables container ports 8080 (http) and 9090 (metrics) and activates liveness, readiness, and startup probes. go.mod gains Prometheus dependencies and comprehensive unit tests for the health server were added.

Sequence Diagram(s)

sequenceDiagram
    participant Main as Application/Main
    participant HS as Health Server
    participant MS as Metrics Server
    participant Broker as Broker
    participant Sig as Signal Handler

    Main->>HS: NewServer(...) / Start(ctx)
    HS-->>Main: listening on :8080 (/healthz, /readyz)

    Main->>MS: NewMetricsServer(...) / Start(ctx)
    MS-->>Main: listening on :9090 (/metrics)

    Main->>Broker: Subscribe()
    Broker-->>Main: subscription established
    Main->>HS: SetConfigLoaded()
    Main->>HS: SetCheck("broker", ok)
    Main->>HS: SetBrokerReady(true)

    Sig-->>Main: shutdown signal
    Main->>HS: SetBrokerReady(false)
    Main->>HS: SetShuttingDown(true)
    Main->>HS: Shutdown(ctx)
    Main->>MS: Shutdown(ctx)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

HYPERFLEET-291| deployment: helm charts template implemented for adapter deployment #14 — Modifies Helm chart probe and container port configuration overlapping with this PR's chart changes.

Suggested labels

lgtm

Suggested reviewers

AlexVulaj
rh-amarin

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 15.79% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main changes: adding health endpoints and graceful shutdown readiness. It references the relevant issue numbers and uses appropriate terminology that accurately reflects the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e6d1e66 and 117cdc7.

📒 Files selected for processing (1)

charts/templates/deployment.yaml

🔇 Additional comments (3)

charts/templates/deployment.yaml (3)

185-199: LGTM — Clean conditional rendering for liveness probe timing.

Using {{- with }} blocks to conditionally emit probe timing parameters is idiomatic Helm and correctly defers to Kubernetes defaults when values aren't specified. This reduces chart maintenance burden and avoids embedding potentially stale defaults.

222-236: LGTM — Consistent pattern with liveness probe.

Same conditional rendering approach applied consistently to readiness probe timing parameters.

259-273: LGTM — Consistent pattern for startup probe.

Same conditional rendering approach. Startup probes particularly benefit from deferring to Kubernetes defaults, as users can customize failureThreshold for slow-starting applications without the chart imposing constraints.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (5)

pkg/health/server_test.go (1)
111-129: Minor: Response bodies not closed in this test.

Lines 120 and 128 call w.Result() but don't close the response body, unlike other tests. While this is unlikely to cause issues in tests, it's inconsistent with the pattern used elsewhere in the file.
Suggested fix
 	req := httptest.NewRequest(http.MethodGet, "/readyz", nil)
 	w := httptest.NewRecorder()
 	server.readyzHandler(w, req)
-	assert.Equal(t, http.StatusOK, w.Result().StatusCode)
+	resp := w.Result()
+	defer resp.Body.Close()
+	assert.Equal(t, http.StatusOK, resp.StatusCode)

 	// Set not ready (simulating shutdown)
 	server.SetReady(false)

 	req = httptest.NewRequest(http.MethodGet, "/readyz", nil)
 	w = httptest.NewRecorder()
 	server.readyzHandler(w, req)
-	assert.Equal(t, http.StatusServiceUnavailable, w.Result().StatusCode)
+	resp = w.Result()
+	defer resp.Body.Close()
+	assert.Equal(t, http.StatusServiceUnavailable, resp.StatusCode)
pkg/health/metrics.go (1)
35-47: Start() always returns nil, masking potential startup errors.

The method launches the server in a goroutine and immediately returns nil. If ListenAndServe fails (e.g., port already in use), the error is only logged asynchronously and the caller proceeds as if startup succeeded. This mirrors the pattern in server.go, so it's consistent within this PR.

For improved reliability, consider using a synchronization mechanism (e.g., a channel or sync.WaitGroup) to confirm the server is actually listening before returning, or at least document this behavior.
Example approach using a ready channel
func (s *MetricsServer) Start(ctx context.Context) error {
	s.log.Infof(ctx, "Starting metrics server on port %s", s.port)

	errCh := make(chan error, 1)
	go func() {
		if err := s.server.ListenAndServe(); err != nil && err != http.ErrServerClosed {
			errCtx := logger.WithErrorField(ctx, err)
			s.log.Errorf(errCtx, "Metrics server error")
			select {
			case errCh <- err:
			default:
			}
		}
	}()

	// Give the server a moment to fail fast on port conflicts
	select {
	case err := <-errCh:
		return err
	case <-time.After(100 * time.Millisecond):
		return nil
	}
}
pkg/health/server.go (3)
25-25: Unused field component.

The component field is stored in the struct but never used in this file. Consider removing it if it's not needed, or use it in log messages to identify which component's health is being reported.

40-44: Consider adding ReadTimeout and WriteTimeout for defense in depth.

While ReadHeaderTimeout is set (good for preventing slowloris attacks), adding ReadTimeout and WriteTimeout would provide additional protection against slow clients holding connections open.
💡 Suggested enhancement
 s.server = &http.Server{
 	Addr:              ":" + port,
 	Handler:           mux,
 	ReadHeaderTimeout: 5 * time.Second,
+	ReadTimeout:       10 * time.Second,
+	WriteTimeout:      10 * time.Second,
 }
50-61: Startup errors are not propagated to the caller.

The goroutine logs errors but doesn't communicate them back. If the server fails to bind (e.g., port already in use), the caller won't know. Consider using a channel to report startup success/failure, or at minimum document this behavior.
💡 Alternative pattern with startup error handling
 // Start starts the health server in a goroutine.
-func (s *Server) Start(ctx context.Context) error {
+// Returns an error channel that will receive any server errors.
+func (s *Server) Start(ctx context.Context) <-chan error {
 	s.log.Infof(ctx, "Starting health server on port %s", s.port)
+	errCh := make(chan error, 1)
 
 	go func() {
 		if err := s.server.ListenAndServe(); err != nil && err != http.ErrServerClosed {
 			errCtx := logger.WithErrorField(ctx, err)
 			s.log.Errorf(errCtx, "Health server error")
+			errCh <- err
 		}
+		close(errCh)
 	}()
 
-	return nil
+	return errCh
 }
Alternatively, if keeping the current signature, document that startup failures are only logged.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4d9156d and 1c49647.

⛔ Files ignored due to path filters (1)

go.sum is excluded by !**/*.sum

📒 Files selected for processing (6)

charts/values.yaml
cmd/adapter/main.go
go.mod
pkg/health/metrics.go
pkg/health/server.go
pkg/health/server_test.go

🧰 Additional context used

🧬 Code graph analysis (3)

pkg/health/server_test.go (3)

pkg/logger/logger.go (1)

Logger (14-40)

pkg/health/server.go (2)

NewServer (29-47)

Response (14-17)

internal/config_loader/types.go (1)

Header (172-175)

cmd/adapter/main.go (5)

pkg/health/server.go (1)

NewServer (29-47)

internal/config_loader/types.go (1)

Metadata (67-71)

pkg/logger/context.go (1)

WithErrorField (156-168)

internal/hyperfleet_api/client.go (1)

WithTimeout (70-74)

pkg/health/metrics.go (1)

NewMetricsServer (20-33)

pkg/health/metrics.go (3)

pkg/health/server.go (1)

Server (20-26)

pkg/logger/logger.go (1)

Logger (14-40)

pkg/logger/context.go (1)

WithErrorField (156-168)

🔇 Additional comments (17)

charts/values.yaml (2)

36-43: LGTM!

Container port definitions correctly expose the health server (8080) and metrics server (9090) endpoints, aligning with the constants defined in cmd/adapter/main.go.

89-122: LGTM!

Probe configurations are well-structured:

Liveness probe with 30s initial delay allows time for full initialization

Readiness probe with 5s initial delay and shorter intervals enables quick traffic routing once ready

Startup probe (disabled by default) with high failureThreshold (30) is useful for slow-starting scenarios

The paths (/healthz, /readyz) and port (8080) correctly match the health server implementation.

go.mod (1)

104-107: LGTM!

Prometheus client libraries are correctly added as indirect dependencies to support the metrics server. The versions appear current.

cmd/adapter/main.go (5)

50-60: LGTM!

Constants are well-organized and clearly documented. Port values align with the Helm chart configuration.

205-235: Health and metrics server lifecycle is well-implemented.

Good practices observed:

Health server starts early with readiness=false (correct initial state)

Deferred shutdowns with explicit timeout contexts

Error handling properly enriches context for logging

One observation: both Start() methods return immediately before the server goroutine actually binds to the port. If port binding fails (e.g., port already in use), the error is only logged asynchronously in the goroutine. Consider verifying the server is actually listening before proceeding, though for most deployments this is acceptable.

281-293: LGTM!

Correct graceful shutdown pattern - marking the service as not ready immediately upon receiving a termination signal ensures Kubernetes stops routing traffic before the actual shutdown begins.

338-340: LGTM!

Correctly marks the adapter as ready only after the broker subscription is successfully established, ensuring /readyz returns 503 during the initialization phase.

370-376: LGTM!

Properly marks the service as not ready before initiating shutdown on fatal subscription errors, maintaining consistency with the graceful shutdown pattern.

pkg/health/server_test.go (3)

15-29: LGTM!

The mock logger correctly implements all methods from the logger.Logger interface, providing a clean test double.

31-50: LGTM!

Good test coverage for the /healthz endpoint, verifying both the HTTP status code and JSON response structure.

52-94: LGTM!

Comprehensive tests for the /readyz endpoint covering both ready and not-ready states with proper JSON response validation.

pkg/health/metrics.go (2)

12-33: LGTM!

Clean implementation with proper ReadHeaderTimeout to prevent slowloris attacks. The use of promhttp.Handler() correctly exposes the default Prometheus registry.

49-53: LGTM!

Graceful shutdown correctly uses the provided context for timeout control.

pkg/health/server.go (4)

63-67: LGTM!

The shutdown implementation correctly uses context for timeout control and propagates errors to the caller.

69-77: LGTM!

Atomic operations ensure thread-safe readiness state management.

79-85: LGTM!

The liveness handler correctly returns 200 OK unconditionally, which is the expected behavior for Kubernetes liveness probes.

87-104: LGTM!

The readiness handler correctly returns 503 when not ready and 200 when ready, which properly implements Kubernetes readiness probe semantics.

Update health endpoints and metrics to comply with HyperFleet standards: /readyz response format: - Now returns detailed checks per standard: {"status": "ok", "checks": {"config": "ok", "broker": "ok"}} - Tracks individual check states (config, broker) instead of single boolean Probe timing (per health-endpoints.md): - Liveness: initialDelaySeconds=15, periodSeconds=20 - Readiness: initialDelaySeconds=5, periodSeconds=10 Metrics (per metrics.md): - Added hyperfleet_adapter_build_info{component, version, commit} - Added hyperfleet_adapter_up{component, version}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In @pkg/health/server_test.go:
- Around line 188-193: The test calls w.Result() multiple times which creates
distinct http.Response objects and can cause the body to be read from a
different instance; fix by calling w.Result() once (e.g., res := w.Result()) and
reuse res for decoding into ReadyResponse and for assertions, decoding from
res.Body and then asserting response.Checks["config"] == CheckOK and
response.Checks["broker"] == CheckError so ReadyResponse and the response body
come from the same http.Response.

🧹 Nitpick comments (5)

pkg/health/metrics.go (1)
52-54: MustRegister will panic if called multiple times.

Using prometheus.MustRegister with the default registry will panic if metrics are already registered (e.g., during tests or if NewMetricsServer is called multiple times). Consider using a custom registry to isolate metrics per server instance.
♻️ Suggested refactor using a custom registry
 func NewMetricsServer(log logger.Logger, port string, cfg MetricsConfig) *MetricsServer {
+	// Use a custom registry to avoid panics on duplicate registration
+	registry := prometheus.NewRegistry()
+
 	// Create build_info metric per HyperFleet metrics standard
 	buildInfo := prometheus.NewGaugeVec(
 		prometheus.GaugeOpts{
 			Name: "hyperfleet_adapter_build_info",
 			Help: "Build information for the adapter",
 		},
 		[]string{"component", "version", "commit"},
 	)

 	// ... (upGauge creation remains the same)

 	// Register metrics
-	prometheus.MustRegister(buildInfo)
-	prometheus.MustRegister(upGauge)
+	registry.MustRegister(buildInfo)
+	registry.MustRegister(upGauge)

 	// ... (metric initialization remains the same)

 	mux := http.NewServeMux()
-	mux.Handle("/metrics", promhttp.Handler())
+	mux.Handle("/metrics", promhttp.HandlerFor(registry, promhttp.HandlerOpts{}))
pkg/health/server_test.go (1)
176-179: Unclosed response body on first request.

The response body from the first request (Line 179) is not closed before making the second request. While this is a test, it's good practice to close response bodies to avoid resource leaks and ensure consistent test behavior.
♻️ Suggested fix
 	req := httptest.NewRequest(http.MethodGet, "/readyz", nil)
 	w := httptest.NewRecorder()
 	server.readyzHandler(w, req)
-	assert.Equal(t, http.StatusOK, w.Result().StatusCode)
+	resp := w.Result()
+	assert.Equal(t, http.StatusOK, resp.StatusCode)
+	resp.Body.Close()
cmd/adapter/main.go (1)

234-236: Metrics server uses HealthServerShutdownTimeout instead of a dedicated constant.

For clarity and future maintainability, consider defining a separate MetricsServerShutdownTimeout constant, even if it has the same value as HealthServerShutdownTimeout.
pkg/health/server.go (2)
127-133: JSON encoding error is not checked.

While unlikely to fail for simple structs, ignoring the error from json.NewEncoder(w).Encode() is not ideal. If encoding fails, the client receives incomplete JSON.
♻️ Suggested improvement
 func (s *Server) healthzHandler(w http.ResponseWriter, r *http.Request) {
 	w.Header().Set("Content-Type", "application/json")
 	w.WriteHeader(http.StatusOK)
-	json.NewEncoder(w).Encode(HealthResponse{Status: "ok"})
+	if err := json.NewEncoder(w).Encode(HealthResponse{Status: "ok"}); err != nil {
+		s.log.Warnf(r.Context(), "Failed to encode health response: %v", err)
+	}
 }
36-45: component field is stored but never used.

The component field is set in the constructor but not referenced anywhere in the Server methods. Consider removing it or using it in log messages for better observability.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c49647 and 0329312.

📒 Files selected for processing (5)

charts/values.yaml
cmd/adapter/main.go
pkg/health/metrics.go
pkg/health/server.go
pkg/health/server_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

charts/values.yaml

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-12-11T02:06:33.975Z

Learnt from: xueli181114
Repo: openshift-hyperfleet/hyperfleet-adapter PR: 17
File: internal/config_loader/validator_schema.go:126-133
Timestamp: 2025-12-11T02:06:33.975Z
Learning: In the hyperfleet-adapter repository (Go), environment variable validation for required parameters (those with `env.*` source, `Required=true`, and no default) occurs at config load time in `internal/config_loader/validator_schema.go`. This is intentional design - environment variables must be set before deployment, and the validation ensures fail-fast behavior with clear error messages.

Applied to files:

cmd/adapter/main.go

🧬 Code graph analysis (3)

cmd/adapter/main.go (3)

pkg/health/server.go (1)

NewServer (48-70)

internal/config_loader/types.go (1)

Metadata (67-71)

pkg/health/metrics.go (2)

NewMetricsServer (30-76)

MetricsConfig (23-27)

pkg/health/server.go (2)

pkg/logger/logger.go (1)

Logger (14-40)

pkg/logger/context.go (1)

WithErrorField (156-168)

pkg/health/metrics.go (3)

pkg/health/server.go (1)

Server (37-45)

pkg/logger/logger.go (1)

Logger (14-40)

pkg/logger/context.go (1)

WithErrorField (156-168)

🔇 Additional comments (8)

pkg/health/metrics.go (2)

78-90: Start method errors are not propagated to caller.

The Start method always returns nil, even if the server fails to bind to the port. Since ListenAndServe runs in a goroutine, immediate binding errors (like "address already in use") are logged but not returned to the caller. This could cause silent failures during startup.

Consider adding a brief startup delay or using a channel to detect early binding failures if this is a concern for your deployment.

62-75: LGTM on server configuration.

Good practices applied: ReadHeaderTimeout is set to mitigate Slowloris attacks, and the HTTP mux setup is straightforward.

pkg/health/server_test.go (1)

103-120: Good coverage of readiness state transitions.

The test properly validates the multi-check readiness model: config and broker checks must both pass for IsReady() to return true.

cmd/adapter/main.go (3)

205-221: Well-structured health server lifecycle integration.

The health server starts early in the initialization sequence, and SetConfigLoaded() is appropriately called after successful config loading. The deferred shutdown with a dedicated timeout context ensures graceful cleanup.

287-293: Good graceful shutdown handling.

Marking SetReady(false) immediately upon receiving a shutdown signal ensures Kubernetes stops routing traffic before the pod terminates. This aligns with graceful shutdown best practices.

344-346: Readiness signaling after broker subscription is correct.

Setting readiness only after the broker subscription is successfully established ensures the pod won't receive traffic until it can actually process events.

pkg/health/server.go (2)

138-150: Good thread-safe implementation of readiness check aggregation.

The read lock is properly acquired before iterating over the checks map, and the map is copied before releasing the lock. This prevents race conditions while keeping the critical section minimal.

47-70: Well-designed constructor with sensible defaults.

Initializing checks to CheckError by default ensures the server starts in a "not ready" state, which is the correct behavior for Kubernetes readiness probes during startup.

pkg/health/server_test.go

Add atomic shuttingDown flag to health server that immediately returns 503 Service Unavailable on /readyz when SIGTERM/SIGINT is received. This follows the HyperFleet Graceful Shutdown Standard: mark not ready before context cancellation to allow Kubernetes to remove the pod from Service endpoints before draining in-flight work.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

pkg/health/server.go (1)
69-73: Consider adding ReadTimeout and WriteTimeout for defense in depth.

While ReadHeaderTimeout protects against slowloris attacks on headers, adding ReadTimeout and WriteTimeout would provide additional protection against slow body reads and slow clients. For health endpoints with small payloads, short timeouts (e.g., 10s) are reasonable.
♻️ Suggested enhancement
 s.server = &http.Server{
 	Addr:              ":" + port,
 	Handler:           mux,
 	ReadHeaderTimeout: 5 * time.Second,
+	ReadTimeout:       10 * time.Second,
+	WriteTimeout:      10 * time.Second,
 }
cmd/adapter/main.go (1)
234-241: Consider using a dedicated MetricsServerShutdownTimeout constant.

The metrics server shutdown uses HealthServerShutdownTimeout. While the value is likely appropriate for both, using a dedicated constant improves clarity and allows independent tuning if needed.
♻️ Suggested change

Add a constant:
// MetricsServerShutdownTimeout is the timeout for gracefully shutting down the metrics server
MetricsServerShutdownTimeout = 5 * time.Second
Then use it in the deferred shutdown:
-	shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), HealthServerShutdownTimeout)
+	shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), MetricsServerShutdownTimeout)

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0329312 and e11d9ba.

📒 Files selected for processing (3)

cmd/adapter/main.go
pkg/health/server.go
pkg/health/server_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/health/server_test.go

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-12-11T02:06:33.975Z

Learnt from: xueli181114
Repo: openshift-hyperfleet/hyperfleet-adapter PR: 17
File: internal/config_loader/validator_schema.go:126-133
Timestamp: 2025-12-11T02:06:33.975Z
Learning: In the hyperfleet-adapter repository (Go), environment variable validation for required parameters (those with `env.*` source, `Required=true`, and no default) occurs at config load time in `internal/config_loader/validator_schema.go`. This is intentional design - environment variables must be set before deployment, and the validation ensures fail-fast behavior with clear error messages.

Applied to files:

cmd/adapter/main.go

🧬 Code graph analysis (1)

pkg/health/server.go (2)

pkg/logger/logger.go (1)

Logger (14-40)

pkg/logger/context.go (1)

WithErrorField (156-168)

🔇 Additional comments (11)

pkg/health/server.go (6)

14-35: LGTM!

The type definitions are clean and follow Go conventions. Using a string type for CheckStatus with constants provides good type safety while allowing JSON serialization. The response structs are well-designed with appropriate omitempty tags.

37-51: LGTM!

Good use of atomic.Bool for the shutdown flag (lock-free fast path) combined with sync.RWMutex for the checks map. This design allows the shutdown check to be performed without contention.

78-90: Start returns before server is confirmed listening.

Start() returns nil immediately while the server starts in a background goroutine. If ListenAndServe fails (e.g., port already in use), the error is only logged, not returned to the caller. This is acceptable for health probes in Kubernetes (probes start later), but callers should be aware the server may not be listening when Start() returns.

If startup failure detection is important, consider using a channel or net.Listen to confirm the port is bound before returning.

133-149: LGTM!

The IsReady implementation correctly checks the atomic shutdown flag first (fast path without locking), then acquires a read lock to check the checks map. This is an efficient and thread-safe pattern.

151-157: LGTM!

Simple liveness probe that returns 200 if the process is alive - correct behavior. The ignored json.Encode error is acceptable since headers are already written and there's no meaningful recovery.

162-202: LGTM!

The handler correctly:

Checks the atomic shutdown flag first for fast 503 response

Copies the checks map under read lock, then releases before encoding

Returns detailed check states for debugging when not ready

Good implementation following the HyperFleet Graceful Shutdown Standard.

cmd/adapter/main.go (5)

50-60: LGTM!

Well-documented constants for server ports and shutdown timeouts. Using string type for ports aligns with the net/http API expectations.

205-221: LGTM!

Health server is correctly initialized early (after logging/OTel setup) so probes are available quickly. SetConfigLoaded() is called immediately after successful start since config was already validated. The deferred shutdown with timeout ensures graceful termination.

287-300: LGTM!

Excellent implementation of the HyperFleet Graceful Shutdown Standard. The SetShuttingDown(true) call before cancel() ensures /readyz returns 503 immediately, allowing Kubernetes to stop routing traffic before the context cancellation propagates to in-flight requests.

345-348: LGTM!

Readiness is correctly set only after the broker subscription is successfully established, ensuring /readyz returns 503 during the startup phase until the adapter is fully operational.

377-383: LGTM!

Consistent with the signal handling pattern - SetShuttingDown(true) is called before cancel() on fatal subscription errors, ensuring the readiness probe correctly reflects the unhealthy state.

…ances Call w.Result() once and reuse the response object for both status code assertion and body decoding to ensure consistency.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In @pkg/health/server_test.go:
- Around line 1-284: Add tests covering the Server Start() and Shutdown()
lifecycle: create a Server via NewServer(...), start it in a goroutine with
Start(), verify it begins listening by performing an HTTP request to /healthz or
/readyz, then call Shutdown() and assert the server stops accepting connections
(requests fail or return context canceled) and that IsShuttingDown()/IsReady()
reflect expected state transitions; also test graceful shutdown behavior by
setting SetShuttingDown(true) and ensuring subsequent requests return the
shutdown response.

🧹 Nitpick comments (2)

pkg/health/server_test.go (2)
176-179: Minor: Response body not closed.

The first w.Result() call creates a response whose body is never closed. While this won't cause issues in tests, it's inconsistent with the pattern used elsewhere in this file.
♻️ Suggested fix
 	req := httptest.NewRequest(http.MethodGet, "/readyz", nil)
 	w := httptest.NewRecorder()
 	server.readyzHandler(w, req)
-	assert.Equal(t, http.StatusOK, w.Result().StatusCode)
+	resp := w.Result()
+	assert.Equal(t, http.StatusOK, resp.StatusCode)
+	_ = resp.Body.Close()
222-225: Minor: Same unclosed response body pattern.

Similar to the previous test, the response body from the first assertion is not closed.
♻️ Suggested fix
 	req := httptest.NewRequest(http.MethodGet, "/readyz", nil)
 	w := httptest.NewRecorder()
 	server.readyzHandler(w, req)
-	assert.Equal(t, http.StatusOK, w.Result().StatusCode)
+	resp := w.Result()
+	assert.Equal(t, http.StatusOK, resp.StatusCode)
+	_ = resp.Body.Close()

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e11d9ba and 2b44655.

📒 Files selected for processing (1)

pkg/health/server_test.go

🔇 Additional comments (11)

pkg/health/server_test.go (11)

1-13: LGTM!

Imports are appropriate for the test file. Good use of require for fatal assertions and assert for non-fatal ones.

15-29: LGTM!

Clean no-op mock implementation. The self-returning pattern for With/WithFields/Without correctly maintains chainability.

31-50: LGTM!

Well-structured test with proper resource cleanup and comprehensive assertions for status code, content type, and response body.

52-75: LGTM!

Good coverage of the default not-ready state, including verification of individual check statuses (config and broker).

77-101: LGTM!

Properly tests the ready state after both SetConfigLoaded() and SetReady(true) are called.

103-120: LGTM!

Good state transition testing with clear progression through the readiness lifecycle.

122-144: LGTM!

Good test of the custom check extensibility feature.

146-167: LGTM!

Important edge case testing partial readiness where config is loaded but broker connection is not established.

200-213: LGTM!

Good coverage of the shutdown flag state transitions.

249-260: LGTM!

Critical test verifying that the shutdown flag takes precedence over check statuses in IsReady().

262-284: LGTM!

Good test ensuring the shutdown message has priority over the generic "not ready" message in the response.

pkg/health/server_test.go

Add tests covering: - Server Start() and Shutdown() lifecycle - HTTP requests to /healthz and /readyz while server is running - Graceful shutdown state transitions (SetShuttingDown) - Verify /readyz returns 503 during shutdown while /healthz stays 200 - Verify server stops accepting connections after Shutdown()

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

pkg/health/server_test.go (2)
170-199: Minor inconsistency in response body handling.

Line 180 calls w.Result() without closing the body, unlike other test methods in this file. While httptest.ResponseRecorder returns a body backed by bytes.Reader (where Close() is a no-op), it's good practice to maintain consistency.
♻️ Suggested fix for consistency
-	server.readyzHandler(w, req)
-	assert.Equal(t, http.StatusOK, w.Result().StatusCode)
+	server.readyzHandler(w, req)
+	resp := w.Result()
+	defer func() { _ = resp.Body.Close() }()
+	assert.Equal(t, http.StatusOK, resp.StatusCode)
287-325: Consider dynamic port allocation for parallel test safety.

Hardcoded ports (18080-18083) could cause flaky tests if run with -parallel. Using port 0 with net.Listen lets the OS assign a free port.
♻️ Example approach for dynamic port allocation
// In NewServer or test setup, use port "0" to let OS assign a free port
// Then retrieve the actual port from the listener
listener, err := net.Listen("tcp", ":0")
if err != nil {
    t.Fatal(err)
}
actualPort := listener.Addr().(*net.TCPAddr).Port
This would require refactoring Server to accept a listener or expose the bound address, which may be out of scope for this PR.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2b44655 and 786f173.

📒 Files selected for processing (1)

pkg/health/server_test.go

🧰 Additional context used

🧬 Code graph analysis (1)

pkg/health/server_test.go (2)

pkg/logger/logger.go (1)

Logger (14-40)

pkg/health/server.go (5)

NewServer (54-76)

HealthResponse (25-28)

ReadyResponse (31-35)

CheckError (21-21)

CheckOK (19-19)

🔇 Additional comments (5)

pkg/health/server_test.go (5)

1-14: LGTM!

Imports are appropriate for testing HTTP handlers with JSON responses. Good use of testify/assert and testify/require for assertions.

16-30: LGTM!

The mockLogger correctly implements all methods of the logger.Logger interface with no-op behavior, which is appropriate for unit tests focused on HTTP behavior rather than logging verification.

32-102: LGTM!

Good test coverage for the healthz and readyz handlers. Tests correctly verify:

HTTP status codes (200 OK, 503 Service Unavailable)

Content-Type header

JSON response structure and values

Proper resource cleanup with deferred body close

201-285: LGTM!

Excellent coverage of graceful shutdown behavior:

Tests IsShuttingDown() state transitions

Verifies /readyz returns 503 with "server is shutting down" message

Confirms shutdown flag takes priority over check status

Validates that checks are not included in response during shutdown

364-420: LGTM!

TestServerLifecycle_GracefulShutdownStateTransitions provides excellent coverage of the graceful shutdown sequence described in the PR objectives:

Verifies server starts ready

Confirms /readyz returns 503 after SetShuttingDown(true)

Validates /healthz (liveness) remains 200 during shutdown

Properly asserts state transitions via IsReady() and IsShuttingDown()

rh-amarin · 2026-01-12T17:25:27Z

charts/values.yaml

+  periodSeconds: 20
+  timeoutSeconds: 5


Any particular reason not to go with the k8s defaults?

https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/core/v1/defaults_test.go

Nope. So I've updated it to use the defaults.

cmd/adapter/main.go

rh-amarin · 2026-01-12T17:45:52Z

cmd/adapter/main.go


+	// Start health server immediately (readiness starts as false)
+	healthServer := health.NewServer(log, HealthServerPort, adapterConfig.Metadata.Name)
+	if err := healthServer.Start(ctx); err != nil {


If the healthServer is started with a Context.... should that be used to cancel it during shutdown?

I mean, should we listen for the signals and call the context.cancel function?

The context passed to Start() is intentionally used only for logging, not for lifecycle control.

Using explicit Shutdown() is preferred because:

http.Server.Shutdown() does graceful shutdown - it stops accepting new connections and waits for in-flight requests to complete

Context cancellation would be more abrupt (similar to Close())

Explicit shutdown allows us to control the order: mark not-ready → shutdown health server → shutdown metrics → cleanup

The current pattern follows the standard Go HTTP server lifecycle: start in goroutine, shutdown explicitly when signaled.

pkg/health/server.go

…Ready to SetBrokerReady - Remove custom probe values (initialDelaySeconds, periodSeconds, etc.) to use K8s defaults - Rename SetReady to SetBrokerReady for consistency with SetConfigLoaded - Remove misleading "backward compatibility" comment

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

cmd/adapter/main.go (1)
223-241: Consider using a dedicated constant for metrics server shutdown timeout.

The metrics server shutdown uses HealthServerShutdownTimeout (line 235). While both servers likely have similar shutdown requirements, using a dedicated MetricsServerShutdownTimeout constant would improve clarity and allow independent tuning if needed.
Suggested change
 // Timeout constants
 const (
 	// OTelShutdownTimeout is the timeout for gracefully shutting down the OpenTelemetry TracerProvider
 	OTelShutdownTimeout = 5 * time.Second
 	// HealthServerShutdownTimeout is the timeout for gracefully shutting down the health server
 	HealthServerShutdownTimeout = 5 * time.Second
+	// MetricsServerShutdownTimeout is the timeout for gracefully shutting down the metrics server
+	MetricsServerShutdownTimeout = 5 * time.Second
 )
Then use MetricsServerShutdownTimeout at line 235.
pkg/health/server_test.go (1)
287-325: Consider using a helper to wait for server readiness.

Using time.Sleep(50 * time.Millisecond) for server startup is common but fragile. A polling approach would be more reliable if test flakiness becomes an issue.
Polling helper example
func waitForServer(t *testing.T, url string, timeout time.Duration) {
    t.Helper()
    deadline := time.Now().Add(timeout)
    for time.Now().Before(deadline) {
        resp, err := http.Get(url)
        if err == nil {
            _ = resp.Body.Close()
            return
        }
        time.Sleep(10 * time.Millisecond)
    }
    t.Fatalf("server did not become ready at %s within %v", url, timeout)
}
pkg/health/server.go (1)
78-90: Server startup errors are not propagated to the caller.

If ListenAndServe fails (e.g., port already in use), the error is logged but Start() returns nil. The caller has no indication that the server failed to bind.

This is a common pattern in Go but can make debugging harder during deployment. Consider using a ready channel or brief delay to verify startup success.
Alternative with startup verification
func (s *Server) Start(ctx context.Context) error {
    s.log.Infof(ctx, "Starting health server on port %s", s.port)

    errCh := make(chan error, 1)
    go func() {
        if err := s.server.ListenAndServe(); err != nil && err != http.ErrServerClosed {
            errCtx := logger.WithErrorField(ctx, err)
            s.log.Errorf(errCtx, "Health server error")
            select {
            case errCh <- err:
            default:
            }
        }
    }()

    // Brief wait to catch immediate startup failures (e.g., port in use)
    select {
    case err := <-errCh:
        return fmt.Errorf("health server failed to start: %w", err)
    case <-time.After(100 * time.Millisecond):
        return nil
    }
}

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 786f173 and e00fa91.

📒 Files selected for processing (4)

charts/values.yaml
cmd/adapter/main.go
pkg/health/server.go
pkg/health/server_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

charts/values.yaml

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-12-04T14:06:51.656Z

Learnt from: xueli181114
Repo: openshift-hyperfleet/hyperfleet-adapter PR: 13
File: internal/executor/post_action_executor.go:198-205
Timestamp: 2025-12-04T14:06:51.656Z
Learning: Logger refactor is tracked in HYPERFLEET-304 for the hyperfleet-adapter repository, which will address how CEL evaluation failures and similar errors are handled and logged.

Applied to files:

cmd/adapter/main.go

📚 Learning: 2025-12-11T02:06:33.975Z

Learnt from: xueli181114
Repo: openshift-hyperfleet/hyperfleet-adapter PR: 17
File: internal/config_loader/validator_schema.go:126-133
Timestamp: 2025-12-11T02:06:33.975Z
Learning: In the hyperfleet-adapter repository (Go), environment variable validation for required parameters (those with `env.*` source, `Required=true`, and no default) occurs at config load time in `internal/config_loader/validator_schema.go`. This is intentional design - environment variables must be set before deployment, and the validation ensures fail-fast behavior with clear error messages.

Applied to files:

cmd/adapter/main.go

🧬 Code graph analysis (2)

pkg/health/server_test.go (1)

pkg/health/server.go (5)

NewServer (54-76)

HealthResponse (25-28)

ReadyResponse (31-35)

CheckError (21-21)

CheckOK (19-19)

pkg/health/server.go (2)

pkg/logger/logger.go (1)

Logger (14-40)

pkg/logger/context.go (1)

WithErrorField (156-168)

🔇 Additional comments (14)

cmd/adapter/main.go (3)

205-221: LGTM! Health server integration follows proper lifecycle management.

The health server initialization correctly:

Starts before other critical components

Marks config as loaded after successful initialization

Uses deferred shutdown with proper timeout context

287-300: LGTM! Graceful shutdown follows HyperFleet standard.

The signal handler correctly marks shuttingDown before context cancellation, ensuring /readyz returns 503 immediately upon receiving SIGTERM/SIGINT. This aligns with the HyperFleet Graceful Shutdown Standard documented in the PR objectives.

345-347: LGTM! Broker readiness marking is correctly positioned.

Marking the health server as ready only after broker subscription is established ensures /readyz accurately reflects the adapter's ability to process events.

pkg/health/server_test.go (6)

1-14: LGTM! Well-structured test file with comprehensive imports.

16-30: LGTM! Mock logger correctly satisfies the interface.

The mock implementation properly handles all methods including With, WithFields, and Without returning the mock instance for chaining.

32-102: LGTM! Handler tests verify expected behavior.

Tests correctly verify:

/healthz always returns 200 with {"status":"ok"}

/readyz returns 503 with check details when not ready

/readyz returns 200 when all checks pass

104-168: LGTM! State transition tests are thorough.

Tests properly verify:

SetBrokerReady and SetConfigLoaded correctly update check statuses

Partial readiness (config OK, broker error) returns 503

Custom checks can be added via SetCheck

201-285: LGTM! Shutdown behavior tests are comprehensive.

Tests verify critical shutdown requirements:

SetShuttingDown toggles the atomic flag correctly

/readyz returns 503 with "server is shutting down" message when flag is set

Shutdown state takes priority over individual check statuses

IsReady() returns false when shutting down, even if all checks pass

364-420: LGTM! Integration test validates graceful shutdown behavior end-to-end.

This test correctly verifies the HyperFleet Graceful Shutdown Standard:

Server transitions from ready to not-ready on shutdown flag

/readyz returns 503 with shutdown message

/healthz continues returning 200 during shutdown (liveness unaffected)

pkg/health/server.go (5)

14-35: LGTM! Well-defined types for health responses.

The CheckStatus type and response structs are properly documented and use correct JSON tags with omitempty where appropriate.

37-51: LGTM! Thread-safe server struct design.

Good use of atomic.Bool for the high-frequency shuttingDown flag (avoids lock contention on every /readyz request) and sync.RWMutex for the checks map.

53-76: LGTM! Constructor initializes with safe defaults.

Good practices:

Default checks start in error state (fail-safe)

ReadHeaderTimeout set to 5s (prevents slowloris attacks)

Routes registered to dedicated ServeMux

132-148: LGTM! IsReady correctly prioritizes shutdown flag.

The check order (atomic shutdown flag first, then mutex-protected checks) ensures fast-path shutdown detection without lock acquisition.

158-200: LGTM! readyzHandler implementation is correct and efficient.

Good implementation details:

Shutdown check is atomic (no lock needed)

Checks map is copied under RLock, then lock released before I/O

Response includes detailed check statuses for debugging

…pply its defaults

rh-amarin · 2026-01-12T22:28:28Z

/lgtm

openshift-ci · 2026-01-12T22:28:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rh-amarin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [rh-amarin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot requested review from rh-amarin and vkareh January 12, 2026 15:27

coderabbitai bot reviewed Jan 12, 2026

View reviewed changes

pkg/health/server_test.go Show resolved Hide resolved

rafabene changed the title ~~HYPERFLEET-448 - feat: add /healthz, /readyz and /metrics endpoints~~ HYPERFLEET-448, HYPERFLEET-449 - feat: add health endpoints and graceful shutdown readiness Jan 12, 2026

coderabbitai bot reviewed Jan 12, 2026

View reviewed changes

HYPERFLEET-449 - fix: reuse w.Result() in test to avoid multiple inst…

2b44655

…ances Call w.Result() once and reuse the response object for both status code assertion and body decoding to ensure consistency.

coderabbitai bot reviewed Jan 12, 2026

View reviewed changes

pkg/health/server_test.go Show resolved Hide resolved

coderabbitai bot reviewed Jan 12, 2026

View reviewed changes

rh-amarin reviewed Jan 12, 2026

View reviewed changes

coderabbitai bot reviewed Jan 12, 2026

View reviewed changes

rafabene added 2 commits January 12, 2026 15:47

HYPERFLEET-448 - chore: run go mod tidy to clean duplicate dependencies

e6d1e66

HYPERFLEET-448 - refactor: remove hardcoded probe defaults, let K8s a…

117cdc7

…pply its defaults

openshift-ci bot assigned rh-amarin Jan 12, 2026

openshift-ci bot added the lgtm label Jan 12, 2026

openshift-ci bot added the approved label Jan 12, 2026

openshift-merge-bot bot merged commit 20c3c13 into main Jan 12, 2026
7 checks passed

HYPERFLEET-448, HYPERFLEET-449 - feat: add health endpoints and graceful shutdown readiness #29

HYPERFLEET-448, HYPERFLEET-449 - feat: add health endpoints and graceful shutdown readiness #29

Uh oh!

Conversation

rafabene commented Jan 12, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

HYPERFLEET-448: Health Endpoints

HYPERFLEET-449: Graceful Shutdown Readiness

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

rh-amarin Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

rafabene Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rh-amarin Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

rafabene Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

rh-amarin commented Jan 12, 2026

Uh oh!

openshift-ci bot commented Jan 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rafabene commented Jan 12, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 12, 2026 •

edited

Loading