fix: run zxporter controller-manager as 2 replicas with proper HA guards by sandipanpanda · Pull Request #352 · devzero-inc/zxporter

sandipanpanda · 2026-04-14T15:04:51Z

📚 Description of Changes

Provide an overview of your changes and why they’re needed. Link to any related issues (e.g., "Fixes #123"). If your PR fixes a bug, resolves a feature request, or updates documentation, please explain how.

What Changed:
(Describe the modifications, additions, or removals.)
Why This Change:
(Explain the problem this PR addresses or the improvement it provides.)
zxporter was deployed as a single replica with a PDB requiring minAvailable: 1, which permanently blocked node drains — the one pod could never be evicted because the PDB required it to stay up
Affected Components:
(Which component does this change affect? - put x for all components)
Compose
K8s
Other (please specify)

❓ Motivation and Context

Why is this change required? What problem does it solve?

Context:
(Provide background information or link to related discussions/issues.)
Relevant Tasks/Issues:
(e.g., Fixes: #GitHub Issue)

🔍 Types of Changes

Indicate which type of changes your code introduces (check all that apply):

BUGFIX: Non-breaking fix for an issue.
NEW FEATURE: Non-breaking addition of functionality.
BREAKING CHANGE: Fix or feature that causes existing functionality to not work as expected.
ENHANCEMENT: Improvement to existing functionality.
CHORE: Changes that do not affect production (e.g., documentation, build tooling, CI).

🔬 QA / Verification Steps

Describe the steps a reviewer should take to verify your changes:

(Step one: e.g., "Run make test to verify all tests pass.")
(Step two: e.g., "Deploy to a Kind cluster with make create-kind && make deploy.")
(Additional steps as needed.)

✅ Global Checklist

Please check all boxes that apply:

I have read and followed the CONTRIBUTING guidelines.
My code follows the code style of this project.
I have updated the documentation as needed.
I have added tests that cover my changes.
All new and existing tests have passed locally.
I have run this code in a local environment to verify functionality.
I have considered the security implications of this change.

zxporter was deployed as a single replica with a PDB requiring minAvailable: 1, which permanently blocked node drains — the one pod could never be evicted because the PDB required it to stay up. This commit fixes the full HA setup across all install paths: - Set controller-manager replicas to 2 (leader election provides instant failover; only one pod is active at a time) - Set PDB minAvailable to 1 (allows evicting one pod during drains while keeping the standby available) - Add standby mode to HealthManager: non-leader pods now return 200 on /readyz while awaiting leadership, so the PDB correctly sees 2 available pods and does not block voluntary disruptions - Clear standby via mgr.Elected() when this pod wins the lease so normal component-level readiness checks resume - Fix transient startup 503: TelemetrySender.Start() now marks ComponentDakrTransport healthy optimistically so readiness does not flap between ClearReadinessSuppression() and the first successful telemetry send (~15s window) - Make Helm deployment replicas conditional on highAvailability.enabled (true → 2, false → 1) so Helm installs are self-consistent

gitar-bot · 2026-04-14T20:00:40Z

@@ -135,6 +135,15 @@ func main() {
 	// reconciling before enforcing readiness checks.
 	healthManager.SuppressReadiness(2 * time.Minute)


⚠️ Edge Case: Readiness grace period expires while pod is still in standby

The 2-minute readiness grace period (SuppressReadiness(2 * time.Minute)) is set once at startup (line 136), before SetStandby(true) is called. If the standby pod takes longer than 2 minutes to win the leader election (e.g., the current leader is slow to release the lease), the grace period will have already expired by the time SetStandby(false) is called. At that point, readiness checks immediately enforce normal component health requirements, but the collectors haven't had time to start and register as healthy yet — causing the readiness probe to fail and the pod to receive no traffic/be restarted.

This is especially likely in scenarios where the existing leader is being drained (the whole point of this PR) and holds the lease until its graceful shutdown timeout expires.

Suggested fix:

Re-apply the readiness grace period when clearing standby: ```go go func() { select { case <-mgr.Elected(): healthManager.SetStandby(false) healthManager.SuppressReadiness(2 * time.Minute) case <-ctx.Done(): } }() ```

_{Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion}

gitar-bot · 2026-04-16T14:27:07Z

Code Review ⚠️ Changes requested 1 resolved / 2 findings

Sets zxp pdb minAvailable to 0 to facilitate node drains, but the readiness grace period currently expires while pods are still in standby, risking traffic loss.

⚠️

Edge Case: Readiness grace period expires while pod is still in standby

📄 cmd/main.go:136 📄 cmd/main.go:143-145 📄 cmd/main.go:206-213

The 2-minute readiness grace period (SuppressReadiness(2 * time.Minute)) is set once at startup (line 136), before SetStandby(true) is called. If the standby pod takes longer than 2 minutes to win the leader election (e.g., the current leader is slow to release the lease), the grace period will have already expired by the time SetStandby(false) is called. At that point, readiness checks immediately enforce normal component health requirements, but the collectors haven't had time to start and register as healthy yet — causing the readiness probe to fail and the pod to receive no traffic/be restarted.

This is especially likely in scenarios where the existing leader is being drained (the whole point of this PR) and holds the lease until its graceful shutdown timeout expires.

Suggested fix

Re-apply the readiness grace period when clearing standby:

```go
go func() {
    select {
    case <-mgr.Elected():
        healthManager.SetStandby(false)
        healthManager.SuppressReadiness(2 * time.Minute)
    case <-ctx.Done():
    }
}()
```

✅ 1 resolved

✅ Quality: PDB with minAvailable: 0 is a no-op; consider disabling it

📄 helm-chart/zxporter/values.yaml:144-147 📄 config/manager/poddisruptionbudget.yaml:7
Setting minAvailable: 0 means the PDB permits evicting all pods simultaneously, providing zero disruption protection. This effectively makes the PDB a no-op resource that exists in the cluster but does nothing.

For the single-replica default case, a better approach would be to set podDisruptionBudget.enabled: false by default (since there's nothing to protect with 1 replica anyway), and only enable it when highAvailability.enabled: true with a meaningful value like maxUnavailable: 1.

Alternatively, if you want to keep the PDB always present so it's ready when HA is enabled, consider using maxUnavailable: 1 instead — this allows one pod to be evicted at a time (unblocking node drains for a single-replica deployment) while still providing protection when scaled up to multiple replicas.

🤖 Prompt for agents

Code Review: Sets zxp pdb minAvailable to 0 to facilitate node drains, but the readiness grace period currently expires while pods are still in standby, risking traffic loss.

1. ⚠️ Edge Case: Readiness grace period expires while pod is still in standby
   Files: cmd/main.go:136, cmd/main.go:143-145, cmd/main.go:206-213

   The 2-minute readiness grace period (`SuppressReadiness(2 * time.Minute)`) is set once at startup (line 136), before `SetStandby(true)` is called. If the standby pod takes longer than 2 minutes to win the leader election (e.g., the current leader is slow to release the lease), the grace period will have already expired by the time `SetStandby(false)` is called. At that point, readiness checks immediately enforce normal component health requirements, but the collectors haven't had time to start and register as healthy yet — causing the readiness probe to fail and the pod to receive no traffic/be restarted.
   
   This is especially likely in scenarios where the existing leader is being drained (the whole point of this PR) and holds the lease until its graceful shutdown timeout expires.

   Suggested fix:
   Re-apply the readiness grace period when clearing standby:
   
   ```go
   go func() {
       select {
       case <-mgr.Elected():
           healthManager.SetStandby(false)
           healthManager.SuppressReadiness(2 * time.Minute)
       case <-ctx.Done():
       }
   }()
   ```

_{Was this helpful? React with 👍 / 👎 | Gitar}

…rds (#352) * Set zxp pdb minAvailable: 0 to unblock node drain * fix: run zxporter controller-manager as 2 replicas with proper HA guards zxporter was deployed as a single replica with a PDB requiring minAvailable: 1, which permanently blocked node drains — the one pod could never be evicted because the PDB required it to stay up. This commit fixes the full HA setup across all install paths: - Set controller-manager replicas to 2 (leader election provides instant failover; only one pod is active at a time) - Set PDB minAvailable to 1 (allows evicting one pod during drains while keeping the standby available) - Add standby mode to HealthManager: non-leader pods now return 200 on /readyz while awaiting leadership, so the PDB correctly sees 2 available pods and does not block voluntary disruptions - Clear standby via mgr.Elected() when this pod wins the lease so normal component-level readiness checks resume - Fix transient startup 503: TelemetrySender.Start() now marks ComponentDakrTransport healthy optimistically so readiness does not flap between ClearReadinessSuppression() and the first successful telemetry send (~15s window) - Make Helm deployment replicas conditional on highAvailability.enabled (true → 2, false → 1) so Helm installs are self-consistent

Set zxp pdb minAvailable: 0 to unblock node drain

841eb9c

gitar-bot Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread helm-chart/zxporter/values.yaml Outdated

Tzvonimir approved these changes Apr 14, 2026

View reviewed changes

sandipanpanda changed the title ~~Set zxp pdb minAvailable: 0 to unblock node drain~~ fix: run zxporter controller-manager as 2 replicas with proper HA guards Apr 14, 2026

gitar-bot Bot reviewed Apr 14, 2026

View reviewed changes

sandipanpanda requested a review from Tzvonimir April 14, 2026 20:15

Tzvonimir approved these changes Apr 15, 2026

View reviewed changes

sandipanpanda added 2 commits April 15, 2026 21:45

Merge branch 'main' into remove-pbd

8ff6a6e

Merge branch 'main' into remove-pbd

321e76a

sandipanpanda merged commit 5e87b75 into main Apr 16, 2026
26 checks passed

sandipanpanda deleted the remove-pbd branch April 16, 2026 16:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: run zxporter controller-manager as 2 replicas with proper HA guards#352

fix: run zxporter controller-manager as 2 replicas with proper HA guards#352
sandipanpanda merged 4 commits into
mainfrom
remove-pbd

sandipanpanda commented Apr 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

gitar-bot Bot Apr 14, 2026

Uh oh!

gitar-bot Bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -135,6 +135,15 @@ func main() {
		// reconciling before enforcing readiness checks.
		healthManager.SuppressReadiness(2 * time.Minute)

Conversation

sandipanpanda commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📚 Description of Changes

❓ Motivation and Context

🔍 Types of Changes

🔬 QA / Verification Steps

✅ Global Checklist

Uh oh!

Uh oh!

gitar-bot Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sandipanpanda commented Apr 14, 2026 •

edited

Loading

gitar-bot Bot commented Apr 16, 2026 •

edited

Loading