fix: run zxporter controller-manager as 2 replicas with proper HA guards#352
Conversation
zxporter was deployed as a single replica with a PDB requiring
minAvailable: 1, which permanently blocked node drains — the one pod
could never be evicted because the PDB required it to stay up.
This commit fixes the full HA setup across all install paths:
- Set controller-manager replicas to 2 (leader election provides
instant failover; only one pod is active at a time)
- Set PDB minAvailable to 1 (allows evicting one pod during drains
while keeping the standby available)
- Add standby mode to HealthManager: non-leader pods now return 200
on /readyz while awaiting leadership, so the PDB correctly sees
2 available pods and does not block voluntary disruptions
- Clear standby via mgr.Elected() when this pod wins the lease so
normal component-level readiness checks resume
- Fix transient startup 503: TelemetrySender.Start() now marks
ComponentDakrTransport healthy optimistically so readiness does
not flap between ClearReadinessSuppression() and the first
successful telemetry send (~15s window)
- Make Helm deployment replicas conditional on highAvailability.enabled
(true → 2, false → 1) so Helm installs are self-consistent
| @@ -135,6 +135,15 @@ func main() { | |||
| // reconciling before enforcing readiness checks. | |||
| healthManager.SuppressReadiness(2 * time.Minute) | |||
There was a problem hiding this comment.
⚠️ Edge Case: Readiness grace period expires while pod is still in standby
The 2-minute readiness grace period (SuppressReadiness(2 * time.Minute)) is set once at startup (line 136), before SetStandby(true) is called. If the standby pod takes longer than 2 minutes to win the leader election (e.g., the current leader is slow to release the lease), the grace period will have already expired by the time SetStandby(false) is called. At that point, readiness checks immediately enforce normal component health requirements, but the collectors haven't had time to start and register as healthy yet — causing the readiness probe to fail and the pod to receive no traffic/be restarted.
This is especially likely in scenarios where the existing leader is being drained (the whole point of this PR) and holds the lease until its graceful shutdown timeout expires.
Suggested fix:
Re-apply the readiness grace period when clearing standby:
```go
go func() {
select {
case <-mgr.Elected():
healthManager.SetStandby(false)
healthManager.SuppressReadiness(2 * time.Minute)
case <-ctx.Done():
}
}()
```
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
Code Review
|
…rds (#352) * Set zxp pdb minAvailable: 0 to unblock node drain * fix: run zxporter controller-manager as 2 replicas with proper HA guards zxporter was deployed as a single replica with a PDB requiring minAvailable: 1, which permanently blocked node drains — the one pod could never be evicted because the PDB required it to stay up. This commit fixes the full HA setup across all install paths: - Set controller-manager replicas to 2 (leader election provides instant failover; only one pod is active at a time) - Set PDB minAvailable to 1 (allows evicting one pod during drains while keeping the standby available) - Add standby mode to HealthManager: non-leader pods now return 200 on /readyz while awaiting leadership, so the PDB correctly sees 2 available pods and does not block voluntary disruptions - Clear standby via mgr.Elected() when this pod wins the lease so normal component-level readiness checks resume - Fix transient startup 503: TelemetrySender.Start() now marks ComponentDakrTransport healthy optimistically so readiness does not flap between ClearReadinessSuppression() and the first successful telemetry send (~15s window) - Make Helm deployment replicas conditional on highAvailability.enabled (true → 2, false → 1) so Helm installs are self-consistent
…rds (#352) * Set zxp pdb minAvailable: 0 to unblock node drain * fix: run zxporter controller-manager as 2 replicas with proper HA guards zxporter was deployed as a single replica with a PDB requiring minAvailable: 1, which permanently blocked node drains — the one pod could never be evicted because the PDB required it to stay up. This commit fixes the full HA setup across all install paths: - Set controller-manager replicas to 2 (leader election provides instant failover; only one pod is active at a time) - Set PDB minAvailable to 1 (allows evicting one pod during drains while keeping the standby available) - Add standby mode to HealthManager: non-leader pods now return 200 on /readyz while awaiting leadership, so the PDB correctly sees 2 available pods and does not block voluntary disruptions - Clear standby via mgr.Elected() when this pod wins the lease so normal component-level readiness checks resume - Fix transient startup 503: TelemetrySender.Start() now marks ComponentDakrTransport healthy optimistically so readiness does not flap between ClearReadinessSuppression() and the first successful telemetry send (~15s window) - Make Helm deployment replicas conditional on highAvailability.enabled (true → 2, false → 1) so Helm installs are self-consistent
📚 Description of Changes
Provide an overview of your changes and why they’re needed. Link to any related issues (e.g., "Fixes #123"). If your PR fixes a bug, resolves a feature request, or updates documentation, please explain how.
What Changed:
(Describe the modifications, additions, or removals.)
Why This Change:
(Explain the problem this PR addresses or the improvement it provides.)
zxporter was deployed as a single replica with a PDB requiring minAvailable: 1, which permanently blocked node drains — the one pod could never be evicted because the PDB required it to stay up
Affected Components:
(Which component does this change affect? - put x for all components)
Compose
K8s
Other (please specify)
❓ Motivation and Context
Why is this change required? What problem does it solve?
Context:
(Provide background information or link to related discussions/issues.)
Relevant Tasks/Issues:
(e.g., Fixes: #GitHub Issue)
🔍 Types of Changes
Indicate which type of changes your code introduces (check all that apply):
🔬 QA / Verification Steps
Describe the steps a reviewer should take to verify your changes:
make testto verify all tests pass.")make create-kind && make deploy.")✅ Global Checklist
Please check all boxes that apply: