Skip to content

fix: run zxporter controller-manager as 2 replicas with proper HA guards#352

Merged
sandipanpanda merged 4 commits into
mainfrom
remove-pbd
Apr 16, 2026
Merged

fix: run zxporter controller-manager as 2 replicas with proper HA guards#352
sandipanpanda merged 4 commits into
mainfrom
remove-pbd

Conversation

@sandipanpanda
Copy link
Copy Markdown
Contributor

@sandipanpanda sandipanpanda commented Apr 14, 2026

📚 Description of Changes

Provide an overview of your changes and why they’re needed. Link to any related issues (e.g., "Fixes #123"). If your PR fixes a bug, resolves a feature request, or updates documentation, please explain how.

  • What Changed:
    (Describe the modifications, additions, or removals.)

  • Why This Change:
    (Explain the problem this PR addresses or the improvement it provides.)
    zxporter was deployed as a single replica with a PDB requiring minAvailable: 1, which permanently blocked node drains — the one pod could never be evicted because the PDB required it to stay up

  • Affected Components:
    (Which component does this change affect? - put x for all components)

  • Compose

  • K8s

  • Other (please specify)

❓ Motivation and Context

Why is this change required? What problem does it solve?

  • Context:
    (Provide background information or link to related discussions/issues.)

  • Relevant Tasks/Issues:
    (e.g., Fixes: #GitHub Issue)

🔍 Types of Changes

Indicate which type of changes your code introduces (check all that apply):

  • BUGFIX: Non-breaking fix for an issue.
  • NEW FEATURE: Non-breaking addition of functionality.
  • BREAKING CHANGE: Fix or feature that causes existing functionality to not work as expected.
  • ENHANCEMENT: Improvement to existing functionality.
  • CHORE: Changes that do not affect production (e.g., documentation, build tooling, CI).

🔬 QA / Verification Steps

Describe the steps a reviewer should take to verify your changes:

  1. (Step one: e.g., "Run make test to verify all tests pass.")
  2. (Step two: e.g., "Deploy to a Kind cluster with make create-kind && make deploy.")
  3. (Additional steps as needed.)

✅ Global Checklist

Please check all boxes that apply:

  • I have read and followed the CONTRIBUTING guidelines.
  • My code follows the code style of this project.
  • I have updated the documentation as needed.
  • I have added tests that cover my changes.
  • All new and existing tests have passed locally.
  • I have run this code in a local environment to verify functionality.
  • I have considered the security implications of this change.

Comment thread helm-chart/zxporter/values.yaml Outdated
  zxporter was deployed as a single replica with a PDB requiring
  minAvailable: 1, which permanently blocked node drains — the one pod
  could never be evicted because the PDB required it to stay up.

  This commit fixes the full HA setup across all install paths:

  - Set controller-manager replicas to 2 (leader election provides
    instant failover; only one pod is active at a time)
  - Set PDB minAvailable to 1 (allows evicting one pod during drains
    while keeping the standby available)
  - Add standby mode to HealthManager: non-leader pods now return 200
    on /readyz while awaiting leadership, so the PDB correctly sees
    2 available pods and does not block voluntary disruptions
  - Clear standby via mgr.Elected() when this pod wins the lease so
    normal component-level readiness checks resume
  - Fix transient startup 503: TelemetrySender.Start() now marks
    ComponentDakrTransport healthy optimistically so readiness does
    not flap between ClearReadinessSuppression() and the first
    successful telemetry send (~15s window)
  - Make Helm deployment replicas conditional on highAvailability.enabled
    (true → 2, false → 1) so Helm installs are self-consistent
@sandipanpanda sandipanpanda changed the title Set zxp pdb minAvailable: 0 to unblock node drain fix: run zxporter controller-manager as 2 replicas with proper HA guards Apr 14, 2026
Comment thread cmd/main.go
@@ -135,6 +135,15 @@ func main() {
// reconciling before enforcing readiness checks.
healthManager.SuppressReadiness(2 * time.Minute)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Edge Case: Readiness grace period expires while pod is still in standby

The 2-minute readiness grace period (SuppressReadiness(2 * time.Minute)) is set once at startup (line 136), before SetStandby(true) is called. If the standby pod takes longer than 2 minutes to win the leader election (e.g., the current leader is slow to release the lease), the grace period will have already expired by the time SetStandby(false) is called. At that point, readiness checks immediately enforce normal component health requirements, but the collectors haven't had time to start and register as healthy yet — causing the readiness probe to fail and the pod to receive no traffic/be restarted.

This is especially likely in scenarios where the existing leader is being drained (the whole point of this PR) and holds the lease until its graceful shutdown timeout expires.

Suggested fix:

Re-apply the readiness grace period when clearing standby:

```go
go func() {
    select {
    case <-mgr.Elected():
        healthManager.SetStandby(false)
        healthManager.SuppressReadiness(2 * time.Minute)
    case <-ctx.Done():
    }
}()
```

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

@sandipanpanda sandipanpanda requested a review from Tzvonimir April 14, 2026 20:15
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 16, 2026

Code Review ⚠️ Changes requested 1 resolved / 2 findings

Sets zxp pdb minAvailable to 0 to facilitate node drains, but the readiness grace period currently expires while pods are still in standby, risking traffic loss.

⚠️ Edge Case: Readiness grace period expires while pod is still in standby

📄 cmd/main.go:136 📄 cmd/main.go:143-145 📄 cmd/main.go:206-213

The 2-minute readiness grace period (SuppressReadiness(2 * time.Minute)) is set once at startup (line 136), before SetStandby(true) is called. If the standby pod takes longer than 2 minutes to win the leader election (e.g., the current leader is slow to release the lease), the grace period will have already expired by the time SetStandby(false) is called. At that point, readiness checks immediately enforce normal component health requirements, but the collectors haven't had time to start and register as healthy yet — causing the readiness probe to fail and the pod to receive no traffic/be restarted.

This is especially likely in scenarios where the existing leader is being drained (the whole point of this PR) and holds the lease until its graceful shutdown timeout expires.

Suggested fix
Re-apply the readiness grace period when clearing standby:

```go
go func() {
    select {
    case <-mgr.Elected():
        healthManager.SetStandby(false)
        healthManager.SuppressReadiness(2 * time.Minute)
    case <-ctx.Done():
    }
}()
```
✅ 1 resolved
Quality: PDB with minAvailable: 0 is a no-op; consider disabling it

📄 helm-chart/zxporter/values.yaml:144-147 📄 config/manager/poddisruptionbudget.yaml:7
Setting minAvailable: 0 means the PDB permits evicting all pods simultaneously, providing zero disruption protection. This effectively makes the PDB a no-op resource that exists in the cluster but does nothing.

For the single-replica default case, a better approach would be to set podDisruptionBudget.enabled: false by default (since there's nothing to protect with 1 replica anyway), and only enable it when highAvailability.enabled: true with a meaningful value like maxUnavailable: 1.

Alternatively, if you want to keep the PDB always present so it's ready when HA is enabled, consider using maxUnavailable: 1 instead — this allows one pod to be evicted at a time (unblocking node drains for a single-replica deployment) while still providing protection when scaled up to multiple replicas.

🤖 Prompt for agents
Code Review: Sets zxp pdb minAvailable to 0 to facilitate node drains, but the readiness grace period currently expires while pods are still in standby, risking traffic loss.

1. ⚠️ Edge Case: Readiness grace period expires while pod is still in standby
   Files: cmd/main.go:136, cmd/main.go:143-145, cmd/main.go:206-213

   The 2-minute readiness grace period (`SuppressReadiness(2 * time.Minute)`) is set once at startup (line 136), before `SetStandby(true)` is called. If the standby pod takes longer than 2 minutes to win the leader election (e.g., the current leader is slow to release the lease), the grace period will have already expired by the time `SetStandby(false)` is called. At that point, readiness checks immediately enforce normal component health requirements, but the collectors haven't had time to start and register as healthy yet — causing the readiness probe to fail and the pod to receive no traffic/be restarted.
   
   This is especially likely in scenarios where the existing leader is being drained (the whole point of this PR) and holds the lease until its graceful shutdown timeout expires.

   Suggested fix:
   Re-apply the readiness grace period when clearing standby:
   
   ```go
   go func() {
       select {
       case <-mgr.Elected():
           healthManager.SetStandby(false)
           healthManager.SuppressReadiness(2 * time.Minute)
       case <-ctx.Done():
       }
   }()
   ```

Was this helpful? React with 👍 / 👎 | Gitar

@sandipanpanda sandipanpanda merged commit 5e87b75 into main Apr 16, 2026
26 checks passed
@sandipanpanda sandipanpanda deleted the remove-pbd branch April 16, 2026 16:09
Parthiba-Hazra added a commit that referenced this pull request May 5, 2026
…rds (#352)

* Set zxp pdb minAvailable: 0 to unblock node drain

* fix: run zxporter controller-manager as 2 replicas with proper HA guards

  zxporter was deployed as a single replica with a PDB requiring
  minAvailable: 1, which permanently blocked node drains — the one pod
  could never be evicted because the PDB required it to stay up.

  This commit fixes the full HA setup across all install paths:

  - Set controller-manager replicas to 2 (leader election provides
    instant failover; only one pod is active at a time)
  - Set PDB minAvailable to 1 (allows evicting one pod during drains
    while keeping the standby available)
  - Add standby mode to HealthManager: non-leader pods now return 200
    on /readyz while awaiting leadership, so the PDB correctly sees
    2 available pods and does not block voluntary disruptions
  - Clear standby via mgr.Elected() when this pod wins the lease so
    normal component-level readiness checks resume
  - Fix transient startup 503: TelemetrySender.Start() now marks
    ComponentDakrTransport healthy optimistically so readiness does
    not flap between ClearReadinessSuppression() and the first
    successful telemetry send (~15s window)
  - Make Helm deployment replicas conditional on highAvailability.enabled
    (true → 2, false → 1) so Helm installs are self-consistent
Parthiba-Hazra added a commit that referenced this pull request May 5, 2026
…rds (#352)

* Set zxp pdb minAvailable: 0 to unblock node drain

* fix: run zxporter controller-manager as 2 replicas with proper HA guards

  zxporter was deployed as a single replica with a PDB requiring
  minAvailable: 1, which permanently blocked node drains — the one pod
  could never be evicted because the PDB required it to stay up.

  This commit fixes the full HA setup across all install paths:

  - Set controller-manager replicas to 2 (leader election provides
    instant failover; only one pod is active at a time)
  - Set PDB minAvailable to 1 (allows evicting one pod during drains
    while keeping the standby available)
  - Add standby mode to HealthManager: non-leader pods now return 200
    on /readyz while awaiting leadership, so the PDB correctly sees
    2 available pods and does not block voluntary disruptions
  - Clear standby via mgr.Elected() when this pod wins the lease so
    normal component-level readiness checks resume
  - Fix transient startup 503: TelemetrySender.Start() now marks
    ComponentDakrTransport healthy optimistically so readiness does
    not flap between ClearReadinessSuppression() and the first
    successful telemetry send (~15s window)
  - Make Helm deployment replicas conditional on highAvailability.enabled
    (true → 2, false → 1) so Helm installs are self-consistent
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants