Pod Disruption Budgets

Pod Disruption Budgets (PDBs) in CloudZero Agent

This document explains why CloudZero Agent uses Pod Disruption Budgets (PDBs) and how to handle them during cluster maintenance operations.

What are Pod Disruption Budgets?

Pod Disruption Budgets are Kubernetes resources that limit the number of pods that can be voluntarily disrupted during cluster maintenance operations. They help ensure application availability during:

Node maintenance
Cluster upgrades
Pod evictions
Voluntary disruptions

For more information, see the Kubernetes documentation on Pod Disruption Budgets.

Why CloudZero Agent Uses PDBs

The CloudZero Agent uses PDBs on all components with minAvailable: 1 to prevent accidental scaling down to 0 replicas during cluster operations. Without PDBs, these components could be accidentally scaled down to 0 replicas during cluster operations, causing:

Complete loss of metrics collection
Loss of cost attribution data
Service unavailability

The Single-Replica "Problem"

You may have noticed that some CloudZero Agent components (like the main agent and kube-state-metrics) run with only 1 replica but still have PDBs with minAvailable: 1. This is intentional and follows Kubernetes best practices for single-instance stateful applications.

As noted in the Kubernetes documentation on single-instance stateful applications, single-instance stateful applications may want to set minAvailable: 1 in the PDB to prevent the pod from being evicted, but this will also prevent the node from being drained.

Why This Matters

The PDB prevents:

Accidental scaling to 0: Prevents the component from being accidentally scaled down during cluster operations
Data loss: Ensures continuous metrics collection and cost attribution
Service disruption: Maintains critical CloudZero functionality

However, this also means:

Node draining is blocked: The PDB prevents pods from being evicted during node maintenance
Cluster upgrades may be delayed: The autoscaler cannot drain nodes with these pods

Solution: Temporarily Disable PDBs During Maintenance

The Kubernetes documentation provides a recommended solution for this scenario: set PDB with maxUnavailable=0, establish an understanding that the cluster operator needs to consult you before termination, and when contacted, prepare for downtime and delete the PDB to indicate readiness for disruption, then recreate it afterwards.

Step 1: Disable PDBs temporarily

You can disable all PDBs in two ways:

Option A: Disable all PDBs globally

# Disable all PDBs for maintenance
defaults:
  podDisruptionBudget:
    enabled: false
kubeStateMetrics:
  podDisruptionBudget: {}

Option B: Disable PDBs individually

# Disable specific component PDBs
components:
  agent:
    podDisruptionBudget:
      enabled: false
  aggregator:
    podDisruptionBudget:
      enabled: false
  webhookServer:
    podDisruptionBudget:
      enabled: false
kubeStateMetrics:
  podDisruptionBudget: {}

Step 2: Apply the changes

helm upgrade <release-name> ./helm -f your-values.yaml

Step 3: Perform your maintenance operation

Cluster upgrades
Node patching
Node draining
Any other operations that require pod eviction

Step 4: Re-enable PDBs after maintenance

# Disable all PDBs for maintenance
# defaults:
#   podDisruptionBudget:
#     enabled: false
# kubeStateMetrics:
#   podDisruptionBudget: {}

helm upgrade <release-name> ./helm -f your-values.yaml

Troubleshooting

PDB Preventing Node Draining

If you encounter issues where PDBs prevent node draining:

Check which PDBs are blocking:
```
kubectl get pdb
```

Temporarily disable the blocking PDB:

kubectl patch pdb <pdb-name> -p '{"spec":{"minAvailable":0}}'

Perform your maintenance operation

Re-enable the PDB:

kubectl patch pdb <pdb-name> -p '{"spec":{"minAvailable":1}}'

Best Practices

Plan maintenance windows: Coordinate with your team before disabling PDBs
Re-enable promptly: Always re-enable PDBs after maintenance to prevent accidental scaling
Monitor during maintenance: Ensure CloudZero functionality is restored after PDBs are re-enabled
Document the process: Keep track of when PDBs are disabled and re-enabled

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pod Disruption Budgets

Pod Disruption Budgets (PDBs) in CloudZero Agent

What are Pod Disruption Budgets?

Why CloudZero Agent Uses PDBs

The Single-Replica "Problem"

Why This Matters

Solution: Temporarily Disable PDBs During Maintenance

Step 1: Disable PDBs temporarily

Option A: Disable all PDBs globally

Option B: Disable PDBs individually

Step 2: Apply the changes

Step 3: Perform your maintenance operation

Step 4: Re-enable PDBs after maintenance

Troubleshooting

PDB Preventing Node Draining

Best Practices

References

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally