Cert agent controller avoids locating the agent pod on unschedulable nodes when possible #2143

cfryanr · 2024-11-25T20:53:00Z

With this PR, the Concierge's controller which creates the "cert agent" Deployment now pays attention to which nodes are marked as unschedulable.

Background: The goal of this controller is to co-locate a Pinniped "cert agent pod" onto the same node as a Kubernetes controller-manager pod. Previously, it would always choose the newest (by creation date) running controller-manager pod, regardless of which node that pod was running on. This approach causes a problem when a cluster administrator uses kubectl drain or kubectl cordon to cordon a node that they wish to avoid using, e.g. because it needs maintenance. When that happened to be the node on which the cert agent pod was running, the cert agent pod would not move away from the cordoned node, even if the admin deletes the pod to try to force it to be rescheduled. (Note that kubectl drain also has the effect of kubectl cordon in that they both mark the node as unschedulable.)

With this PR, now when there are multiple running controller-manager pods to choose from, the controller will prefer to co-locate the cert agent pod with one that is running on a node which allows scheduling pods (spec.unschedulable equal to false).

When an admin uses kubectl cordon to cordon the node on which the cert agent pod is running, the Concierge will notice this within about 5 minutes and will move the pod to another node, assuming that a suitable one is available. The delay is because this controller does not dynamically watch all activity on the nodes. It could, but that would have some performance impact only to handle this rare corner case, so I feel like the delay is acceptable. Also note that just cordoning a node does not mean that its pods should necessarily be rescheduled, because it only marks it as unschedulable for future pods, which is another reason that the delay is acceptable. To speed things up, the admin can manually delete the cert agent pod, or can use kubectl drain to delete all the pods from the node (including the cert agent pod). This will cause the controller to run faster and move the pod to another node if possible.

Release note:

The Concierge will try to avoid scheduling the cert agent pod onto unschedulable nodes, when possible.
This means that the Concierge will pay attention to when the cert agent pod's node has been cordoned
or drained by the cluster administrator and will move the cert agent pod, if another suitable node is available.

netlify · 2024-11-25T20:53:16Z

✅ Deploy Preview for pinniped-dev canceled.

Name	Link
🔨 Latest commit	`e61afcd`
🔍 Latest deploy log	https://app.netlify.com/sites/pinniped-dev/deploys/67474344ea5aa50008b4c831

codecov · 2024-11-25T21:05:29Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 31.33%. Comparing base (6ac5446) to head (e61afcd).
Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2143      +/-   ##
==========================================
+ Coverage   31.28%   31.33%   +0.04%     
==========================================
  Files         371      371              
  Lines       61131    61174      +43     
==========================================
+ Hits        19125    19167      +42     
  Misses      41482    41482              
- Partials      524      525       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

vmwclabot added the cla-not-required label Nov 25, 2024

cfryanr mentioned this pull request Nov 25, 2024

The pinniped-concierge-kube-cert-agent hard coded to specific control plane node causes problems when node is unavailable #2137

Closed

kube cert agent controller avoids unschedulable nodes when possible

e44d70b

cfryanr force-pushed the rr/kube-cert-agent-for-unschedulable-nodes branch from 7a1a096 to e44d70b Compare November 25, 2024 22:20

Merge branch 'main' into rr/kube-cert-agent-for-unschedulable-nodes

e61afcd

joshuatcasey approved these changes Nov 27, 2024

View reviewed changes

joshuatcasey enabled auto-merge November 27, 2024 17:27

joshuatcasey merged commit 615b60b into main Nov 27, 2024
39 checks passed

joshuatcasey deleted the rr/kube-cert-agent-for-unschedulable-nodes branch November 27, 2024 18:27

cfryanr mentioned this pull request Dec 3, 2024

Release checklist for v0.36.0 #2148

Closed

20 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cert agent controller avoids locating the agent pod on unschedulable nodes when possible #2143

Cert agent controller avoids locating the agent pod on unschedulable nodes when possible #2143

Uh oh!

cfryanr commented Nov 25, 2024 •

edited

Loading

Uh oh!

netlify bot commented Nov 25, 2024 •

edited

Loading

Uh oh!

codecov bot commented Nov 25, 2024 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Cert agent controller avoids locating the agent pod on unschedulable nodes when possible #2143

Cert agent controller avoids locating the agent pod on unschedulable nodes when possible #2143

Uh oh!

Conversation

cfryanr commented Nov 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Nov 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pinniped-dev canceled.

Uh oh!

codecov bot commented Nov 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cfryanr commented Nov 25, 2024 •

edited

Loading

netlify bot commented Nov 25, 2024 •

edited

Loading

codecov bot commented Nov 25, 2024 •

edited

Loading