Skip to content

Conversation

@cfryanr
Copy link
Contributor

@cfryanr cfryanr commented Nov 25, 2024

Fixes #2137.

With this PR, the Concierge's controller which creates the "cert agent" Deployment now pays attention to which nodes are marked as unschedulable.

Background: The goal of this controller is to co-locate a Pinniped "cert agent pod" onto the same node as a Kubernetes controller-manager pod. Previously, it would always choose the newest (by creation date) running controller-manager pod, regardless of which node that pod was running on. This approach causes a problem when a cluster administrator uses kubectl drain or kubectl cordon to cordon a node that they wish to avoid using, e.g. because it needs maintenance. When that happened to be the node on which the cert agent pod was running, the cert agent pod would not move away from the cordoned node, even if the admin deletes the pod to try to force it to be rescheduled. (Note that kubectl drain also has the effect of kubectl cordon in that they both mark the node as unschedulable.)

With this PR, now when there are multiple running controller-manager pods to choose from, the controller will prefer to co-locate the cert agent pod with one that is running on a node which allows scheduling pods (spec.unschedulable equal to false).

When an admin uses kubectl cordon to cordon the node on which the cert agent pod is running, the Concierge will notice this within about 5 minutes and will move the pod to another node, assuming that a suitable one is available. The delay is because this controller does not dynamically watch all activity on the nodes. It could, but that would have some performance impact only to handle this rare corner case, so I feel like the delay is acceptable. Also note that just cordoning a node does not mean that its pods should necessarily be rescheduled, because it only marks it as unschedulable for future pods, which is another reason that the delay is acceptable. To speed things up, the admin can manually delete the cert agent pod, or can use kubectl drain to delete all the pods from the node (including the cert agent pod). This will cause the controller to run faster and move the pod to another node if possible.

Release note:

The Concierge will try to avoid scheduling the cert agent pod onto unschedulable nodes, when possible.
This means that the Concierge will pay attention to when the cert agent pod's node has been cordoned
or drained by the cluster administrator and will move the cert agent pod, if another suitable node is available.

@netlify
Copy link

netlify bot commented Nov 25, 2024

Deploy Preview for pinniped-dev canceled.

Name Link
🔨 Latest commit e61afcd
🔍 Latest deploy log https://app.netlify.com/sites/pinniped-dev/deploys/67474344ea5aa50008b4c831

@codecov
Copy link

codecov bot commented Nov 25, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 31.33%. Comparing base (6ac5446) to head (e61afcd).
Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2143      +/-   ##
==========================================
+ Coverage   31.28%   31.33%   +0.04%     
==========================================
  Files         371      371              
  Lines       61131    61174      +43     
==========================================
+ Hits        19125    19167      +42     
  Misses      41482    41482              
- Partials      524      525       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cfryanr cfryanr force-pushed the rr/kube-cert-agent-for-unschedulable-nodes branch from 7a1a096 to e44d70b Compare November 25, 2024 22:20
@joshuatcasey joshuatcasey merged commit 615b60b into main Nov 27, 2024
39 checks passed
@joshuatcasey joshuatcasey deleted the rr/kube-cert-agent-for-unschedulable-nodes branch November 27, 2024 18:27
@cfryanr cfryanr mentioned this pull request Dec 3, 2024
20 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The pinniped-concierge-kube-cert-agent hard coded to specific control plane node causes problems when node is unavailable

4 participants