config/v1/types_cluster_operator: Expand upgradeable inputs to cluster scope#926
Conversation
…r scope Consider these cases: a. Component A is in a state that allows updates, and nothing in the rest of the cluster would break if A updated. b. Component A is in a state that allows updates, but component B (which is in-cluster, but not part of A) would break if A updated. c. Component A would break if it updated. Operator A should pretty clearly be Upgradeable=True for (a) and Upgradeable=False for (c). Before this commit, a narrow reading of the comment would have operator A be Upgradeable=True for (b). This commit moves it to Upgradeable=False, based on discussion in [1], where it becomes the job of the API-server to set Upgradeable=False if updating the API-server would break nodes running old kubelets. The API-server can say "to unblock minor updates, update your kubelets". The machine-config operator will simultaneously say "hey, your kubelets are old, and here's how to update: $STEPS", but it won't use Upgradeable=False to say that (because the machine-config operator would be _happy_ to have its component nodes updated). As pointed out in discussion in [1], this is a bit of a bottomless pit. For example, component A may be removing a deprecated feature on update, and there may be user workloads that occasionally depend on that feature but hardly ever use it. Component A might reasonably think "nobody has used $OUTGOING_FEATURE in the last week, so I'm Upgradeable=True", and then post-update, the user-workload would go to hit the removed API and break. And obviously in-cluster components will have even more limited access to any out-of-cluster components that depend on them. So using Upgradeable=False to protect other components from breaking is going to be a best-effort sort of thing. But this commit pivots so that it's more clear that we'll put that effort in when we can. [1]: openshift/enhancements#762
d7dc014 to
1a82848
Compare
|
I'm fuzzy on this comment about the API-server being a client of the kubelet for exec flows. Perhaps that is sufficient to get the kubelet-skew-guard in under (c)? And maybe we want the godocs to be generic enough that operator B could say "hey, A is going to fast, wait for me to catch up" would be possible for cases where A isn't smart enough to notice B falling behind? Wording would be something like:
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: soltysh, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Consider these cases:
a. Component A is in a state that allows updates, and nothing in the rest of the cluster would break if A updated.
b. Component A is in a state that allows updates, but component B (which is in-cluster, but not part of A) would break if A updated.
c. Component A would break if it updated.
Operator A should pretty clearly be
Upgradeable=Truefor (a) andUpgradeable=Falsefor (c).Before this commit, a narrow reading of the comment would have operator A be
Upgradeable=Truefor (b). This commit moves it toUpgradeable=False, based on discussion in openshift/enhancements#762, where it becomes the job of the API-server to setUpgradeable=Falseif updating the API-server would break nodes running old kubelets. The API-server can say "to unblock minor updates, update your kubelets". The machine-config operator will simultaneously say "hey, your kubelets are old, and here's how to update:$STEPS", but it won't useUpgradeable=Falseto say that (because the machine-config operator would be happy to have its component nodes updated).As pointed out in discussion in openshift/enhancements#762, this is a bit of a bottomless pit. For example, component A may be removing a deprecated feature on update, and there may be user workloads that occasionally depend on that feature but hardly ever use it. Component A might reasonably think "nobody has used
$OUTGOING_FEATUREin the last week, so I'mUpgradeable=True", and then post-update, the user-workload would go to hit the removed API and break. And obviously in-cluster components will have even more limited access to any out-of-cluster components that depend on them. So usingUpgradeable=Falseto protect other components from breaking is going to be a best-effort sort of thing. But this commit pivots so that it's more clear that we'll put that effort in when we can.