@@ -117,34 +117,17 @@ use.
117117This allows us to * inform* the admin for removals that are more than one minor
118118version away and * block* upgrades for removals which are imminent.
119119
120- ### MCO - Enforce OpenShift's defined host component version skew policies
121-
122- The MCO, will set Upgradeable=False whenever any MachineConfigPool has one more
123- more nodes present which fall outside of a defined list of constraints. For
124- instance, if OpenShift has a defined Kubelet Version Skew of N-1, the node
125- constraints enforced by the MCO defined in OCP 4.7 (Kube 1.20) would be as follows:
126-
127- ``` yaml
128- node.status.nodeInfo.kubeletVersion :
129- - v1.20
130- ` ` `
131-
132- If the policy were to change allowing for a version skew of N-2, v1.19 would be
133- added to the list of acceptable matches. As a result a cluster which had been
134- upgraded from 4.6 to 4.7 would allow a subsequent upgrade to 4.8 as long as all
135- kubelets were either v1.19 or v1.20. The 4.8 MCO would then evaluate the Upgradeable
136- condition based on its constraints, if v1.19 weren't allowed it would then
137- inhibit upgrades to 4.9. This means the MCO must set Upgradeable=False until it
138- has confirmed constraints have been met.
139-
140- ` ` ` yaml
141- node.status.nodeInfo.kubeletVersion :
142- - v1.20
143- - v1.19
144- ` ` `
145-
146- The MCO is not responsible for defining these constraints and constraints are
147- only widened whenever we have CI testing proves them to be safe.
120+ ### APIServer - Enforce OpenShift's defined kubelet version skew policies
121+
122+ The API Server Operator will set ` Upgradeable=False ` whenever any of the nodes
123+ within the cluster are at the skew limit; that is, when an upgrade of the API
124+ Server would exceed the allowable kubelet version skew. For instance, if
125+ OpenShift has a defined kubelet version skew of N-1, the API Server Operator
126+ would report ` Upgradeable=True ` if all of the nodes are at N, and
127+ ` Upgradeable=False ` if at least one of the nodes is not up to date. If the
128+ kubelet skew policy were to change, allowing for a version skew of N-2, the API
129+ Server Operator would report ` Upgradeable=True ` if all of the nodes are at N or
130+ N-1, and ` Upgradeable=False ` if any of the nodes are at N-2.
148131
149132These changes will need to be backported to 4.7 prior to 4.7 EOL.
150133
@@ -304,8 +287,8 @@ that's broadly scoped as "EUS 4.6 to EUS 4.10 Validator"?
304287
305288- CI tests are necessary which attempt to upgrade while violating kubelet to API
306289compatibility, ie: 4.6 to 4.7 upgrade with MachineConfigPools paused, then check
307- for Upgradeable=False condition to be set by MCO assuming that our rules only allow
308- for N-1 skew.
290+ for Upgradeable=False condition to be set by the API Server Operator, assuming
291+ that our rules only allow for N-1 skew.
309292- CI tests are necessary which install an OLM Operator which expresses a maxKubeVersion
310293or maxOCPVersion equal to the current cluster version and checks for Upgradeable=False
311294on OLM
@@ -393,6 +376,32 @@ The idea is to find the best form of an argument why this enhancement should _no
393376
394377## Alternatives
395378
379+ ### MCO Kubelet Skew Enforcement
380+
381+ Instead of the API Server Operator enforcing kubelet skew compliance through
382+ the ` Upgradeable ` flag, the MCO could provide this functionality. Either of
383+ these two operators are the obvious choice for such a check since they are
384+ responsible for both halves of the kubelet-API Server interaction. It makes
385+ more sense for the leading component to implement the check, however, since
386+ it's the leading edge that's going to violate the skew compliance first. In the
387+ case of OpenShift, that leading edge is the API Server and it makes more sense
388+ for it to determine whether a step forward is going to violate the skew policy.
389+ On top of that, the gating mechanism we have today is the ` Upgradeable=False `
390+ flag, which indicates that a particular operator cannot be upgraded, thereby
391+ halting the upgrade of the entire cluster. It doesn't make sense for the MCO to
392+ assert this condition, since an upgrade of the MCO and its operands (RHCOS)
393+ would actually reduce the skew. If the MCO were to use this mechanism to
394+ enforce the skew, it would be a reinterpretation of the function of that flag
395+ to instead indicate that the entire cluster cannot be upgraded. It's a subtle
396+ but important distinction that preserves low coupling between individual
397+ operators.
398+
399+ ### MCO Rollout Gating
400+
401+ (This section was written assuming that the MCO would be responsible for
402+ enforcing the node skew policy, but this plan has since been modified to make
403+ the API Server Operator responsible for this enforcement.)
404+
396405Rather than having MCO enforce version skew policies between OS managed
397406components and operator managed components it could simply set Upgradeable=False
398407whenever a rollout is in progress. This would preclude minor version upgrades in
0 commit comments