EUS Upgrades MVP by sdodson · Pull Request #585 · openshift/enhancements

sdodson · 2021-01-14T03:10:52Z

An enhancement that focuses on general upgrade safety, speed, and
a minimal amount of EUS to EUS specific work.

TODO:

Find a board / team to take on the disconnected upgrade planning tools

pweil- · 2021-01-14T13:48:16Z

+
+We have pre-existing mechanisms to alert admins to their use of APIs which are
+slated for removal in a future release. Currently those mechanisms do not look
+forward far enough to span 4.6 to 4.10, we will need to add the ability to


Experience question: do you envision that we know the intention of the user to move from EUS to EUS or simply that we provide a way to see forward looking information if the admin wants to see it?

More concretely, just as an example, is the vision:

showing the API removals, operator versioning (below), timing (below), or other safeguards for the current, explicitly known step prominently in the existing experience with a link to "Upgrading to the next EUS?" that provides all the info we have

that we know more explicitly they've picked a future destination (EUS or not) and we need an experience that customizes to that choice?

or is it a phase 1 then 2? :-)

Yes, if we want alerting to be different based on EUS vs not we need to have some mechanism to know that they're EUS. Right now I'd planned on that being a requirement that they upgrade into a 4.6 version that falls after the EUS transition and we'd only backport these alerting rules to those versions.

Requiring them to upgrade deep into 4.6.z may be a tough sell but I think if we move on to eventually removing reboots it's a decent exchange because it allows us to assume a higher baseline of maturity because they've got at least all of the quality improvements that happened during the first year of 4.6. This requirement also allows us to limit the upgrade graph and thus the versions we must test forcing them deep into 4.7, 4.8, and 4.9. Only 4.10 will be relatively new at that point and they'd also have the option of waiting another 3-6 months before completing the 4.9 and 4.10 upgrades if they wish.

Requiring them to upgrade deep into 4.6.z may be a tough sell...

For folks running from 4.6.{shortly-after-EUS-but-before-some-forward-looking-warning}, there's really not all that much difference between the first hop taking them to 4.6.{huge} or 4.7.{huge} and picking up the warning then, if we decide that 4.6.* -> 4.7.{huge} seems safe enough to add an update edge. So this just comes down to our usual "how many update edges are we willing to stretch tests over?", which is a new-release-build-time decision, and doesn't need any enhancement-time decisions.

While we know the risk is low it's still a cluster roll and a lot of customers think "EUS" means "I only have to reboot or upgrade once every 18 months."

Based on the call with API Server team it sounds like we're going to pursue only Info level alerts whenever we know of an API removal happening in the next four minor releases. We'll also pursue Upgradeable=False whenever the removal is in the next minor, so an API removed in 4.10 would get Info alerts in 4.6, 4.7, 4.8, 4.9 and Upgradeable=False in 4.9, that way the admin knows while running 4.6 but isn't blocked from forward progress until something would actually break when they take that step.

That means it becomes safe to backport all of the logic related to these alerts to 4.6 regardless of whether or not it's an EUS cluster, therefore we can set a deadline of having this backported to 4.6 when 4.9 goes GA which will be at least 12 weeks before potential 4.10 upgrades. I believe we can reasonably assert that they must be no more than 12 weeks old at the time we start providing EUS upgrades.

pweil- · 2021-01-14T14:43:59Z

+we will require anyone upgrading along an EUS specific upgrade path to update to
+a 4.6 version which is specific to EUS. This is necessary because only those who
+follow the EUS upgrade path have 4.10 as their next logical target version. We
+should not alert non EUS clusters of removals in 4.10 until they've upgraded to


Maybe this is the intention driven experience I was asking about above. It isn't that they've picked 4.10 (they still have to pick 4.7 as the next target) it's that they've upgraded to this 4.6 EUS version that has all the notification items backported to it?

In that case would there be an EUS 4.7, 4.8, and 4.9 or would the backported notifications show for people not on the EUS path too? (or the alternative above that only shows {next releases notifications} with expansion to {next EUS notifications}

Yes, this is the intent signal.

The backporting to 4.7, 4.8, and 4.9 is a good point to raise. I hadn't thought about that, I would assume we leave alerting rules in those versions whatever is appropriate for non-EUS upgrades. We know that 4.6 EUS upgrades will be upgrading through those versions but if we raise the red flags in 4.6 EUS versions and they start the upgrade anyway I feel like they were given enough warning.

I'm also envisioning an escalation of alerts based on when the API is removed. APIs removed in N+1 are warning, N+2 or greater are info level only. When running an EUS specific version N+1 should be equivalent to the alert set of your eventual target minus one, ie: 4.9 in the case that 4.6 and 4.10 are the two EUS versions.

I think now that we know we'll only use Info level alerts for all removals between 4.6 and 4.10 and then Upgradeable=False for removals between 4.N and 4.N+1 we no longer require this intent signal. It's safe to tell all 4.6 customers about removals between 4.6 and 4.10 and it's safe/required to inhibit upgrades from 4.6 to 4.7 when APIs removed in 4.7 are in use.

pweil- · 2021-01-14T14:49:38Z

+
+#### As an admin preparing for 4.6 to 4.10 EUS upgrade I'm alerted to API removals
+
+We have pre-existing mechanisms to alert admins to their use of APIs which are


I'm assuming this is the notification and alert section.

For UX related teams: I think we need to clarify the experience on if this would be incorporated to part of the upgrade visualization (in addition to alerting) and if so how would that API work?

It may be useful if the CVO scraped up alerts of this nature and raised them and then the console could only display alerts firing from CVO on the upgrades panel.

Do we need to consider using upgradeable=false if we detect usage of an API that is removed in next version?

iow.:

4.6
4.7
4.8: API server go "upgradeable=false" if there are cronjobs in the cluster
4.9: CronJobs are gone, but cluster has still 1000 cron jobs created
4.10:

Info alerts are planned for 4.7, showing up in the console notification side-bar. We will have to backport to 4.6.

@sdodson interesting idea from @sferich888 today: if the APIs are deprecated in what we generate our API docs from can we use that info to also generate big, scary release notes?

Another suggestion for collaboration: @jwmatthews and the container migration teams may have thoughts on how we can help admins work through deprecations and other ways to make sure they're front and center during our upgrade process.

@pweil- FYI, we have some beginning validation work in context of Migrations from MTC to warn a user when they will face a problem that a GVK they relied on in the source cluster isn't present in the target cluster (https://github.com/konveyor/mig-controller/blob/master/pkg/gvk/gvk.go#L66) This is a problem space we are interested in, we expect to see it with future migrations, open to collaborating. (cc @djwhatle @eriknelson @shawn-hurley @jortel )

k8s also has https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/1693-warnings which kubectl & oc consume informing users that they are using deprecating APIs. The question is if/how we could also consume that elsewhere.

@pweil- @sferich888 Yeah I definitely agree that we'll want to have a release notes if not an upgrade section of docs that's unique to 4.6 to 4.10 upgrades. I'll add an epic place holder to ensure that as part of 4.10 dev cycle we pull this together.

I'll find a slot next week to sync up with @sanchezl @deads2k and you, @jwmatthews.

sdodson · 2021-01-14T18:29:48Z

+- While I believe all of this work to be prerequisite to taking on work to skip
+upgrade steps that may not be widely agreed upon.
+
+- Do we need to consider CCO for similar treatment that we're providing for API


@dgoodwin @joelddiaz Your thoughts here please

As long as we visit each minor release between 4.6->4.10, I think we're okay. We just need to make sure the minor release visited is a release where CCO knows about the next minor release requirements. So for example: you can't go from 4.6.10 (which can alert about upgradeability to 4.7) to 4.7.2 (which doesn't know about upgradeability to 4.8 b/c 4.7.2 was too early to know about what was coming in 4.8).

We'd always be respecting minimum version requirements, we'll be upgrading along very late 4.6.z, 4.7.z, and 4.8.z. While 4.9.z will be relatively young it must meet the minimum version requirements to upgrade to 4.10.

sdodson · 2021-01-15T14:10:59Z

+
+### User Stories
+
+#### As an admin preparing for 4.6 to 4.10 EUS upgrade I'm alerted to API removals


@deads2k @sttts PTAL at this section as well as the enhancement in general.

sdodson · 2021-01-15T14:11:32Z

+backport that API metadata to 4.6 EUS versions once we have a firm understanding
+of all removals between 4.6 and 4.10.
+
+#### As an admin I cannot violate OpenShift's defined Kubelet to API version skew


@darkmuggle PTAL at this proposed epic as well as the enhancement in full

sdodson · 2021-01-15T14:50:03Z

@darkmuggle @soltysh @sttts @deads2k @ecordell @dgoodwin @crawford @eparis @derekwaynecarr PTAL

darkmuggle · 2021-01-15T17:31:24Z

+node_count_per_failure_domain) percentage of pods being rescheduled more than
+once during a rolling restart of a given failure domain.
+
+We should make the MCO topology aware so that it upgrades all of the host in one


The MCO should not be topology-aware, IMHO. Topology is very much platform-specific. Really, what is being asked is that the MCO would have to infer sub-pools based on cloud regions. The understanding of cloud-fault domains is well above the level of the MCO. The MCO would need a platform-agnostic means of enumerating the fault-domain and upgrade priorities.

I'm no longer certain this is necessary but I'm going to leave this in until I hear back from the workloads team that this isn't necessary.

If this is not necessary would prefer to skip as this feels like a separate enhancement altogether.

This is really a rollup of many different enhancements across many teams, each likely to lead to an enhancement of their own if sufficiently complex to require an ehancement. I believe we should try to minimize pod rescheduling and I'm currently asking that the workloads team drive that investigation, we'll see what they come up with. I'll change this just to mention that this may require coordination between MCO and Scheduler in some form.

crawford · 2021-01-15T21:53:00Z

+only core platform content?
+
+- Should we externalize the API removal bits? Perhaps via a marketplace operator
+that's broadly scoped as "EUS 4.6 to EUS 4.10 Validator"?


If we've written our operators correctly, an EUS Validator would be entirely redundant. All operators must be able to reason about (and take action against!) the validity of their operands in the context of the larger system. This is what makes them an operator, as opposed to a controller.

I think the idea here is to inform about user workloads wrt deprecated/removed APIs between 4.6 and 4.10, not with our own.

Correct, OLM is just the delivery mechanism to install an optional component which could decouple the EUS specific API removal notification from the release payload. I don't think we want to attempt to remediate any issues identified but just inform the admin.

russellb · 2021-01-22T18:09:11Z

/priority important-soon

wking · 2021-02-03T06:35:06Z

+failure domain because anti-affinity scheduling is used in most scenarios. We
+should provide some mechanism to rank upgraded nodes as slightly higher priority
+in scheduler configuration. Perhaps a timestamp of configuration generation that
+the scheduler then biases toward nodes with the greatest timestamps as a last in


This seems achievable and super-useful. Currently the MCO annotates nodes with current and target MachineConfig references. It could also annotate them with a rank or timestamp or some such of the current MachineConfig, and the scheduler could use that rank when breaking ties to increase the chances that drained workloads land on the already-updated nodes. With anti-affinity rules and all that usual stuff protecting workloads from getting overly wedged into a particular failure domain, even if that means temporarily scheduling some workload on still-to-be-updated nodes.

I think using a timestamp, while easy enough, is likely going to result in a problem as its does not indicate whether the nodes have the preferred state. I think having the MCO introduce the idea of an "update-epoch" would be clear and allow the scheduler to send workloads based on nodes with the highest epoch. Something like machineconfiguration.openshift.io/update-epoch: 20 (this would be interesting for telemetry, too).

However, if someone has paused their machine config pools, it would have the effect of preferring to run workloads on unpaused pools (since they would have a higher epoch) during normal operations.

An enhancement that focuses on general upgrade safety, speed, and a minimal amount of EUS to EUS specific work.

sdodson · 2021-02-03T19:27:14Z

+upgrade completes we rebalance so that the last node ends up with proportionate
+workload.
+
+#### ??? - Provide a hosted graph viewer which will assist disconnected admins in mirroring all content between 4.6 and 4.10


I'm still unsure which board this needs to land on, trying to track that down.

yuqi-zhang · 2021-02-03T22:40:11Z

+```
+
+If the policy were to change to allow for a version skew of N-2, v1.18 would be
+added to the list of acceptable matches. The MCO is not responsible for defining


How will these constrains be defined, and how does the MCO retroactively apply them to prevent the upgrade in the first place?

It sounds like the 4.6 MCO would first have to understand what the constraints are to prevent the upgrade to 4.7, despite the restrictions defined in the 4.7 MCO?

Actually on second thought, there is no real need for the MCO to expand on its current block on pools, since a previous upgrade not finishing would block further upgrades anyways. This means we would never have a kubelet version skew > 1, making this check pointless (unless you wanted to initiate a direct update from 4.6->4.8, which as I understand, shouldn't happen regardless).

The specific implementation needs to be hashed out, but it would be expected to be defined in the release payload of the version of the MCO which is running, a configmap most likely.

kikisdeliveryservice · 2021-02-03T23:09:04Z

+
+### MCO - Enforce OpenShift's defined host component version skew policies
+
+The MCO, will set Upgradeable=False whenever any MachineConfigPool has one more


This feels like a bit of an overreach onto the MCO. We should be setting upgradeable=false bc an upgrade isn't finished/a pool is degraded and reporting our general status but we shouldn't have to read into kubelet versions etc... to set anything. This feels like overcomplicating the matter? Why can the CVO not make the decision about the next acceptable upgrade based on operator statuses and known upgrade graphs?

Also for this enhancement, we should have the narrowest path possible for these upgrades which would make it safer and easier to support.

The MCO manages the host os, the host os "manages" the kubelet, therefore MCO seems like the right component to enforce constraints of the host os components, it's likely that kernel will be an additional component with constraints.

If an MCP were paused before an upgrade is initiated what happens in your scenario where we set Upgradeable=False when a pool isn't finished or degraded?

The CVO already defines cluster version upgrade paths, but the long term goal is to introduce a decoupling of cluster version from host components while still enforcing version skew policies. This enhancement doesn't propose that, just systems to enable us to change that policy in the future. The future would look like this

Start with 4.6 cluster, Pause Worker MCP

Upgrade 4.6 to 4.7

Upgrade 4.7 to 4.8

MCO is now Upgradeable=False because constraints are violated, kubelet is 1.19 but we're only upgradeable in 1.20 and 1.21, can't upgrade to 4.9.

Worker MCP is unpaused

When all hosts in MCPs now meet the constraints Upgradeable=True

Pause Worker MCP again

Upgrade to 4.9

Upgrade to 4.10

MCO is now Upgradeable=False because constraints are violated, kubelet is 1.21 but we're looking for 1.22 or 1.23.

Worker MCP is unpaused

Is the plan then to skip versions for the worker nodes (e.g. 4.6->4.8->4.10 directly), and that is what will be in the separate proposal you mentioned? And this is setting up for that eventuality?

Put another way, this implementation is for safety only and has nothing to do with EUS. Today you could screw up your cluster by pausing worker pool, upgrading your master from 4.4->4.7, and unpause the worker pool only to have it blow up since it can't make that jump directly. In which case, perhaps we should consider adding some sort of additional check for the wait for all pools work, where paused worker pools block upgrades too?

@yuqi-zhang Yes, exactly, this closes that gap by defining constraints that all nodes in an MCP must meet before MCO agrees that the cluster can be upgraded to the next minor. We want to decouple that decision from whether or not the pool is up to date in order to provide more flexibility. If someone changed /etc/motd we wouldn't want that to block them from upgrading to the next minor while that rollout is in progress.

I think my question is why does the MCO have to even look at kubelet versions to determine this?

I'm open to other suggestions. We had thought of maintaining a list of machine-os-content pullspecs or however that's represented, but that leaves RHEL workers out of the picture and they are managed via MCPs even if that comes via assistance of openshift-ansible. I'm also not sure how we'd maintain that list, where as simple string matching or semantic version comparison would more accurately express the constraints.

Another item I hadn't mentioned is there may become a time when we need to express these constraints on kernel versions, specifically for RHEL Workers.

Is the concern the cluster load required to examine node objects, growth in RBAC necessary to do so, or being able to implement these rules effectively? Anything else?

MCO is now Upgradeable=False because constraints are violated, kubelet is 1.19 but we're only upgradeable in 1.20 and 1.21, can't upgrade to 4.9.

@sdodson Just to add a question, if we add something like this for the general case, what are we doing to determine the next upgrade will be 4.9 and not another 4.8 version (which should be allowed?). i.e. when your control plane are on 4.8, your workers are on 4.6, you should block upgrades to 4.9 but not another 4.8 version correct? Without pre-flight checks (maybe in the CVO itself) for the overall upgrade, we won't know in the MCO what the next version would be (until an upgrade gets to the MCO, and its already too late to block by then)

@yuqi-zhang CVO only inhibits minor version upgrades when Upgradeable=False is set, it does not inhibit z-stream/patch upgrades. So an upgrade into a later 4.8.z would be allowed.

We do this so that if we screw up setting Upgadeable=False you're still allowed to upgrade into a later z-stream where we may fix that.

kikisdeliveryservice · 2021-02-03T23:13:36Z

+node_count_per_failure_domain) percentage of pods being rescheduled more than
+once during a rolling restart of a given failure domain.
+
+We should make the MCO topology aware so that it upgrades all of the host in one


If this is not necessary would prefer to skip as this feels like a separate enhancement altogether.

openshift-ci-robot · 2021-02-05T23:27:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: crawford

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [crawford]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

shawn-hurley · 2021-02-08T14:02:20Z

+We have pre-existing mechanisms to alert admins to their use of APIs which are
+slated for removal in a future release. Currently those mechanisms do not look
+forward far enough to span 4.6 to 4.10, we will need to add the ability to
+backport that API metadata to 4.6 once we have a firm understanding of all


How does an OLM operator backport the same knowledge so that we can also alert customers to breaking API's?

@shawn-hurley Do you mean CRDs installed and managed via OLM or OLM managed operators which use deprecated/removed core APIs? I would expect the latter to generate alerts already once this work lands, but the former would not be covered.

pweil- · 2021-02-08T14:52:24Z

Based on current unresolved comments I believe that this PR is in a good state to merge as the umbrella proposal with subsequent iterations on any additional guidance this proposal needs to provide and the following sub-topic implementations being continued discussions for owning teams:

MCO safety checks:
https://github.com/openshift/enhancements/pull/585/files#r569816300
https://github.com/openshift/enhancements/pull/585/files#r569802804

cluster update-retarget: https://github.com/openshift/enhancements/pull/585/files#r569160529

Hosted graph viewer: https://github.com/openshift/enhancements/pull/585/files#r569688098

Eliminate reboots (separate enhancement): https://github.com/openshift/enhancements/pull/585/files#r558461059

User workload alerting: https://github.com/openshift/enhancements/pull/585/files#r558612148

OLM Backports: #585 (comment)

/lgtm

openshift-bot · 2021-02-08T15:21:35Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot requested review from joelanford and sudhaponnaganti January 14, 2021 03:11

pweil- reviewed Jan 14, 2021

View reviewed changes

sdodson commented Jan 14, 2021

View reviewed changes

sdodson commented Jan 15, 2021

View reviewed changes