HIVE-2819: Lift upgradeable condition from CVO to cluster deployment label#2639
Conversation
wking
left a comment
There was a problem hiding this comment.
I'm not a Hive approver, but looks good to me; thanks!
/lgtm
|
Colon delimiter to help the Jira-linking bots: /retitle HIVE-2819: Lift upgradeable condition from CVO to cluster deployment label |
|
@AlexVulaj: This pull request references HIVE-2819 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
2uasimojo
left a comment
There was a problem hiding this comment.
Please add some unit testing for this. We don't need much. Some helpful references:
- How I added UT in a previous PR in this area.
- The dummy ClusterVersion object you'll have to enhance with some status conditions to make your test work. (You may have to do some refactoring to get both positive and negative code paths covered, since currently this thing is hardcoded and buried inside another object.)
I also want to point out that this controller is by default only Watch()ing ClusterDeployment (on the hub), which means there may be a nontrivial delay between when the spoke ClusterVersion object changes and when the controller runs to update the hub CD labels. I believe @hlipsig struggled with this for ARO, resulting in https://issues.redhat.com//browse/HIVE-2619, which you may end up finding handy when you go to use this thing.
| upgradeableCondition := "" | ||
| for _, condition := range clusterVersion.Status.Conditions { | ||
| if condition.Type == openshiftapiv1.OperatorUpgradeable && condition.Status != openshiftapiv1.ConditionTrue { | ||
| upgradeableCondition = cmp.Or(condition.Message, fmt.Sprintf("%s: %s", condition.Type, condition.Status)) |
There was a problem hiding this comment.
Does this Message change frequently? Updating the label is going to trigger a requeue, which will go query it again. We don't want to end up thrashing.
There was a problem hiding this comment.
The CVO does have some throttling, but with a lower bound of 15s in the minOnFailedPreconditions case, it may be so little you don't care ;). There's also an ignoreThrottlePeriod knob in the CVO from openshift/cluster-version-operator@b6b7345, but at the moment that's still only used for the sync-on-CVO-container-exit, which should be rare. Or maybe 15s is enough that you can squeeze in a requeue, see no change, and back off until your (~2h?) next check on the cluster?
That's not a lot of hard numbers, and without a full survey of ClusterOperator maintainers, it's hard to claim exhaustive coverage. But anecdotally, I'm not aware of folks including mutable, high-churn strings in their messages, and I could see folks filing bugs against churny components to ask them to calm down. I could also see something defensive in Hive about "we realize this could churn, and are not interested in trying to keep up", and adding some kind of per-cluster throttling/back-off to avoid the thrashing you're concerned about.
|
|
||
| upgradeableCondition := "" | ||
| for _, condition := range clusterVersion.Status.Conditions { | ||
| if condition.Type == openshiftapiv1.OperatorUpgradeable && condition.Status != openshiftapiv1.ConditionTrue { |
There was a problem hiding this comment.
I presume there is only ever one status condition with this Type? If so, you could break once you've found it, as iterating through the remainder of the conditions is just a waste.
| @@ -1,6 +1,7 @@ | |||
| package clusterversion | |||
There was a problem hiding this comment.
I don't want the main comment-stream to get too noisy with the latency issue, so pulling this:
I also want to point out that this controller is by default only Watch()ing ClusterDeployment (on the hub), which means there may be a nontrivial delay between when the spoke ClusterVersion object changes and when the controller runs to update the hub CD labels.
out to this random, unrelated line of code to give it a dedicated thread.
I'm personally not concerned about the latency here, because most of the issues that Upgradeable complains about are long-running, slow-changing issues (like "you're on SDN; migrate to OVN to access 4.17"). So ~hours stale is likely to be still accurate in many cases.
And when we miss with a false negative (stale ClusterDeployment data said the update was ok, but turned out ClusterVersion had moved to be Upgradeable=False), it's not terrible. The user could request an update, and the cluster-version operator would reject the request with whatever the Upgradeable=False message was. So the cluster is still safe, it's just a bit more of a rug-pull UX. Having fresher data would improve the UX, but would not increase cluster safety.
When we miss with a false positive (stale ClusterDeployment data said the update was blocked, but turned out ClusterVersion had moved to be Upgradeable=True or unset the Upgradeable condition), it's not terrible either. The user's access to the next 4.(y+1) is delayed by an hour or two until the ClusterDeployment catches up. But it's just a feature update, patch updates pulling in bugfixes and CVEs would not be impacted. And users who want to avoid any risk of delay could just get their Upgradeable ducks lined up more than an hour before they were hoping to launch the update (hopefully nobody is actually trying to cut it that close).
The gap here is flappy issues like PoolUpdating, which I dropped in 4.19 (openshift/machine-config-operator#4760), and I'm happy to backport that (and fixes to any other flappy Upgradeable conditions, although I can't think of any offhand) to older 4.y, if stale ClusterDeployment UX impacts turn out to be an issue.
There was a problem hiding this comment.
Cool. Note that a dummy CD update (like an annotation) could be used to force a resync. That's a thing you can't do through OCM today (that I know of), but would be trivial to implement.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #2639 +/- ##
==========================================
+ Coverage 49.98% 49.99% +0.01%
==========================================
Files 281 281
Lines 33204 33215 +11
==========================================
+ Hits 16596 16607 +11
Misses 15267 15267
Partials 1341 1341
🚀 New features to boost your workflow:
|
This commit lifts the "Upgradeable" status from the ClusterVersion into a new hive.openshift.io/minor-version-upgrade-unavailable label on the ClusterDeployment. Higher level tools can consume this message to warn users about minor version upgrades that CVO would reject with a reason.
96f63f8 to
076760e
Compare
|
The tests look great, thanks @AlexVulaj. /lgtm |
|
@2uasimojo: The Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: 2uasimojo, AlexVulaj, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@AlexVulaj: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/retest hive-on-pull-request |
|
@2uasimojo: The Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Bah, konflux seems down. Since we're not actually using it yet... /override "Red Hat Konflux / hive-on-pull-request" |
|
@2uasimojo: Overrode contexts on behalf of 2uasimojo: Red Hat Konflux / hive-on-pull-request DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
e9d99a8
into
openshift:master
This PR lifts the
Upgradeablestatus from the ClusterVersion into a newhive.openshift.io/minor-version-upgrade-unavailablelabel on the ClusterDeployment.Higher level tools can consume this message to warn users about minor version upgrades that CVO would reject with a reason.