HIVE-2819: Lift upgradeable condition from CVO to cluster deployment label by AlexVulaj · Pull Request #2639 · openshift/hive

AlexVulaj · 2025-03-28T20:11:43Z

This PR lifts the Upgradeable status from the ClusterVersion into a new hive.openshift.io/minor-version-upgrade-unavailable label on the ClusterDeployment.

Higher level tools can consume this message to warn users about minor version upgrades that CVO would reject with a reason.

wking

I'm not a Hive approver, but looks good to me; thanks!

/lgtm

wking · 2025-03-28T20:41:22Z

Colon delimiter to help the Jira-linking bots:

/retitle HIVE-2819: Lift upgradeable condition from CVO to cluster deployment label

openshift-ci-robot · 2025-03-28T20:41:29Z

@AlexVulaj: This pull request references HIVE-2819 which is a valid jira issue.

Details

In response to this:

This PR lifts the Upgradeable status from the ClusterVersion into a new hive.openshift.io/minor-version-upgrade-unavailable label on the ClusterDeployment.

Higher level tools can consume this message to warn users about minor version upgrades that CVO would reject with a reason.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

2uasimojo

Please add some unit testing for this. We don't need much. Some helpful references:

How I added UT in a previous PR in this area.
The dummy ClusterVersion object you'll have to enhance with some status conditions to make your test work. (You may have to do some refactoring to get both positive and negative code paths covered, since currently this thing is hardcoded and buried inside another object.)

I also want to point out that this controller is by default only Watch()ing ClusterDeployment (on the hub), which means there may be a nontrivial delay between when the spoke ClusterVersion object changes and when the controller runs to update the hub CD labels. I believe @hlipsig struggled with this for ARO, resulting in https://issues.redhat.com//browse/HIVE-2619, which you may end up finding handy when you go to use this thing.

2uasimojo · 2025-03-28T20:53:22Z

+	upgradeableCondition := ""
+	for _, condition := range clusterVersion.Status.Conditions {
+		if condition.Type == openshiftapiv1.OperatorUpgradeable && condition.Status != openshiftapiv1.ConditionTrue {
+			upgradeableCondition = cmp.Or(condition.Message, fmt.Sprintf("%s: %s", condition.Type, condition.Status))


Does this Message change frequently? Updating the label is going to trigger a requeue, which will go query it again. We don't want to end up thrashing.

The CVO does have some throttling, but with a lower bound of 15s in the minOnFailedPreconditions case, it may be so little you don't care ;). There's also an ignoreThrottlePeriod knob in the CVO from openshift/cluster-version-operator@b6b7345, but at the moment that's still only used for the sync-on-CVO-container-exit, which should be rare. Or maybe 15s is enough that you can squeeze in a requeue, see no change, and back off until your (~2h?) next check on the cluster?

That's not a lot of hard numbers, and without a full survey of ClusterOperator maintainers, it's hard to claim exhaustive coverage. But anecdotally, I'm not aware of folks including mutable, high-churn strings in their messages, and I could see folks filing bugs against churny components to ask them to calm down. I could also see something defensive in Hive about "we realize this could churn, and are not interested in trying to keep up", and adding some kind of per-cluster throttling/back-off to avoid the thrashing you're concerned about.

2uasimojo · 2025-03-28T20:54:24Z


+	upgradeableCondition := ""
+	for _, condition := range clusterVersion.Status.Conditions {
+		if condition.Type == openshiftapiv1.OperatorUpgradeable && condition.Status != openshiftapiv1.ConditionTrue {


I presume there is only ever one status condition with this Type? If so, you could break once you've found it, as iterating through the remainder of the conditions is just a waste.

wking · 2025-03-28T21:42:30Z

@@ -1,6 +1,7 @@
 package clusterversion


I don't want the main comment-stream to get too noisy with the latency issue, so pulling this:

I also want to point out that this controller is by default only Watch()ing ClusterDeployment (on the hub), which means there may be a nontrivial delay between when the spoke ClusterVersion object changes and when the controller runs to update the hub CD labels.

out to this random, unrelated line of code to give it a dedicated thread.

I'm personally not concerned about the latency here, because most of the issues that Upgradeable complains about are long-running, slow-changing issues (like "you're on SDN; migrate to OVN to access 4.17"). So ~hours stale is likely to be still accurate in many cases.

And when we miss with a false negative (stale ClusterDeployment data said the update was ok, but turned out ClusterVersion had moved to be Upgradeable=False), it's not terrible. The user could request an update, and the cluster-version operator would reject the request with whatever the Upgradeable=False message was. So the cluster is still safe, it's just a bit more of a rug-pull UX. Having fresher data would improve the UX, but would not increase cluster safety.

When we miss with a false positive (stale ClusterDeployment data said the update was blocked, but turned out ClusterVersion had moved to be Upgradeable=True or unset the Upgradeable condition), it's not terrible either. The user's access to the next 4.(y+1) is delayed by an hour or two until the ClusterDeployment catches up. But it's just a feature update, patch updates pulling in bugfixes and CVEs would not be impacted. And users who want to avoid any risk of delay could just get their Upgradeable ducks lined up more than an hour before they were hoping to launch the update (hopefully nobody is actually trying to cut it that close).

The gap here is flappy issues like PoolUpdating, which I dropped in 4.19 (openshift/machine-config-operator#4760), and I'm happy to backport that (and fixes to any other flappy Upgradeable conditions, although I can't think of any offhand) to older 4.y, if stale ClusterDeployment UX impacts turn out to be an issue.

Cool. Note that a dummy CD update (like an annotation) could be used to force a resync. That's a thing you can't do through OCM today (that I know of), but would be trivial to implement.

codecov · 2025-03-28T21:59:39Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 49.99%. Comparing base (bd97bef) to head (076760e).
Report is 2 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2639      +/-   ##
==========================================
+ Coverage   49.98%   49.99%   +0.01%     
==========================================
  Files         281      281              
  Lines       33204    33215      +11     
==========================================
+ Hits        16596    16607      +11     
  Misses      15267    15267              
  Partials     1341     1341

Files with missing lines	Coverage Δ
pkg/constants/constants.go	`100.00% <ø> (ø)`
...roller/clusterversion/clusterversion_controller.go	`45.45% <100.00%> (+5.45%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

This commit lifts the "Upgradeable" status from the ClusterVersion into a new hive.openshift.io/minor-version-upgrade-unavailable label on the ClusterDeployment. Higher level tools can consume this message to warn users about minor version upgrades that CVO would reject with a reason.

2uasimojo · 2025-03-31T15:44:05Z

The tests look great, thanks @AlexVulaj.

/lgtm
/retest hive-on-pull-request

openshift-ci · 2025-03-31T15:44:12Z

@2uasimojo: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

/test coverage

/test e2e

/test e2e-azure

/test e2e-gcp

/test e2e-pool

/test e2e-vsphere

/test images

/test periodic-images

/test security

/test unit

/test verify

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-hive-master-coverage

pull-ci-openshift-hive-master-e2e

pull-ci-openshift-hive-master-e2e-pool

pull-ci-openshift-hive-master-images

pull-ci-openshift-hive-master-periodic-images

pull-ci-openshift-hive-master-security

pull-ci-openshift-hive-master-unit

pull-ci-openshift-hive-master-verify

Details

In response to this:

The tests look great, thanks @AlexVulaj.

/lgtm
/retest hive-on-pull-request

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2025-03-31T15:45:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 2uasimojo, AlexVulaj, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [2uasimojo]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2025-03-31T17:02:24Z

@AlexVulaj: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

2uasimojo · 2025-03-31T17:17:21Z

/retest hive-on-pull-request

openshift-ci · 2025-03-31T17:17:50Z

@2uasimojo: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

/test coverage

/test e2e

/test e2e-azure

/test e2e-gcp

/test e2e-pool

/test e2e-vsphere

/test images

/test periodic-images

/test security

/test unit

/test verify

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-hive-master-coverage

pull-ci-openshift-hive-master-e2e

pull-ci-openshift-hive-master-e2e-pool

pull-ci-openshift-hive-master-images

pull-ci-openshift-hive-master-periodic-images

pull-ci-openshift-hive-master-security

pull-ci-openshift-hive-master-unit

pull-ci-openshift-hive-master-verify

Details

In response to this:

/retest hive-on-pull-request

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

2uasimojo · 2025-03-31T17:19:46Z

Bah, konflux seems down. Since we're not actually using it yet...

/override "Red Hat Konflux / hive-on-pull-request"

openshift-ci · 2025-03-31T17:20:13Z

@2uasimojo: Overrode contexts on behalf of 2uasimojo: Red Hat Konflux / hive-on-pull-request

Details

In response to this:

Bah, konflux seems down. Since we're not actually using it yet...

/override "Red Hat Konflux / hive-on-pull-request"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci Bot requested review from dlom and jstuever March 28, 2025 20:12

wking approved these changes Mar 28, 2025

View reviewed changes

openshift-ci Bot assigned wking Mar 28, 2025

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Mar 28, 2025

AlexVulaj changed the title ~~Lift upgradeable condition from CVO to cluster deployment label.~~ https://issues.redhat.com/browse/HIVE-2819 Lift upgradeable condition from CVO to cluster deployment label. Mar 28, 2025

AlexVulaj changed the title ~~https://issues.redhat.com/browse/HIVE-2819 Lift upgradeable condition from CVO to cluster deployment label.~~ HIVE-2819 | Lift upgradeable condition from CVO to cluster deployment label. Mar 28, 2025

openshift-ci Bot changed the title ~~HIVE-2819 | Lift upgradeable condition from CVO to cluster deployment label.~~ HIVE-2819: Lift upgradeable condition from CVO to cluster deployment label Mar 28, 2025

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 28, 2025

2uasimojo reviewed Mar 28, 2025

View reviewed changes

wking reviewed Mar 28, 2025

View reviewed changes

AlexVulaj force-pushed the cluster-deployment-labels-upgradeable branch from 96f63f8 to 076760e Compare March 31, 2025 14:53

openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 31, 2025

openshift-ci Bot assigned 2uasimojo Mar 31, 2025

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Mar 31, 2025

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 31, 2025

openshift-merge-bot Bot merged commit e9d99a8 into openshift:master Mar 31, 2025
10 of 11 checks passed

AlexVulaj deleted the cluster-deployment-labels-upgradeable branch March 31, 2025 17:37

AlexVulaj mentioned this pull request Apr 1, 2025

HIVE-2819: Use annotation instead of label for upgradeable #2650

Merged

Conversation

AlexVulaj commented Mar 28, 2025

Uh oh!

wking left a comment

Choose a reason for hiding this comment

Uh oh!

wking commented Mar 28, 2025 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Mar 28, 2025 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

2uasimojo left a comment

Choose a reason for hiding this comment

Uh oh!

2uasimojo Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

wking Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

2uasimojo Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

wking Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

2uasimojo Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

2uasimojo commented Mar 31, 2025

Uh oh!

openshift-ci Bot commented Mar 31, 2025

Uh oh!

openshift-ci Bot commented Mar 31, 2025

Uh oh!

openshift-ci Bot commented Mar 31, 2025

Uh oh!

2uasimojo commented Mar 31, 2025

Uh oh!

openshift-ci Bot commented Mar 31, 2025

Uh oh!

2uasimojo commented Mar 31, 2025

Uh oh!

openshift-ci Bot commented Mar 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wking commented Mar 28, 2025 •

edited by openshift-ci Bot

Loading

openshift-ci-robot commented Mar 28, 2025 •

edited by openshift-ci Bot

Loading

wking Mar 28, 2025 •

edited

Loading

wking Mar 28, 2025 •

edited

Loading

codecov Bot commented Mar 28, 2025 •

edited

Loading