Skip to content

HIVE-2819: Lift upgradeable condition from CVO to cluster deployment label#2639

Merged
openshift-merge-bot[bot] merged 1 commit into
openshift:masterfrom
AlexVulaj:cluster-deployment-labels-upgradeable
Mar 31, 2025
Merged

HIVE-2819: Lift upgradeable condition from CVO to cluster deployment label#2639
openshift-merge-bot[bot] merged 1 commit into
openshift:masterfrom
AlexVulaj:cluster-deployment-labels-upgradeable

Conversation

@AlexVulaj
Copy link
Copy Markdown
Contributor

This PR lifts the Upgradeable status from the ClusterVersion into a new hive.openshift.io/minor-version-upgrade-unavailable label on the ClusterDeployment.

Higher level tools can consume this message to warn users about minor version upgrades that CVO would reject with a reason.

@openshift-ci openshift-ci Bot requested review from dlom and jstuever March 28, 2025 20:12
Copy link
Copy Markdown
Member

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a Hive approver, but looks good to me; thanks!

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Mar 28, 2025
@AlexVulaj AlexVulaj changed the title Lift upgradeable condition from CVO to cluster deployment label. https://issues.redhat.com/browse/HIVE-2819 Lift upgradeable condition from CVO to cluster deployment label. Mar 28, 2025
@AlexVulaj AlexVulaj changed the title https://issues.redhat.com/browse/HIVE-2819 Lift upgradeable condition from CVO to cluster deployment label. HIVE-2819 | Lift upgradeable condition from CVO to cluster deployment label. Mar 28, 2025
@wking
Copy link
Copy Markdown
Member

wking commented Mar 28, 2025

Colon delimiter to help the Jira-linking bots:

/retitle HIVE-2819: Lift upgradeable condition from CVO to cluster deployment label

@openshift-ci openshift-ci Bot changed the title HIVE-2819 | Lift upgradeable condition from CVO to cluster deployment label. HIVE-2819: Lift upgradeable condition from CVO to cluster deployment label Mar 28, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 28, 2025
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Mar 28, 2025

@AlexVulaj: This pull request references HIVE-2819 which is a valid jira issue.

Details

In response to this:

This PR lifts the Upgradeable status from the ClusterVersion into a new hive.openshift.io/minor-version-upgrade-unavailable label on the ClusterDeployment.

Higher level tools can consume this message to warn users about minor version upgrades that CVO would reject with a reason.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown
Member

@2uasimojo 2uasimojo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some unit testing for this. We don't need much. Some helpful references:

  • How I added UT in a previous PR in this area.
  • The dummy ClusterVersion object you'll have to enhance with some status conditions to make your test work. (You may have to do some refactoring to get both positive and negative code paths covered, since currently this thing is hardcoded and buried inside another object.)

I also want to point out that this controller is by default only Watch()ing ClusterDeployment (on the hub), which means there may be a nontrivial delay between when the spoke ClusterVersion object changes and when the controller runs to update the hub CD labels. I believe @hlipsig struggled with this for ARO, resulting in https://issues.redhat.com//browse/HIVE-2619, which you may end up finding handy when you go to use this thing.

upgradeableCondition := ""
for _, condition := range clusterVersion.Status.Conditions {
if condition.Type == openshiftapiv1.OperatorUpgradeable && condition.Status != openshiftapiv1.ConditionTrue {
upgradeableCondition = cmp.Or(condition.Message, fmt.Sprintf("%s: %s", condition.Type, condition.Status))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this Message change frequently? Updating the label is going to trigger a requeue, which will go query it again. We don't want to end up thrashing.

Copy link
Copy Markdown
Member

@wking wking Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CVO does have some throttling, but with a lower bound of 15s in the minOnFailedPreconditions case, it may be so little you don't care ;). There's also an ignoreThrottlePeriod knob in the CVO from openshift/cluster-version-operator@b6b7345, but at the moment that's still only used for the sync-on-CVO-container-exit, which should be rare. Or maybe 15s is enough that you can squeeze in a requeue, see no change, and back off until your (~2h?) next check on the cluster?

That's not a lot of hard numbers, and without a full survey of ClusterOperator maintainers, it's hard to claim exhaustive coverage. But anecdotally, I'm not aware of folks including mutable, high-churn strings in their messages, and I could see folks filing bugs against churny components to ask them to calm down. I could also see something defensive in Hive about "we realize this could churn, and are not interested in trying to keep up", and adding some kind of per-cluster throttling/back-off to avoid the thrashing you're concerned about.


upgradeableCondition := ""
for _, condition := range clusterVersion.Status.Conditions {
if condition.Type == openshiftapiv1.OperatorUpgradeable && condition.Status != openshiftapiv1.ConditionTrue {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume there is only ever one status condition with this Type? If so, you could break once you've found it, as iterating through the remainder of the conditions is just a waste.

@@ -1,6 +1,7 @@
package clusterversion
Copy link
Copy Markdown
Member

@wking wking Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want the main comment-stream to get too noisy with the latency issue, so pulling this:

I also want to point out that this controller is by default only Watch()ing ClusterDeployment (on the hub), which means there may be a nontrivial delay between when the spoke ClusterVersion object changes and when the controller runs to update the hub CD labels.

out to this random, unrelated line of code to give it a dedicated thread.

I'm personally not concerned about the latency here, because most of the issues that Upgradeable complains about are long-running, slow-changing issues (like "you're on SDN; migrate to OVN to access 4.17"). So ~hours stale is likely to be still accurate in many cases.

And when we miss with a false negative (stale ClusterDeployment data said the update was ok, but turned out ClusterVersion had moved to be Upgradeable=False), it's not terrible. The user could request an update, and the cluster-version operator would reject the request with whatever the Upgradeable=False message was. So the cluster is still safe, it's just a bit more of a rug-pull UX. Having fresher data would improve the UX, but would not increase cluster safety.

When we miss with a false positive (stale ClusterDeployment data said the update was blocked, but turned out ClusterVersion had moved to be Upgradeable=True or unset the Upgradeable condition), it's not terrible either. The user's access to the next 4.(y+1) is delayed by an hour or two until the ClusterDeployment catches up. But it's just a feature update, patch updates pulling in bugfixes and CVEs would not be impacted. And users who want to avoid any risk of delay could just get their Upgradeable ducks lined up more than an hour before they were hoping to launch the update (hopefully nobody is actually trying to cut it that close).

The gap here is flappy issues like PoolUpdating, which I dropped in 4.19 (openshift/machine-config-operator#4760), and I'm happy to backport that (and fixes to any other flappy Upgradeable conditions, although I can't think of any offhand) to older 4.y, if stale ClusterDeployment UX impacts turn out to be an issue.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Note that a dummy CD update (like an annotation) could be used to force a resync. That's a thing you can't do through OCM today (that I know of), but would be trivial to implement.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 28, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 49.99%. Comparing base (bd97bef) to head (076760e).
Report is 2 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2639      +/-   ##
==========================================
+ Coverage   49.98%   49.99%   +0.01%     
==========================================
  Files         281      281              
  Lines       33204    33215      +11     
==========================================
+ Hits        16596    16607      +11     
  Misses      15267    15267              
  Partials     1341     1341              
Files with missing lines Coverage Δ
pkg/constants/constants.go 100.00% <ø> (ø)
...roller/clusterversion/clusterversion_controller.go 45.45% <100.00%> (+5.45%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

This commit lifts the "Upgradeable" status from the ClusterVersion into a new hive.openshift.io/minor-version-upgrade-unavailable label on the ClusterDeployment.

Higher level tools can consume this message to warn users about minor version upgrades that CVO would reject with a reason.
@AlexVulaj AlexVulaj force-pushed the cluster-deployment-labels-upgradeable branch from 96f63f8 to 076760e Compare March 31, 2025 14:53
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 31, 2025
@2uasimojo
Copy link
Copy Markdown
Member

The tests look great, thanks @AlexVulaj.

/lgtm
/retest hive-on-pull-request

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 31, 2025

@2uasimojo: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

/test coverage
/test e2e
/test e2e-azure
/test e2e-gcp
/test e2e-pool
/test e2e-vsphere
/test images
/test periodic-images
/test security
/test unit
/test verify

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-hive-master-coverage
pull-ci-openshift-hive-master-e2e
pull-ci-openshift-hive-master-e2e-pool
pull-ci-openshift-hive-master-images
pull-ci-openshift-hive-master-periodic-images
pull-ci-openshift-hive-master-security
pull-ci-openshift-hive-master-unit
pull-ci-openshift-hive-master-verify
Details

In response to this:

The tests look great, thanks @AlexVulaj.

/lgtm
/retest hive-on-pull-request

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Mar 31, 2025
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 31, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 2uasimojo, AlexVulaj, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 31, 2025
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 31, 2025

@AlexVulaj: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@2uasimojo
Copy link
Copy Markdown
Member

/retest hive-on-pull-request

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 31, 2025

@2uasimojo: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

/test coverage
/test e2e
/test e2e-azure
/test e2e-gcp
/test e2e-pool
/test e2e-vsphere
/test images
/test periodic-images
/test security
/test unit
/test verify

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-hive-master-coverage
pull-ci-openshift-hive-master-e2e
pull-ci-openshift-hive-master-e2e-pool
pull-ci-openshift-hive-master-images
pull-ci-openshift-hive-master-periodic-images
pull-ci-openshift-hive-master-security
pull-ci-openshift-hive-master-unit
pull-ci-openshift-hive-master-verify
Details

In response to this:

/retest hive-on-pull-request

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@2uasimojo
Copy link
Copy Markdown
Member

Bah, konflux seems down. Since we're not actually using it yet...

/override "Red Hat Konflux / hive-on-pull-request"

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 31, 2025

@2uasimojo: Overrode contexts on behalf of 2uasimojo: Red Hat Konflux / hive-on-pull-request

Details

In response to this:

Bah, konflux seems down. Since we're not actually using it yet...

/override "Red Hat Konflux / hive-on-pull-request"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot openshift-merge-bot Bot merged commit e9d99a8 into openshift:master Mar 31, 2025
10 of 11 checks passed
@AlexVulaj AlexVulaj deleted the cluster-deployment-labels-upgradeable branch March 31, 2025 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants