OCPBUGS-15200: Filter out shallowly `UpdateEffectNone` errors from a `MultipleErrors` message in the Failing condition by DavidHurta · Pull Request #1050 · openshift/cluster-version-operator

DavidHurta · 2024-06-03T14:08:13Z

Various errors get propagated to users, such as the summarized task
graph error. For example, in the form of the message in the Failing
condition. However, update errors set with the update effect of
UpdateEffectNone can confuse users, as these primarily informing
messages get displayed together with valid update errors that heavily
impact the update. This can result in a message such as:

{
  "lastTransitionTime": "2023-06-20T13:40:12Z",
  "message": "Multiple errors are preventing progress:\n* Cluster
  operator authentication is updating versions\n* Could not update
  customresourcedefinition \"alertingrules.monitoring.openshift.io\"
  (512 of 993): the object is invalid, possibly due to local cluster
  configuration",
  "reason": "MultipleErrors",
  "status": "True",
  "type": "Failing"
}

The Failing condition is not true because of the UpdateEffectNone
error ("Cluster operator authentication is updating versions"), but
its message still gets displayed.

This PR makes sure that update errors that do not heavily affect
the update will be removed from the Failing condition message to an
extent.

This pull request references https://issues.redhat.com/browse/OCPBUGS-15200

openshift-ci-robot · 2024-06-03T14:09:38Z

@Davoska: This pull request references Jira Issue OCPBUGS-15200, which is invalid:

expected the bug to target the "4.17.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

The summarized task graph error gets propagated to users. For example, in the form of the message in the Failing condition. However, update errors set with the update effect of UpdateEffectNone can confuse users as these informing messages get displayed together with valid update errors impacting the update. This can result in a message such as:
{
 "lastTransitionTime": "2023-06-20T13:40:12Z",
 "message": "Multiple errors are preventing progress:\n* Cluster
 operator authentication is updating versions\n* Could not update
 customresourcedefinition \"alertingrules.monitoring.openshift.io\"
 (512 of 993): the object is invalid, possibly due to local cluster
 configuration",
 "reason": "MultipleErrors",
 "status": "True",
 "type": "Failing"
}
The Failing condition is not true because of the UpdateEffectNone error ("Cluster operator authentication is updating versions"), but its message still gets displayed.

This commit makes sure that update errors that do not have an effect on the update will not get propagated further. Thus improving the user experience. However, they will still be shown in the logs to help with more precise debugging.

This pull request references https://issues.redhat.com/browse/OCPBUGS-15200

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

DavidHurta · 2024-06-03T14:11:13Z

/jira refresh

openshift-ci-robot · 2024-06-03T14:11:21Z

@Davoska: This pull request references Jira Issue OCPBUGS-15200, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.17.0) matches configured target version for branch (4.17.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jiajliu

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

DavidHurta · 2024-06-03T14:13:26Z

I would like to test this on a live cluster (edit: and fix the failing CI). Thus, I am putting this PR on hold for the time being.

/hold

DavidHurta · 2024-06-03T14:15:32Z

/uncc LalatenduMohanty
/cc @wking

petr-muller · 2024-06-03T15:12:40Z

Approach SGTM 👍

petr-muller · 2024-06-03T15:24:56Z

I have not looked at the code closely yet but one piece to check for possible interaction is #1041 which renders all reconciliation problems (including the UpdateEffectNone ones) for external consumption, as a pseudo-api.

If possible we'd like to keep UpdateEffectNone errors there, if possible. I think filtering them out on the producer side would hide them from ReconciliationIssues?

DavidHurta · 2024-06-04T13:25:29Z

/hold

I am re-working the PR.

openshift-ci-robot · 2024-06-11T19:11:24Z

@Davoska: This pull request references Jira Issue OCPBUGS-15200, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.17.0) matches configured target version for branch (4.17.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @dis016

Details

In response to this:

TBD

This pull request references https://issues.redhat.com/browse/OCPBUGS-15200

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

petr-muller

approach lgtm, some code readability nits + what Trevor says ;)

dis016 · 2024-08-23T07:35:13Z

Test Scenario: Make a CO(authentication) degraded.

Original Failure: Reason: MultipleErrors; Message: Multiple issues: CO A is degraded, CO B is updating versions

Install a 4.17 cluster and degrade the Cluster operator authentication.

NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.17.0-0.test-2024-08-22-131833-ci-ln-ntyc4tb-latest   True        False         99m     Cluster version is 4.17.0-0.test-2024-08-22-131833-ci-ln-ntyc4tb-latest
%
% cat oauth.yaml 
apiVersion: config.openshift.io/v1
kind: OAuth
metadata:
  name: cluster
spec:
  identityProviders:
  - name: oidcidp 
    mappingMethod: claim 
    type: OpenID
    openID:
      clientID: test
      clientSecret: 
        name: test
      claims: 
        preferredUsername:
        - preferred_username
        name:
        - name
        email:
        - email
      issuer: https://www.idp-issuer.example.com 
 % oc apply -f oauth.yaml 
Warning: resource oauths/cluster is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by oc apply. oc apply should only be used on resources created declaratively by either oc create --save-config or oc apply. The missing annotation will be patched automatically.
oauth.config.openshift.io/cluster configured
 %  oc get co authentication
NAME             VERSION                                                AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication   4.17.0-0.test-2024-08-22-131833-ci-ln-ntyc4tb-latest   True        False         True       103m    OAuthServerConfigObservationDegraded: failed to apply IDP oidcidp config: dial tcp: lookup www.idp-issuer.example.com on 172.30.0.10:53: no such host
%

Trigger Upgrade to version which doesn't contain the PR Changes

% oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release@sha256:b8105494ce61dc1f5ba68f173c78adfb834ff70c66e7399b9ae401021517f27f  --allow-explicit-upgrade --force
warning: The requested upgrade image is not one of the available updates.  You have used --allow-explicit-upgrade to the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Updating to release image registry.ci.openshift.org/ocp/release@sha256:b8105494ce61dc1f5ba68f173c78adfb834ff70c66e7399b9ae401021517f27f
% 
% oc adm upgrade 
Error while reconciling 4.17.0-0.test-2024-08-22-131833-ci-ln-ntyc4tb-latest: the cluster operator authentication is degraded

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

%  oc adm upgrade status 
info: An upgrade is in progress. Working towards 4.17.0-0.nightly-2024-08-19-165854: 6 of 900 done (0% complete)

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

dinesh@Dineshs-MacBook-Pro Downloads %

with error upgrade is proceeded and CVO is throwing the error


% while true; do oc adm upgrade; oc get clusterversion version -o json | jq '.status.conditions[] | select (.type=="Failing")' ; sleep 60; done  

info: An upgrade is in progress. Working towards 4.17.0-0.nightly-2024-08-19-165854: 110 of 900 done (12% complete), waiting on etcd, kube-apiserver

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

{
  "lastTransitionTime": "2024-08-22T15:41:42Z",
  "status": "False",
  "type": "Failing"
}
...
...
info: An upgrade is in progress. Unable to apply 4.17.0-0.nightly-2024-08-19-165854: an unknown error has occurred: MultipleErrors

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

{
  "lastTransitionTime": "2024-08-22T16:06:12Z",
  "message": "Multiple errors are preventing progress:\n* Cluster operator authentication is degraded\n* Cluster operators cluster-autoscaler, console, marketplace, monitoring, node-tuning, openshift-apiserver, openshift-controller-manager are updating versions",
  **"reason": "MultipleErrors",**
  "status": "True",
  "type": "Failing"
}

Expected/New Failure: Reason: ClusterOperatorDegraded; Message: CO A is degraded
Install a 4.17 Cluster and degrade the CO authentication

% oc get clusterversion 
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.17.0-0.nightly-2024-08-18-131731   True        False         22m     Cluster version is 4.17.0-0.nightly-2024-08-18-131731
%
% cat oauth.yaml 
apiVersion: config.openshift.io/v1
kind: OAuth
metadata:
  name: cluster
spec:
  identityProviders:
  - name: oidcidp 
    mappingMethod: claim 
    type: OpenID
    openID:
      clientID: test
      clientSecret: 
        name: test
      claims: 
        preferredUsername:
        - preferred_username
        name:
        - name
        email:
        - email
      issuer: https://www.idp-issuer.example.com 
% oc apply -f oauth.yaml 
Warning: resource oauths/cluster is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by oc apply. oc apply should only be used on resources created declaratively by either oc create --save-config or oc apply. The missing annotation will be patched automatically.
oauth.config.openshift.io/cluster configured
% 
% oc get co authentication 
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication   4.17.0-0.nightly-2024-08-18-131731   True        False         True       28m     OAuthServerConfigObservationDegraded: failed to apply IDP oidcidp config: dial tcp: lookup www.idp-issuer.example.com on 172.30.0.10:53: no such host
%

Trigger an upgrade to version which contains the PR changes

% oc adm upgrade --to-image=registry.build05.ci.openshift.org/ci-ln-r73233t/releasesha256:6005ad60e79b21be48536e8574123d9a5c1b698f79622722edf23aca45884732  --allow-explicit-upgrade --force
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates.  You have used --allow-explicit-upgrade to the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Updating to release image registry.build05.ci.openshift.org/ci-ln-r73233t/releasesha256:6005ad60e79b21be48536e8574123d9a5c1b698f79622722edf23aca45884732
% 
% oc adm upgrade status
Error while reconciling 4.17.0-0.nightly-2024-08-18-131731: the cluster operator authentication is degraded

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

%

~~Upgrade is not triggered and CVO is throwing an error~~. upgrade didn't trigger due to typo error in above oc adm upgrade

% while true; do oc adm upgrade; oc get clusterversion version -o json | jq '.status.conditions[] | select (.type=="Failing")' ; sleep 60; done  

Error while reconciling 4.17.0-0.nightly-2024-08-18-131731: the cluster operator authentication is degraded

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

{
  "lastTransitionTime": "2024-08-23T07:12:40Z",
  "message": "Cluster operator authentication is degraded",
  "reason": "ClusterOperatorDegraded",
  "status": "True",
  "type": "Failing"
}
Error while reconciling 4.17.0-0.nightly-2024-08-18-131731: the cluster operator authentication is degraded

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

{
  "lastTransitionTime": "2024-08-23T07:12:40Z",
  "message": "Cluster operator authentication is degraded",
  "reason": "ClusterOperatorDegraded",
  "status": "True",
  "type": "Failing"
}
 %

petr-muller · 2024-08-26T14:10:03Z

@dis016 this looks good, right? if so (and unless you plan more testing) can you please drop a /label qe-approved here?

dis016 · 2024-08-27T03:35:21Z

@petr-muller i am looking for more testing scenario's as @Davoska mentioned.

Break the cluster in a different manner? Update to a version with an invalid release manifest?

dis016 · 2024-08-29T08:36:27Z

Hi @Davoska, after degrading the Operator, upgrade is not triggered. Please check once when you have time.

DavidHurta · 2024-08-29T11:43:23Z

Oh, I thought that the verification was successful.

It is uncommon for the CVO to not trigger an update and not provide any information. I would expect the ReleaseAccepted condition to contain more information. I have tried to replicate your run. The upgrade is requested, and nothing happens for a few minutes.

$  oc adm upgrade --to-image=registry.build05.ci.openshift.org/ci-ln-r73233t/releasesha256:6005ad60e79b21be48536e8574123d9a5c1b698f79622722edf23aca45884732  --allow-explicit-upgrade --force --allow-upgrade-with-warnings 
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
warning: --allow-upgrade-with-warnings is bypassing: the cluster is experiencing an error reconciling "4.17.0-0.nightly-2024-08-18-131731":

  Reason: ClusterOperatorDegraded
  Message: Cluster operator authentication is degraded
Requested update to release image registry.build05.ci.openshift.org/ci-ln-r73233t/releasesha256:6005ad60e79b21be48536e8574123d9a5c1b698f79622722edf23aca45884732
$ oc adm upgrade
Failing=True:

  Reason: ClusterOperatorDegraded
  Message: Cluster operator authentication is degraded

Error while reconciling 4.17.0-0.nightly-2024-08-18-131731: the cluster operator authentication is degraded

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

Then finally (notice the ReleaseAccepted condition):

$ oc adm upgrade
Failing=True:

  Reason: ClusterOperatorDegraded
  Message: Cluster operator authentication is degraded

Error while reconciling 4.17.0-0.nightly-2024-08-18-131731: the cluster operator authentication is degraded

ReleaseAccepted=False

  Reason: RetrievePayload
  Message: Retrieving payload failed version="" image="registry.build05.ci.openshift.org/ci-ln-r73233t/releasesha256:6005ad60e79b21be48536e8574123d9a5c1b698f79622722edf23aca45884732" failure=Unable to download and prepare the update: deadline exceeded, reason: "DeadlineExceeded", message: "Job was active longer than specified deadline"

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

Is it possible that the release no longer existed in your run as well? It's maybe possible that the DeadlineExceeded error showed up a minute later? Let's catch up on Slack to speed up the review.

Updating to a freshly created build of this PR is successful:

$ oc adm upgrade
Failing=True:

  Reason: ClusterOperatorDegraded
  Message: Cluster operator authentication is degraded

Error while reconciling 4.17.0-0.nightly-2024-08-18-131731: the cluster operator authentication is degraded

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

$  oc adm upgrade --to-image "registry.build05.ci.openshift.org/ci-ln-xt7559b/release:latest"  --allow-explicit-upgrade --force --allow-upgrade-with-warnings 
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
warning: --allow-upgrade-with-warnings is bypassing: the cluster is experiencing an error reconciling "4.17.0-0.nightly-2024-08-18-131731":

  Reason: ClusterOperatorDegraded
  Message: Cluster operator authentication is degraded
Requested update to release image registry.build05.ci.openshift.org/ci-ln-xt7559b/release:latest
$ oc adm upgrade 
info: An upgrade is in progress. Working towards 4.17.0-0.ci.test-2024-08-29-112841-ci-ln-xt7559b-latest: 3 of 900 done (0% complete)

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

DavidHurta · 2024-08-29T12:08:12Z

Edit: This comment is wrong. It checks the version that does not contain the PR.

Is it possible that the release no longer existed in your run as well?

~~Or the CVO can't simply download the existing release. Same as me locally:~~

$ podman pull registry.ci.openshift.org/ocp/release@sha256:b8105494ce61dc1f5ba68f173c78adfb834ff70c66e7399b9ae401021517f27f
Trying to pull registry.ci.openshift.org/ocp/release@sha256:b8105494ce61dc1f5ba68f173c78adfb834ff70c66e7399b9ae401021517f27f...
Error: initializing source docker://registry.ci.openshift.org/ocp/release@sha256:b8105494ce61dc1f5ba68f173c78adfb834ff70c66e7399b9ae401021517f27f: unable to retrieve auth token: invalid username/password: authentication required

dis016 · 2024-08-29T14:29:42Z

Expected/New Failure: Reason: ClusterOperatorDegraded; Message: CO A is degraded
Install a 4.17 Cluster and degrade the CO authentication

# oc get clusterversion 
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.17.0-0.nightly-2024-08-29-051633   True        False         4m44s   Cluster version is 4.17.0-0.nightly-2024-08-29-051633
# cat oauth.yaml 
apiVersion: config.openshift.io/v1
kind: OAuth
metadata:
  name: cluster
spec:
  identityProviders:
  - name: oidcidp 
    mappingMethod: claim 
    type: OpenID
    openID:
      clientID: test
      clientSecret: 
        name: test
      claims: 
        preferredUsername:
        - preferred_username
        name:
        - name
        email:
        - email
      issuer: https://www.idp-issuer.example.com
 # oc apply -f oauth.yaml 
Warning: resource oauths/cluster is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by oc apply. oc apply should only be used on resources created declaratively by either oc create --save-config or oc apply. The missing annotation will be patched automatically.
oauth.config.openshift.io/cluster configured
#oc get co authentication
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication   4.17.0-0.nightly-2024-08-29-051633   True        False         True       9m9s    OAuthServerConfigObservationDegraded: failed to apply IDP oidcidp config: dial tcp: lookup www.idp-issuer.example.com on 172.30.0.10:53: no such host
# oc get clusterversion 
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.17.0-0.nightly-2024-08-29-051633   True        False         11m     Error while reconciling 4.17.0-0.nightly-2024-08-29-051633: the cluster operator authentication is degraded
# oc adm upgrade 
Failing=True:

  Reason: ClusterOperatorDegraded
  Message: Cluster operator authentication is degraded

Error while reconciling 4.17.0-0.nightly-2024-08-29-051633: the cluster operator authentication is degraded

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

Trigger an upgrade to version which contains the PR changes

# oc adm upgrade --to-image=registry.build05.ci.openshift.org/ci-ln-r73233t/release@sha256:6005ad60e79b21be48536e8574123d9a5c1b698f79622722edf23aca45884732  --allow-explicit-upgrade --force --allow-upgrade-with-warnings
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
warning: --allow-upgrade-with-warnings is bypassing: the cluster is experiencing an error reconciling "4.17.0-0.nightly-2024-08-29-051633":

  Reason: ClusterOperatorDegraded
  Message: Cluster operator authentication is degraded
Requested update to release image registry.build05.ci.openshift.org/ci-ln-r73233t/release@sha256:6005ad60e79b21be48536e8574123d9a5c1b698f79622722edf23aca45884732

Upgrade is triggered and CVO is throwing new error after sometime.

# while true; do oc adm upgrade; oc get clusterversion version -o json | jq '.status.conditions[] | select (.type=="Failing")' ; sleep 60; done 
info: An upgrade is in progress. Working towards 4.17.0-0.test-2024-08-22-165326-ci-ln-r73233t-latest: 110 of 900 done (12% complete), waiting on etcd, kube-apiserver

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

{
  "lastTransitionTime": "2024-08-29T12:53:29Z",
  "status": "False",
  "type": "Failing"
}
...
...
Failing=True:

  Reason: ClusterOperatorDegraded
  Message: Cluster operator authentication is degraded

info: An upgrade is in progress. Unable to apply 4.17.0-0.test-2024-08-22-165326-ci-ln-r73233t-latest: an unknown error has occurred: MultipleErrors

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

{
  "lastTransitionTime": "2024-08-29T13:19:13Z",
  "message": "Cluster operator authentication is degraded",
  "reason": "ClusterOperatorDegraded",
  "status": "True",
  "type": "Failing"
}
...
...
info: An upgrade is in progress. Working towards 4.17.0-0.test-2024-08-22-165326-ci-ln-r73233t-latest: 704 of 900 done (78% complete), waiting up to 40 minutes on authentication

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

{
  "lastTransitionTime": "2024-08-29T13:24:43Z",
  "status": "False",
  "type": "Failing"
}

After the upgrade stuck with waiting up to 40 minutes on authentication, un-degrade the CO authentication.

# cat oauth.yaml 
apiVersion: config.openshift.io/v1
kind: OAuth
metadata:
  name: cluster
spec: {}
# oc apply -f oauth.yaml 
oauth.config.openshift.io/cluster configured
# oc get co authentication 
NAME             VERSION                                                AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication   4.17.0-0.test-2024-08-22-165326-ci-ln-r73233t-latest   True        False         False      57m

Now CVO error should disappear then upgrade should resume.

# while true; do oc adm upgrade; oc get clusterversion version -o json | jq '.status.conditions[] | select (.type=="Failing")' ; sleep 60; done 
info: An upgrade is in progress. Working towards 4.17.0-0.test-2024-08-22-165326-ci-ln-r73233t-latest: 726 of 900 done (80% complete), waiting on dns, network

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

{
  "lastTransitionTime": "2024-08-29T13:37:13Z",
  "status": "False",
  "type": "Failing"
}
...
info: An upgrade is in progress. Working towards 4.17.0-0.test-2024-08-22-165326-ci-ln-r73233t-latest: 761 of 900 done (84% complete), waiting on machine-config

Upgradeable=False

  Reason: PoolUpdating
  Message: Cluster operator machine-config should not be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

{
  "lastTransitionTime": "2024-08-29T13:37:13Z",
  "status": "False",
  "type": "Failing"
}
Cluster version is 4.17.0-0.test-2024-08-22-165326-ci-ln-r73233t-latest

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

{
  "lastTransitionTime": "2024-08-29T13:37:13Z",
  "status": "False",
  "type": "Failing"
}

DavidHurta · 2024-08-29T20:21:51Z

To help with the verification, there is another method that combines a degraded CO and another issue.

I have a cluster that contains this PR using the Cluster Bot. I have also set the authentication CO to be degraded.

$ oc adm upgrade
Failing=True:

  Reason: ClusterOperatorDegraded
  Message: Cluster operator authentication is degraded

Error while reconciling 4.17.0-0.ci.test-2024-08-29-152220-ci-ln-bjz45bb-latest: the cluster operator authentication is degraded

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

My goal is to create another issue while upgrading the cluster. In the same run-level as the authentication operator is the openshift-samples operator. I have chosen this CO as my victim.

I have created a custom ValidatingAdmissionPolicy and a ValidatingAdmissionPolicyBinding. I want to prohibit the CVO from updating the openshift-samples operator deployment. This should raise an error by the CVO while upgrading.

The policy and its binding:

$ cat policy.yaml 
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
  name: example
spec:
  matchConstraints:
    namespaceSelector: {}
    objectSelector: {}
    resourceRules:
      - operations:
          - CREATE
          - UPDATE
        apiGroups:
          - apps
        apiVersions:
          - v1
        resources:
          - deployments
        scope: '*'
    matchPolicy: Equivalent
  validations:
    - expression: object.spec.replicas < 0
  failurePolicy: Fail
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
  name: example
spec:
  policyName: example
  matchResources:
    namespaceSelector:
      matchLabels:
        kubernetes.io/metadata.name: openshift-cluster-samples-operator
    objectSelector: {}
    matchPolicy: Equivalent
  validationActions:
    - Deny

Apply the resources:

$ oc apply -f policy.yaml 
validatingadmissionpolicy.admissionregistration.k8s.io/example created
validatingadmissionpolicybinding.admissionregistration.k8s.io/example created

Request an upgrade to a release that contains this PR:

$  oc adm upgrade --to-image "registry.build05.ci.openshift.org/ci-ln-xt7559b/release:latest"  --allow-explicit-upgrade --force --allow-upgrade-with-warnings 
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
warning: --allow-upgrade-with-warnings is bypassing: the cluster is experiencing an error reconciling "4.17.0-0.ci.test-2024-08-29-152220-ci-ln-bjz45bb-latest":

  Reason: ClusterOperatorDegraded
  Message: Cluster operator authentication is degraded
Requested update to release image registry.build05.ci.openshift.org/ci-ln-xt7559b/release:latest

After a while, we get the MultipleErrors reason inside the Failing condition:

$ oc adm upgrade
Failing=True:

  Reason: MultipleErrors
  Message: Multiple errors are preventing progress:
  * Cluster operator authentication is degraded
  * Could not update deployment "openshift-cluster-samples-operator/cluster-samples-operator" (490 of 900): the object is invalid, possibly due to local cluster configuration

info: An upgrade is in progress. Unable to apply 4.17.0-0.ci.test-2024-08-29-112841-ci-ln-xt7559b-latest: an unknown error has occurred: MultipleErrors

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

We can check the CVO logs to be sure that message was filtered as expected.

$ oc logs deploy/cluster-version-operator -n openshift-cluster-version | grep Filtered -A 6
I0829 17:41:57.325723       1 status.go:308] Filtered failure message changed from 'Multiple errors are preventing progress:
* Cluster operator authentication is degraded
* Cluster operators cloud-credential, cluster-autoscaler, console, csi-snapshot-controller, image-registry, ingress, insights, kube-storage-version-migrator, machine-approver, marketplace, monitoring, node-tuning, openshift-apiserver, openshift-controller-manager, operator-lifecycle-manager, service-ca, storage are updating versions
* Could not update deployment "openshift-cluster-samples-operator/cluster-samples-operator" (490 of 900): the object is invalid, possibly due to local cluster configuration' to 'Multiple errors are preventing progress:
* Cluster operator authentication is degraded
* Could not update deployment "openshift-cluster-samples-operator/cluster-samples-operator" (490 of 900): the object is invalid, possibly due to local cluster configuration'

As we can see, the filtering successfully filtered out the Cluster operators ... are updating from the Failing message. The reason was unchanged, as there are still multiple errors.

petr-muller · 2024-08-30T09:34:45Z

$ oc logs deploy/cluster-version-operator -n openshift-cluster-version | grep Filtered -A 6
I0829 17:41:57.325723       1 status.go:308] Filtered failure message changed from 'Multiple errors are preventing progress:
* Cluster operator authentication is degraded
* Cluster operators cloud-credential, cluster-autoscaler, console, csi-snapshot-controller, image-registry, ingress, insights, kube-storage-version-migrator, machine-approver, marketplace, monitoring, node-tuning, openshift-apiserver, openshift-controller-manager, operator-lifecycle-manager, service-ca, storage are updating versions
* Could not update deployment "openshift-cluster-samples-operator/cluster-samples-operator" (490 of 900): the object is invalid, possibly due to local cluster configuration' to 'Multiple errors are preventing progress:
* Cluster operator authentication is degraded
* Could not update deployment "openshift-cluster-samples-operator/cluster-samples-operator" (490 of 900): the object is invalid, possibly due to local cluster configuration'

This is AWESOME

dis016 · 2024-09-04T06:32:31Z

/label qe-approved

openshift-ci-robot · 2024-09-04T06:32:39Z

@Davoska: This pull request references Jira Issue OCPBUGS-15200, which is invalid:

expected the bug to target either version "4.18." or "openshift-4.18.", but it targets "4.17.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Various errors get propagated to users, such as the summarized task
graph error. For example, in the form of the message in the Failing
condition. However, update errors set with the update effect of
UpdateEffectNone can confuse users, as these primarily informing
messages get displayed together with valid update errors that heavily
impact the update. This can result in a message such as:
{
 "lastTransitionTime": "2023-06-20T13:40:12Z",
 "message": "Multiple errors are preventing progress:\n* Cluster
 operator authentication is updating versions\n* Could not update
 customresourcedefinition \"alertingrules.monitoring.openshift.io\"
 (512 of 993): the object is invalid, possibly due to local cluster
 configuration",
 "reason": "MultipleErrors",
 "status": "True",
 "type": "Failing"
}
The Failing condition is not true because of the UpdateEffectNone
error ("Cluster operator authentication is updating versions"), but
its message still gets displayed.

This PR makes sure that update errors that do not heavily affect
the update will be removed from the Failing condition message to an
extent.

This pull request references https://issues.redhat.com/browse/OCPBUGS-15200

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

petr-muller · 2024-09-04T10:07:04Z

/jira refresh

Fixed up the target version, we missed 4.17

openshift-ci-robot · 2024-09-04T10:07:08Z

@petr-muller: This pull request references Jira Issue OCPBUGS-15200, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.18.0) matches configured target version for branch (4.18.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @dis016

Details

In response to this:

/jira refresh

Fixed up the target version, we missed 4.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-09-04T10:09:28Z

@Davoska: Jira Issue OCPBUGS-15200: All pull requests linked via external trackers have merged:

openshift/cluster-version-operator#1050

Jira Issue OCPBUGS-15200 has been moved to the MODIFIED state.

Details

In response to this:

Various errors get propagated to users, such as the summarized task
graph error. For example, in the form of the message in the Failing
condition. However, update errors set with the update effect of
UpdateEffectNone can confuse users, as these primarily informing
messages get displayed together with valid update errors that heavily
impact the update. This can result in a message such as:
{
 "lastTransitionTime": "2023-06-20T13:40:12Z",
 "message": "Multiple errors are preventing progress:\n* Cluster
 operator authentication is updating versions\n* Could not update
 customresourcedefinition \"alertingrules.monitoring.openshift.io\"
 (512 of 993): the object is invalid, possibly due to local cluster
 configuration",
 "reason": "MultipleErrors",
 "status": "True",
 "type": "Failing"
}
The Failing condition is not true because of the UpdateEffectNone
error ("Cluster operator authentication is updating versions"), but
its message still gets displayed.

This PR makes sure that update errors that do not heavily affect
the update will be removed from the Failing condition message to an
extent.

This pull request references https://issues.redhat.com/browse/OCPBUGS-15200

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

DavidHurta · 2024-09-04T10:53:01Z

🎉🎉🎉

dis016 · 2024-09-04T13:22:29Z

/cherry-pick release-4.17

openshift-cherrypick-robot · 2024-09-04T13:23:13Z

@dis016: new pull request created: #1082

Details

In response to this:

/cherry-pick release-4.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-bot · 2024-09-04T16:29:16Z

[ART PR BUILD NOTIFIER]

Distgit: cluster-version-operator
This PR has been included in build cluster-version-operator-container-v4.18.0-202409041514.p0.g5915d37.assembly.stream.el9.
All builds following this will include this PR.

jiajliu · 2024-09-05T00:49:56Z

Fixed up the target version, we missed 4.17

@dis016 fyi

DavidHurta · 2024-11-26T17:58:32Z

/cherry-pick release-4.17

openshift-cherrypick-robot · 2024-11-26T17:59:15Z

@DavidHurta: new pull request created: #1114

Details

In response to this:

/cherry-pick release-4.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 3, 2024

openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jun 3, 2024

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 3, 2024

openshift-ci Bot requested a review from jiajliu June 3, 2024 14:11

openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 3, 2024

openshift-ci Bot requested review from LalatenduMohanty and petr-muller June 3, 2024 14:13

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 3, 2024

openshift-ci Bot requested review from wking and removed request for LalatenduMohanty June 3, 2024 14:15

DavidHurta marked this pull request as draft June 4, 2024 13:25

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 4, 2024

DavidHurta force-pushed the OCPBUGS-15200-filter-out-update-effect-none-errors branch 2 times, most recently from e11d635 to 8b4d632 Compare June 11, 2024 19:09

openshift-ci Bot requested a review from dis016 June 11, 2024 19:11

DavidHurta changed the title ~~OCPBUGS-15200: Filter out UpdateEffectNone errors from the summarized task graph error~~ OCPBUGS-15200: Filter out shallowly UpdateEffectNone errors from the Failing condition Jun 11, 2024

wking reviewed Jun 11, 2024

View reviewed changes

Comment thread pkg/cvo/cvo_scenarios_test.go

wking reviewed Jun 11, 2024

View reviewed changes

Comment thread pkg/cvo/status.go Outdated

wking reviewed Jun 11, 2024

View reviewed changes

Comment thread pkg/cvo/status.go Outdated

petr-muller reviewed Jun 12, 2024

View reviewed changes

Comment thread pkg/cvo/status.go Outdated

Comment thread pkg/cvo/status.go Outdated

Comment thread pkg/cvo/status.go Outdated

openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 14, 2024

openshift-ci Bot added the qe-approved Signifies that QE has signed off on this PR label Sep 4, 2024

openshift-ci-robot added jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. and removed jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Sep 4, 2024

openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Sep 4, 2024

openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Sep 4, 2024

openshift-merge-bot Bot merged commit 5915d37 into openshift:master Sep 4, 2024

openshift-cherrypick-robot mentioned this pull request Sep 4, 2024

[release-4.17] OCPBUGS-39558: Filter out shallowly UpdateEffectNone errors from a MultipleErrors message in the Failing condition #1082

Closed

openshift-cherrypick-robot mentioned this pull request Nov 26, 2024

[release-4.17] OCPBUGS-39558: Filter out shallowly UpdateEffectNone errors from a MultipleErrors message in the Failing condition #1114

Merged

Conversation

DavidHurta commented Jun 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Jun 3, 2024

Uh oh!

DavidHurta commented Jun 3, 2024

Uh oh!

openshift-ci-robot commented Jun 3, 2024

Uh oh!

DavidHurta commented Jun 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DavidHurta commented Jun 3, 2024

Uh oh!

petr-muller commented Jun 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petr-muller commented Jun 3, 2024

Uh oh!

DavidHurta commented Jun 4, 2024

Uh oh!

openshift-ci-robot commented Jun 11, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

petr-muller left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dis016 commented Aug 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petr-muller commented Aug 26, 2024

Uh oh!

dis016 commented Aug 27, 2024

Uh oh!

dis016 commented Aug 29, 2024

Uh oh!

DavidHurta commented Aug 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DavidHurta commented Aug 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dis016 commented Aug 29, 2024

Uh oh!

DavidHurta commented Aug 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petr-muller commented Aug 30, 2024

Uh oh!

dis016 commented Sep 4, 2024

Uh oh!

openshift-ci-robot commented Sep 4, 2024

Uh oh!

petr-muller commented Sep 4, 2024

Uh oh!

openshift-ci-robot commented Sep 4, 2024

Uh oh!

openshift-ci-robot commented Sep 4, 2024

Uh oh!

DavidHurta commented Sep 4, 2024

Uh oh!

dis016 commented Sep 4, 2024

Uh oh!

openshift-cherrypick-robot commented Sep 4, 2024

Uh oh!

openshift-bot commented Sep 4, 2024

Uh oh!

jiajliu commented Sep 5, 2024

Uh oh!

DavidHurta commented Nov 26, 2024

Uh oh!

openshift-cherrypick-robot commented Nov 26, 2024

Uh oh!

Reviewers

DavidHurta commented Jun 3, 2024 •

edited

Loading

DavidHurta commented Jun 3, 2024 •

edited

Loading

petr-muller commented Jun 3, 2024 •

edited

Loading

dis016 commented Aug 23, 2024 •

edited

Loading

DavidHurta commented Aug 29, 2024 •

edited

Loading

DavidHurta commented Aug 29, 2024 •

edited

Loading

DavidHurta commented Aug 29, 2024 •

edited

Loading