Skip to content

OCPBUGS-15200: Filter out shallowly UpdateEffectNone errors from a MultipleErrors message in the Failing condition#1050

Merged
openshift-merge-bot[bot] merged 3 commits intoopenshift:masterfrom
DavidHurta:OCPBUGS-15200-filter-out-update-effect-none-errors
Sep 4, 2024
Merged

OCPBUGS-15200: Filter out shallowly UpdateEffectNone errors from a MultipleErrors message in the Failing condition#1050
openshift-merge-bot[bot] merged 3 commits intoopenshift:masterfrom
DavidHurta:OCPBUGS-15200-filter-out-update-effect-none-errors

Conversation

@DavidHurta
Copy link
Copy Markdown
Contributor

@DavidHurta DavidHurta commented Jun 3, 2024

Various errors get propagated to users, such as the summarized task
graph error. For example, in the form of the message in the Failing
condition. However, update errors set with the update effect of
UpdateEffectNone can confuse users, as these primarily informing
messages get displayed together with valid update errors that heavily
impact the update. This can result in a message such as:

{
  "lastTransitionTime": "2023-06-20T13:40:12Z",
  "message": "Multiple errors are preventing progress:\n* Cluster
  operator authentication is updating versions\n* Could not update
  customresourcedefinition \"alertingrules.monitoring.openshift.io\"
  (512 of 993): the object is invalid, possibly due to local cluster
  configuration",
  "reason": "MultipleErrors",
  "status": "True",
  "type": "Failing"
}

The Failing condition is not true because of the UpdateEffectNone
error ("Cluster operator authentication is updating versions"), but
its message still gets displayed.

This PR makes sure that update errors that do not heavily affect
the update will be removed from the Failing condition message to an
extent.

This pull request references https://issues.redhat.com/browse/OCPBUGS-15200

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 3, 2024
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@Davoska: This pull request references Jira Issue OCPBUGS-15200, which is invalid:

  • expected the bug to target the "4.17.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

The summarized task graph error gets propagated to users. For example, in the form of the message in the Failing condition. However, update errors set with the update effect of UpdateEffectNone can confuse users as these informing messages get displayed together with valid update errors impacting the update. This can result in a message such as:

{
 "lastTransitionTime": "2023-06-20T13:40:12Z",
 "message": "Multiple errors are preventing progress:\n* Cluster
 operator authentication is updating versions\n* Could not update
 customresourcedefinition \"alertingrules.monitoring.openshift.io\"
 (512 of 993): the object is invalid, possibly due to local cluster
 configuration",
 "reason": "MultipleErrors",
 "status": "True",
 "type": "Failing"
}

The Failing condition is not true because of the UpdateEffectNone error ("Cluster operator authentication is updating versions"), but its message still gets displayed.

This commit makes sure that update errors that do not have an effect on the update will not get propagated further. Thus improving the user experience. However, they will still be shown in the logs to help with more precise debugging.

This pull request references https://issues.redhat.com/browse/OCPBUGS-15200

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jun 3, 2024
@DavidHurta
Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 3, 2024
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@Davoska: This pull request references Jira Issue OCPBUGS-15200, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.17.0) matches configured target version for branch (4.17.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jiajliu

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested a review from jiajliu June 3, 2024 14:11
@DavidHurta
Copy link
Copy Markdown
Contributor Author

DavidHurta commented Jun 3, 2024

I would like to test this on a live cluster (edit: and fix the failing CI). Thus, I am putting this PR on hold for the time being.

/hold

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 3, 2024
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 3, 2024
@DavidHurta
Copy link
Copy Markdown
Contributor Author

/uncc LalatenduMohanty
/cc @wking

@openshift-ci openshift-ci Bot requested review from wking and removed request for LalatenduMohanty June 3, 2024 14:15
@petr-muller
Copy link
Copy Markdown
Member

petr-muller commented Jun 3, 2024

Approach SGTM 👍

@petr-muller
Copy link
Copy Markdown
Member

I have not looked at the code closely yet but one piece to check for possible interaction is #1041 which renders all reconciliation problems (including the UpdateEffectNone ones) for external consumption, as a pseudo-api.

If possible we'd like to keep UpdateEffectNone errors there, if possible. I think filtering them out on the producer side would hide them from ReconciliationIssues?

@DavidHurta
Copy link
Copy Markdown
Contributor Author

/hold

I am re-working the PR.

@DavidHurta DavidHurta marked this pull request as draft June 4, 2024 13:25
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 4, 2024
@DavidHurta DavidHurta force-pushed the OCPBUGS-15200-filter-out-update-effect-none-errors branch 2 times, most recently from e11d635 to 8b4d632 Compare June 11, 2024 19:09
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@Davoska: This pull request references Jira Issue OCPBUGS-15200, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.17.0) matches configured target version for branch (4.17.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @dis016

Details

In response to this:

TBD

This pull request references https://issues.redhat.com/browse/OCPBUGS-15200

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested a review from dis016 June 11, 2024 19:11
@DavidHurta DavidHurta changed the title OCPBUGS-15200: Filter out UpdateEffectNone errors from the summarized task graph error OCPBUGS-15200: Filter out shallowly UpdateEffectNone errors from the Failing condition Jun 11, 2024
Comment thread pkg/cvo/cvo_scenarios_test.go
Comment thread pkg/cvo/status.go Outdated
Comment thread pkg/cvo/status.go Outdated
Copy link
Copy Markdown
Member

@petr-muller petr-muller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approach lgtm, some code readability nits + what Trevor says ;)

Comment thread pkg/cvo/status.go Outdated
Comment thread pkg/cvo/status.go Outdated
Comment thread pkg/cvo/status.go Outdated
@openshift-ci openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 14, 2024
@dis016
Copy link
Copy Markdown

dis016 commented Aug 23, 2024

Test Scenario: Make a CO(authentication) degraded.

Original Failure: Reason: MultipleErrors; Message: Multiple issues: CO A is degraded, CO B is updating versions

Install a 4.17 cluster and degrade the Cluster operator authentication.

NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.17.0-0.test-2024-08-22-131833-ci-ln-ntyc4tb-latest   True        False         99m     Cluster version is 4.17.0-0.test-2024-08-22-131833-ci-ln-ntyc4tb-latest
%
% cat oauth.yaml 
apiVersion: config.openshift.io/v1
kind: OAuth
metadata:
  name: cluster
spec:
  identityProviders:
  - name: oidcidp 
    mappingMethod: claim 
    type: OpenID
    openID:
      clientID: test
      clientSecret: 
        name: test
      claims: 
        preferredUsername:
        - preferred_username
        name:
        - name
        email:
        - email
      issuer: https://www.idp-issuer.example.com 
 % oc apply -f oauth.yaml 
Warning: resource oauths/cluster is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by oc apply. oc apply should only be used on resources created declaratively by either oc create --save-config or oc apply. The missing annotation will be patched automatically.
oauth.config.openshift.io/cluster configured
 %  oc get co authentication
NAME             VERSION                                                AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication   4.17.0-0.test-2024-08-22-131833-ci-ln-ntyc4tb-latest   True        False         True       103m    OAuthServerConfigObservationDegraded: failed to apply IDP oidcidp config: dial tcp: lookup www.idp-issuer.example.com on 172.30.0.10:53: no such host
% 

Trigger Upgrade to version which doesn't contain the PR Changes

% oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release@sha256:b8105494ce61dc1f5ba68f173c78adfb834ff70c66e7399b9ae401021517f27f  --allow-explicit-upgrade --force
warning: The requested upgrade image is not one of the available updates.  You have used --allow-explicit-upgrade to the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Updating to release image registry.ci.openshift.org/ocp/release@sha256:b8105494ce61dc1f5ba68f173c78adfb834ff70c66e7399b9ae401021517f27f
% 
% oc adm upgrade 
Error while reconciling 4.17.0-0.test-2024-08-22-131833-ci-ln-ntyc4tb-latest: the cluster operator authentication is degraded

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

%  oc adm upgrade status 
info: An upgrade is in progress. Working towards 4.17.0-0.nightly-2024-08-19-165854: 6 of 900 done (0% complete)

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

dinesh@Dineshs-MacBook-Pro Downloads % 

with error upgrade is proceeded and CVO is throwing the error


% while true; do oc adm upgrade; oc get clusterversion version -o json | jq '.status.conditions[] | select (.type=="Failing")' ; sleep 60; done  

info: An upgrade is in progress. Working towards 4.17.0-0.nightly-2024-08-19-165854: 110 of 900 done (12% complete), waiting on etcd, kube-apiserver

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

{
  "lastTransitionTime": "2024-08-22T15:41:42Z",
  "status": "False",
  "type": "Failing"
}
...
...
info: An upgrade is in progress. Unable to apply 4.17.0-0.nightly-2024-08-19-165854: an unknown error has occurred: MultipleErrors

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

{
  "lastTransitionTime": "2024-08-22T16:06:12Z",
  "message": "Multiple errors are preventing progress:\n* Cluster operator authentication is degraded\n* Cluster operators cluster-autoscaler, console, marketplace, monitoring, node-tuning, openshift-apiserver, openshift-controller-manager are updating versions",
  **"reason": "MultipleErrors",**
  "status": "True",
  "type": "Failing"
}

Expected/New Failure: Reason: ClusterOperatorDegraded; Message: CO A is degraded
Install a 4.17 Cluster and degrade the CO authentication

% oc get clusterversion 
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.17.0-0.nightly-2024-08-18-131731   True        False         22m     Cluster version is 4.17.0-0.nightly-2024-08-18-131731
%
% cat oauth.yaml 
apiVersion: config.openshift.io/v1
kind: OAuth
metadata:
  name: cluster
spec:
  identityProviders:
  - name: oidcidp 
    mappingMethod: claim 
    type: OpenID
    openID:
      clientID: test
      clientSecret: 
        name: test
      claims: 
        preferredUsername:
        - preferred_username
        name:
        - name
        email:
        - email
      issuer: https://www.idp-issuer.example.com 
% oc apply -f oauth.yaml 
Warning: resource oauths/cluster is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by oc apply. oc apply should only be used on resources created declaratively by either oc create --save-config or oc apply. The missing annotation will be patched automatically.
oauth.config.openshift.io/cluster configured
% 
% oc get co authentication 
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication   4.17.0-0.nightly-2024-08-18-131731   True        False         True       28m     OAuthServerConfigObservationDegraded: failed to apply IDP oidcidp config: dial tcp: lookup www.idp-issuer.example.com on 172.30.0.10:53: no such host
%

Trigger an upgrade to version which contains the PR changes

% oc adm upgrade --to-image=registry.build05.ci.openshift.org/ci-ln-r73233t/releasesha256:6005ad60e79b21be48536e8574123d9a5c1b698f79622722edf23aca45884732  --allow-explicit-upgrade --force
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates.  You have used --allow-explicit-upgrade to the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Updating to release image registry.build05.ci.openshift.org/ci-ln-r73233t/releasesha256:6005ad60e79b21be48536e8574123d9a5c1b698f79622722edf23aca45884732
% 
% oc adm upgrade status
Error while reconciling 4.17.0-0.nightly-2024-08-18-131731: the cluster operator authentication is degraded

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

%

Upgrade is not triggered and CVO is throwing an error. upgrade didn't trigger due to typo error in above oc adm upgrade

% while true; do oc adm upgrade; oc get clusterversion version -o json | jq '.status.conditions[] | select (.type=="Failing")' ; sleep 60; done  

Error while reconciling 4.17.0-0.nightly-2024-08-18-131731: the cluster operator authentication is degraded

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

{
  "lastTransitionTime": "2024-08-23T07:12:40Z",
  "message": "Cluster operator authentication is degraded",
  "reason": "ClusterOperatorDegraded",
  "status": "True",
  "type": "Failing"
}
Error while reconciling 4.17.0-0.nightly-2024-08-18-131731: the cluster operator authentication is degraded

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

{
  "lastTransitionTime": "2024-08-23T07:12:40Z",
  "message": "Cluster operator authentication is degraded",
  "reason": "ClusterOperatorDegraded",
  "status": "True",
  "type": "Failing"
}
 % 

@petr-muller
Copy link
Copy Markdown
Member

@dis016 this looks good, right? if so (and unless you plan more testing) can you please drop a /label qe-approved here?

@dis016
Copy link
Copy Markdown

dis016 commented Aug 27, 2024

@petr-muller i am looking for more testing scenario's as @Davoska mentioned.

Break the cluster in a different manner? Update to a version with an invalid release manifest?

@dis016
Copy link
Copy Markdown

dis016 commented Aug 29, 2024

Hi @Davoska, after degrading the Operator, upgrade is not triggered. Please check once when you have time.

@DavidHurta
Copy link
Copy Markdown
Contributor Author

DavidHurta commented Aug 29, 2024

Oh, I thought that the verification was successful.

It is uncommon for the CVO to not trigger an update and not provide any information. I would expect the ReleaseAccepted condition to contain more information. I have tried to replicate your run. The upgrade is requested, and nothing happens for a few minutes.

$  oc adm upgrade --to-image=registry.build05.ci.openshift.org/ci-ln-r73233t/releasesha256:6005ad60e79b21be48536e8574123d9a5c1b698f79622722edf23aca45884732  --allow-explicit-upgrade --force --allow-upgrade-with-warnings 
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
warning: --allow-upgrade-with-warnings is bypassing: the cluster is experiencing an error reconciling "4.17.0-0.nightly-2024-08-18-131731":

  Reason: ClusterOperatorDegraded
  Message: Cluster operator authentication is degraded
Requested update to release image registry.build05.ci.openshift.org/ci-ln-r73233t/releasesha256:6005ad60e79b21be48536e8574123d9a5c1b698f79622722edf23aca45884732
$ oc adm upgrade
Failing=True:

  Reason: ClusterOperatorDegraded
  Message: Cluster operator authentication is degraded

Error while reconciling 4.17.0-0.nightly-2024-08-18-131731: the cluster operator authentication is degraded

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

Then finally (notice the ReleaseAccepted condition):

$ oc adm upgrade
Failing=True:

  Reason: ClusterOperatorDegraded
  Message: Cluster operator authentication is degraded

Error while reconciling 4.17.0-0.nightly-2024-08-18-131731: the cluster operator authentication is degraded

ReleaseAccepted=False

  Reason: RetrievePayload
  Message: Retrieving payload failed version="" image="registry.build05.ci.openshift.org/ci-ln-r73233t/releasesha256:6005ad60e79b21be48536e8574123d9a5c1b698f79622722edf23aca45884732" failure=Unable to download and prepare the update: deadline exceeded, reason: "DeadlineExceeded", message: "Job was active longer than specified deadline"

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

Is it possible that the release no longer existed in your run as well? It's maybe possible that the DeadlineExceeded error showed up a minute later? Let's catch up on Slack to speed up the review.

Updating to a freshly created build of this PR is successful:

$ oc adm upgrade
Failing=True:

  Reason: ClusterOperatorDegraded
  Message: Cluster operator authentication is degraded

Error while reconciling 4.17.0-0.nightly-2024-08-18-131731: the cluster operator authentication is degraded

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

$  oc adm upgrade --to-image "registry.build05.ci.openshift.org/ci-ln-xt7559b/release:latest"  --allow-explicit-upgrade --force --allow-upgrade-with-warnings 
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
warning: --allow-upgrade-with-warnings is bypassing: the cluster is experiencing an error reconciling "4.17.0-0.nightly-2024-08-18-131731":

  Reason: ClusterOperatorDegraded
  Message: Cluster operator authentication is degraded
Requested update to release image registry.build05.ci.openshift.org/ci-ln-xt7559b/release:latest
$ oc adm upgrade 
info: An upgrade is in progress. Working towards 4.17.0-0.ci.test-2024-08-29-112841-ci-ln-xt7559b-latest: 3 of 900 done (0% complete)

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

@DavidHurta
Copy link
Copy Markdown
Contributor Author

DavidHurta commented Aug 29, 2024

Edit: This comment is wrong. It checks the version that does not contain the PR.

Is it possible that the release no longer existed in your run as well?

Or the CVO can't simply download the existing release. Same as me locally:

$ podman pull registry.ci.openshift.org/ocp/release@sha256:b8105494ce61dc1f5ba68f173c78adfb834ff70c66e7399b9ae401021517f27f
Trying to pull registry.ci.openshift.org/ocp/release@sha256:b8105494ce61dc1f5ba68f173c78adfb834ff70c66e7399b9ae401021517f27f...
Error: initializing source docker://registry.ci.openshift.org/ocp/release@sha256:b8105494ce61dc1f5ba68f173c78adfb834ff70c66e7399b9ae401021517f27f: unable to retrieve auth token: invalid username/password: authentication required

@dis016
Copy link
Copy Markdown

dis016 commented Aug 29, 2024

Expected/New Failure: Reason: ClusterOperatorDegraded; Message: CO A is degraded
Install a 4.17 Cluster and degrade the CO authentication

# oc get clusterversion 
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.17.0-0.nightly-2024-08-29-051633   True        False         4m44s   Cluster version is 4.17.0-0.nightly-2024-08-29-051633
# cat oauth.yaml 
apiVersion: config.openshift.io/v1
kind: OAuth
metadata:
  name: cluster
spec:
  identityProviders:
  - name: oidcidp 
    mappingMethod: claim 
    type: OpenID
    openID:
      clientID: test
      clientSecret: 
        name: test
      claims: 
        preferredUsername:
        - preferred_username
        name:
        - name
        email:
        - email
      issuer: https://www.idp-issuer.example.com
 # oc apply -f oauth.yaml 
Warning: resource oauths/cluster is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by oc apply. oc apply should only be used on resources created declaratively by either oc create --save-config or oc apply. The missing annotation will be patched automatically.
oauth.config.openshift.io/cluster configured
#oc get co authentication
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication   4.17.0-0.nightly-2024-08-29-051633   True        False         True       9m9s    OAuthServerConfigObservationDegraded: failed to apply IDP oidcidp config: dial tcp: lookup www.idp-issuer.example.com on 172.30.0.10:53: no such host
# oc get clusterversion 
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.17.0-0.nightly-2024-08-29-051633   True        False         11m     Error while reconciling 4.17.0-0.nightly-2024-08-29-051633: the cluster operator authentication is degraded
# oc adm upgrade 
Failing=True:

  Reason: ClusterOperatorDegraded
  Message: Cluster operator authentication is degraded

Error while reconciling 4.17.0-0.nightly-2024-08-29-051633: the cluster operator authentication is degraded

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.     

Trigger an upgrade to version which contains the PR changes

# oc adm upgrade --to-image=registry.build05.ci.openshift.org/ci-ln-r73233t/release@sha256:6005ad60e79b21be48536e8574123d9a5c1b698f79622722edf23aca45884732  --allow-explicit-upgrade --force --allow-upgrade-with-warnings
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
warning: --allow-upgrade-with-warnings is bypassing: the cluster is experiencing an error reconciling "4.17.0-0.nightly-2024-08-29-051633":

  Reason: ClusterOperatorDegraded
  Message: Cluster operator authentication is degraded
Requested update to release image registry.build05.ci.openshift.org/ci-ln-r73233t/release@sha256:6005ad60e79b21be48536e8574123d9a5c1b698f79622722edf23aca45884732

Upgrade is triggered and CVO is throwing new error after sometime.

# while true; do oc adm upgrade; oc get clusterversion version -o json | jq '.status.conditions[] | select (.type=="Failing")' ; sleep 60; done 
info: An upgrade is in progress. Working towards 4.17.0-0.test-2024-08-22-165326-ci-ln-r73233t-latest: 110 of 900 done (12% complete), waiting on etcd, kube-apiserver

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

{
  "lastTransitionTime": "2024-08-29T12:53:29Z",
  "status": "False",
  "type": "Failing"
}
...
...
Failing=True:

  Reason: ClusterOperatorDegraded
  Message: Cluster operator authentication is degraded

info: An upgrade is in progress. Unable to apply 4.17.0-0.test-2024-08-22-165326-ci-ln-r73233t-latest: an unknown error has occurred: MultipleErrors

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

{
  "lastTransitionTime": "2024-08-29T13:19:13Z",
  "message": "Cluster operator authentication is degraded",
  "reason": "ClusterOperatorDegraded",
  "status": "True",
  "type": "Failing"
}
...
...
info: An upgrade is in progress. Working towards 4.17.0-0.test-2024-08-22-165326-ci-ln-r73233t-latest: 704 of 900 done (78% complete), waiting up to 40 minutes on authentication

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

{
  "lastTransitionTime": "2024-08-29T13:24:43Z",
  "status": "False",
  "type": "Failing"
}

After the upgrade stuck with waiting up to 40 minutes on authentication, un-degrade the CO authentication.

# cat oauth.yaml 
apiVersion: config.openshift.io/v1
kind: OAuth
metadata:
  name: cluster
spec: {}
# oc apply -f oauth.yaml 
oauth.config.openshift.io/cluster configured
# oc get co authentication 
NAME             VERSION                                                AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication   4.17.0-0.test-2024-08-22-165326-ci-ln-r73233t-latest   True        False         False      57m 

Now CVO error should disappear then upgrade should resume.

# while true; do oc adm upgrade; oc get clusterversion version -o json | jq '.status.conditions[] | select (.type=="Failing")' ; sleep 60; done 
info: An upgrade is in progress. Working towards 4.17.0-0.test-2024-08-22-165326-ci-ln-r73233t-latest: 726 of 900 done (80% complete), waiting on dns, network

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

{
  "lastTransitionTime": "2024-08-29T13:37:13Z",
  "status": "False",
  "type": "Failing"
}
...
info: An upgrade is in progress. Working towards 4.17.0-0.test-2024-08-22-165326-ci-ln-r73233t-latest: 761 of 900 done (84% complete), waiting on machine-config

Upgradeable=False

  Reason: PoolUpdating
  Message: Cluster operator machine-config should not be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

{
  "lastTransitionTime": "2024-08-29T13:37:13Z",
  "status": "False",
  "type": "Failing"
}
Cluster version is 4.17.0-0.test-2024-08-22-165326-ci-ln-r73233t-latest

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

{
  "lastTransitionTime": "2024-08-29T13:37:13Z",
  "status": "False",
  "type": "Failing"
}

@DavidHurta
Copy link
Copy Markdown
Contributor Author

DavidHurta commented Aug 29, 2024

To help with the verification, there is another method that combines a degraded CO and another issue.

I have a cluster that contains this PR using the Cluster Bot. I have also set the authentication CO to be degraded.

$ oc adm upgrade
Failing=True:

  Reason: ClusterOperatorDegraded
  Message: Cluster operator authentication is degraded

Error while reconciling 4.17.0-0.ci.test-2024-08-29-152220-ci-ln-bjz45bb-latest: the cluster operator authentication is degraded

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

My goal is to create another issue while upgrading the cluster. In the same run-level as the authentication operator is the openshift-samples operator. I have chosen this CO as my victim.

I have created a custom ValidatingAdmissionPolicy and a ValidatingAdmissionPolicyBinding. I want to prohibit the CVO from updating the openshift-samples operator deployment. This should raise an error by the CVO while upgrading.

The policy and its binding:

$ cat policy.yaml 
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
  name: example
spec:
  matchConstraints:
    namespaceSelector: {}
    objectSelector: {}
    resourceRules:
      - operations:
          - CREATE
          - UPDATE
        apiGroups:
          - apps
        apiVersions:
          - v1
        resources:
          - deployments
        scope: '*'
    matchPolicy: Equivalent
  validations:
    - expression: object.spec.replicas < 0
  failurePolicy: Fail
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
  name: example
spec:
  policyName: example
  matchResources:
    namespaceSelector:
      matchLabels:
        kubernetes.io/metadata.name: openshift-cluster-samples-operator
    objectSelector: {}
    matchPolicy: Equivalent
  validationActions:
    - Deny

Apply the resources:

$ oc apply -f policy.yaml 
validatingadmissionpolicy.admissionregistration.k8s.io/example created
validatingadmissionpolicybinding.admissionregistration.k8s.io/example created

Request an upgrade to a release that contains this PR:

$  oc adm upgrade --to-image "registry.build05.ci.openshift.org/ci-ln-xt7559b/release:latest"  --allow-explicit-upgrade --force --allow-upgrade-with-warnings 
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
warning: --allow-upgrade-with-warnings is bypassing: the cluster is experiencing an error reconciling "4.17.0-0.ci.test-2024-08-29-152220-ci-ln-bjz45bb-latest":

  Reason: ClusterOperatorDegraded
  Message: Cluster operator authentication is degraded
Requested update to release image registry.build05.ci.openshift.org/ci-ln-xt7559b/release:latest

After a while, we get the MultipleErrors reason inside the Failing condition:

$ oc adm upgrade
Failing=True:

  Reason: MultipleErrors
  Message: Multiple errors are preventing progress:
  * Cluster operator authentication is degraded
  * Could not update deployment "openshift-cluster-samples-operator/cluster-samples-operator" (490 of 900): the object is invalid, possibly due to local cluster configuration

info: An upgrade is in progress. Unable to apply 4.17.0-0.ci.test-2024-08-29-112841-ci-ln-xt7559b-latest: an unknown error has occurred: MultipleErrors

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

We can check the CVO logs to be sure that message was filtered as expected.

$ oc logs deploy/cluster-version-operator -n openshift-cluster-version | grep Filtered -A 6
I0829 17:41:57.325723       1 status.go:308] Filtered failure message changed from 'Multiple errors are preventing progress:
* Cluster operator authentication is degraded
* Cluster operators cloud-credential, cluster-autoscaler, console, csi-snapshot-controller, image-registry, ingress, insights, kube-storage-version-migrator, machine-approver, marketplace, monitoring, node-tuning, openshift-apiserver, openshift-controller-manager, operator-lifecycle-manager, service-ca, storage are updating versions
* Could not update deployment "openshift-cluster-samples-operator/cluster-samples-operator" (490 of 900): the object is invalid, possibly due to local cluster configuration' to 'Multiple errors are preventing progress:
* Cluster operator authentication is degraded
* Could not update deployment "openshift-cluster-samples-operator/cluster-samples-operator" (490 of 900): the object is invalid, possibly due to local cluster configuration'

As we can see, the filtering successfully filtered out the Cluster operators ... are updating from the Failing message. The reason was unchanged, as there are still multiple errors.

@petr-muller
Copy link
Copy Markdown
Member

$ oc logs deploy/cluster-version-operator -n openshift-cluster-version | grep Filtered -A 6
I0829 17:41:57.325723       1 status.go:308] Filtered failure message changed from 'Multiple errors are preventing progress:
* Cluster operator authentication is degraded
* Cluster operators cloud-credential, cluster-autoscaler, console, csi-snapshot-controller, image-registry, ingress, insights, kube-storage-version-migrator, machine-approver, marketplace, monitoring, node-tuning, openshift-apiserver, openshift-controller-manager, operator-lifecycle-manager, service-ca, storage are updating versions
* Could not update deployment "openshift-cluster-samples-operator/cluster-samples-operator" (490 of 900): the object is invalid, possibly due to local cluster configuration' to 'Multiple errors are preventing progress:
* Cluster operator authentication is degraded
* Could not update deployment "openshift-cluster-samples-operator/cluster-samples-operator" (490 of 900): the object is invalid, possibly due to local cluster configuration'

This is AWESOME

@dis016
Copy link
Copy Markdown

dis016 commented Sep 4, 2024

/label qe-approved

@openshift-ci openshift-ci Bot added the qe-approved Signifies that QE has signed off on this PR label Sep 4, 2024
@openshift-ci-robot openshift-ci-robot added jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. and removed jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Sep 4, 2024
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@Davoska: This pull request references Jira Issue OCPBUGS-15200, which is invalid:

  • expected the bug to target either version "4.18." or "openshift-4.18.", but it targets "4.17.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Various errors get propagated to users, such as the summarized task
graph error. For example, in the form of the message in the Failing
condition. However, update errors set with the update effect of
UpdateEffectNone can confuse users, as these primarily informing
messages get displayed together with valid update errors that heavily
impact the update. This can result in a message such as:

{
 "lastTransitionTime": "2023-06-20T13:40:12Z",
 "message": "Multiple errors are preventing progress:\n* Cluster
 operator authentication is updating versions\n* Could not update
 customresourcedefinition \"alertingrules.monitoring.openshift.io\"
 (512 of 993): the object is invalid, possibly due to local cluster
 configuration",
 "reason": "MultipleErrors",
 "status": "True",
 "type": "Failing"
}

The Failing condition is not true because of the UpdateEffectNone
error ("Cluster operator authentication is updating versions"), but
its message still gets displayed.

This PR makes sure that update errors that do not heavily affect
the update will be removed from the Failing condition message to an
extent.

This pull request references https://issues.redhat.com/browse/OCPBUGS-15200

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@petr-muller
Copy link
Copy Markdown
Member

/jira refresh

Fixed up the target version, we missed 4.17

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Sep 4, 2024
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@petr-muller: This pull request references Jira Issue OCPBUGS-15200, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.18.0) matches configured target version for branch (4.18.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @dis016

Details

In response to this:

/jira refresh

Fixed up the target version, we missed 4.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Sep 4, 2024
@openshift-merge-bot openshift-merge-bot Bot merged commit 5915d37 into openshift:master Sep 4, 2024
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@Davoska: Jira Issue OCPBUGS-15200: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-15200 has been moved to the MODIFIED state.

Details

In response to this:

Various errors get propagated to users, such as the summarized task
graph error. For example, in the form of the message in the Failing
condition. However, update errors set with the update effect of
UpdateEffectNone can confuse users, as these primarily informing
messages get displayed together with valid update errors that heavily
impact the update. This can result in a message such as:

{
 "lastTransitionTime": "2023-06-20T13:40:12Z",
 "message": "Multiple errors are preventing progress:\n* Cluster
 operator authentication is updating versions\n* Could not update
 customresourcedefinition \"alertingrules.monitoring.openshift.io\"
 (512 of 993): the object is invalid, possibly due to local cluster
 configuration",
 "reason": "MultipleErrors",
 "status": "True",
 "type": "Failing"
}

The Failing condition is not true because of the UpdateEffectNone
error ("Cluster operator authentication is updating versions"), but
its message still gets displayed.

This PR makes sure that update errors that do not heavily affect
the update will be removed from the Failing condition message to an
extent.

This pull request references https://issues.redhat.com/browse/OCPBUGS-15200

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@DavidHurta
Copy link
Copy Markdown
Contributor Author

🎉🎉🎉

@dis016
Copy link
Copy Markdown

dis016 commented Sep 4, 2024

/cherry-pick release-4.17

@openshift-cherrypick-robot
Copy link
Copy Markdown

@dis016: new pull request created: #1082

Details

In response to this:

/cherry-pick release-4.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-bot
Copy link
Copy Markdown
Contributor

[ART PR BUILD NOTIFIER]

Distgit: cluster-version-operator
This PR has been included in build cluster-version-operator-container-v4.18.0-202409041514.p0.g5915d37.assembly.stream.el9.
All builds following this will include this PR.

@jiajliu
Copy link
Copy Markdown
Contributor

jiajliu commented Sep 5, 2024

Fixed up the target version, we missed 4.17

@dis016 fyi

@DavidHurta
Copy link
Copy Markdown
Contributor Author

/cherry-pick release-4.17

@openshift-cherrypick-robot
Copy link
Copy Markdown

@DavidHurta: new pull request created: #1114

Details

In response to this:

/cherry-pick release-4.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants