Skip to content

Revert "OCPBUGS#9448: Add a note regarding checking clusterversion"#65812

Merged
michaelryanpeter merged 1 commit intoopenshift:mainfrom
wking:do-not-ignore-update-errors-or-get-clusterversion
Nov 1, 2023
Merged

Revert "OCPBUGS#9448: Add a note regarding checking clusterversion"#65812
michaelryanpeter merged 1 commit intoopenshift:mainfrom
wking:do-not-ignore-update-errors-or-get-clusterversion

Conversation

@wking
Copy link
Copy Markdown
Member

@wking wking commented Oct 5, 2023

This reverts commit fb0d246, #57136.

The commit message did not explain the motivation, but OCPBUGS-9448 has:

One note is:
"
After the update completes, you can confirm that the cluster version has updated to the new version:
$ oc get clusterversion
"

The thing is that if the upgrade didn't complete, this output may include errors, for example:

" oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.10.26 True True 24m Unable to apply 4.11.0-rc.7: an unknown error has occurred: MultipleErrors
"

We should include a note to disregard these errors as long as the "Progressing" is in True.

But 1143cbe (#55005) was moving us from oc get clusterversion towards oc adm upgrade for "show my current context". Both get ClusterVersion and print a summary of its current status to the console. oc get ... is generic CustomResourceDefinition rendering and the only knobs are things like additionalPrinterColumns, because get needs to work if there are multiple matching resources and fit them each into one line of text. We have more control over the rendering in 'oc adm upgrade', where we know we only care about the ClusterVersion named cluster, and can take as many lines as we need to explain the details of the current cluster state. So basically there's no upside to using oc get clusterversion, unless you are using -o json or something and piping the results into a robot for structured consumption. For direct human consumption, oc adm upgrade will always be better, and in this commit, I'm moving us back to using oc adm upgrade. This also gets the intervening Cluster version is <version> example output back to making sense, since fb0d246 failed to update that example output when it pivoted the suggested command.

I'm also dropping the "you can ignore the error" line. While there are certainly cases when cluster components ask for admin intervention (e.g. by setting Available=False in a ClusterOperator) despite admin intervention not actually being needed, it is absolutely not a good idea to ignore those in general. Cases:

  • Cluster claims it is happy...

    • ... and it actually is happy. Hooray!

    • ... but it is actually unhealthy. Worth a bug to the overly optimistic component about more accurately reporting that it is unhealthy, and should be calling for admin intervention.

  • Cluster claims it is unhealthy...

    • ... and it actually is unhealthy. Sad. But at least it is appropriately calling for help. Error messages should be actionable, so the admin can quickly identify and resolve the issue, and if not, it is worth a bug to the opaquely-failing component about more helpfully guiding the admin through triage and resolution.

    • ... but it is actually pretty healthy. This is the case where the admin could ignore the error message. But it is hard to distinguish from the "actually unhealthy" cases where the admin should not ignore the error message. Certainly:

         an unknown error has occurred: MultipleErrors
      

      is insufficient data to decide that the error report is false-positive noise. Worth digging into the source of the error report, and a bug to the noisy component about not calling for admin assistance when no intervention is actually needed.

There will probably always be some corner cases where the cluster does not ask for help despite needing help, or the cluster asks for help despite not needing help, but as the controllers get smarter, those cases should become more and more rare, to the point where we should not need to discuss them in docs beyond generic comments pointing out that no real-world diagnostic system is completely free of false-positive and false-negative risk.

CC @xenolinux , @achuzhoy , @lahinson , and @jboxman , who were involved in the pull request I'm reverting. If my explaination for the revert don't make sense, please let me know what I'm missing :)

Version(s): #57136 was backported through 4.9 with #57454. Since then, 4.9 and 4.10 have gone end-of-life, so perhaps this revert only needs to get picked back through 4.11?

Link to docs preview:

QE review:

  • QE has approved this change.

Additional information:

Unclear to me how to get more updates-team dev/QE review of these changes. Ideally we'd have the "is this docs change how we want to address this customer issue?" discussion on the original pull request, and not in a revert pull request several months later.

@openshift-ci openshift-ci Bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Oct 5, 2023
@achuzhoy
Copy link
Copy Markdown

achuzhoy commented Oct 5, 2023

@wking The error message suggests the admin should investigate (immediately?) why the issue occurs.
The point I was trying to convey is that no action is necessary as long as the upgrade is in progress.
Id suggest rewording instead of removing completely.
Something along the lines of:
An error message "an unknown error has occurred: MultipleErrors" may appear while the upgrade is in progress and should get resolved upon success.

@ocpdocs-previewbot
Copy link
Copy Markdown

ocpdocs-previewbot commented Oct 5, 2023

🤖 Updated build preview is available at:
https://65812--docspreview.netlify.app

Build log: https://circleci.com/gh/ocpdocs-previewbot/openshift-docs/28103

@wking
Copy link
Copy Markdown
Member Author

wking commented Oct 5, 2023

The point I was trying to convey is that no action is necessary as long as the upgrade is in progress.

Progressing=True is the CVO saying "I'm currently attempting to roll out an update". It is not a statement about whether the CVO feels like it is succeeding or not in that attempt. The CVO gives its feelings about how successful (or not) reconciliation attempts are going via Failing. So while it's certainly possible to have the CVO go Failing=True, complain about MultipleErrors, and subsequently complete the update without needing admin intervention, it's also possible to have the CVO go Failing=True, complain about MultipleErrors, and stick forever until an admin shows up to help resolve the issues (example). The only way to know if admin intervention is needed or not is to respond to the Failing=True and Progressing=True message strings, and dig down until you understand what's actually going wrong.

@achuzhoy
Copy link
Copy Markdown

achuzhoy commented Oct 5, 2023

Can we add with proper phrasing what you wrote to the doc maybe?
i.e. "The only way to know if admin intervention is needed or not is to respond to the Failing=True and Progressing=True message strings, and dig down until you understand what's actually going wrong."

@wking
Copy link
Copy Markdown
Member Author

wking commented Oct 6, 2023

We could add that statement, but presumably folks getting a Failing=True message would have a hint that something might be wrong. I have no problem with docs like "even robots get things wrong sometimes, so remember not to treat them as infallible when investigating reported issues". I'm fine with docs like "robots sometimes get things wrong, but they also often get things right, so if they're telling you the cluster is suffering, at least hear them out to see if they're right that time". I just don't want the current docs that tell people "the robots are probably lying to you; feel free to ignore them" ;)

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 14, 2023
This reverts commit fb0d246, openshift#57136.

The commit message did not explain the motivation, but [1] has:

  One note is:
  "
  After the update completes, you can confirm that the cluster version has updated to the new version:
  $ oc get clusterversion
  "

  The thing is that if the upgrade didn't complete, this output may include errors, for example:

  " oc get clusterversion
  NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
  version 4.10.26 True True 24m Unable to apply 4.11.0-rc.7: an unknown error has occurred: MultipleErrors
  "

  We should include a note to disregard these errors as long as the "Progressing" is in True.

But 1143cbe (Removed jq commands and replaced with oc adm upgrade,
2023-01-20, openshift#55005) was moving us from 'oc get clusterversion' towards
'oc adm upgrade' for "show my current context".  Both get
ClusterVersion and print a summary of its current status to the
console.  'oc get ...' is generic CustomResourceDefinition rendering
and the only knobs are things like additionalPrinterColumns, because
'get' needs to work if there are multiple matching resources and fit
them each into one line of text.  We have more control over the
rendering in 'oc adm upgrade', where we know we only care about the
ClusterVersion named 'cluster', and can take as many lines as we need
to explain the details of the current cluster state.  So basically
there's no upside to using 'oc get clusterversion', unless you are
using '-o json' or something and piping the results into a robot for
structured consumption.  For direct human consumption, 'oc adm
upgrade' will always be better, and in this commit, I'm moving us back
to using 'oc adm upgrade'.  This also gets the intervening 'Cluster
version is <version>' example output back to making sense, since
fb0d246 failed to update that example output when it pivoted
the suggested command.

I'm also dropping the "you can ignore the error" line.  While there
are certainly cases when cluster components ask for admin intervention
(e.g. by setting Available=False in a ClusterOperator) despite admin
intervention not actually being needed, it is absolutely not a good
idea to ignore those in general.  Cases:

* Cluster claims it is happy...
  * ... and it actually is happy.  Hooray!

  * ... but it is actually unhealthy.  Worth a bug to the overly
    optimistic component about more accurately reporting that it is
    unhealthy, and should be calling for admin intervention.

* Cluster claims it is unhealthy...
  * ... and it actually is unhealthy.  Sad.  But at least it is
    appropriately calling for help.  Error messages should be
    actionable, so the admin can quickly identify and resolve the
    issue, and if not, it is worth a bug to the opaquely-failing
    component about more helpfully guiding the admin through
    triage and resolution.

  * ... but it is actually pretty healthy.  This is the case where the
    admin could ignore the error message.  But it is hard to
    distinguish from the "actually unhealthy" cases where the admin
    should not ignore the error message.  Certainly:

      an unknown error has occurred: MultipleErrors

    is insufficient data to decide that the error report is
    false-positive noise.  Worth digging into the source of the error
    report, and a bug to the noisy component about not calling for
    admin assistance when no intervention is actually needed.

There will probably always be some corner cases where the cluster does
not ask for help despite needing help, or the cluster asks for help
despite not needing help, but as the controllers get smarter, those
cases should become more and more rare, to the point where we should
not need to discuss them in docs beyond generic comments pointing out
that no real-world diagnostic system is completely free of
false-positive and false-negative risk.

[1]: https://issues.redhat.com/browse/OCPBUGS-9448
@wking wking force-pushed the do-not-ignore-update-errors-or-get-clusterversion branch from ed8d81d to c71ce8f Compare October 16, 2023 16:58
@wking
Copy link
Copy Markdown
Member Author

wking commented Oct 16, 2023

Rebased around #65618 with ed8d81d -> c71ce8f.

@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 16, 2023
@openshift-ci openshift-ci Bot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 16, 2023
@skopacz1
Copy link
Copy Markdown
Contributor

@evakhoni could you PTAL when you have the chance? Thanks!

@skopacz1
Copy link
Copy Markdown
Contributor

@shellyyang1989 could you PTAL when you have the chance? Thanks

@shellyyang1989
Copy link
Copy Markdown

LGTM

@skopacz1
Copy link
Copy Markdown
Contributor

skopacz1 commented Nov 1, 2023

/label peer-review-needed

@openshift-ci openshift-ci Bot added the peer-review-needed Signifies that the peer review team needs to review this PR label Nov 1, 2023
@mburke5678 mburke5678 added the peer-review-in-progress Signifies that the peer review team is reviewing this PR label Nov 1, 2023
@mburke5678
Copy link
Copy Markdown
Contributor

The actual change LGTM. Whether or not we should make the change, I will leave to other, more informed folks.

@mburke5678 mburke5678 added peer-review-done Signifies that the peer review team has reviewed this PR and removed peer-review-in-progress Signifies that the peer review team is reviewing this PR peer-review-needed Signifies that the peer review team needs to review this PR labels Nov 1, 2023
@wking
Copy link
Copy Markdown
Member Author

wking commented Nov 1, 2023

/label merge-review-needed

@openshift-ci openshift-ci Bot added the merge-review-needed Signifies that the merge review team needs to review this PR label Nov 1, 2023
@michaelryanpeter
Copy link
Copy Markdown
Contributor

/label merge-review-in-progress

@openshift-ci openshift-ci Bot added the merge-review-in-progress Signifies that the merge review team is reviewing this PR label Nov 1, 2023
@michaelryanpeter michaelryanpeter merged commit ba70964 into openshift:main Nov 1, 2023
@michaelryanpeter
Copy link
Copy Markdown
Contributor

/cherrypick enterprise-4.15

@michaelryanpeter
Copy link
Copy Markdown
Contributor

/cherrypick enterprise-4.14

@michaelryanpeter
Copy link
Copy Markdown
Contributor

/cherrypick enterprise-4.13

@michaelryanpeter
Copy link
Copy Markdown
Contributor

/cherrypick enterprise-4.12

@michaelryanpeter
Copy link
Copy Markdown
Contributor

/cherrypick enterprise-4.11

@michaelryanpeter michaelryanpeter added this to the Continuous Release milestone Nov 1, 2023
@openshift-cherrypick-robot
Copy link
Copy Markdown

@michaelryanpeter: new pull request created: #67240

Details

In response to this:

/cherrypick enterprise-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot
Copy link
Copy Markdown

@michaelryanpeter: new pull request created: #67241

Details

In response to this:

/cherrypick enterprise-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot
Copy link
Copy Markdown

@michaelryanpeter: #65812 failed to apply on top of branch "enterprise-4.13":

Applying: Revert "OCPBUGS#9448: Add a note regarding checking clusterversion"
Using index info to reconstruct a base tree...
M	modules/update-upgrading-cli.adoc
Falling back to patching base and 3-way merge...
Auto-merging modules/update-upgrading-cli.adoc
CONFLICT (content): Merge conflict in modules/update-upgrading-cli.adoc
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Revert "OCPBUGS#9448: Add a note regarding checking clusterversion"
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherrypick enterprise-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot
Copy link
Copy Markdown

@michaelryanpeter: #65812 failed to apply on top of branch "enterprise-4.12":

Applying: Revert "OCPBUGS#9448: Add a note regarding checking clusterversion"
Using index info to reconstruct a base tree...
M	modules/update-upgrading-cli.adoc
Falling back to patching base and 3-way merge...
Auto-merging modules/update-upgrading-cli.adoc
CONFLICT (content): Merge conflict in modules/update-upgrading-cli.adoc
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Revert "OCPBUGS#9448: Add a note regarding checking clusterversion"
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherrypick enterprise-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot
Copy link
Copy Markdown

@michaelryanpeter: #65812 failed to apply on top of branch "enterprise-4.11":

Applying: Revert "OCPBUGS#9448: Add a note regarding checking clusterversion"
Using index info to reconstruct a base tree...
M	modules/update-upgrading-cli.adoc
Falling back to patching base and 3-way merge...
Auto-merging modules/update-upgrading-cli.adoc
CONFLICT (content): Merge conflict in modules/update-upgrading-cli.adoc
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Revert "OCPBUGS#9448: Add a note regarding checking clusterversion"
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherrypick enterprise-4.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@michaelryanpeter
Copy link
Copy Markdown
Contributor

@skopacz1 The bot picks for 4.11 to 4.13 did not pick cleanly. Would you please do a manual cherry pick when you get the chance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

branch/enterprise-4.11 branch/enterprise-4.12 branch/enterprise-4.13 branch/enterprise-4.14 branch/enterprise-4.15 merge-review-in-progress Signifies that the merge review team is reviewing this PR merge-review-needed Signifies that the merge review team needs to review this PR peer-review-done Signifies that the peer review team has reviewed this PR size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants