Revert "OCPBUGS#9448: Add a note regarding checking clusterversion"#65812
Conversation
|
@wking The error message suggests the admin should investigate (immediately?) why the issue occurs. |
|
🤖 Updated build preview is available at: Build log: https://circleci.com/gh/ocpdocs-previewbot/openshift-docs/28103 |
|
|
Can we add with proper phrasing what you wrote to the doc maybe? |
|
We could add that statement, but presumably folks getting a |
This reverts commit fb0d246, openshift#57136. The commit message did not explain the motivation, but [1] has: One note is: " After the update completes, you can confirm that the cluster version has updated to the new version: $ oc get clusterversion " The thing is that if the upgrade didn't complete, this output may include errors, for example: " oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.26 True True 24m Unable to apply 4.11.0-rc.7: an unknown error has occurred: MultipleErrors " We should include a note to disregard these errors as long as the "Progressing" is in True. But 1143cbe (Removed jq commands and replaced with oc adm upgrade, 2023-01-20, openshift#55005) was moving us from 'oc get clusterversion' towards 'oc adm upgrade' for "show my current context". Both get ClusterVersion and print a summary of its current status to the console. 'oc get ...' is generic CustomResourceDefinition rendering and the only knobs are things like additionalPrinterColumns, because 'get' needs to work if there are multiple matching resources and fit them each into one line of text. We have more control over the rendering in 'oc adm upgrade', where we know we only care about the ClusterVersion named 'cluster', and can take as many lines as we need to explain the details of the current cluster state. So basically there's no upside to using 'oc get clusterversion', unless you are using '-o json' or something and piping the results into a robot for structured consumption. For direct human consumption, 'oc adm upgrade' will always be better, and in this commit, I'm moving us back to using 'oc adm upgrade'. This also gets the intervening 'Cluster version is <version>' example output back to making sense, since fb0d246 failed to update that example output when it pivoted the suggested command. I'm also dropping the "you can ignore the error" line. While there are certainly cases when cluster components ask for admin intervention (e.g. by setting Available=False in a ClusterOperator) despite admin intervention not actually being needed, it is absolutely not a good idea to ignore those in general. Cases: * Cluster claims it is happy... * ... and it actually is happy. Hooray! * ... but it is actually unhealthy. Worth a bug to the overly optimistic component about more accurately reporting that it is unhealthy, and should be calling for admin intervention. * Cluster claims it is unhealthy... * ... and it actually is unhealthy. Sad. But at least it is appropriately calling for help. Error messages should be actionable, so the admin can quickly identify and resolve the issue, and if not, it is worth a bug to the opaquely-failing component about more helpfully guiding the admin through triage and resolution. * ... but it is actually pretty healthy. This is the case where the admin could ignore the error message. But it is hard to distinguish from the "actually unhealthy" cases where the admin should not ignore the error message. Certainly: an unknown error has occurred: MultipleErrors is insufficient data to decide that the error report is false-positive noise. Worth digging into the source of the error report, and a bug to the noisy component about not calling for admin assistance when no intervention is actually needed. There will probably always be some corner cases where the cluster does not ask for help despite needing help, or the cluster asks for help despite not needing help, but as the controllers get smarter, those cases should become more and more rare, to the point where we should not need to discuss them in docs beyond generic comments pointing out that no real-world diagnostic system is completely free of false-positive and false-negative risk. [1]: https://issues.redhat.com/browse/OCPBUGS-9448
ed8d81d to
c71ce8f
Compare
|
@evakhoni could you PTAL when you have the chance? Thanks! |
|
@shellyyang1989 could you PTAL when you have the chance? Thanks |
|
LGTM |
|
/label peer-review-needed |
|
The actual change LGTM. Whether or not we should make the change, I will leave to other, more informed folks. |
|
/label merge-review-needed |
|
/label merge-review-in-progress |
|
/cherrypick enterprise-4.15 |
|
/cherrypick enterprise-4.14 |
|
/cherrypick enterprise-4.13 |
|
/cherrypick enterprise-4.12 |
|
/cherrypick enterprise-4.11 |
|
@michaelryanpeter: new pull request created: #67240 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@michaelryanpeter: new pull request created: #67241 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@michaelryanpeter: #65812 failed to apply on top of branch "enterprise-4.13": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@michaelryanpeter: #65812 failed to apply on top of branch "enterprise-4.12": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@michaelryanpeter: #65812 failed to apply on top of branch "enterprise-4.11": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@skopacz1 The bot picks for 4.11 to 4.13 did not pick cleanly. Would you please do a manual cherry pick when you get the chance? |
This reverts commit fb0d246, #57136.
The commit message did not explain the motivation, but OCPBUGS-9448 has:
But 1143cbe (#55005) was moving us from
oc get clusterversiontowardsoc adm upgradefor "show my current context". Both get ClusterVersion and print a summary of its current status to the console.oc get ...is generic CustomResourceDefinition rendering and the only knobs are things likeadditionalPrinterColumns, becausegetneeds to work if there are multiple matching resources and fit them each into one line of text. We have more control over the rendering in 'oc adm upgrade', where we know we only care about the ClusterVersion namedcluster, and can take as many lines as we need to explain the details of the current cluster state. So basically there's no upside to usingoc get clusterversion, unless you are using-o jsonor something and piping the results into a robot for structured consumption. For direct human consumption,oc adm upgradewill always be better, and in this commit, I'm moving us back to usingoc adm upgrade. This also gets the interveningCluster version is <version>example output back to making sense, since fb0d246 failed to update that example output when it pivoted the suggested command.I'm also dropping the "you can ignore the error" line. While there are certainly cases when cluster components ask for admin intervention (e.g. by setting
Available=Falsein a ClusterOperator) despite admin intervention not actually being needed, it is absolutely not a good idea to ignore those in general. Cases:Cluster claims it is happy...
... and it actually is happy. Hooray!
... but it is actually unhealthy. Worth a bug to the overly optimistic component about more accurately reporting that it is unhealthy, and should be calling for admin intervention.
Cluster claims it is unhealthy...
... and it actually is unhealthy. Sad. But at least it is appropriately calling for help. Error messages should be actionable, so the admin can quickly identify and resolve the issue, and if not, it is worth a bug to the opaquely-failing component about more helpfully guiding the admin through triage and resolution.
... but it is actually pretty healthy. This is the case where the admin could ignore the error message. But it is hard to distinguish from the "actually unhealthy" cases where the admin should not ignore the error message. Certainly:
is insufficient data to decide that the error report is false-positive noise. Worth digging into the source of the error report, and a bug to the noisy component about not calling for admin assistance when no intervention is actually needed.
There will probably always be some corner cases where the cluster does not ask for help despite needing help, or the cluster asks for help despite not needing help, but as the controllers get smarter, those cases should become more and more rare, to the point where we should not need to discuss them in docs beyond generic comments pointing out that no real-world diagnostic system is completely free of false-positive and false-negative risk.
CC @xenolinux , @achuzhoy , @lahinson , and @jboxman , who were involved in the pull request I'm reverting. If my explaination for the revert don't make sense, please let me know what I'm missing :)
Version(s): #57136 was backported through 4.9 with #57454. Since then, 4.9 and 4.10 have gone end-of-life, so perhaps this revert only needs to get picked back through 4.11?
Link to docs preview:
QE review:
Additional information:
Unclear to me how to get more updates-team dev/QE review of these changes. Ideally we'd have the "is this docs change how we want to address this customer issue?" discussion on the original pull request, and not in a revert pull request several months later.