Skip to content

OCPBUGS#9448: Add a note regarding checking the cluster version#57136

Merged
jboxman-rh merged 1 commit intoopenshift:mainfrom
xenolinux:clusterversion
Mar 17, 2023
Merged

OCPBUGS#9448: Add a note regarding checking the cluster version#57136
jboxman-rh merged 1 commit intoopenshift:mainfrom
xenolinux:clusterversion

Conversation

@xenolinux
Copy link
Copy Markdown
Contributor

@xenolinux xenolinux commented Mar 14, 2023

@openshift-ci openshift-ci Bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Mar 14, 2023
@ocpdocs-previewbot
Copy link
Copy Markdown

ocpdocs-previewbot commented Mar 14, 2023

🤖 Updated build preview is available at:
https://57136--docspreview.netlify.app

Build log: https://circleci.com/gh/ocpdocs-previewbot/openshift-docs/11401

@xenolinux xenolinux force-pushed the clusterversion branch 2 times, most recently from 944de6a to ff043df Compare March 16, 2023 05:36
@achuzhoy
Copy link
Copy Markdown

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Mar 16, 2023
@xenolinux
Copy link
Copy Markdown
Contributor Author

/label peer-review-needed

@openshift-ci openshift-ci Bot added the peer-review-needed Signifies that the peer review team needs to review this PR label Mar 17, 2023
@lahinson
Copy link
Copy Markdown
Contributor

/label peer-review-in-progress
/remove-label peer-review-needed

@openshift-ci openshift-ci Bot added peer-review-in-progress Signifies that the peer review team is reviewing this PR and removed peer-review-needed Signifies that the peer review team needs to review this PR labels Mar 17, 2023
Copy link
Copy Markdown
Contributor

@lahinson lahinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested a slight revision for clarity, but otherwise, LGTM!

Comment thread modules/update-upgrading-cli.adoc Outdated
+
[NOTE]
====
You can ignore if the `oc get clusterversion` command gives the following error while `PROGRESSING` is `True`:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can ignore if the `oc get clusterversion` command gives the following error while `PROGRESSING` is `True`:
If the `oc get clusterversion` command displays the following error while the `PROGRESSING` status is `True`, you can ignore the error.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

@lahinson
Copy link
Copy Markdown
Contributor

/remove-label peer-review-in-progress
/label peer-review-done

@openshift-ci openshift-ci Bot added peer-review-done Signifies that the peer review team has reviewed this PR and removed peer-review-in-progress Signifies that the peer review team is reviewing this PR labels Mar 17, 2023
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 17, 2023
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Mar 17, 2023

New changes are detected. LGTM label has been removed.

@xenolinux
Copy link
Copy Markdown
Contributor Author

/label merge-review-needed

@openshift-ci openshift-ci Bot added the merge-review-needed Signifies that the merge review team needs to review this PR label Mar 17, 2023
@jboxman-rh jboxman-rh added merge-review-in-progress Signifies that the merge review team is reviewing this PR branch/enterprise-4.9 branch/enterprise-4.10 branch/enterprise-4.11 branch/enterprise-4.12 branch/enterprise-4.13 and removed merge-review-needed Signifies that the merge review team needs to review this PR labels Mar 17, 2023
@jboxman-rh jboxman-rh added this to the Continuous Release milestone Mar 17, 2023
@jboxman-rh jboxman-rh merged commit 358b3cc into openshift:main Mar 17, 2023
@jboxman-rh
Copy link
Copy Markdown

/cherry-pick enterprise-4.13

@openshift-cherrypick-robot
Copy link
Copy Markdown

@jboxman-rh: new pull request created: #57356

Details

In response to this:

/cherry-pick enterprise-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jboxman-rh
Copy link
Copy Markdown

/cherry-pick enterprise-4.12

@jboxman-rh
Copy link
Copy Markdown

/cherry-pick enterprise-4.11

@jboxman-rh
Copy link
Copy Markdown

/cherry-pick enterprise-4.10

@jboxman-rh
Copy link
Copy Markdown

/cherry-pick enterprise-4.9

@openshift-cherrypick-robot
Copy link
Copy Markdown

@jboxman-rh: new pull request created: #57361

Details

In response to this:

/cherry-pick enterprise-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot
Copy link
Copy Markdown

@jboxman-rh: new pull request created: #57362

Details

In response to this:

/cherry-pick enterprise-4.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot
Copy link
Copy Markdown

@jboxman-rh: new pull request created: #57363

Details

In response to this:

/cherry-pick enterprise-4.10

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot
Copy link
Copy Markdown

@jboxman-rh: #57136 failed to apply on top of branch "enterprise-4.9":

Applying: OCPBUGS#9448: Add a note regarding checking clusterversion
Using index info to reconstruct a base tree...
M	modules/update-upgrading-cli.adoc
Falling back to patching base and 3-way merge...
Auto-merging modules/update-upgrading-cli.adoc
CONFLICT (content): Merge conflict in modules/update-upgrading-cli.adoc
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 OCPBUGS#9448: Add a note regarding checking clusterversion
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherry-pick enterprise-4.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jboxman-rh
Copy link
Copy Markdown

@xenolinux, looks like this doesn't merge cleanly into 4.9; Is this something that you can fix? Thanks!

@xenolinux
Copy link
Copy Markdown
Contributor Author

@xenolinux, looks like this doesn't merge cleanly into 4.9; Is this something that you can fix? Thanks!

Yes. Opened a manual CP PR #57454

@jboxman-rh jboxman-rh removed the merge-review-in-progress Signifies that the merge review team is reviewing this PR label Apr 3, 2023
wking added a commit to wking/openshift-docs that referenced this pull request Oct 16, 2023
This reverts commit fb0d246, openshift#57136.

The commit message did not explain the motivation, but [1] has:

  One note is:
  "
  After the update completes, you can confirm that the cluster version has updated to the new version:
  $ oc get clusterversion
  "

  The thing is that if the upgrade didn't complete, this output may include errors, for example:

  " oc get clusterversion
  NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
  version 4.10.26 True True 24m Unable to apply 4.11.0-rc.7: an unknown error has occurred: MultipleErrors
  "

  We should include a note to disregard these errors as long as the "Progressing" is in True.

But 1143cbe (Removed jq commands and replaced with oc adm upgrade,
2023-01-20, openshift#55005) was moving us from 'oc get clusterversion' towards
'oc adm upgrade' for "show my current context".  Both get
ClusterVersion and print a summary of its current status to the
console.  'oc get ...' is generic CustomResourceDefinition rendering
and the only knobs are things like additionalPrinterColumns, because
'get' needs to work if there are multiple matching resources and fit
them each into one line of text.  We have more control over the
rendering in 'oc adm upgrade', where we know we only care about the
ClusterVersion named 'cluster', and can take as many lines as we need
to explain the details of the current cluster state.  So basically
there's no upside to using 'oc get clusterversion', unless you are
using '-o json' or something and piping the results into a robot for
structured consumption.  For direct human consumption, 'oc adm
upgrade' will always be better, and in this commit, I'm moving us back
to using 'oc adm upgrade'.  This also gets the intervening 'Cluster
version is <version>' example output back to making sense, since
fb0d246 failed to update that example output when it pivoted
the suggested command.

I'm also dropping the "you can ignore the error" line.  While there
are certainly cases when cluster components ask for admin intervention
(e.g. by setting Available=False in a ClusterOperator) despite admin
intervention not actually being needed, it is absolutely not a good
idea to ignore those in general.  Cases:

* Cluster claims it is happy...
  * ... and it actually is happy.  Hooray!

  * ... but it is actually unhealthy.  Worth a bug to the overly
    optimistic component about more accurately reporting that it is
    unhealthy, and should be calling for admin intervention.

* Cluster claims it is unhealthy...
  * ... and it actually is unhealthy.  Sad.  But at least it is
    appropriately calling for help.  Error messages should be
    actionable, so the admin can quickly identify and resolve the
    issue, and if not, it is worth a bug to the opaquely-failing
    component about more helpfully guiding the admin through
    triage and resolution.

  * ... but it is actually pretty healthy.  This is the case where the
    admin could ignore the error message.  But it is hard to
    distinguish from the "actually unhealthy" cases where the admin
    should not ignore the error message.  Certainly:

      an unknown error has occurred: MultipleErrors

    is insufficient data to decide that the error report is
    false-positive noise.  Worth digging into the source of the error
    report, and a bug to the noisy component about not calling for
    admin assistance when no intervention is actually needed.

There will probably always be some corner cases where the cluster does
not ask for help despite needing help, or the cluster asks for help
despite not needing help, but as the controllers get smarter, those
cases should become more and more rare, to the point where we should
not need to discuss them in docs beyond generic comments pointing out
that no real-world diagnostic system is completely free of
false-positive and false-negative risk.

[1]: https://issues.redhat.com/browse/OCPBUGS-9448
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/openshift-docs that referenced this pull request Nov 1, 2023
This reverts commit fb0d246, openshift#57136.

The commit message did not explain the motivation, but [1] has:

  One note is:
  "
  After the update completes, you can confirm that the cluster version has updated to the new version:
  $ oc get clusterversion
  "

  The thing is that if the upgrade didn't complete, this output may include errors, for example:

  " oc get clusterversion
  NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
  version 4.10.26 True True 24m Unable to apply 4.11.0-rc.7: an unknown error has occurred: MultipleErrors
  "

  We should include a note to disregard these errors as long as the "Progressing" is in True.

But 1143cbe (Removed jq commands and replaced with oc adm upgrade,
2023-01-20, openshift#55005) was moving us from 'oc get clusterversion' towards
'oc adm upgrade' for "show my current context".  Both get
ClusterVersion and print a summary of its current status to the
console.  'oc get ...' is generic CustomResourceDefinition rendering
and the only knobs are things like additionalPrinterColumns, because
'get' needs to work if there are multiple matching resources and fit
them each into one line of text.  We have more control over the
rendering in 'oc adm upgrade', where we know we only care about the
ClusterVersion named 'cluster', and can take as many lines as we need
to explain the details of the current cluster state.  So basically
there's no upside to using 'oc get clusterversion', unless you are
using '-o json' or something and piping the results into a robot for
structured consumption.  For direct human consumption, 'oc adm
upgrade' will always be better, and in this commit, I'm moving us back
to using 'oc adm upgrade'.  This also gets the intervening 'Cluster
version is <version>' example output back to making sense, since
fb0d246 failed to update that example output when it pivoted
the suggested command.

I'm also dropping the "you can ignore the error" line.  While there
are certainly cases when cluster components ask for admin intervention
(e.g. by setting Available=False in a ClusterOperator) despite admin
intervention not actually being needed, it is absolutely not a good
idea to ignore those in general.  Cases:

* Cluster claims it is happy...
  * ... and it actually is happy.  Hooray!

  * ... but it is actually unhealthy.  Worth a bug to the overly
    optimistic component about more accurately reporting that it is
    unhealthy, and should be calling for admin intervention.

* Cluster claims it is unhealthy...
  * ... and it actually is unhealthy.  Sad.  But at least it is
    appropriately calling for help.  Error messages should be
    actionable, so the admin can quickly identify and resolve the
    issue, and if not, it is worth a bug to the opaquely-failing
    component about more helpfully guiding the admin through
    triage and resolution.

  * ... but it is actually pretty healthy.  This is the case where the
    admin could ignore the error message.  But it is hard to
    distinguish from the "actually unhealthy" cases where the admin
    should not ignore the error message.  Certainly:

      an unknown error has occurred: MultipleErrors

    is insufficient data to decide that the error report is
    false-positive noise.  Worth digging into the source of the error
    report, and a bug to the noisy component about not calling for
    admin assistance when no intervention is actually needed.

There will probably always be some corner cases where the cluster does
not ask for help despite needing help, or the cluster asks for help
despite not needing help, but as the controllers get smarter, those
cases should become more and more rare, to the point where we should
not need to discuss them in docs beyond generic comments pointing out
that no real-world diagnostic system is completely free of
false-positive and false-negative risk.

[1]: https://issues.redhat.com/browse/OCPBUGS-9448
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/openshift-docs that referenced this pull request Nov 1, 2023
This reverts commit fb0d246, openshift#57136.

The commit message did not explain the motivation, but [1] has:

  One note is:
  "
  After the update completes, you can confirm that the cluster version has updated to the new version:
  $ oc get clusterversion
  "

  The thing is that if the upgrade didn't complete, this output may include errors, for example:

  " oc get clusterversion
  NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
  version 4.10.26 True True 24m Unable to apply 4.11.0-rc.7: an unknown error has occurred: MultipleErrors
  "

  We should include a note to disregard these errors as long as the "Progressing" is in True.

But 1143cbe (Removed jq commands and replaced with oc adm upgrade,
2023-01-20, openshift#55005) was moving us from 'oc get clusterversion' towards
'oc adm upgrade' for "show my current context".  Both get
ClusterVersion and print a summary of its current status to the
console.  'oc get ...' is generic CustomResourceDefinition rendering
and the only knobs are things like additionalPrinterColumns, because
'get' needs to work if there are multiple matching resources and fit
them each into one line of text.  We have more control over the
rendering in 'oc adm upgrade', where we know we only care about the
ClusterVersion named 'cluster', and can take as many lines as we need
to explain the details of the current cluster state.  So basically
there's no upside to using 'oc get clusterversion', unless you are
using '-o json' or something and piping the results into a robot for
structured consumption.  For direct human consumption, 'oc adm
upgrade' will always be better, and in this commit, I'm moving us back
to using 'oc adm upgrade'.  This also gets the intervening 'Cluster
version is <version>' example output back to making sense, since
fb0d246 failed to update that example output when it pivoted
the suggested command.

I'm also dropping the "you can ignore the error" line.  While there
are certainly cases when cluster components ask for admin intervention
(e.g. by setting Available=False in a ClusterOperator) despite admin
intervention not actually being needed, it is absolutely not a good
idea to ignore those in general.  Cases:

* Cluster claims it is happy...
  * ... and it actually is happy.  Hooray!

  * ... but it is actually unhealthy.  Worth a bug to the overly
    optimistic component about more accurately reporting that it is
    unhealthy, and should be calling for admin intervention.

* Cluster claims it is unhealthy...
  * ... and it actually is unhealthy.  Sad.  But at least it is
    appropriately calling for help.  Error messages should be
    actionable, so the admin can quickly identify and resolve the
    issue, and if not, it is worth a bug to the opaquely-failing
    component about more helpfully guiding the admin through
    triage and resolution.

  * ... but it is actually pretty healthy.  This is the case where the
    admin could ignore the error message.  But it is hard to
    distinguish from the "actually unhealthy" cases where the admin
    should not ignore the error message.  Certainly:

      an unknown error has occurred: MultipleErrors

    is insufficient data to decide that the error report is
    false-positive noise.  Worth digging into the source of the error
    report, and a bug to the noisy component about not calling for
    admin assistance when no intervention is actually needed.

There will probably always be some corner cases where the cluster does
not ask for help despite needing help, or the cluster asks for help
despite not needing help, but as the controllers get smarter, those
cases should become more and more rare, to the point where we should
not need to discuss them in docs beyond generic comments pointing out
that no real-world diagnostic system is completely free of
false-positive and false-negative risk.

[1]: https://issues.redhat.com/browse/OCPBUGS-9448
skopacz1 pushed a commit to skopacz1/openshift-docs that referenced this pull request Nov 1, 2023
This reverts commit fb0d246, openshift#57136.

The commit message did not explain the motivation, but [1] has:

  One note is:
  "
  After the update completes, you can confirm that the cluster version has updated to the new version:
  $ oc get clusterversion
  "

  The thing is that if the upgrade didn't complete, this output may include errors, for example:

  " oc get clusterversion
  NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
  version 4.10.26 True True 24m Unable to apply 4.11.0-rc.7: an unknown error has occurred: MultipleErrors
  "

  We should include a note to disregard these errors as long as the "Progressing" is in True.

But 1143cbe (Removed jq commands and replaced with oc adm upgrade,
2023-01-20, openshift#55005) was moving us from 'oc get clusterversion' towards
'oc adm upgrade' for "show my current context".  Both get
ClusterVersion and print a summary of its current status to the
console.  'oc get ...' is generic CustomResourceDefinition rendering
and the only knobs are things like additionalPrinterColumns, because
'get' needs to work if there are multiple matching resources and fit
them each into one line of text.  We have more control over the
rendering in 'oc adm upgrade', where we know we only care about the
ClusterVersion named 'cluster', and can take as many lines as we need
to explain the details of the current cluster state.  So basically
there's no upside to using 'oc get clusterversion', unless you are
using '-o json' or something and piping the results into a robot for
structured consumption.  For direct human consumption, 'oc adm
upgrade' will always be better, and in this commit, I'm moving us back
to using 'oc adm upgrade'.  This also gets the intervening 'Cluster
version is <version>' example output back to making sense, since
fb0d246 failed to update that example output when it pivoted
the suggested command.

I'm also dropping the "you can ignore the error" line.  While there
are certainly cases when cluster components ask for admin intervention
(e.g. by setting Available=False in a ClusterOperator) despite admin
intervention not actually being needed, it is absolutely not a good
idea to ignore those in general.  Cases:

* Cluster claims it is happy...
  * ... and it actually is happy.  Hooray!

  * ... but it is actually unhealthy.  Worth a bug to the overly
    optimistic component about more accurately reporting that it is
    unhealthy, and should be calling for admin intervention.

* Cluster claims it is unhealthy...
  * ... and it actually is unhealthy.  Sad.  But at least it is
    appropriately calling for help.  Error messages should be
    actionable, so the admin can quickly identify and resolve the
    issue, and if not, it is worth a bug to the opaquely-failing
    component about more helpfully guiding the admin through
    triage and resolution.

  * ... but it is actually pretty healthy.  This is the case where the
    admin could ignore the error message.  But it is hard to
    distinguish from the "actually unhealthy" cases where the admin
    should not ignore the error message.  Certainly:

      an unknown error has occurred: MultipleErrors

    is insufficient data to decide that the error report is
    false-positive noise.  Worth digging into the source of the error
    report, and a bug to the noisy component about not calling for
    admin assistance when no intervention is actually needed.

There will probably always be some corner cases where the cluster does
not ask for help despite needing help, or the cluster asks for help
despite not needing help, but as the controllers get smarter, those
cases should become more and more rare, to the point where we should
not need to discuss them in docs beyond generic comments pointing out
that no real-world diagnostic system is completely free of
false-positive and false-negative risk.

[1]: https://issues.redhat.com/browse/OCPBUGS-9448
skopacz1 pushed a commit to skopacz1/openshift-docs that referenced this pull request Nov 1, 2023
This reverts commit fb0d246, openshift#57136.

The commit message did not explain the motivation, but [1] has:

  One note is:
  "
  After the update completes, you can confirm that the cluster version has updated to the new version:
  $ oc get clusterversion
  "

  The thing is that if the upgrade didn't complete, this output may include errors, for example:

  " oc get clusterversion
  NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
  version 4.10.26 True True 24m Unable to apply 4.11.0-rc.7: an unknown error has occurred: MultipleErrors
  "

  We should include a note to disregard these errors as long as the "Progressing" is in True.

But 1143cbe (Removed jq commands and replaced with oc adm upgrade,
2023-01-20, openshift#55005) was moving us from 'oc get clusterversion' towards
'oc adm upgrade' for "show my current context".  Both get
ClusterVersion and print a summary of its current status to the
console.  'oc get ...' is generic CustomResourceDefinition rendering
and the only knobs are things like additionalPrinterColumns, because
'get' needs to work if there are multiple matching resources and fit
them each into one line of text.  We have more control over the
rendering in 'oc adm upgrade', where we know we only care about the
ClusterVersion named 'cluster', and can take as many lines as we need
to explain the details of the current cluster state.  So basically
there's no upside to using 'oc get clusterversion', unless you are
using '-o json' or something and piping the results into a robot for
structured consumption.  For direct human consumption, 'oc adm
upgrade' will always be better, and in this commit, I'm moving us back
to using 'oc adm upgrade'.  This also gets the intervening 'Cluster
version is <version>' example output back to making sense, since
fb0d246 failed to update that example output when it pivoted
the suggested command.

I'm also dropping the "you can ignore the error" line.  While there
are certainly cases when cluster components ask for admin intervention
(e.g. by setting Available=False in a ClusterOperator) despite admin
intervention not actually being needed, it is absolutely not a good
idea to ignore those in general.  Cases:

* Cluster claims it is happy...
  * ... and it actually is happy.  Hooray!

  * ... but it is actually unhealthy.  Worth a bug to the overly
    optimistic component about more accurately reporting that it is
    unhealthy, and should be calling for admin intervention.

* Cluster claims it is unhealthy...
  * ... and it actually is unhealthy.  Sad.  But at least it is
    appropriately calling for help.  Error messages should be
    actionable, so the admin can quickly identify and resolve the
    issue, and if not, it is worth a bug to the opaquely-failing
    component about more helpfully guiding the admin through
    triage and resolution.

  * ... but it is actually pretty healthy.  This is the case where the
    admin could ignore the error message.  But it is hard to
    distinguish from the "actually unhealthy" cases where the admin
    should not ignore the error message.  Certainly:

      an unknown error has occurred: MultipleErrors

    is insufficient data to decide that the error report is
    false-positive noise.  Worth digging into the source of the error
    report, and a bug to the noisy component about not calling for
    admin assistance when no intervention is actually needed.

There will probably always be some corner cases where the cluster does
not ask for help despite needing help, or the cluster asks for help
despite not needing help, but as the controllers get smarter, those
cases should become more and more rare, to the point where we should
not need to discuss them in docs beyond generic comments pointing out
that no real-world diagnostic system is completely free of
false-positive and false-negative risk.

[1]: https://issues.redhat.com/browse/OCPBUGS-9448
michaelryanpeter pushed a commit that referenced this pull request Nov 1, 2023
This reverts commit fb0d246, #57136.

The commit message did not explain the motivation, but [1] has:

  One note is:
  "
  After the update completes, you can confirm that the cluster version has updated to the new version:
  $ oc get clusterversion
  "

  The thing is that if the upgrade didn't complete, this output may include errors, for example:

  " oc get clusterversion
  NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
  version 4.10.26 True True 24m Unable to apply 4.11.0-rc.7: an unknown error has occurred: MultipleErrors
  "

  We should include a note to disregard these errors as long as the "Progressing" is in True.

But 1143cbe (Removed jq commands and replaced with oc adm upgrade,
2023-01-20, #55005) was moving us from 'oc get clusterversion' towards
'oc adm upgrade' for "show my current context".  Both get
ClusterVersion and print a summary of its current status to the
console.  'oc get ...' is generic CustomResourceDefinition rendering
and the only knobs are things like additionalPrinterColumns, because
'get' needs to work if there are multiple matching resources and fit
them each into one line of text.  We have more control over the
rendering in 'oc adm upgrade', where we know we only care about the
ClusterVersion named 'cluster', and can take as many lines as we need
to explain the details of the current cluster state.  So basically
there's no upside to using 'oc get clusterversion', unless you are
using '-o json' or something and piping the results into a robot for
structured consumption.  For direct human consumption, 'oc adm
upgrade' will always be better, and in this commit, I'm moving us back
to using 'oc adm upgrade'.  This also gets the intervening 'Cluster
version is <version>' example output back to making sense, since
fb0d246 failed to update that example output when it pivoted
the suggested command.

I'm also dropping the "you can ignore the error" line.  While there
are certainly cases when cluster components ask for admin intervention
(e.g. by setting Available=False in a ClusterOperator) despite admin
intervention not actually being needed, it is absolutely not a good
idea to ignore those in general.  Cases:

* Cluster claims it is happy...
  * ... and it actually is happy.  Hooray!

  * ... but it is actually unhealthy.  Worth a bug to the overly
    optimistic component about more accurately reporting that it is
    unhealthy, and should be calling for admin intervention.

* Cluster claims it is unhealthy...
  * ... and it actually is unhealthy.  Sad.  But at least it is
    appropriately calling for help.  Error messages should be
    actionable, so the admin can quickly identify and resolve the
    issue, and if not, it is worth a bug to the opaquely-failing
    component about more helpfully guiding the admin through
    triage and resolution.

  * ... but it is actually pretty healthy.  This is the case where the
    admin could ignore the error message.  But it is hard to
    distinguish from the "actually unhealthy" cases where the admin
    should not ignore the error message.  Certainly:

      an unknown error has occurred: MultipleErrors

    is insufficient data to decide that the error report is
    false-positive noise.  Worth digging into the source of the error
    report, and a bug to the noisy component about not calling for
    admin assistance when no intervention is actually needed.

There will probably always be some corner cases where the cluster does
not ask for help despite needing help, or the cluster asks for help
despite not needing help, but as the controllers get smarter, those
cases should become more and more rare, to the point where we should
not need to discuss them in docs beyond generic comments pointing out
that no real-world diagnostic system is completely free of
false-positive and false-negative risk.

[1]: https://issues.redhat.com/browse/OCPBUGS-9448
GroceryBoyJr pushed a commit to GroceryBoyJr/openshift-docs that referenced this pull request Nov 17, 2023
This reverts commit fb0d246, openshift#57136.

The commit message did not explain the motivation, but [1] has:

  One note is:
  "
  After the update completes, you can confirm that the cluster version has updated to the new version:
  $ oc get clusterversion
  "

  The thing is that if the upgrade didn't complete, this output may include errors, for example:

  " oc get clusterversion
  NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
  version 4.10.26 True True 24m Unable to apply 4.11.0-rc.7: an unknown error has occurred: MultipleErrors
  "

  We should include a note to disregard these errors as long as the "Progressing" is in True.

But 1143cbe (Removed jq commands and replaced with oc adm upgrade,
2023-01-20, openshift#55005) was moving us from 'oc get clusterversion' towards
'oc adm upgrade' for "show my current context".  Both get
ClusterVersion and print a summary of its current status to the
console.  'oc get ...' is generic CustomResourceDefinition rendering
and the only knobs are things like additionalPrinterColumns, because
'get' needs to work if there are multiple matching resources and fit
them each into one line of text.  We have more control over the
rendering in 'oc adm upgrade', where we know we only care about the
ClusterVersion named 'cluster', and can take as many lines as we need
to explain the details of the current cluster state.  So basically
there's no upside to using 'oc get clusterversion', unless you are
using '-o json' or something and piping the results into a robot for
structured consumption.  For direct human consumption, 'oc adm
upgrade' will always be better, and in this commit, I'm moving us back
to using 'oc adm upgrade'.  This also gets the intervening 'Cluster
version is <version>' example output back to making sense, since
fb0d246 failed to update that example output when it pivoted
the suggested command.

I'm also dropping the "you can ignore the error" line.  While there
are certainly cases when cluster components ask for admin intervention
(e.g. by setting Available=False in a ClusterOperator) despite admin
intervention not actually being needed, it is absolutely not a good
idea to ignore those in general.  Cases:

* Cluster claims it is happy...
  * ... and it actually is happy.  Hooray!

  * ... but it is actually unhealthy.  Worth a bug to the overly
    optimistic component about more accurately reporting that it is
    unhealthy, and should be calling for admin intervention.

* Cluster claims it is unhealthy...
  * ... and it actually is unhealthy.  Sad.  But at least it is
    appropriately calling for help.  Error messages should be
    actionable, so the admin can quickly identify and resolve the
    issue, and if not, it is worth a bug to the opaquely-failing
    component about more helpfully guiding the admin through
    triage and resolution.

  * ... but it is actually pretty healthy.  This is the case where the
    admin could ignore the error message.  But it is hard to
    distinguish from the "actually unhealthy" cases where the admin
    should not ignore the error message.  Certainly:

      an unknown error has occurred: MultipleErrors

    is insufficient data to decide that the error report is
    false-positive noise.  Worth digging into the source of the error
    report, and a bug to the noisy component about not calling for
    admin assistance when no intervention is actually needed.

There will probably always be some corner cases where the cluster does
not ask for help despite needing help, or the cluster asks for help
despite not needing help, but as the controllers get smarter, those
cases should become more and more rare, to the point where we should
not need to discuss them in docs beyond generic comments pointing out
that no real-world diagnostic system is completely free of
false-positive and false-negative risk.

[1]: https://issues.redhat.com/browse/OCPBUGS-9448
DCChadwick pushed a commit to DCChadwick/openshift-docs that referenced this pull request Nov 28, 2023
This reverts commit fb0d246, openshift#57136.

The commit message did not explain the motivation, but [1] has:

  One note is:
  "
  After the update completes, you can confirm that the cluster version has updated to the new version:
  $ oc get clusterversion
  "

  The thing is that if the upgrade didn't complete, this output may include errors, for example:

  " oc get clusterversion
  NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
  version 4.10.26 True True 24m Unable to apply 4.11.0-rc.7: an unknown error has occurred: MultipleErrors
  "

  We should include a note to disregard these errors as long as the "Progressing" is in True.

But 1143cbe (Removed jq commands and replaced with oc adm upgrade,
2023-01-20, openshift#55005) was moving us from 'oc get clusterversion' towards
'oc adm upgrade' for "show my current context".  Both get
ClusterVersion and print a summary of its current status to the
console.  'oc get ...' is generic CustomResourceDefinition rendering
and the only knobs are things like additionalPrinterColumns, because
'get' needs to work if there are multiple matching resources and fit
them each into one line of text.  We have more control over the
rendering in 'oc adm upgrade', where we know we only care about the
ClusterVersion named 'cluster', and can take as many lines as we need
to explain the details of the current cluster state.  So basically
there's no upside to using 'oc get clusterversion', unless you are
using '-o json' or something and piping the results into a robot for
structured consumption.  For direct human consumption, 'oc adm
upgrade' will always be better, and in this commit, I'm moving us back
to using 'oc adm upgrade'.  This also gets the intervening 'Cluster
version is <version>' example output back to making sense, since
fb0d246 failed to update that example output when it pivoted
the suggested command.

I'm also dropping the "you can ignore the error" line.  While there
are certainly cases when cluster components ask for admin intervention
(e.g. by setting Available=False in a ClusterOperator) despite admin
intervention not actually being needed, it is absolutely not a good
idea to ignore those in general.  Cases:

* Cluster claims it is happy...
  * ... and it actually is happy.  Hooray!

  * ... but it is actually unhealthy.  Worth a bug to the overly
    optimistic component about more accurately reporting that it is
    unhealthy, and should be calling for admin intervention.

* Cluster claims it is unhealthy...
  * ... and it actually is unhealthy.  Sad.  But at least it is
    appropriately calling for help.  Error messages should be
    actionable, so the admin can quickly identify and resolve the
    issue, and if not, it is worth a bug to the opaquely-failing
    component about more helpfully guiding the admin through
    triage and resolution.

  * ... but it is actually pretty healthy.  This is the case where the
    admin could ignore the error message.  But it is hard to
    distinguish from the "actually unhealthy" cases where the admin
    should not ignore the error message.  Certainly:

      an unknown error has occurred: MultipleErrors

    is insufficient data to decide that the error report is
    false-positive noise.  Worth digging into the source of the error
    report, and a bug to the noisy component about not calling for
    admin assistance when no intervention is actually needed.

There will probably always be some corner cases where the cluster does
not ask for help despite needing help, or the cluster asks for help
despite not needing help, but as the controllers get smarter, those
cases should become more and more rare, to the point where we should
not need to discuss them in docs beyond generic comments pointing out
that no real-world diagnostic system is completely free of
false-positive and false-negative risk.

[1]: https://issues.redhat.com/browse/OCPBUGS-9448
skrthomas pushed a commit to skrthomas/openshift-docs that referenced this pull request Dec 6, 2023
This reverts commit fb0d246, openshift#57136.

The commit message did not explain the motivation, but [1] has:

  One note is:
  "
  After the update completes, you can confirm that the cluster version has updated to the new version:
  $ oc get clusterversion
  "

  The thing is that if the upgrade didn't complete, this output may include errors, for example:

  " oc get clusterversion
  NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
  version 4.10.26 True True 24m Unable to apply 4.11.0-rc.7: an unknown error has occurred: MultipleErrors
  "

  We should include a note to disregard these errors as long as the "Progressing" is in True.

But 1143cbe (Removed jq commands and replaced with oc adm upgrade,
2023-01-20, openshift#55005) was moving us from 'oc get clusterversion' towards
'oc adm upgrade' for "show my current context".  Both get
ClusterVersion and print a summary of its current status to the
console.  'oc get ...' is generic CustomResourceDefinition rendering
and the only knobs are things like additionalPrinterColumns, because
'get' needs to work if there are multiple matching resources and fit
them each into one line of text.  We have more control over the
rendering in 'oc adm upgrade', where we know we only care about the
ClusterVersion named 'cluster', and can take as many lines as we need
to explain the details of the current cluster state.  So basically
there's no upside to using 'oc get clusterversion', unless you are
using '-o json' or something and piping the results into a robot for
structured consumption.  For direct human consumption, 'oc adm
upgrade' will always be better, and in this commit, I'm moving us back
to using 'oc adm upgrade'.  This also gets the intervening 'Cluster
version is <version>' example output back to making sense, since
fb0d246 failed to update that example output when it pivoted
the suggested command.

I'm also dropping the "you can ignore the error" line.  While there
are certainly cases when cluster components ask for admin intervention
(e.g. by setting Available=False in a ClusterOperator) despite admin
intervention not actually being needed, it is absolutely not a good
idea to ignore those in general.  Cases:

* Cluster claims it is happy...
  * ... and it actually is happy.  Hooray!

  * ... but it is actually unhealthy.  Worth a bug to the overly
    optimistic component about more accurately reporting that it is
    unhealthy, and should be calling for admin intervention.

* Cluster claims it is unhealthy...
  * ... and it actually is unhealthy.  Sad.  But at least it is
    appropriately calling for help.  Error messages should be
    actionable, so the admin can quickly identify and resolve the
    issue, and if not, it is worth a bug to the opaquely-failing
    component about more helpfully guiding the admin through
    triage and resolution.

  * ... but it is actually pretty healthy.  This is the case where the
    admin could ignore the error message.  But it is hard to
    distinguish from the "actually unhealthy" cases where the admin
    should not ignore the error message.  Certainly:

      an unknown error has occurred: MultipleErrors

    is insufficient data to decide that the error report is
    false-positive noise.  Worth digging into the source of the error
    report, and a bug to the noisy component about not calling for
    admin assistance when no intervention is actually needed.

There will probably always be some corner cases where the cluster does
not ask for help despite needing help, or the cluster asks for help
despite not needing help, but as the controllers get smarter, those
cases should become more and more rare, to the point where we should
not need to discuss them in docs beyond generic comments pointing out
that no real-world diagnostic system is completely free of
false-positive and false-negative risk.

[1]: https://issues.redhat.com/browse/OCPBUGS-9448
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

branch/enterprise-4.9 branch/enterprise-4.10 branch/enterprise-4.11 branch/enterprise-4.12 branch/enterprise-4.13 peer-review-done Signifies that the peer review team has reviewed this PR size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants