Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Sep 2, 2021

Builds on #995; consider reviewing that one first.

Some operators have no configured operands (e.g. the bare-metal operator on non-metal platforms, or the image-registry operator when the admins have configured managementState:Removed). Some operators have many configured operands. Operators writing ClusterOperator conditions should not limit the conditions to speak about just the operator, or just a particular operand. Instead, operators should speak about the component as a whole. Is something about the service that the component provides gone? If so, Available=False (midnight admin page). Is something about the service that the component provides not hitting its service-level objectives? If so, Degraded=True (working-hours admin queue). Doesn't matter if that thing is "the operator is having trouble talking to the API to figure out how the operands are doing" or "the operator is super-happy, and sees that some operand is sad". That's all stuff that can be distinguished in the reason/message.

Copy link
Contributor

@bparees bparees left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i put comments on the Upgradeable message in the other PR.

but the net new updates in this PR look reasonable to me.

@wking wking force-pushed the document-operator-conditions-covering-the-component branch 2 times, most recently from 4f64bd6 to a3258e8 Compare September 13, 2021 22:48
…nents

Some operators have no configured operands (e.g. the bare-metal
operator on non-metal platforms, or the image-registry operator when
the admins have configured managementState:Removed [1]).  Some
operators have many configured operands.  Operators writing
ClusterOperator conditions should not limit the conditions to speak
about just the operator, or just a particular operand.  Instead,
operators should speak about the component as a whole.  Is something
about the service that the component provides gone?  If so,
Available=False (midnight admin page).  Is something about the service
that the component provides not hitting its service-level objectives?
If so, Degraded=True (working-hours admin queue).  Doesn't matter if
that thing is "the operator is having trouble talking to the API to
figure out how the operands are doing" or "the operator is
super-happy, and sees that some operand is sad".  That's all stuff
that can be distinguished in the reason/message.

[1]: https://docs.openshift.com/container-platform/4.8/registry/configuring_registry_storage/configuring-registry-storage-baremetal.html#registry-removed_configuring-registry-storage-baremetal
@wking wking force-pushed the document-operator-conditions-covering-the-component branch from a3258e8 to f99f4bb Compare September 14, 2021 17:36
@wking
Copy link
Member Author

wking commented Sep 14, 2021

Rebased onto master with a3258e8 -> f99f4bb now that #995 has landed.

@bparees
Copy link
Contributor

bparees commented Sep 14, 2021

still looks reasonable to me but this seems worthy of a second set of eyes.

/approve

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 14, 2021
@asalkeld
Copy link

looks great to me, really helps clarify things for baremetal 👍

type ClusterStatusConditionType string

const (
// Available indicates that the operand (eg: openshift-apiserver for the
Copy link
Member Author

@wking wking Sep 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case it helps with review, I personally find the output of:

$ git show --word-diff=color

easier to read for this commit than GitHub's rendering.

@sadasu
Copy link
Contributor

sadasu commented Sep 15, 2021

@wking Thanks for providing this clarification for the various CO states. Here are some remaining questions specific to the cluster-baremetal-operator.
Disabled state is not mentioned in the API, so that is probably still not a supported state.

The cluster-baremetal-operator (CBO) is responsible for deploying the metal3 pod when the Provisioning CR is present and when the platform is "Baremetal". Let us consider the follwing scenarios:

  1. CBO running on Unsupported platforms - We do not expect the Provisioning CR to be present or for the metal3 pod to be deployed. Would this mean that the ClusterOperator for CBO reports its status as "Available=True"? In this case, the Operand is not running but it is not expected to be running either. (CBO currently uses "Disabled=True" and "Available=True" to indicate this state.)
  2. CBO running on Supported platforms without Provisioning CR - When the Provisioning CR is not present, the operand (metal3 pod) cannot be deployed. Based on the current documentation for the ClusterOperator Api, should CBO report the CO status as "Available=False"? (CBO currently uses Disabled=False, Available=False to indicate this state). But, this not a "page-the-admin" situation because Provisioning CR can be added on Day-2 to then deploy the operand (metal3 pod) and start provisioning Baremetal hosts.

@bparees
Copy link
Contributor

bparees commented Sep 15, 2021

CBO running on Unsupported platforms - We do not expect the Provisioning CR to be present or for the metal3 pod to be deployed. Would this mean that the ClusterOperator for CBO reports its status as "Available=True"? In this case, the Operand is not running but it is not expected to be running either. (CBO currently uses "Disabled=True" and "Available=True" to indicate this state.)

yes. CBO is doing exactly what it is expected to be doing in this scenario, so it is available=true. (note: if a CR is created, but CBO is actively ignoring the CR because of the platform type, it would be appropriate for the CBO to include some sort of message in its status conditions that make this clear, like "Available=true Reason=NonMetalPlatform Message=operator is functioning normally, although there is a CR, the CR is ignored because the platform is not metal")

CBO running on Supported platforms without Provisioning CR - When the Provisioning CR is not present, the operand (metal3 pod) cannot be deployed. Based on the current documentation for the ClusterOperator Api, should CBO report the CO status as "Available=False"? (CBO currently uses Disabled=False, Available=False to indicate this state). But, this not a "page-the-admin" situation because Provisioning CR can be added on Day-2 to then deploy the operand (metal3 pod) and start provisioning Baremetal hosts.

I don't see why it would be available=false. The function that is expected to be provided is being provided.

Available=true, Reason=NoProvisionRequested Message="CBO is running and responding to requests, however no provisioning is currently requested"

It's only when a CR exists but the request can't be fulfilled that available=false would make sense. That is the point at which the CBO(and its operands) are not providing the functionality it is expected to provide. (or in other cases where things are going wrong w/ the operator or operand)

@sadasu
Copy link
Contributor

sadasu commented Sep 15, 2021

CBO running on Unsupported platforms - We do not expect the Provisioning CR to be present or for the metal3 pod to be deployed. Would this mean that the ClusterOperator for CBO reports its status as "Available=True"? In this case, the Operand is not running but it is not expected to be running either. (CBO currently uses "Disabled=True" and "Available=True" to indicate this state.)

yes. CBO is doing exactly what it is expected to be doing in this scenario, so it is available=true. (note: if a CR is created, but CBO is actively ignoring the CR because of the platform type, it would be appropriate for the CBO to include some sort of message in its status conditions that make this clear, like "Available=true Reason=NonMetalPlatform Message=operator is functioning normally, although there is a CR, the CR is ignored because the platform is not metal")

+1. We are also currently setting Disabled=True. Do we stop doing that?

CBO running on Supported platforms without Provisioning CR - When the Provisioning CR is not present, the operand (metal3 pod) cannot be deployed. Based on the current documentation for the ClusterOperator Api, should CBO report the CO status as "Available=False"? (CBO currently uses Disabled=False, Available=False to indicate this state). But, this not a "page-the-admin" situation because Provisioning CR can be added on Day-2 to then deploy the operand (metal3 pod) and start provisioning Baremetal hosts.

I don't see why it would be available=false. The function that is expected to be provided is being provided.

Available=true, Reason=NoProvisionRequested Message="CBO is running and responding to requests, however no provisioning is currently requested"

It's only when a CR exists but the request can't be fulfilled that available=false would make sense. That is the point at which the CBO(and its operands) are not providing the functionality it is expected to provide. (or in other cases where things are going wrong w/ the operator or operand)

+1 for this too. Since the operand wasn't running, we were setting Disabled=False, Available=False. As you pointed out, CBO is behaving exactly as expected, even when the CR is absent. So, we will go ahead and set Available=True with an appropriate Reason and not use the Disabled flag at all.

@bparees
Copy link
Contributor

bparees commented Sep 15, 2021

+1. We are also currently setting Disabled=True. Do we stop doing that?

it is up to you. It has no official api meaning or implications for upgrades/alerts/etc, so you can set it or not set it as you like.

@sadasu
Copy link
Contributor

sadasu commented Sep 15, 2021

/lgtm
Thanks for making these updates and answering all my questions.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 15, 2021
@sadasu
Copy link
Contributor

sadasu commented Sep 15, 2021

/hold
@bparees not sure if you were looking for approvals from more teams/individuals.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 15, 2021
@bparees
Copy link
Contributor

bparees commented Sep 15, 2021

@wking let's give it another day or two in case anyone else cares enough to weigh in, and then i'd say you can remove the hold.

@awolffredhat
Copy link

/lgtm
Thanks for making this more clear

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 19, 2021

@awolffredhat: changing LGTM is restricted to collaborators

Details

In response to this:

/lgtm
Thanks for making this more clear

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 19, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: awolffredhat, bparees, sadasu, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bparees
Copy link
Contributor

bparees commented Sep 19, 2021

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 19, 2021
@openshift-merge-robot openshift-merge-robot merged commit cc0db11 into openshift:master Sep 19, 2021
@wking wking deleted the document-operator-conditions-covering-the-component branch September 21, 2021 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants