config/v1/types_cluster_operator: Explain that conditions cover components #1000

wking · 2021-09-02T04:40:07Z

Builds on #995; consider reviewing that one first.

Some operators have no configured operands (e.g. the bare-metal operator on non-metal platforms, or the image-registry operator when the admins have configured managementState:Removed). Some operators have many configured operands. Operators writing ClusterOperator conditions should not limit the conditions to speak about just the operator, or just a particular operand. Instead, operators should speak about the component as a whole. Is something about the service that the component provides gone? If so, Available=False (midnight admin page). Is something about the service that the component provides not hitting its service-level objectives? If so, Degraded=True (working-hours admin queue). Doesn't matter if that thing is "the operator is having trouble talking to the API to figure out how the operands are doing" or "the operator is super-happy, and sees that some operand is sad". That's all stuff that can be distinguished in the reason/message.

bparees

i put comments on the Upgradeable message in the other PR.

but the net new updates in this PR look reasonable to me.

…nents Some operators have no configured operands (e.g. the bare-metal operator on non-metal platforms, or the image-registry operator when the admins have configured managementState:Removed [1]). Some operators have many configured operands. Operators writing ClusterOperator conditions should not limit the conditions to speak about just the operator, or just a particular operand. Instead, operators should speak about the component as a whole. Is something about the service that the component provides gone? If so, Available=False (midnight admin page). Is something about the service that the component provides not hitting its service-level objectives? If so, Degraded=True (working-hours admin queue). Doesn't matter if that thing is "the operator is having trouble talking to the API to figure out how the operands are doing" or "the operator is super-happy, and sees that some operand is sad". That's all stuff that can be distinguished in the reason/message. [1]: https://docs.openshift.com/container-platform/4.8/registry/configuring_registry_storage/configuring-registry-storage-baremetal.html#registry-removed_configuring-registry-storage-baremetal

wking · 2021-09-14T17:36:25Z

Rebased onto master with a3258e8 -> f99f4bb now that #995 has landed.

bparees · 2021-09-14T19:48:02Z

still looks reasonable to me but this seems worthy of a second set of eyes.

/approve

asalkeld · 2021-09-14T21:57:52Z

looks great to me, really helps clarify things for baremetal 👍

wking · 2021-09-15T02:53:08Z

config/v1/types_cluster_operator.go

 type ClusterStatusConditionType string

 const (
-	// Available indicates that the operand (eg: openshift-apiserver for the


In case it helps with review, I personally find the output of:

$ git show --word-diff=color

easier to read for this commit than GitHub's rendering.

sadasu · 2021-09-15T15:59:47Z

@wking Thanks for providing this clarification for the various CO states. Here are some remaining questions specific to the cluster-baremetal-operator.
Disabled state is not mentioned in the API, so that is probably still not a supported state.

The cluster-baremetal-operator (CBO) is responsible for deploying the metal3 pod when the Provisioning CR is present and when the platform is "Baremetal". Let us consider the follwing scenarios:

CBO running on Unsupported platforms - We do not expect the Provisioning CR to be present or for the metal3 pod to be deployed. Would this mean that the ClusterOperator for CBO reports its status as "Available=True"? In this case, the Operand is not running but it is not expected to be running either. (CBO currently uses "Disabled=True" and "Available=True" to indicate this state.)
CBO running on Supported platforms without Provisioning CR - When the Provisioning CR is not present, the operand (metal3 pod) cannot be deployed. Based on the current documentation for the ClusterOperator Api, should CBO report the CO status as "Available=False"? (CBO currently uses Disabled=False, Available=False to indicate this state). But, this not a "page-the-admin" situation because Provisioning CR can be added on Day-2 to then deploy the operand (metal3 pod) and start provisioning Baremetal hosts.

bparees · 2021-09-15T16:13:40Z

CBO running on Unsupported platforms - We do not expect the Provisioning CR to be present or for the metal3 pod to be deployed. Would this mean that the ClusterOperator for CBO reports its status as "Available=True"? In this case, the Operand is not running but it is not expected to be running either. (CBO currently uses "Disabled=True" and "Available=True" to indicate this state.)

yes. CBO is doing exactly what it is expected to be doing in this scenario, so it is available=true. (note: if a CR is created, but CBO is actively ignoring the CR because of the platform type, it would be appropriate for the CBO to include some sort of message in its status conditions that make this clear, like "Available=true Reason=NonMetalPlatform Message=operator is functioning normally, although there is a CR, the CR is ignored because the platform is not metal")

CBO running on Supported platforms without Provisioning CR - When the Provisioning CR is not present, the operand (metal3 pod) cannot be deployed. Based on the current documentation for the ClusterOperator Api, should CBO report the CO status as "Available=False"? (CBO currently uses Disabled=False, Available=False to indicate this state). But, this not a "page-the-admin" situation because Provisioning CR can be added on Day-2 to then deploy the operand (metal3 pod) and start provisioning Baremetal hosts.

I don't see why it would be available=false. The function that is expected to be provided is being provided.

Available=true, Reason=NoProvisionRequested Message="CBO is running and responding to requests, however no provisioning is currently requested"

It's only when a CR exists but the request can't be fulfilled that available=false would make sense. That is the point at which the CBO(and its operands) are not providing the functionality it is expected to provide. (or in other cases where things are going wrong w/ the operator or operand)

sadasu · 2021-09-15T16:25:39Z

CBO running on Unsupported platforms - We do not expect the Provisioning CR to be present or for the metal3 pod to be deployed. Would this mean that the ClusterOperator for CBO reports its status as "Available=True"? In this case, the Operand is not running but it is not expected to be running either. (CBO currently uses "Disabled=True" and "Available=True" to indicate this state.)

yes. CBO is doing exactly what it is expected to be doing in this scenario, so it is available=true. (note: if a CR is created, but CBO is actively ignoring the CR because of the platform type, it would be appropriate for the CBO to include some sort of message in its status conditions that make this clear, like "Available=true Reason=NonMetalPlatform Message=operator is functioning normally, although there is a CR, the CR is ignored because the platform is not metal")

+1. We are also currently setting Disabled=True. Do we stop doing that?

CBO running on Supported platforms without Provisioning CR - When the Provisioning CR is not present, the operand (metal3 pod) cannot be deployed. Based on the current documentation for the ClusterOperator Api, should CBO report the CO status as "Available=False"? (CBO currently uses Disabled=False, Available=False to indicate this state). But, this not a "page-the-admin" situation because Provisioning CR can be added on Day-2 to then deploy the operand (metal3 pod) and start provisioning Baremetal hosts.

I don't see why it would be available=false. The function that is expected to be provided is being provided.

Available=true, Reason=NoProvisionRequested Message="CBO is running and responding to requests, however no provisioning is currently requested"

It's only when a CR exists but the request can't be fulfilled that available=false would make sense. That is the point at which the CBO(and its operands) are not providing the functionality it is expected to provide. (or in other cases where things are going wrong w/ the operator or operand)

+1 for this too. Since the operand wasn't running, we were setting Disabled=False, Available=False. As you pointed out, CBO is behaving exactly as expected, even when the CR is absent. So, we will go ahead and set Available=True with an appropriate Reason and not use the Disabled flag at all.

bparees · 2021-09-15T16:28:33Z

+1. We are also currently setting Disabled=True. Do we stop doing that?

it is up to you. It has no official api meaning or implications for upgrades/alerts/etc, so you can set it or not set it as you like.

sadasu · 2021-09-15T16:33:25Z

/lgtm
Thanks for making these updates and answering all my questions.

sadasu · 2021-09-15T16:36:23Z

/hold
@bparees not sure if you were looking for approvals from more teams/individuals.

bparees · 2021-09-15T16:38:44Z

@wking let's give it another day or two in case anyone else cares enough to weigh in, and then i'd say you can remove the hold.

awolffredhat · 2021-09-19T16:17:07Z

/lgtm
Thanks for making this more clear

openshift-ci · 2021-09-19T16:17:27Z

@awolffredhat: changing LGTM is restricted to collaborators

Details

In response to this:

/lgtm
Thanks for making this more clear

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2021-09-19T16:17:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: awolffredhat, bparees, sadasu, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [bparees]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bparees · 2021-09-19T22:22:18Z

/hold cancel

openshift-ci bot requested review from knobunc and soltysh September 2, 2021 04:40

wking mentioned this pull request Sep 9, 2021

Bug 2001523: Set Available condition to False when console-operator in Removed state openshift/console-operator#584

Closed

bparees reviewed Sep 9, 2021

View reviewed changes

wking force-pushed the document-operator-conditions-covering-the-component branch 2 times, most recently from 4f64bd6 to a3258e8 Compare September 13, 2021 22:48

wking force-pushed the document-operator-conditions-covering-the-component branch from a3258e8 to f99f4bb Compare September 14, 2021 17:36

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 14, 2021

wking commented Sep 15, 2021

View reviewed changes

openshift-ci bot assigned sadasu Sep 15, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 15, 2021

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 15, 2021

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 19, 2021

openshift-merge-robot merged commit cc0db11 into openshift:master Sep 19, 2021

wking deleted the document-operator-conditions-covering-the-component branch September 21, 2021 16:47

config/v1/types_cluster_operator: Explain that conditions cover components #1000

config/v1/types_cluster_operator: Explain that conditions cover components #1000

Uh oh!

Conversation

wking commented Sep 2, 2021

Uh oh!

bparees left a comment

Choose a reason for hiding this comment

Uh oh!

wking commented Sep 14, 2021

Uh oh!

bparees commented Sep 14, 2021

Uh oh!

asalkeld commented Sep 14, 2021

Uh oh!

wking Sep 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sadasu commented Sep 15, 2021

Uh oh!

bparees commented Sep 15, 2021

Uh oh!

sadasu commented Sep 15, 2021

Uh oh!

bparees commented Sep 15, 2021

Uh oh!

sadasu commented Sep 15, 2021

Uh oh!

sadasu commented Sep 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bparees commented Sep 15, 2021

Uh oh!

awolffredhat commented Sep 19, 2021

Uh oh!

openshift-ci bot commented Sep 19, 2021

Uh oh!

openshift-ci bot commented Sep 19, 2021

Uh oh!

bparees commented Sep 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

wking Sep 15, 2021 •

edited

Loading

sadasu commented Sep 15, 2021 •

edited

Loading