OCPBUGS-1636: pkg/cvo/sync_worker: Pre-create ClusterOperator in reconciling-mode too #840

wking · 2022-09-23T07:16:37Z

Originally, all component operators were responsible for creating their own ClusterOperator, and we'd just watch to make sure we were happy enough with what they did. However, on install, or when updating to a version that added a new component, we could have timelines like:

CVO creates a namespace for an operator.
CVO creates ... for the operator.
CVO creates the operator Deployment.
Operator deployment never comes up, for whatever reason.
Admin must-gathers.
Must gather uses ClusterOperators for discovering important stuff, and because the ClusterOperator doesn't exist yet, we get no data about why the deployment didn't come up.

So in 2a469e3 (#318), we added ClusterOperator pre-creation to get:

CVO pre-creates ClusterOperator for an operator.
CVO creates the namespace for an operator.
CVO creates ... for the operator.
CVO creates the operator Deployment.
Operator deployment never comes up, for whatever reason.
Admin must-gathers.
Must gather uses ClusterOperators for discovering important stuff, and finds the one the CVO had pre-created with hard-coded relatedObjects, gathers stuff from the referenced operator namespace, and allows us to trouble-shoot the issue.

However, all existing component operators already knew how to create their own ClusterOperator, because that was the only path before the CVO learned about pre-creation. And even since then, most new operators come into the cluster on install or on update, when the CVO is pre-creating. New in 4.12, the platform-operator is coming in, and it has two relevant characteristics:

It does not know how to create the platform-operators-aggregated ClusterOperator.
It is gated behind TechPreviewNoUpgrade.

So we are exposed to:

Admin installs a cluster. No platform-operators-aggregated, because it's not TechPreviewNoUpgrade.
Install complete. CVO transitions to reconciling mode.
Admin enables TechPreviewNoUpgrade.
CVO notices, and reboots fc00c62 (allow more than one featureset #821).
Because we decided to not transition into updating mode for feature-set changes, we stay in reconciling mode.
Because we're in reconciling mode, we skip the ClusterOperator pre-creation, and get right in to the status check.

Because the platform operator didn't create the ClusterOperator either, the CVO's status check fails with:

 45657:E0923 01:43:25.610286       1 task.go:117] error running apply for clusteroperator "openshift-platform-operators/platform-operators-aggregated" (587 of 960): clusteroperator.config.openshift.io "platform-operators-aggregated" not found

With this commit, I stop making the ClusterOperator pre-creation conditional, so the new flow is:

...
6. Even in reconciling mode, we pre-create the ClusterOperator.
7. Because we pre-created the ClusterOperator, the CVO's status check succeeds (at least, after the operator writes acceptable status to the ClusterOperator we've created for it).

This will also help us recover components where a bunch of in-cluster resources had been deleted, assuming the CVO was still alive. There may be other component operators who rely on the CVO for ClusterOperator creation, but which we haven't noticed because they aren't also gated behind TechPreviewNoUpgrade.

openshift-ci-robot · 2022-09-23T07:16:43Z

@wking: This pull request references Jira Issue OCPBUGS-1636, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.12.0) matches configured target version for branch (4.12.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jianzhangbjz

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Originally, all component operators were responsible for creating their own ClusterOperator, and we'd just watch to make sure we were happy enough with what they did. However, on install, or when updating to a version that added a new component, we could have timelines like:

CVO creates a namespace for an operator.

CVO creates ... for the operator.

CVO creates the operator Deployment.

Operator deployment never comes up, for whatever reason.

Admin must-gathers.

Must gather uses ClusterOperators for discovering important stuff, and because the ClusterOperator doesn't exist yet, we get no data about why the deployment didn't come up.

So in 2a469e3 (#318), we added ClusterOperator pre-creation to get:

CVO pre-creates ClusterOperator for an operator.

CVO creates the namespace for an operator.

CVO creates ... for the operator.

CVO creates the operator Deployment.

Operator deployment never comes up, for whatever reason.

Admin must-gathers.

Must gather uses ClusterOperators for discovering important stuff, and finds the one the CVO had pre-created with hard-coded relatedObjects, gathers stuff from the referenced operator namespace, and allows us to trouble-shoot the issue.

However, all existing component operators already knew how to create their own ClusterOperator, because that was the only path before the CVO learned about pre-creation. And even since then, most new operators come into the cluster on install or on update, when the CVO is pre-creating. New in 4.12, the platform-operator is coming in, and it has two relevant characteristics:

It does not know how to create the platform-operators-aggregated ClusterOperator.

It is gated behind TechPreviewNoUpgrade.

So we are exposed to:
Admin installs a cluster. No platform-operators-aggregated, because it's not TechPreviewNoUpgrade.

Install complete. CVO transitions to reconciling mode.

Admin enables TechPreviewNoUpgrade.

CVO notices, and reboots fc00c62 (allow more than one featureset #821).

Because we decided to not transition into updating mode for feature-set changes, we stay in reconciling mode.

Because we're in reconciling mode, we skip the ClusterOperator pre-creation, and get right in to the status check.
Because the platform operator didn't create the ClusterOperator either, the CVO's status check fails with:
45657:E0923 01:43:25.610286       1 task.go:117] error running apply for clusteroperator "openshift-platform-operators/platform-operators-aggregated" (587 of 960): clusteroperator.config.openshift.io "platform-operators-aggregated" not found
With this commit, I stop making the ClusterOperator pre-creation conditional, so the new flow is:

...
6. Even in reconciling mode, we pre-create the ClusterOperator.
7. Because we pre-created the ClusterOperator, the CVO's status check succeeds (at least, after the operator writes acceptable status to the ClusterOperator we've created for it).

This will also help us recover components where a bunch of in-cluster resources had been deleted, assuming the CVO was still alive. There may be other component operators who rely on the CVO for ClusterOperator creation, but which we haven't noticed because they aren't also gated behind TechPreviewNoUpgrade.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Originally, all component operators were responsible for creating their own ClusterOperator, and we'd just watch to make sure we were happy enough with what they did. However, on install, or when updating to a version that added a new component, we could have timelines like: 1. CVO creates a namespace for an operator. 2. CVO creates ... for the operator. 3. CVO creates the operator Deployment. 4. Operator deployment never comes up, for whatever reason. 5. Admin must-gathers. 6. Must gather uses ClusterOperators for discovering important stuff, and because the ClusterOperator doesn't exist yet, we get no data about why the deployment didn't come up. So in 2a469e3 (cvo: When installing or upgrading, fast-fill cluster-operators, 2020-02-07, openshift#318), we added ClusterOperator pre-creation to get: 1. CVO pre-creates ClusterOperator for an operator. 2. CVO creates the namespace for an operator. 3. CVO creates ... for the operator. 4. CVO creates the operator Deployment. 5. Operator deployment never comes up, for whatever reason. 6. Admin must-gathers. 7. Must gather uses ClusterOperators for discovering important stuff, and finds the one the CVO had pre-created with hard-coded relatedObjects, gathers stuff from the referenced operator namespace, and allows us to trouble-shoot the issue. However, all existing component operators already knew how to create their own ClusterOperator, because that was the only path before the CVO learned about pre-creation. And even since then, most new operators come into the cluster on install or on update, when the CVO is pre-creating. New in 4.12, the platform-operator is coming in [1], and it has two relevant characteristics: * It does not know how to create the platform-operators-aggregated ClusterOperator [2]. * It is gated behind TechPreviewNoUpgrade [3]. So we are exposed to: 1. Admin installs a cluster. No platform-operators-aggregated, because it's not TechPreviewNoUpgrade. 2. Install complete. CVO transitions to reconciling mode. 3. Admin enables TechPreviewNoUpgrade. 4. CVO notices, and reboots fc00c62 (update the manifest selection to honor any featureset, 2022-08-17, openshift#821). 5. Because we decided to not transition into updating mode for feature-set changes, we stay in reconciling mode. 6. Because we're in reconciling mode, we skip the ClusterOperator pre-creation, and get right in to the status check. 7. Because the platform operator didn't create the ClusterOperator either, the CVO's status check fails with [2]: 45657:E0923 01:43:25.610286 1 task.go:117] error running apply for clusteroperator "openshift-platform-operators/platform-operators-aggregated" (587 of 960): clusteroperator.config.openshift.io "platform-operators-aggregated" not found With this commit, I stop making the ClusterOperator pre-creation conditional, so the new flow is: ... 6. Even in reconciling mode, we pre-create the ClusterOperator. 7. Because we pre-created the ClusterOperator, the CVO's status check succeeds (at least, after the operator writes acceptable status to the ClusterOperator we've created for it). This will also help us recover components where a bunch of in-cluster resources had been deleted, assuming the CVO was still alive. There may be other component operators who rely on the CVO for ClusterOperator creation, but which we haven't noticed because they aren't also gated behind TechPreviewNoUpgrade. [1]: https://github.com/openshift/enhancements/blob/6e1697418be807d0ae567a9f83ac654a1fd0ee9a/enhancements/olm/platform-operators.md [2]: https://issues.redhat.com/browse/OCPBUGS-1636 [3]: https://github.com/openshift/platform-operators/blob/4ecea427cf5302dfcdf4a5af8d28eadebacc2037/manifests/0000_50_cluster-platform-operator-manager_07-aggregated-clusteroperator.yaml#L8

wking · 2022-09-23T21:33:40Z

/retest

jianzhangbjz · 2022-09-26T09:20:06Z

Passed, test steps see comments: https://issues.redhat.com/browse/OCPBUGS-1636
/label qe-approved

jottofar · 2022-09-26T14:31:42Z

/retest

LalatenduMohanty

/lgtm

openshift-ci · 2022-09-28T14:42:50Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LalatenduMohanty, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [LalatenduMohanty,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2022-09-28T14:56:48Z

/retest-required

Remaining retests: 0 against base HEAD 0ea295b and 2 for PR HEAD 6a0aa99 in total

openshift-ci-robot · 2022-09-28T18:03:29Z

/retest-required

Remaining retests: 0 against base HEAD e660bab and 1 for PR HEAD 6a0aa99 in total

wking · 2022-10-01T05:12:21Z

/retest

wking · 2022-10-01T18:46:47Z

/retest-required

wking · 2022-10-02T14:51:36Z

/retest-required

wking · 2022-10-02T22:37:00Z

Azure CSI SCC failures are unrelated.

/override ci/prow/e2e-agnostic-upgrade-into-change

openshift-ci · 2022-10-02T22:37:15Z

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-upgrade-into-change

Details

In response to this:

Azure CSI SCC failures are unrelated.

/override ci/prow/e2e-agnostic-upgrade-into-change

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2022-10-02T22:37:17Z

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2022-10-02T22:41:21Z

@wking: All pull requests linked via external trackers have merged:

openshift/cluster-version-operator#840

Jira Issue OCPBUGS-1636 has been moved to the MODIFIED state.

Details

In response to this:

Originally, all component operators were responsible for creating their own ClusterOperator, and we'd just watch to make sure we were happy enough with what they did. However, on install, or when updating to a version that added a new component, we could have timelines like:

CVO creates a namespace for an operator.

CVO creates ... for the operator.

CVO creates the operator Deployment.

Operator deployment never comes up, for whatever reason.

Admin must-gathers.

Must gather uses ClusterOperators for discovering important stuff, and because the ClusterOperator doesn't exist yet, we get no data about why the deployment didn't come up.

So in 2a469e3 (#318), we added ClusterOperator pre-creation to get:

CVO pre-creates ClusterOperator for an operator.

CVO creates the namespace for an operator.

CVO creates ... for the operator.

CVO creates the operator Deployment.

Operator deployment never comes up, for whatever reason.

Admin must-gathers.

Must gather uses ClusterOperators for discovering important stuff, and finds the one the CVO had pre-created with hard-coded relatedObjects, gathers stuff from the referenced operator namespace, and allows us to trouble-shoot the issue.

However, all existing component operators already knew how to create their own ClusterOperator, because that was the only path before the CVO learned about pre-creation. And even since then, most new operators come into the cluster on install or on update, when the CVO is pre-creating. New in 4.12, the platform-operator is coming in, and it has two relevant characteristics:

It does not know how to create the platform-operators-aggregated ClusterOperator.

It is gated behind TechPreviewNoUpgrade.

So we are exposed to:
Admin installs a cluster. No platform-operators-aggregated, because it's not TechPreviewNoUpgrade.

Install complete. CVO transitions to reconciling mode.

Admin enables TechPreviewNoUpgrade.

CVO notices, and reboots fc00c62 (allow more than one featureset #821).

Because we decided to not transition into updating mode for feature-set changes, we stay in reconciling mode.

Because we're in reconciling mode, we skip the ClusterOperator pre-creation, and get right in to the status check.
Because the platform operator didn't create the ClusterOperator either, the CVO's status check fails with:
45657:E0923 01:43:25.610286       1 task.go:117] error running apply for clusteroperator "openshift-platform-operators/platform-operators-aggregated" (587 of 960): clusteroperator.config.openshift.io "platform-operators-aggregated" not found
With this commit, I stop making the ClusterOperator pre-creation conditional, so the new flow is:

...
6. Even in reconciling mode, we pre-create the ClusterOperator.
7. Because we pre-created the ClusterOperator, the CVO's status check succeeds (at least, after the operator writes acceptable status to the ClusterOperator we've created for it).

This will also help us recover components where a bunch of in-cluster resources had been deleted, assuming the CVO was still alive. There may be other component operators who rely on the CVO for ClusterOperator creation, but which we haven't noticed because they aren't also gated behind TechPreviewNoUpgrade.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot requested review from LalatenduMohanty, jianzhangbjz and jottofar September 23, 2022 07:16

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 23, 2022

wking force-pushed the always-precreate-cluster-operators branch from 6e5918f to 6a0aa99 Compare September 23, 2022 07:25

openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Sep 26, 2022

LalatenduMohanty approved these changes Sep 28, 2022

View reviewed changes

openshift-ci bot assigned LalatenduMohanty Sep 28, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 28, 2022

openshift-merge-robot merged commit b1cff4e into openshift:master Oct 2, 2022

wking deleted the always-precreate-cluster-operators branch October 2, 2022 22:49

OCPBUGS-1636: pkg/cvo/sync_worker: Pre-create ClusterOperator in reconciling-mode too #840

OCPBUGS-1636: pkg/cvo/sync_worker: Pre-create ClusterOperator in reconciling-mode too #840

Uh oh!

Conversation

wking commented Sep 23, 2022

Uh oh!

openshift-ci-robot commented Sep 23, 2022

Uh oh!

wking commented Sep 23, 2022

Uh oh!

jianzhangbjz commented Sep 26, 2022

Uh oh!

jottofar commented Sep 26, 2022

Uh oh!

LalatenduMohanty left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Sep 28, 2022

Uh oh!

openshift-ci-robot commented Sep 28, 2022

Uh oh!

openshift-ci-robot commented Sep 28, 2022

Uh oh!

wking commented Oct 1, 2022

Uh oh!

wking commented Oct 1, 2022

Uh oh!

wking commented Oct 2, 2022

Uh oh!

wking commented Oct 2, 2022

Uh oh!

openshift-ci bot commented Oct 2, 2022

Uh oh!

openshift-ci bot commented Oct 2, 2022

Uh oh!

openshift-ci-robot commented Oct 2, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants