Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Sep 23, 2022

Originally, all component operators were responsible for creating their own ClusterOperator, and we'd just watch to make sure we were happy enough with what they did. However, on install, or when updating to a version that added a new component, we could have timelines like:

  1. CVO creates a namespace for an operator.
  2. CVO creates ... for the operator.
  3. CVO creates the operator Deployment.
  4. Operator deployment never comes up, for whatever reason.
  5. Admin must-gathers.
  6. Must gather uses ClusterOperators for discovering important stuff, and because the ClusterOperator doesn't exist yet, we get no data about why the deployment didn't come up.

So in 2a469e3 (#318), we added ClusterOperator pre-creation to get:

  1. CVO pre-creates ClusterOperator for an operator.
  2. CVO creates the namespace for an operator.
  3. CVO creates ... for the operator.
  4. CVO creates the operator Deployment.
  5. Operator deployment never comes up, for whatever reason.
  6. Admin must-gathers.
  7. Must gather uses ClusterOperators for discovering important stuff, and finds the one the CVO had pre-created with hard-coded relatedObjects, gathers stuff from the referenced operator namespace, and allows us to trouble-shoot the issue.

However, all existing component operators already knew how to create their own ClusterOperator, because that was the only path before the CVO learned about pre-creation. And even since then, most new operators come into the cluster on install or on update, when the CVO is pre-creating. New in 4.12, the platform-operator is coming in, and it has two relevant characteristics:

So we are exposed to:

  1. Admin installs a cluster. No platform-operators-aggregated, because it's not TechPreviewNoUpgrade.

  2. Install complete. CVO transitions to reconciling mode.

  3. Admin enables TechPreviewNoUpgrade.

  4. CVO notices, and reboots fc00c62 (allow more than one featureset #821).

  5. Because we decided to not transition into updating mode for feature-set changes, we stay in reconciling mode.

  6. Because we're in reconciling mode, we skip the ClusterOperator pre-creation, and get right in to the status check.

  7. Because the platform operator didn't create the ClusterOperator either, the CVO's status check fails with:

     45657:E0923 01:43:25.610286       1 task.go:117] error running apply for clusteroperator "openshift-platform-operators/platform-operators-aggregated" (587 of 960): clusteroperator.config.openshift.io "platform-operators-aggregated" not found
    

With this commit, I stop making the ClusterOperator pre-creation conditional, so the new flow is:

...
6. Even in reconciling mode, we pre-create the ClusterOperator.
7. Because we pre-created the ClusterOperator, the CVO's status check succeeds (at least, after the operator writes acceptable status to the ClusterOperator we've created for it).

This will also help us recover components where a bunch of in-cluster resources had been deleted, assuming the CVO was still alive. There may be other component operators who rely on the CVO for ClusterOperator creation, but which we haven't noticed because they aren't also gated behind TechPreviewNoUpgrade.

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Sep 23, 2022
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Jira Issue OCPBUGS-1636, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.12.0) matches configured target version for branch (4.12.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jianzhangbjz

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Originally, all component operators were responsible for creating their own ClusterOperator, and we'd just watch to make sure we were happy enough with what they did. However, on install, or when updating to a version that added a new component, we could have timelines like:

  1. CVO creates a namespace for an operator.
  2. CVO creates ... for the operator.
  3. CVO creates the operator Deployment.
  4. Operator deployment never comes up, for whatever reason.
  5. Admin must-gathers.
  6. Must gather uses ClusterOperators for discovering important stuff, and because the ClusterOperator doesn't exist yet, we get no data about why the deployment didn't come up.

So in 2a469e3 (#318), we added ClusterOperator pre-creation to get:

  1. CVO pre-creates ClusterOperator for an operator.
  2. CVO creates the namespace for an operator.
  3. CVO creates ... for the operator.
  4. CVO creates the operator Deployment.
  5. Operator deployment never comes up, for whatever reason.
  6. Admin must-gathers.
  7. Must gather uses ClusterOperators for discovering important stuff, and finds the one the CVO had pre-created with hard-coded relatedObjects, gathers stuff from the referenced operator namespace, and allows us to trouble-shoot the issue.

However, all existing component operators already knew how to create their own ClusterOperator, because that was the only path before the CVO learned about pre-creation. And even since then, most new operators come into the cluster on install or on update, when the CVO is pre-creating. New in 4.12, the platform-operator is coming in, and it has two relevant characteristics:

So we are exposed to:

  1. Admin installs a cluster. No platform-operators-aggregated, because it's not TechPreviewNoUpgrade.

  2. Install complete. CVO transitions to reconciling mode.

  3. Admin enables TechPreviewNoUpgrade.

  4. CVO notices, and reboots fc00c62 (allow more than one featureset #821).

  5. Because we decided to not transition into updating mode for feature-set changes, we stay in reconciling mode.

  6. Because we're in reconciling mode, we skip the ClusterOperator pre-creation, and get right in to the status check.

  7. Because the platform operator didn't create the ClusterOperator either, the CVO's status check fails with:

    45657:E0923 01:43:25.610286       1 task.go:117] error running apply for clusteroperator "openshift-platform-operators/platform-operators-aggregated" (587 of 960): clusteroperator.config.openshift.io "platform-operators-aggregated" not found
    

With this commit, I stop making the ClusterOperator pre-creation conditional, so the new flow is:

...
6. Even in reconciling mode, we pre-create the ClusterOperator.
7. Because we pre-created the ClusterOperator, the CVO's status check succeeds (at least, after the operator writes acceptable status to the ClusterOperator we've created for it).

This will also help us recover components where a bunch of in-cluster resources had been deleted, assuming the CVO was still alive. There may be other component operators who rely on the CVO for ClusterOperator creation, but which we haven't noticed because they aren't also gated behind TechPreviewNoUpgrade.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 23, 2022
Originally, all component operators were responsible for creating
their own ClusterOperator, and we'd just watch to make sure we were
happy enough with what they did.  However, on install, or when
updating to a version that added a new component, we could have
timelines like:

1. CVO creates a namespace for an operator.
2. CVO creates ... for the operator.
3. CVO creates the operator Deployment.
4. Operator deployment never comes up, for whatever reason.
5. Admin must-gathers.
6. Must gather uses ClusterOperators for discovering important stuff,
   and because the ClusterOperator doesn't exist yet, we get no data
   about why the deployment didn't come up.

So in 2a469e3 (cvo: When installing or upgrading, fast-fill
cluster-operators, 2020-02-07, openshift#318), we added ClusterOperator
pre-creation to get:

1. CVO pre-creates ClusterOperator for an operator.
2. CVO creates the namespace for an operator.
3. CVO creates ... for the operator.
4. CVO creates the operator Deployment.
5. Operator deployment never comes up, for whatever reason.
6. Admin must-gathers.
7. Must gather uses ClusterOperators for discovering important stuff,
   and finds the one the CVO had pre-created with hard-coded
   relatedObjects, gathers stuff from the referenced operator
   namespace, and allows us to trouble-shoot the issue.

However, all existing component operators already knew how to create
their own ClusterOperator, because that was the only path before the
CVO learned about pre-creation.  And even since then, most new
operators come into the cluster on install or on update, when the CVO
is pre-creating.  New in 4.12, the platform-operator is coming in [1],
and it has two relevant characteristics:

* It does not know how to create the platform-operators-aggregated
  ClusterOperator [2].
* It is gated behind TechPreviewNoUpgrade [3].

So we are exposed to:

1. Admin installs a cluster.  No platform-operators-aggregated,
   because it's not TechPreviewNoUpgrade.
2. Install complete.  CVO transitions to reconciling mode.
3. Admin enables TechPreviewNoUpgrade.
4. CVO notices, and reboots fc00c62 (update the manifest selection
   to honor any featureset, 2022-08-17, openshift#821).
5. Because we decided to not transition into updating mode for
   feature-set changes, we stay in reconciling mode.
6. Because we're in reconciling mode, we skip the ClusterOperator
   pre-creation, and get right in to the status check.
7. Because the platform operator didn't create the ClusterOperator
   either, the CVO's status check fails with [2]:

     45657:E0923 01:43:25.610286       1 task.go:117] error running apply for clusteroperator "openshift-platform-operators/platform-operators-aggregated" (587 of 960): clusteroperator.config.openshift.io "platform-operators-aggregated" not found

With this commit, I stop making the ClusterOperator pre-creation
conditional, so the new flow is:

...
6. Even in reconciling mode, we pre-create the ClusterOperator.
7. Because we pre-created the ClusterOperator, the CVO's status check
   succeeds (at least, after the operator writes acceptable status to
   the ClusterOperator we've created for it).

This will also help us recover components where a bunch of in-cluster
resources had been deleted, assuming the CVO was still alive.  There
may be other component operators who rely on the CVO for
ClusterOperator creation, but which we haven't noticed because they
aren't also gated behind TechPreviewNoUpgrade.

[1]: https://github.com/openshift/enhancements/blob/6e1697418be807d0ae567a9f83ac654a1fd0ee9a/enhancements/olm/platform-operators.md
[2]: https://issues.redhat.com/browse/OCPBUGS-1636
[3]: https://github.com/openshift/platform-operators/blob/4ecea427cf5302dfcdf4a5af8d28eadebacc2037/manifests/0000_50_cluster-platform-operator-manager_07-aggregated-clusteroperator.yaml#L8
@wking wking force-pushed the always-precreate-cluster-operators branch from 6e5918f to 6a0aa99 Compare September 23, 2022 07:25
@wking
Copy link
Member Author

wking commented Sep 23, 2022

/retest

@jianzhangbjz
Copy link
Member

Passed, test steps see comments: https://issues.redhat.com/browse/OCPBUGS-1636
/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Sep 26, 2022
@jottofar
Copy link
Contributor

/retest

Copy link
Member

@LalatenduMohanty LalatenduMohanty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 28, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 28, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LalatenduMohanty, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [LalatenduMohanty,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 0ea295b and 2 for PR HEAD 6a0aa99 in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD e660bab and 1 for PR HEAD 6a0aa99 in total

@wking
Copy link
Member Author

wking commented Oct 1, 2022

/retest

@wking
Copy link
Member Author

wking commented Oct 1, 2022

/retest-required

1 similar comment
@wking
Copy link
Member Author

wking commented Oct 2, 2022

/retest-required

@wking
Copy link
Member Author

wking commented Oct 2, 2022

Azure CSI SCC failures are unrelated.

/override ci/prow/e2e-agnostic-upgrade-into-change

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 2, 2022

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-upgrade-into-change

Details

In response to this:

Azure CSI SCC failures are unrelated.

/override ci/prow/e2e-agnostic-upgrade-into-change

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 2, 2022

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit b1cff4e into openshift:master Oct 2, 2022
@openshift-ci-robot
Copy link
Contributor

@wking: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-1636 has been moved to the MODIFIED state.

Details

In response to this:

Originally, all component operators were responsible for creating their own ClusterOperator, and we'd just watch to make sure we were happy enough with what they did. However, on install, or when updating to a version that added a new component, we could have timelines like:

  1. CVO creates a namespace for an operator.
  2. CVO creates ... for the operator.
  3. CVO creates the operator Deployment.
  4. Operator deployment never comes up, for whatever reason.
  5. Admin must-gathers.
  6. Must gather uses ClusterOperators for discovering important stuff, and because the ClusterOperator doesn't exist yet, we get no data about why the deployment didn't come up.

So in 2a469e3 (#318), we added ClusterOperator pre-creation to get:

  1. CVO pre-creates ClusterOperator for an operator.
  2. CVO creates the namespace for an operator.
  3. CVO creates ... for the operator.
  4. CVO creates the operator Deployment.
  5. Operator deployment never comes up, for whatever reason.
  6. Admin must-gathers.
  7. Must gather uses ClusterOperators for discovering important stuff, and finds the one the CVO had pre-created with hard-coded relatedObjects, gathers stuff from the referenced operator namespace, and allows us to trouble-shoot the issue.

However, all existing component operators already knew how to create their own ClusterOperator, because that was the only path before the CVO learned about pre-creation. And even since then, most new operators come into the cluster on install or on update, when the CVO is pre-creating. New in 4.12, the platform-operator is coming in, and it has two relevant characteristics:

So we are exposed to:

  1. Admin installs a cluster. No platform-operators-aggregated, because it's not TechPreviewNoUpgrade.

  2. Install complete. CVO transitions to reconciling mode.

  3. Admin enables TechPreviewNoUpgrade.

  4. CVO notices, and reboots fc00c62 (allow more than one featureset #821).

  5. Because we decided to not transition into updating mode for feature-set changes, we stay in reconciling mode.

  6. Because we're in reconciling mode, we skip the ClusterOperator pre-creation, and get right in to the status check.

  7. Because the platform operator didn't create the ClusterOperator either, the CVO's status check fails with:

    45657:E0923 01:43:25.610286       1 task.go:117] error running apply for clusteroperator "openshift-platform-operators/platform-operators-aggregated" (587 of 960): clusteroperator.config.openshift.io "platform-operators-aggregated" not found
    

With this commit, I stop making the ClusterOperator pre-creation conditional, so the new flow is:

...
6. Even in reconciling mode, we pre-create the ClusterOperator.
7. Because we pre-created the ClusterOperator, the CVO's status check succeeds (at least, after the operator writes acceptable status to the ClusterOperator we've created for it).

This will also help us recover components where a bunch of in-cluster resources had been deleted, assuming the CVO was still alive. There may be other component operators who rely on the CVO for ClusterOperator creation, but which we haven't noticed because they aren't also gated behind TechPreviewNoUpgrade.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the always-precreate-cluster-operators branch October 2, 2022 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants