-
Notifications
You must be signed in to change notification settings - Fork 216
OCPBUGS-1636: pkg/cvo/sync_worker: Pre-create ClusterOperator in reconciling-mode too #840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-1636: pkg/cvo/sync_worker: Pre-create ClusterOperator in reconciling-mode too #840
Conversation
|
@wking: This pull request references Jira Issue OCPBUGS-1636, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Originally, all component operators were responsible for creating their own ClusterOperator, and we'd just watch to make sure we were happy enough with what they did. However, on install, or when updating to a version that added a new component, we could have timelines like: 1. CVO creates a namespace for an operator. 2. CVO creates ... for the operator. 3. CVO creates the operator Deployment. 4. Operator deployment never comes up, for whatever reason. 5. Admin must-gathers. 6. Must gather uses ClusterOperators for discovering important stuff, and because the ClusterOperator doesn't exist yet, we get no data about why the deployment didn't come up. So in 2a469e3 (cvo: When installing or upgrading, fast-fill cluster-operators, 2020-02-07, openshift#318), we added ClusterOperator pre-creation to get: 1. CVO pre-creates ClusterOperator for an operator. 2. CVO creates the namespace for an operator. 3. CVO creates ... for the operator. 4. CVO creates the operator Deployment. 5. Operator deployment never comes up, for whatever reason. 6. Admin must-gathers. 7. Must gather uses ClusterOperators for discovering important stuff, and finds the one the CVO had pre-created with hard-coded relatedObjects, gathers stuff from the referenced operator namespace, and allows us to trouble-shoot the issue. However, all existing component operators already knew how to create their own ClusterOperator, because that was the only path before the CVO learned about pre-creation. And even since then, most new operators come into the cluster on install or on update, when the CVO is pre-creating. New in 4.12, the platform-operator is coming in [1], and it has two relevant characteristics: * It does not know how to create the platform-operators-aggregated ClusterOperator [2]. * It is gated behind TechPreviewNoUpgrade [3]. So we are exposed to: 1. Admin installs a cluster. No platform-operators-aggregated, because it's not TechPreviewNoUpgrade. 2. Install complete. CVO transitions to reconciling mode. 3. Admin enables TechPreviewNoUpgrade. 4. CVO notices, and reboots fc00c62 (update the manifest selection to honor any featureset, 2022-08-17, openshift#821). 5. Because we decided to not transition into updating mode for feature-set changes, we stay in reconciling mode. 6. Because we're in reconciling mode, we skip the ClusterOperator pre-creation, and get right in to the status check. 7. Because the platform operator didn't create the ClusterOperator either, the CVO's status check fails with [2]: 45657:E0923 01:43:25.610286 1 task.go:117] error running apply for clusteroperator "openshift-platform-operators/platform-operators-aggregated" (587 of 960): clusteroperator.config.openshift.io "platform-operators-aggregated" not found With this commit, I stop making the ClusterOperator pre-creation conditional, so the new flow is: ... 6. Even in reconciling mode, we pre-create the ClusterOperator. 7. Because we pre-created the ClusterOperator, the CVO's status check succeeds (at least, after the operator writes acceptable status to the ClusterOperator we've created for it). This will also help us recover components where a bunch of in-cluster resources had been deleted, assuming the CVO was still alive. There may be other component operators who rely on the CVO for ClusterOperator creation, but which we haven't noticed because they aren't also gated behind TechPreviewNoUpgrade. [1]: https://github.com/openshift/enhancements/blob/6e1697418be807d0ae567a9f83ac654a1fd0ee9a/enhancements/olm/platform-operators.md [2]: https://issues.redhat.com/browse/OCPBUGS-1636 [3]: https://github.com/openshift/platform-operators/blob/4ecea427cf5302dfcdf4a5af8d28eadebacc2037/manifests/0000_50_cluster-platform-operator-manager_07-aggregated-clusteroperator.yaml#L8
6e5918f to
6a0aa99
Compare
|
/retest |
|
Passed, test steps see comments: https://issues.redhat.com/browse/OCPBUGS-1636 |
|
/retest |
LalatenduMohanty
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: LalatenduMohanty, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
|
/retest-required |
1 similar comment
|
/retest-required |
|
Azure CSI SCC failures are unrelated. /override ci/prow/e2e-agnostic-upgrade-into-change |
|
@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-upgrade-into-change DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@wking: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@wking: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-1636 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Originally, all component operators were responsible for creating their own ClusterOperator, and we'd just watch to make sure we were happy enough with what they did. However, on install, or when updating to a version that added a new component, we could have timelines like:
So in 2a469e3 (#318), we added ClusterOperator pre-creation to get:
However, all existing component operators already knew how to create their own ClusterOperator, because that was the only path before the CVO learned about pre-creation. And even since then, most new operators come into the cluster on install or on update, when the CVO is pre-creating. New in 4.12, the platform-operator is coming in, and it has two relevant characteristics:
So we are exposed to:
Admin installs a cluster. No platform-operators-aggregated, because it's not TechPreviewNoUpgrade.
Install complete. CVO transitions to reconciling mode.
Admin enables TechPreviewNoUpgrade.
CVO notices, and reboots fc00c62 (allow more than one featureset #821).
Because we decided to not transition into updating mode for feature-set changes, we stay in reconciling mode.
Because we're in reconciling mode, we skip the ClusterOperator pre-creation, and get right in to the status check.
Because the platform operator didn't create the ClusterOperator either, the CVO's status check fails with:
With this commit, I stop making the ClusterOperator pre-creation conditional, so the new flow is:
...
6. Even in reconciling mode, we pre-create the ClusterOperator.
7. Because we pre-created the ClusterOperator, the CVO's status check succeeds (at least, after the operator writes acceptable status to the ClusterOperator we've created for it).
This will also help us recover components where a bunch of in-cluster resources had been deleted, assuming the CVO was still alive. There may be other component operators who rely on the CVO for ClusterOperator creation, but which we haven't noticed because they aren't also gated behind TechPreviewNoUpgrade.